Knee Bone Models From Ultrasound

The number of total knee arthroplasties performed worldwide is on the rise. Patient-specific planning and implants may improve surgical outcomes but require 3-D models of the bones involved. Ultrasound (US) may become a cheap and nonharmful imaging modality if the shortcomings of segmentation techniques in terms of automation, accuracy, and robustness are overcome; furthermore, any kind of US-based bone reconstruction must involve some kind of model completion to handle occluded areas, for example, the frontal femur. A fully automatic and robust processing pipeline is proposed, generating full bone models from 3-D freehand US scanning. A convolutional neural network (CNN) is combined with a statistical shape model (SSM) to segment and extrapolate the bone surface. We evaluate the method in vivo on ten subjects, comparing the US-based model to a magnetic resonance imaging (MRI) reference. The partial freehand 3-D record of the femur and tibia bones deviate by 0.7–0.8 mm from the MRI reference. After completion, the full bone model shows an average submillimetric error in the case of the femur and 1.24 mm in the case of the tibia. Processing of the images is performed in real time, and the final model fitting step is computed in less than a minute. It took an average of 22 min for a full record per subject.


I. INTRODUCTION
R EPLACING the knee joint surface with an implant is one of the most frequently conducted surgeries worldwide [1]. Statistically, one in four people in the United States will receive a total knee arthroplasty (TKA) in their lifetime [2], [3], [4]. The main indication for knee replacement is osteoarthritis, which affects 13% of women and 10% of men over the age of 60 in the United States [5].
The shape, size, and positioning of the implant are crucial factors for a functional knee replacement and acceptable patient satisfaction [6], [7]. Different anatomical structures require different implant sizes, whereas a linear scaling of a one-shape implant is not sufficient for a satisfactory result [8], Apart from lower costs, US as an alternative medical imaging technique has the advantage of being nonharmful [24]. Moreover, it offers the possibility of visualizing various tissues or bones and real-time capability, which enables functional diagnosis [21], [25]; however, US also has some disadvantages: noise due to diffuse deflections, imaging artifacts, blurred bone surfaces, and bone shadowing [26], [27]. Probes for acquiring 2-D images can be combined with a localizer system, which can be optical, electromagnetic, or mechanical, for the 3-D imaging of large anatomical areas, such as the knee [26]. This approach is called "freehand 3-D US." Given the drawbacks mentioned above, every knee bone US scan will lack information on the inner tibiofemoral and patellofemoral joint space, which is occluded by the patella. As such, some kind of model completion is necessary to provide full bone models to the clinician.

II. RELATED WORK
The reconstruction of complete bone surface models from US data is mostly approached by a two-stage concept: first, US images are segmented for the derivation of a partial surface point cloud. Second, the partial surface scan is completed to a full bone model using statistical shape models (SSMs). Accordingly, we first present related work on US image segmentation, followed by related work on US-based model completion.

A. US Image Segmentation
Bone surfaces need to be segmented in US images for the extraction of partial bone surface points. In addition to characteristics of US images, such as noise and artifacts, one of the most challenging aspects of US segmentation is the distinction of bone interfaces from other tissue interfaces. The survey of Pandey et al. [25] provides a comprehensible introduction to the topic and gives an overview of US segmentation to the year 2019.
Ronneberger et al. [28] established a standard for medical image segmentation in 2015 based on learned features. They introduce a convolutional neural network (CNN)-based encoder-decoder architecture with skip connections called "U-Net." Subsequent publications introduce extensions, such as the 3-D U-Net [29], the U-Net++ with redesigned skip connections [30] or the nnU-Net framework for the automated configuration of architecture and training processes [31].
In addition to U-Net and its variants, other architectures employed for US bone segmentation include the Mask R-CNN [53], Pyramid Attention Network [54], and DeepLabv3+ [55], [56], [57]; further approaches include the use of conditional generative adversarial networks, such as the pix2pix architecture introduced by Isola et al. [58] for image-to-image translation. Zhou et al. [59] employ pix2pix for direct segmentation and the improvement of U-Net segmentations. Similarly, Alsinan et al. [60] condition a conditional generative adversarial network for the generation of bone shadow maps, which are then fused with CNN-based features.

B. Reconstruction
The SSMs allow the incorporation of morphological statistics for the creation of new shapes as a linear combination of mean shapex and modes matrix P. Approaches for USbased model completion to date rely mostly on SSMs for shape generation but differ in the methods of building the SSM and adapting it to US data. Morooka et al. [61] and Sarkalkan et al. [62] give an overview of US-based model completion; we refer to [63] for a survey and comprehensible introduction.
Hacihaliloglu et al. [64] propose a statistical model of the shape and pose of the spine and fit it to local phase features extracted from 3-D US data for reconstruction. Regarding model adaption, they employ a method based on Gaussian mixture models combined with an expectation-maximization schemeKlicken oder tippen Sie hier, um Text einzugeben. or a quasi-Newton optimization [65], [66], which bypasses the need for correspondences.
Anas et al. [67] propose a very similar concept, using a statistical model of pose and shape for the intraoperative reconstruction of carpal bones from US. They, first adapt pose and shape components to preoperative CT data, and then only adapt pose components for the subsequent reconstruction from segmented US images. The implementation of the model adaption is again based on correspondence-free Gaussian mixture models, initialized with landmark-based registration.
As opposed to previous methods, some approaches rely on point correspondences for SSM adaption to US: Hänisch et al. [68] adapt an SSM to segmented 3-D US images for the reconstruction of knee bones. They employ a variant of the iterative closest point algorithm for establishing point correspondences. In a subsequent publication, correspondences are established by searching for high-intensity pixels along vertex normals in the original 3-D US image [54], [69].
The recent work of Mahfouz et al. [70] is most closely related to our work: they employ 3-D freehand US with electromagnetic tracking for image acquisition. Bone surfaces are segmented in the frequency-range data acquired and combined into a 3-D partial point cloud based on tracking information. After the outlier removal on the partial point cloud, they match it to an SSM for the completion of the distal femoral and proximal tibial bone models; however, essential details, such as the fitting method of their implementation, remain unknown.
Compared to related work, the objectives of our work are as follows.
1) The development of a new SSM adoption algorithm based on a general-purpose optimizer (GPO).
2) The evaluation of the GPO approach against the state of the art. 3) The proposal of a fully automated workflow with realtime capability. 4) The presentation of an entire pipeline, including all the algorithms. 5) Open access to test samples and results. 6) The first in vivo evaluation in a near-clinical setting.

III. METHODS
The workflow proposed is a multistep procedure, as demonstrated in Fig. 1: landmarks were acquired using a freehand 3-D setup with a 2-D probe and a tracking system for the prepositioning of the image volume relative to the mean shape of the SSM. Next, two partial scans of the knee were acquired using the US probe. The 2-D images were segmented fully automatically in real time by a CNN. After the acquisition, the partial scans of the knee were combined with the SSM to reconstruct a full bone model of the proximal femur and distal tibia.

A. Experimental Setup
We used a Clarius L15 HD (Clarius, Vancouver, BC, Canada) probe to acquire the US images. The Clarius system provides an advanced programming interface, which we use to feed the image data into our custom software. The imaging depth was adjusted according to the approximate depth of the bone. The Clarius software provided fully automatic adjustment of imaging parameters, such as the frequency, easing the acquisition process.
The FusionTrack500 (Atracsys, Puidoux, Switzerland) localizer system, which comes with its own advanced programming interface, was used for tracking. We tracked the position of the probe and a dynamic reference base attached to the subject's lower leg by Velcro braces at 300 Hz. Tracking data were matched with image data based on the acquisition time recorded by our software. The beamforming and image transfer process of the US images induced a latency, which we determined in preliminary experiments to be 175 ms. Given the time stamps, we matched the image and tracking data and constructed the following transformation chain: was determined from computer-aided design data of the 3-D-printed adapter connecting the tracking body to the probe and validated using a sphere-based method. T Probe was the pose of the probe, T DRB was the pose of the dynamic reference base and T model was the transformation of the anatomical landmarks on the mean shape of the SSM to the anatomical landmarks of the subject acquired prior to the record.
The subject was put in a sitting position with the leg in 90 • flexion. The patella is moved distally in this pose, minimizing the occlusion of the femur. The optical tracking camera was positioned at the subject's feet area. The clinical expert sonographer was standing to the side of the subject, enabling free movement and providing a view of the knee and our custom software simultaneously (see Fig. 2). The approximate position of the femur was determined by acquiring the anatomical landmarks, namely, the medial and lateral epicondyle, as well as a point on the distal femoral shaft. The tibial landmarks were medial and lateral points on the proximal crest, as well as a point on the proximal shaft. Note that the rather ambiguous points on the shafts were chosen due to the requirement that all points are accessible in both the prone and seated position. Alternative registration methods include pair-wise surface-based registration and registration onto the SSM mean shape. After the frontal femur and tibia had been recorded, the subject was asked to lie down in a prone position, and the scanning of the posterior femur and tibia was performed. During this acquisition process, the sonographer tried to optimize the probe orientation to the bone surface in order to obtain a distinct bone response and tried not to include bones other than the respective target anatomy in the record. After the record had been completed, both point sets were registered to the SSM mean shape using a custom MATLAB (MathWorks, Natick, MA, USA) implementation of the iterative closest point algorithm that is robust to outliers. The point sets acquired by this process were still partial point sets, as the inner parts of the knee joint spaces cannot be imaged due to bone shadowing.

B. Datasets
Prior to the experiment described above, training data for the individual pipeline components were acquired as follows.
1) US Segmentation: A separate set of 4565 images was acquired and annotated beforehand. The images depict all parts of the knee joint, including areas showing no bone surface at all. Only the thin bone response was annotated and not the bone shadow. The dataset included ten subjects, ages 26-38, seven male and three female.
The dataset was split 10:1 into a train and a validation dataset. Image height varied from 150 to 750 pixels, and image width 420-650 pixels. Accordingly, the pixel spacing varied from 0.08 to 0.12 mm per pixel. See Fig. 3 for an example image.
2) Statistical Shape Model: Two SSMs of the tibia and femur, respectively, were built from 414 anonymized bone  surface datasets of patients that had undergone TKA. All patients underwent CT imaging, which was subsequently manually segmented and freed from osteophytes by experts. The mean shape surface models were aligned according to anatomical landmarks. See Fig. 4 for an example mesh of the distal femur. We refer the reader to our previous publication for additional details [54].
3) Reference: MRI: Magnetic resonance imaging (MRI) was used as a reference to evaluate the accuracy of the US-based reconstruction. Nine subjects of age 28-58 were recruited, five males and four females. A 3-D water-selective cartilage scan was recorded, and the femur and tibia were manually segmented by medical engineering students. The segmentation was validated by an expert with more than five years of clinical experience. All volumes have an image size of 720 × 720 × 266. The slice thickness is 0.75 mm, and the in-plane pixel spacing is 0.25 mm. See Fig. 5 for an example image and its annotation performed with 3-DSlicer [71]. A recent high-resolution CT image was available for one subject.

C. Segmentation by CNN
We opted for the Deeplabv3+ for image segmentation as it combines a fast processing speed with high accuracy. The lightweight MobileNetV2 backbone was chosen as its low model capacity enabled better generalization of our small dataset. At the same time, this enabled fast inference. We refer to our previous publication [57] and the article by Chen et al. [72] for implementation details. We did not find any benefit from employing more recent CNN or vision transformer architectures on the task of bone segmentation [57], [73]. The network was trained on a separate dataset for the experiment at hand, described in Section III-B. Hyperparameter tuning involved the learning rate and loss function. The bone surface areas inferred were skeletonized to a single pixel-wide centerline during the US image acquisition.

D. Bone Model Completion by SSM
All methods were implemented with MATLAB. Several algorithms were applied to adopt the SSM to the incomplete point set: 1) the active shape model (ASM) search, originally introduced by Cootes et al. [74], alternates between establishing correspondences by nearest-neighbor search and updating the rigid transformation and modes by projecting the vertex position deltas into mode space; 2) a simultaneous leastsquares (LSO) optimization of pose and shape, as suggested by Blanz et al. [75]; and 3) direct optimization of pose and shape by minimizing an objective function on the average surface distance between the shape model and the incomplete point cloud. A GPO, namely, "fminsearch" by MATLAB, was used to minimize the objective function. Hyperparameter optimization, for example, regarding the number of modes predicted or the number of iterations that were tuned in preliminary in silico studies.

E. Evaluation
The performance of the methods was evaluated in terms of speed and accuracy. Computation times for segmenting 2-D image slices and the adoption of the shape models are reported, as well as total times for image acquisition of the individual bones and subjects. Regarding accuracy, the average surface distance error (SDE) directed from 1) the partially recorded point set and 2) the full reconstructed model to the reference MRI mesh is reported. See (1) for the definition, where X and Y are the source and target point sets, respectively, N the size of the source point set and dist(x, y) the Euclidean distance between two points in R 3 As the partial set only represents part of the bone, a directed distance from the partial scan to the ground truth mesh was computed. It should be noted that errors in terms of bone surfaces not segmented by the algorithm are not quantified by this evaluation. A directed distance from the adapted model to the ground truth mesh was computed for the full reconstruction. Only those parts of the bones that are relevant for TKA planning were evaluated (see Fig. 6). In contrast to the partial scan, the adapted model is guaranteed to include the entire bone surface of the knee. Additionally, we provide the overall anterior-posterior (AP) and medial-lateral (ML) sizes of both the MRI reference and the US-based reconstruction of the femur and tibia.   Table I presents the evaluation of the partial US records. The mean distance from the recorded frontal femoral surface to the ground truth ranges from 0.81 to 2.68 mm. The high value of 2.68 mm is due to the tibia being visible in the record, with all other values being well below. See Fig. 7(a) and (b) for two examples. In the case of the frontal tibia, the distance ranges from 1.4 to 3.83 mm, which appears to be much worse; however, the fibula is part of the scan in all records, as shown exemplarily in Fig. 7(c). Note that this does not affect the reconstruction quality, as these points are outliers and far from the reconstructed geometry. When adding the posterior scan to the partial point clouds, the error increases strongly. In the case of the femur, it ranges from 1.14 to 3.86 mm, and in the case of the tibia, from 2.49 to 6.72 mm. Again, this high error is mostly due to other bones being part of the records. Several cases, however, show a clear geometry mismatch; see, for example, the medial posterior femoral condyles in Fig. 7(d).

IV. RESULTS
In order to isolate the surface mismatch from the false positive detection of neighboring bones, we manually cleaned the scan of subject #2 from the latter. Note that this step is not part of the regular pipeline. See Table II for the evaluation. Both the femur and tibia show errors <1 mm; however, the error increases when including the posterior part.
The directed mean surface distance of the masked reconstruction is reported in Tables III and IV. Because of the bone model being evaluated, the false positive errors of neighboring bones are no longer present in this evaluation. All three algorithms are evaluated on frontal and combined partial scans of the femur and tibia, respectively. In the case of the femur, the GPO method yields the best surface errors of 0.62-1.51 mm. The mean surface errors of all ten subjects is 0.96 mm. Both the LSO and the ASM methods fail to reconstruct subject #1. Their surface reconstruction errors are an average of 1.57 and 1.72 mm, respectively. A similar result is found for the tibia. Again, the GPO method yields   TABLE III  SDE ERROR OF THE MASKED RECONSTRUCTION COMPARED TO THE  GROUND TRUTH FOR EACH INDIVIDUAL SUBJECT USING GPO. NOTE THAT SUBJECT #4 WAS EVALUATED USING A CT GROUND TRUTH the lowest errors, ranging from 0.87 to 1.7 mm, with an average error of 1.24 mm. The LSO and ASM methods achieve an average of 1.84 and 2.21 mm, failing to reconstruct two and three subjects, respectively. Surface distance heatmaps for subjects #2 and #9 are given in Fig. 8, for which the best and worst reconstructions were found, respectively. A closer investigation of all ten subjects reveals that errors are mainly located in the lateral posterior condyle, as can be seen from the example. Similarly, in the case of the tibia, the largest deviations can be found on the posterior part; furthermore, the tibial plateau shows a noticeable variation. See Fig. 9 for a heatmap visualization.
When the records of the posterior parts are included, mixed results are found. All algorithms perform worse for the femur. Their rank order, however, remains the same, with the GPO being the best-performing algorithm, achieving a mean error of 1.29 mm. By contrast, performance gains can be observed for the tibia. While the ASM algorithm still fails to reconstruct several subjects, and even the error of the GPO method increases to 1.39 mm, the LSO method slightly outperforms the GPO method applied to only the frontal scan. It achieves the lowest errors of 0.71-1.66 mm, with an average of 1.18 mm.
Finally, we evaluated the overall AP and ML size as defined by [76]. The results are presented in Table V. Regarding the computation time, all methods meet the required criteria. The CNN processes an image in 27 ms, achieving a frame rate of 37 Hz. No drops in the frame rate were observed, even with the simultaneous graphical load of the 3-D view.
The total time spent per subject lies in the range of 14 to 33 min, with 22 min spent on average. This time includes the subject getting seated, fixation of the leg, recording of two scans, repositioning and -fixation of the subject in a prone position, and recording of yet another two scans. Investigating only the time spent on recording the frontal femur, which is the largest of all four records, the time spent ranges from 62 to 142 s, with 96.4 s on average.
Exemplary results of subjects #2, #6, and #9 will be provided on request, including the ground truth model, the partial surface scan, and the full bone reconstruction.  Similarly, the intercondylar eminence shows a rather high mismatch. Scale according to Fig. 8. Neighboring bones were also, however, recorded, leading to false positive segmentation and high quantitative errors, especially in the case of the tibia, where the fibula is visible in every record. The actual surface match, investigated separately for one subject, shows an accurate surface reconstruction with errors well below 1 mm. This does not hold true when the frontal and posterior scans are combined, indicating an inadequate registration.
Investigating the accuracy after reconstruction of the full bone, an error threshold of 1 mm was be achieved in the case of the femur but not of the tibia. This may be due to a large part of the tibia evaluation mask, the tibial plateau, not being imageable using US. The high errors seen in the evaluation of the original US scan could be reduced greatly, as the model is able to remove false positive segmentation of neighboring bones. In fact, even when false positive segmentation is removed manually, the reconstruction step reduces the overall error, as can be seen from the example of subject #2: the average SDE on the frontal femur scan decreases from 0.83 mm in the manually cleaned partial scan to 0.62 mm after model fitting. We hypothesize that this is due to the averaging of noise caused by segmentation errors and movement artifacts. This finding underlines that our CNN-based segmentation does not suffice for high-quality reconstruction and that a second model-fitting step is necessary.
Regarding the model-fitting method, the ASM and LSO approaches perform worse than the GPO approach on average and completely fail in some cases. The inclusion of the posterior surface scan showed only a small benefit in one evaluation and had a noticeable negative impact on all the others. As mentioned beforehand, this indicates inadequate registration. Landmark-based registration was insufficient and ruled out early. Registering the scans on the basis of overlapping surfaces was not robust either. We, therefore, registered both scans onto the mean shape, which, of course, differs in geometry; furthermore, the information on the AP size cannot be retained by this approach. We identify this as the biggest weakness of the proposed method. Although the posterior part may be extrapolated from only the frontal scan with surprising accuracy, it remains the area of the largest surface mismatch. Apart from the difficulties when registering the posterior part, scanning the objects in a prone position also turned out to be difficult. In extension, the soft tissue covering the knee is under tension, making it hard to orient the probe orthogonally to the bone surface; furthermore, the maximum imaging depth of 7 cm was insufficient for some subjects.
Investigating the AP and ML size, a geometrical mismatch may occur on both sides (AP and ML). Accordingly, the error is expected to exceed the average SDE. The results confirm this hypothesis, especially for the tibia. Surprisingly, although no information about the posterior bone is provided to the algorithm, the AP error does not exceed the ML error. Given typical step sizes of at least 2 mm for off-the-shelf implant components, the method only based on frontal scans is sufficiently precise for implant templating in the case of the femur but not of the tibia.
Apart from the femoral ML size, the reconstructions tend to be larger than the MRI reference. One reason could be an underestimation of the average speed of sound. Salehi et al. [77] found an up to 4% mismatch in an ex vivo study, which could cause a shift up to 2.8 mm.
Comparing the errors computed for subject #4 to the average performance, we see a persistent and noticeable gap of about 0.24-0.6 mm lower errors for the CT-based evaluation. This may indicate accuracy issues in the MRI-based ground truth and an even better actual performance of our method.
In comparison to closely related previous work by Mahfouz et al. [70] and our group [54], this is the first report achieving a reconstruction of submillimetric accuracy on average; furthermore, to the best of our knowledge, it is the first small-scale in vivo evaluation. This is highly relevant, as motion artifacts may compromise accuracy, especially for larger scans. Regarding our setup, we were not able to benefit from the posterior records as their relative position to the frontal scan could not be recovered with sufficient accuracy. The main reasons for this are the low repeatability of landmark-based registration and insufficient overlap for surface-based registration. Although the setup by Mahfouz et al. [70] using electromagnetic tracking and scanning the knee in flexion could remove the need to reposition the patient, and thus, the need for registration, errors reported for electromagnetic tracking devices also have to be taken into account. Kral et al. [78] report an 0.25 mm increase in the target registration error when using an electromagnetic as opposed to an optical tracking system in a laboratory environment mimicking the operation room. Elfring [79]; however, reports up to 2.69 mm in a similar. We hypothesize that the lower reconstruction errors observed in our investigation compared to the work of Mahfouz et al. [70] were achieved by superior segmentation and model fitting methods; however, since Mahfouz et al. [70] have not disclosed their algorithms, this must remain an assumption.
In conclusion, we proposed the first fully automatic US-based workflow that is able to achieve submillimetric in vivo reconstruction, averaged over all subjects. In future work, we will investigate the EM-based scanning procedure, as well as an automated robotic scanning approach, and quantify the individual contributing error sources to identify the potential for improvement.