A Comparative Study on Depth of Penetration Measurements in Diagnostic Ultrasounds Through the Adaptive SNR Threshold Method

Depth of penetration (DOP) has been investigated in the scientific literature as an informative parameter able to monitor over time both the sensitivity and the general performance of ultrasound (US) diagnostic systems. In common practice, this parameter may suffer from operator-related errors due to its visual assessment. Different image analysis algorithms have been proposed in the literature to address this issue. In this regard, this work evaluates the adaptive SNR threshold method (AdSTM) on six US diagnostic systems equipped with three US probe models, operating at four frequencies. Data were collected from a US phantom with two distinct zones with different attenuation coefficients. The AdSTM results were compared to the outcomes provided by the naked eye method (NEM), which was performed by five non-medical observers. Despite the small population sample of observers, the obtained results were generally consistent across methods, and suggest the implementation of a calibration procedure for AdSTM, and more extensive testing.


I. INTRODUCTION
Q UALITY assessment is a hot topic in the scientific community these days [1], [2], [3], [4], [5]. For medical ultrasound (US), depth of penetration (DOP) is one of the most influential parameters in quality controls (QCs), as it is regarded as a useful tool for obtaining valuable information about the progressive deterioration of US system performance over time [6]. DOP is also closely related to sensitivity [7], which is an important metrological characteristic for determining the quality of a US system. DOP has been defined in numerous studies [8], [9], [10], [11] as the highest depth value at which US signals can still be distinguished from electronic noise due to the scattering phenomenon that occurs when echoes pass through a tissue-mimicking material (TMM). Currently, when periodic QCs are performed, DOP is retrieved by eye on the US system display [2], [12], [13]  depth level beyond which a clear distinction of the displayed US speckle, due to TMM, is prevented by the presence of background electronic noise [12]. Because of the dependency on the patient, the phantom, and the operator, such a method is unsuitable for guaranteeing the repeatability of DOP measurements, although US settings and environmental conditions, (e.g., light) are maintained throughout the measurement procedure [9]. As a result, offline image analysis methods have been proposed in the literature [6], [10], [11], [14] to address the issues related to the intrinsic subjectivity of visual tests. Nevertheless, despite the solutions deployed, the scientific community is still waiting for a globally shared standard on US equipment quality assessment obtained through this parameter. To date, the most promising approach for DOP assessment, recommended as a standard reference method in [6], is based on the estimation of signal-to-noise ratio (SNR) for increasing depths in US images [15], [16], [17]. DOP is defined as the depth at which the estimated SNR value exceeds a predetermined threshold. In any case, many of the aforementioned SNR-based algorithms rely on ambiguous criteria for threshold determination [15], [18] or suffer from operator dependence uncertainty [12]. The adaptive SNR threshold method (AdSTM), a novel image analysis SNRbased method for DOP assessment, was recently introduced to address these limitations. It is based on automatic threshold determination, thus improving the method in [14], and as shown in the previous preliminary studies [19], [20], it has the advantage of increasing the reliability and reproducibility of the results. The goal of this study is thus to test the AdSTM across a variety of US diagnostic systems and configurations available on the market, including different probe models, and to analyze the DOP variation in response to different probe operating frequencies at distinct levels of medium attenuation. DOP uncertainty was assessed using a series of Monte Carlo simulations (MCSs). Finally, the obtained results were compared to the naked eye method's (NEM) scores of independent observers.

A. Adaptive SNR Threshold Method
As described in [19] and [21], the AdSTM image analysis SNR-based method for DOP assessment (Fig. 1) was developed by processing phantom and in-air clips (the latter  obtained by decoupling the probe from the phantom surface). The method, implemented through a custom-written MATLAB function, post-processes video clips taken from phantom and air observations by averaging N consecutive frames to produce two average images (I ph and I air ). Then, a rectangular region of interest (ROI) of fixed width is automatically drawn from I ph and I air to compute the signal and the noise contributions, respectively [19]. The SNR curve is then estimated as a function of depth z, and the threshold th SNR is computed as follows: where g is the smallest gray level difference perceivable by the human eye [19], [21], [22], L s is the maximum luminance level, and α is the maximum sensitivity of the US system. The latter is obtained by applying the following steps [19]: 1) Determination of the sigmoidal function f (z) through SNR curve non-linear fitting. 2) Computation of the first-order derivative of f (z) is needed to locate the depth z min (Fig. 2) corresponding to its minimum value α = f (z min ).
The DOP is then automatically calculated as the depth value where the threshold th SNR intersects the SNR curve.
AdSTM robustness was tested in this study by comparing the obtained values to those determined by NEM, i.e., through visual examination of five separate observers (excluding people with medical expertise during testing). The observers' judgment was independently performed on the average image I ph in a similar fashion to [19], through a further in-house MATLAB function, by preserving the same environment lighting and setting conditions. A sixfold repetition was chosen to test both intra-and interindividual variability of the observers involved in this study.

B. Experimental Setup
A multi-purpose, multi-tissue US phantom (CIRS, Model 040GSE) [23] was used to collect the phantom clips. This reference device is constituted by two attenuation zones (0.70 and 0.50 dB·cm −1 ·MHz −1 ). Data were acquired from six intermediate technology level US diagnostic systems equipped with three US probe models (linear, phased, and convex array) at both attenuation zones. Each probe was placed on the phantom scanning surface where the speckle background is visible, avoiding grayscale targets, anechoic stepped cylinders, and nylon wires embedded in the phantom. To maximize US energy transmission, a coupling gel was used, and the probes were held in place by a holder.
Both the phantom and in-air clips were collected under the same raw scanning settings (see Table I). This configuration ensures that all the US systems operate under the same conditions, therefore allowing proper outcome comparison. The AdSTM was used to post-process all the acquired video clips. In particular, the average images were obtained by averaging N = 15 consecutive frames (Fig. 3). After that, an automatically driven mask-like crop on I ph and I air allowed excluding from the images the superimposed setting details [19]. Furthermore, based on the considerations discussed in [19], an accurate estimation of DOP can be performed by setting an ROI width of 30 px (3−5 mm for the linear array probe, 7−11 mm for the phased one, and 8−11 mm for the convex one).

III. MONTE CARLO SIMULATION
MCS was chosen for AdSTM testing, as several studies in the literature [24], [25], [26], [27], [28] testify that it is a  To simulate the uncertainty caused by operator drawing, the ROI width w and shift s (both expressed in terms of the number of pixels) were varied within a uniform distribution.
A first series of MCSs (7·10 3 cycles, to reduce computational burden) was run with w and s as input distributions (see Table II), obtaining the th SNR distributions as output. It is worth noting that the SNR threshold is an adimensional parameter whose value depends on the probe used but it does not seem to be relevantly affected by the distributions in Table II. As a result, different mean and standard deviation (SD) values can  Tables III and IV) as inputs.

IV. RESULTS
Tables III and IV present DOP results from three different US probes tested at four operating frequencies, for all the six US systems, obtained by the AdSTM and the NEM, in correspondence with the two phantom attenuation zones. AdSTM uncertainties δ AdSTM were computed by combining the repeatability uncertainties, from 2.5 and 97.5 MCS distribution percentiles, and those due to probe position on the phantom scanning surface, whose values were estimated in [29]. NEM uncertainties δ NEM were estimated as the square root of the mean squared error of the data obtained through the visual examinations by considering a coverage factor of 2. Oneway ANOVA ( p < 0.3 for the lower attenuation zone and p < 0.1 for the higher attenuation zone) was then performed on NEM uncertainties to check for intra-and inter-observer variability. In the cases marked with ( * ) in Tables III and IV, the AdSTM did not yield an SNR threshold because the DOP value is greater than or equal to the field of view (FOV). This occurs because of the US probe sensitivity combined with the limited phantom depth and the low attenuation coefficient. Nevertheless, the AdSTM and the NEM data in such cases are always consistent, therefore indicating that the automatic method provides accurate information also in such extreme cases.
The percentage error values with respect to the FOV (δ AdSTM% and δ NEM% for the AdSTM and the NEM, respectively) were computed from DOP uncertainties with a procedure similar to [19] and [20] as follows: The mean percentage errors in correspondence of 0.70 and 0.50 dB·cm −1 ·MHz −1 attenuation zones are 2.2% and 2.5%, respectively, for the AdSTM, while 3.9% and 4.2%, respectively, for the NEM. No significant difference was found between the percentage error values retrieved for the two different attenuations. These results confirm that AdSTM  TABLE IV  DOP MEASUREMENTS FOR DIAGNOSTIC SYSTEMS IV-VI: RESULTS COMPARISON FOR THREE US PROBES ACCORDING TO THE OPERATING FREQUENCY   TABLE V   ADSTM AND NEM MEAN PERCENTAGE ERROR FOR EACH US PROBE  ACCORDING TOTHE ATTENUATION COEFFICIENT shows a lower dispersion as compared to NEM, corroborating the preliminary results found in [19]. In addition, an analysis of the mean percentage error for each US probe model was performed (see Table V). AdSTM showed the lowest mean percentage error (1.6%) with phased array probes at 0.70 dB·cm −1 ·MHz −1 , while NEM displayed its minimum (3.6%) with linear array probes at 0.70 dB·cm −1 ·MHz −1 . AdSTM showed the highest mean percentage error (3.2%) with linear array probes, while NEM showed its maximum (4.9%) with phased array probes at 0.50 dB·cm −1 ·MHz −1 . The linear array probe model showed the lowest observers' mean percentage error.

V. DISCUSSION
In order to perform a compatibility analysis [30], given the low observer population sample involved in this study, a compatibility criterion was used through the application of  Table VI), as follows: where Afterward, the distributions of µ DOP% values that are strictly compatible according to [30] were used to determine the maximum acceptable differences between the DOP values, depending on the medium attenuation, th C,0.70 and th C,0.50 , as follows: th C,0.70 = µ 0.70 + 3S 0.70 th C,0.50 = µ 0.50 + 3S 0.50 (7) whereμ 0.70 andμ 0.50 are the mean values of µ DOP% at 0.70 and 0.50 dB·cm −1 ·MHz −1 , respectively, whileS 0.70 and S 0.50 are the corresponding SDs. The values obtained were th C,0.70 = 13% and th C,0.50 = 11%, respectively. Therefore, all the AdSTM and the NEM DOP measurements from which it resulted that DOP % ≤ th C for the corresponding attenuation zone, were considered as compatible. Therefore, the percentage of DOP results that showed a significant discrepancy between the AdSTM and the NEM, corresponds to less than 6% for 0.70 dB·cm −1 ·MHz −1 and less than 3% for 0.50 dB·cm −1 ·MHz −1 . These percentages could be due to US images that presented a low dynamic range (darker images), or a higher noise level, in correspondence with the used scanning setting. Hence, it can be assessed that, globally, the two methods applied in this study provide compatible results, at different probe models and operating frequencies, as well as attenuation coefficients.
Finally, it must be pointed out that the values obtained with AdSTM were systematically higher than the NEM ones. This could be related to 1) a higher sensitivity of the automatic method as compared to the human eye, 2) the limited observers' population sample, and/or 3) reduced observers' technical expertise. Anyway, this suggests that the AdSTM may need a calibration procedure to correct for the small bias that systematically affects the DOP measurements. Such a procedure could be based on the SNR threshold assessment on the actual US system dynamic range rather than on the full luminance scale.

VI. CONCLUSION
In this study, the AdSTM was tested on six intermediate technology-level US diagnostic systems, each of which was equipped with three US probe models, operating at four different frequencies. The measurements were taken with a multi-purpose, multi-tissue US phantom with two different attenuation zones (0.70 and 0.50 dB·cm −1 ·MHz −1 ). A first series of MCSs was performed for the uncertainty assessment of the SNR thresholds. Then, the outcomes of the first simulations were used in a second series of MCSs to estimate the repeatability DOP uncertainty of the AdSTM. DOP measurements were compared with the mean judgment outcomes of five independent observers, through the NEM implemented with an in-house MATLAB function. One-way ANOVA was performed on NEM uncertainties to check for intra-and inter-observer variability. The obtained results were globally compatible among the methods implied despite the limited observers' population sample. Further investigations could consist in both increasing the observers' population sample size and repeating that with a different population sample with higher technical expertise. Moreover, it would be interesting to develop a calibration procedure for the AdSTM to minimize bias occurring systematically in DOP measurements.