Moving Beyond Simulation: Data-Driven Quantitative Photoacoustic Imaging Using Tissue-Mimicking Phantoms

Accurate measurement of optical absorption coefficients from photoacoustic imaging (PAI) data would enable direct mapping of molecular concentrations, providing vital clinical insight. The ill-posed nature of the problem of absorption coefficient recovery has prohibited PAI from achieving this goal in living systems due to the domain gap between simulation and experiment. To bridge this gap, we introduce a collection of experimentally well-characterised imaging phantoms and their digital twins. This first-of-a-kind phantom data set enables supervised training of a U-Net on experimental data for pixel-wise estimation of absorption coefficients. We show that training on simulated data results in artefacts and biases in the estimates, reinforcing the existence of a domain gap between simulation and experiment. Training on experimentally acquired data, however, yielded more accurate and robust estimates of optical absorption coefficients. We compare the results to fluence correction with a Monte Carlo model from reference optical properties of the materials, which yields a quantification error of approximately 20%. Application of the trained U-Nets to a blood flow phantom demonstrated spectral biases when training on simulated data, while application to a mouse model highlighted the ability of both learning-based approaches to recover the depth-dependent loss of signal intensity. We demonstrate that training on experimental phantoms can restore the correlation of signal amplitudes measured in depth. While the absolute quantification error remains high and further improvements are needed, our results highlight the promise of deep learning to advance quantitative PAI.


Introduction
Photoacoustic (PA) imaging (PAI) has the unique ability to provide high-resolution optical-contrast molecular imaging up to several centimetres deep in tissue with potential clinical application areas ranging from diagnosis of breast cancer (1) or cardiovascular disease (2), to grading of skin (3,4) or inflammatory diseases (5).PAI transforms absorbed laser energy into acoustic sound waves (6, 7) that can be measured as timeseries pressure data p(t) using acoustic detectors (8,9).
A key aim of PAI is estimation of absolute molecular concentrations from images acquired at multiple wavelengths (10,11).To achieve this goal, two inverse problems have to be solved: the acoustic inverse problem, which refers to the reconstruction of the initial pressure distribution p0 from p(t) and the optical inverse problem, which refers to the estimation of the optical absorption coefficient µa from p0.Reconstructing p0 from p(t) requires an inverse acoustic operator, denoted as A −1 (an image reconstruction algorithm).To obtain accurate reconstructions, the acoustic properties of the tissue, θ, and the characteristics of the acoustic detector, ϵ, must be taken into consideration: p0 ≈ A −1 (p(t, ϵ), θ, ϵ).
The spatial distribution of ϕ and Γ in living subjects are typically unknown but must be accounted for to accurately determine µa.Γ is often assumed constant (11).ϕ could be accounted for by iterative methods, using fixed-point iteration schemes assuming a priori known µs (12) or variational approaches, allowing a range of constraints and priors (13).To date, these methods have been restricted to simulated data, except when including a calibrated imaging target into the medium (14).Deep learning approaches have been proposed to solve both the acoustic (15,16) and optical inverse problems (17)(18)(19)(20) in PAI.Iterative or variational methods typically work well with simulated data, but do not achieve convincing results on experimental data (21).A key limitation for application in living subjects is that the optical and acoustic properties are unknown, making validation of quantitative PAI methods extremely difficult.One indirect possibility to validate these methods is to analyse the performance of methods that use the estimated µa as an input, for example, semantic image segmentation (22,23), or oximetry (24).
Here, we bridge the gap from simulated to experimental training data by fabricating a collection of well-characterised co-polymer-in-oil tissue-mimicking phantoms (25) covering a biologically relevant range of µa and µs, to enable the use of supervised deep learning for the optical inverse problem of PAI (Fig. 1).We create digital phantom twins by characterising optical properties with a double integrating sphere (DIS) system and train U-Nets (26) to estimate µa on matching simulated and experimental PA images.We apply the trained U-Nets to a held-out set of test phantoms and compare the results to fluence correction with a Monte Carlo model based on the DIS reference measurements.Importantly, we then explore the potential of the methods on pre-clinical data from living subjects.

Methods
A. Tissue mimicking phantoms.A total of 137 cylindrical phantoms with a diameter of 27.5 mm, a height of 65-80 mm, and an approximate volume of 40-50 mL were fabricated based on a previously published protocol (25).Briefly, 83.80 g mineral oil (Sigma Aldrich-330779-1L) was used as a base.25.14 g polystyrene-block-poly(ethylene-ran-butylene)-blockpolystyrene (SEBS, Sigma Aldrich 200557-250G) as the poly-mer and 2.09 g butylated hydroxytoluene (HT, Sigma Aldrich W218405-1KG-K) as an antioxidant was added to the mineral oil base, which was heated to 150 • C and poured into a mould (for details see (25), (27), and Supplementary Figure 1).Different concentrations of titanium dioxide (TiO2, Sigma Aldrich 232033-100g) and alcohol-soluble nigrosin (Sigma Aldrich 211680-100G) were used to tune µ ′ s and µa respectively.Each phantom consisted of a background cylinder into which imaging targets (referred to as inclusions) were added.Addition of the inclusions was performed using either careful pouring (diameter > 6 mm) or via a vacuum pump.Optical properties were sampled from the ranges µa = 0.05 cm −1 -4.0 cm −1 and µ ′ s = 5 cm −1 -15 cm −1 , defined at a reference wavelength of 800 nm.The phantoms featured various inclusion patterns with distinct µa and µ ′ s combinations in both the inclusions and the backgrounds (detailed in Supplementary Figure 2 and Supplementary Table 1).

B. Optical property characterisation.
For each phantom material, rectangular samples with a length of 5.9 cm, a width of 1.8 cm, and a thickness ranging between 2 mm and 4 mm were prepared for optical characterisation.Sample thicknesses were determined at five distinct locations using digital vernier calipers.We used an in-house double integrating sphere (DIS) system (25) based on the system developed by Pickering et al. (28) to determine µa and µ ′ s of the phantom material in a wavelength range of 600 to 950 nm.The recorded transmittance and reflectance values were normalised and analysed with the inverse adding doubling (IAD) algorithm (29) to estimate the optical properties of the material.The anisotropy (g) and refractive index (n) were assumed to be g = 0.7 and n = 1.4 (30).
C. Photoacoustic imaging and data processing.The phantoms were imaged using a pre-clinical PAI system (MSOT inVision-256TF, iThera Medical GmbH, Munich, Germany) described in detail elsewhere (31).We imaged in the wavelength range from 700 nm to 900 nm in 10 nm steps using 10 frame averaging.The phantoms were mounted in the imaging system using a custom 3D-printed polylactic acid phantom holder printed with the Prusa i3 MK3S+ (Prusa Research, Czechia) and immersed in deionized water for acoustic coupling.The phantoms were imaged at different cross-sections in 2 mm steps along a 44 mmlong section.The temperature of the water bath was set to 25 • C.
Images were reconstructed using the delay-and-sum algorithm with the PATATO toolbox (32) with a bandpass filter between 5 KHz and 7 MHz, application of a Hilbert transform, as well as time interpolation by a factor of three and detection element interpolation by a factor of two.The images were reconstructed into a 300×300 pixel grid with a field of view of 32×32 mm.To enable compatibility with the U-Net, images were cropped to 288×288 pixels without changing the resolution.

D. Data quality assurance.
A wide range of artefacts can corrupt the phantom preparation and imaging process when creating a wide range of inclusion geometries.Sources of artefact include: unsolved polymers; material inhomogeneities; air bubbles; yellowing from oxidation; and acoustic reflections from the 3D-printed phantom holder.After image reconstruction, manual quality control was performed for each tomographic cross-section to check for the presence of artefacts that impacted data quality.Phantoms were only included in subsequent analysis if at least one cross-sectional slice contained no artefacts.Using this process, 35 of the 137 phantoms were excluded.
E. Synthetic imaging with digital twins.We manually segmented the cross-sectional images using the Medical Imaging Interaction Toolkit (MITK) (33) into segmentation classes of background and target inclusions, which were used for creating the digital phantom twins.
We implemented a digital twin of the PAI device including both the illumination and detection geometry.The light source consists of five fibre bundle pairs radially distributed around the object and was implemented as a custom illumination in Monte Carlo eXtreme (MCX) (34).The implementation of this system can be found in https://github.com/jgroehl/mcx(last accessed 04/04/2023).The detection geometry consists of 256 toroidally focussed transducer elements covering a 270 • field of view around the imaging target.The arrangement of the transducers was implemented in the k-Wave toolbox (35).
Each segmentation class was assigned the optical properties determined by the DIS measurements for the sample material at 21 wavelengths from 700 nm to 900 nm in 10 nm steps.The background medium was assumed to be pure water (36), as this is the acoustic couplant used in the imaging system.We simulated the forward models using the SIMPA toolkit (37).The light distribution was simulated in 3D using the MCX Monte Carlo model and the propagation of sound waves was simulated in 2D using a custom k-Wave script.For computational efficiency, we limited the Monte Carlo model to 10 7 photons for each simulation run and only simulated the acoustic model in 2D assuming an infinitely symmetric initial pressure outside of the imaging plane.Despite all assumptions, the simulations correlate well with the experiment (R=0.94,Supplementary Figure 3).G. Absorption estimation methods.We applied four methods for calibration of the optical absorption coefficient µa against the PAI data (Fig. 2).

F. Training and held
Linear calibration was chosen as a baseline model.To account for depth-dependent attenuation, we selected the brightest 2% of pixels in the background material of the training phantoms (i.e. the rim) and applied a linear regression to relate the arbitrary units PA signals in these pixels to the reference optical absorption (Cal., Fig. 2A), giving µa = 1 1485 ( p0 − 313), R=0.94.
Fluence correction with reference Monte Carlo simulations was performed as a gold standard method.Monte Carlo simulations of light fluence were made based on the reference absorption and scattering values obtained by the DIS measurements (GT-ϕ, Fig. 2B).The resulting linear fit on the fluence corrected data gave µa = 1 8801 ( p0 ϕ M C − 832), R=0.93.Absorption estimation with deep learning was achieved by training a modified U-Net (26) in a supervised manner on the training data.The U-Net was used without a final activation layer, to make it suitable for regression tasks (18).To make use of the digital twin simulations during training, we defined the U-Net to have a two-channel output: the first channel to estimate µa and the second channel to estimate ϕ.Our U-Net implementation used strided convolutions with a (3×3) kernel size for downsampling and strided transpose convolutions with a (2×2) kernel size for upsampling.It used leaky rectified linear units as the non-linear activation layers and has four sampling steps with a total of 1024 channels in the bottleneck and used the mean squared error as the loss function for both output channels.
We trained two U-Nets: one on the simulated dataset (DL-Sim, Fig. 2C) and one on the experimentally acquired training data set (DL-Exp, Fig. 2D).The training data were normalised by the mean and standard deviation of the respective training H. Performance evaluation.For the final performance evaluation, the estimates of each of the five training folds were averaged.In addition to performance evaluation on the heldout test phantoms, we also applied the network to completely different experimental data without additional training.To this end, we used a blood flow phantom experiment and in vivo imaging of healthy mice.While these experiments allow us to investigate the generalisability and limitations of the models, inaccuracies are expected due to variability in the Grüneisen parameter distribution compared to the phantoms.

Flow phantom data.
A variable oxygenation blood flow phantom was imaged as described previously (38).We fabricated agar-based cylindrical phantoms with a radius of 10 mm and embedded a polyvinyl chloride tube (i.d.0.8 mm, o.d.0.9 mm) in the centre of the phantom during the agar setting process at room temperature.For the base mixture, 1.5 % (w/v) agar (Sigma Aldrich 9012-36-6) was added to Milli-Q water and was heated until it dissolved.Once the mixture had cooled to approximately 40 • C, 2.08% (v/v) of pre-warmed intralipid (20% emulsion, Merck, 68890-65-3) and 0.74% (v/v) Nigrosin solution (0.5 mg/ml in Milli-Q water) was added, mixed, and poured into the mould.PA images of the phantom were taken between 700 nm and 900 nm in 20 nm steps.
In vivo mouse data.All animal procedures were conducted in accordance with project and personal licences (PPL number: PE12C2B96), issued under the United Kingdom Animals (Scientific Procedures) Act, 1986 and approved locally under compliance form CFSB2317. Nine healthy 9-week-old female C57BL/6 albino mice were imaged according to a previously established standard operating procedure (31).Imaging was performed at 10 wavelengths between 700 and 900 nm averaging over 10 scans each.The kidneys, spleen, spine, and aorta were segmented in PA images using MITK.
Performance measures.In this work, we report the relative error, which is defined as Rel.Error(x, x) = |x−x| x * 100, and the absolute error, which is defined as Abs.Error(x, x) = |x − x|, where x is the estimate for a quantity of interest x.We compute the mean of the pixel-wise measure for each segmentation region.When reporting aggregate results, we proceed to compute the median and interquartile range of the resulting distribution of mean errors per segmentation region.We also report the generalised contrast-to-noise ratio, which is defined as gCNR , where hi and h b are the histograms of the signals in the inclusion and background and k is the index of the histogram bin (39).The average in a certain background segmentation class was taken as the mean, while the average in a inclusion class was the median over the first 1.28 mm depth inside the inclusions.The depth threshold was manually determined sweeping from 0 mm to 5 mm depth to yield the best µa quantification results on the training data when using the GT-ϕ model.GT-ϕ is the model expected to be most affected by the depth-dependent signal decrease.

I. Statistics.
The degree of linear correlation between variables is analysed using the Pearson correlation coefficient.Statistical significance is tested using a non-parametric unpaired Mann-Whitney U test.The significance is indicated by * (P < 0.05), ** (P < 0.01), *** (P < 0.001) and **** (P < 0.0001).Data are reported as mean ± standard deviation unless otherwise stated.

A. Deep learning models show improved spatial distribution of absorption estimates.
We first inspected the images of estimated µa over all test phantoms.We highlight here a particular example of a challenging test case with a halo artefact around one inclusion where DL-Exp achieved a good agreement with the reference in the background (DL-Exp: 0.12 ± 0.07 cm −1 vs Reference: 0.098 cm −1 ) and one inclusion (DL-Exp: 1.0 ± 0.12 cm −1 vs Reference: 1.04 cm −1 ) but showed poorer performance in the other inclusion (DL-Exp: 1.97 ± 0.35 cm −1 vs Reference: 3.1 cm −1 ) at 800 nm, illustrated using a line profile through both inclusions (Fig. 3).More examples showing success and failure cases are illustrated in Supplementary Figs 4 and 5 for a range of inclusion types and positions.The naïve linear calibration model systematically underestimated µa in both inclusions and in the background material (Cal., Fig. 3A).Fluence correction with Monte Carlo simulations yields an overestimation of µa for the larger inclusion and towards the centre of the background material, creating distorted edges of the inclusions (GT-ϕ, Fig. 3B).
The estimation of µa from DL-Sim shows a systematic overestimation of the µa in the background material at the upper rim of the phantom, presumably caused by incorrect assumptions present in the 2D acoustic forward model (DL-Sim, Fig. 3C).DL-Sim also shows artefacts, with estimation of a high µa outside of the phantom area in the coupling medium.In contrast, DL-Exp predicts spatially smooth estimates with sharp borders around the phantom and inclusions (DL-Exp, Fig. 3D).DL-Exp also shows improved quantitative performance in the inclusions compared to DL-Sim, particularly on the right inclusion (Fig. 3E).The same findings are seen consistently in line profiles across the test phantom set (see Supplementary Figs 4  and 5), with DL-Exp consistently outperforming all methods in spatial accuracy and quantitative estimation, highlighting the substantial added value of training on experimental data.

B. Absorption coefficient estimates at depth can be recovered with deep learning.
We next analysed the quantitative performance of the models in more detail.Cal.performs extremely well in the background material at the surface of the phantoms (R=0.88),providing an accurate estimate of µa independently of µs (Fig. 4A).This is not true for signals arising from deeper within the phantoms.In the inclusions at depth within the phantoms, there was no correlation with µa (R=-0.22)(Fig. 4B), as expected due to depth-dependent optical attenuation and spectral colouring (11).
To understand the limit at which the model breaks down, we analysed the 25 training phantoms without any inclusions, showing that the correlation between PAI signal and µa systematically decreases with depth (Fig. 4C), ranging from R=0.98 at 1 mm depth to R=0.15 at 13 mm depth (Fig. 4D-E).Although we observe the same overall trend between the experiment and simulation (Fig. 4C) the experimental results do deviate from the monotonic exponential decay seen in the simulation.A number of factors could contribute to this observation: (1) the optical forward model assumes a literature-derived anisotropy and the applicability of the Henyey-Greenstein scattering function (40), which might differ in our phantom materials; (2) we do not model the noise present in the experimental data; and (3) we do not perform 3D acoustic forward modelling, thus not accounting for out of plane pressure waves or ultrasound focusing.
Having established the performance limits of the baseline model (Cal.),we then examined the performance of the U-Net estimations (complete data shown in Supplementary Fig. 6).Application of these maintained or improved the excellent linear mapping in the background material (Table .2 Background), while recovering a linear relationship between the absorption estimate in the inclusions and the ground truth (Table .2 In- While both deep learning methods underperform compared to GT-ϕ (vs.DL-Exp p=0.012, vs. DL-Sim p=10 −13 ), DL-Exp (median error of 29%) significantly (p=10 −7 ) outperformed DL-Sim (median error of 37%).We also observed a systematic underestimation of high absorption coefficients for both deep learning models.Since this effect is consistent for the simulated and experimental data set, we assume that this effect is caused by the composition of the dataset and a larger, more diverse data set would be required to counteract this effect.When evaluating inclusions with µa ≤ 2.5 cm −1 , a likely in vivo scenario, we find that DL-Exp begins to outperform GT-ϕ with regard to the correlation coefficient and gCNR (Table .2 µa ≤ 2.5 cm −1 ), demonstrating the future potential of the approach in living subjects.
C. Experimentally trained deep learning models are transferable to a blood flow phantom set.We next applied the Cal., DL-Sim, and DL-Exp µa estimation methods in an independent flow phantom experiment (Fig. 6); the GT-ϕ method could not be applied as ground truth optical properties are unknown.The flow phantom was distinct from the training and test material, as it was fabricated from a different base and included a tube carrying blood that was chemically deoxygenated over time.We found that DL-Exp estimated a nearly two-fold higher µa value in the blood tube and DL-Sim a nearly three-fold increased µa value compared to the baseline method (Fig. 6A).Thus, both methods were able to recover a value closer to the expected reference of about 4.3 cm −1 at 800 nm (41).When computing sO2 from the µa estimates, DL-Sim produced a very poor dynamic range of sO2 during the imaging time, ranging from 60% to 45%, rendering any biological applications of the derived biomarker potentially useless (Fig. 6B).DL-Exp, on the other hand, was able to reproduce sO2 at the start of the experiment and marginally increased the dynamic range towards the end of the experiment compared to the reference Cal.method.
Comparison of the mean spectra at different time points (Fig. 6C,D) with literature-derived reference spectra for the µa of haemoglobin (41), however, revealed that none of the methods were able to compensate for the spectral colouring induced by water absorption in the coupling media, clearly visible from 850-900 nm.D. Oxygenation maps derived from learned µa estimates show more realistic arterial oxygen saturation.Finally, we tasked Cal., DL-Sim, and DL-Exp methods to estimate µa in vivo (Fig. 6A,C,E) and subsequently calculated sO2 from multispectral estimates (Fig. 6B,D,F).Despite their excellent spatial performance in phantoms, the U-Net µa estimates were not able to retain the high spatial frequencies of the in vivo PA image, which also translates to the resulting sO2 estimates (Fig. 6C,E), most likely due to the lack of high spatial frequency features in the training set.
We computed the relative µa-ratio between the aorta and the spine (Fig. 6G), to understand whether the methods could compensate depth-dependent decreases in signal intensity.Physiologically, we would expect a higher µa and oxygenation in the aorta compared to the spine due to higher average blood volume and presence of both arteries and veins in the spine.While the Cal.estimate gave approximately the same µa for both regions, the U-Net estimates result in a 2-3 fold higher µa in the aorta compared to the spine.Consistent with the artefacts observed in the test phantoms, DL-Sim estimated high µa values in the upper rim of the mouse body ans in the water background.
Application of linear unmixing revealed that the general trends of high and low oxygenation in the segmented organs were quite consistent across all methods.A notable exception was the left kidney, where both U-Net estimations showed relatively high oxygenation, whereas the calibrated signal estimated relatively low oxygenation.Computing the arterial sO2 in the aorta reveals an 8 ± 5 percentage points increase (p=0.002) in blood oxygenation on the DL-Exp µa estimate compared to the calibrated signal (Fig. 6H), though all models still underestimate the expected arterial blood oxygenation in an anaesthetised mouse of about 94%-98% (42), perhaps relating to the generally lower dynamic range seen in the blood phantom.

Discussion
We developed a supervised deep learning method to address the challenge of estimating optical absorption coefficients in PAI enabled by a collection of carefully annotated phantoms.The dataset consists of an unprecedented 102 tissue-mimicking phantoms spanning an important biological range of tissue optical properties.When assuming independence of images at 21 captured wavelengths from 700 nm to 900 nm, our dataset features a total of 2142 unique PA images of targets with distinct and known µa and µ ′ s distributions.Each image was subject to manual segmentation and quality assessment, then assigned reference optical absorption and scattering coefficients as determined by independent DIS measurements.We used digital twins of the phantoms to simulate a synthetic replica data set to analyse the differences between using simulated and synthetic data sets for model training.

Discussion of Results.
We show that erroneous model assumptions have an influence on the quantitative accuracy of the deep learning model and that training with experimentally acquired phantoms has advantages in both accuracy and generalisability of the approach.Our results indicate that tissue-mimicking phantoms can be used as reliable calibration objects and that using experimental training data is beneficial compared to synthetic training data, indicating the existence of a simulation gap.At the same time, we confirm that linear calibration schemes fail as they are unable to account for the exponential decay of fluence with depth.We find that using the experimental data set yields improved accuracy across multiple test cases compared to simulated data sets, at least given the simulation pipeline used in this work.
The deep learning models were initially applied to a test phantom set and results were compared with the naïve calibration model and fluence compensation using Monte Carlo modelling (assuming known optical properties).All methods could recover some correlation of the estimated µa coefficients compared to the reference µa.A striking result of DL-Exp is the overall improvement in spatial accuracy of the predictions, which is demonstrated by smooth estimates with sharply defined borders between segmentation classes, quantified by consistently high gCNR values.On the other hand, we found a median relative quantification error of ≈ 29% when training on experimentally acquired data and ≈ 37% when training on simulated data.In comparison, fluence correction with a Monte Carlo model of the ground truth absorption and scattering properties, achieves a relative absorption estimation error of ≈ 22%, but is less accurate in terms of spatial estimation at depth and less generalisable to other settings.
When evaluating using only imaging targets with µa ≤ 2.5 cm −1 , we find that DL-Exp begins to outperform GT-ϕ on several reported measures.By extending the training data set, we hope to achieve significantly better quantification results.Improvement of the deep learning model to mimic traditional iterative schemes might also be a promising way forward to further improve quantitative performance.When moving to completely different phantom and experimental data, the U-Nets performed less well in terms of spatial predictions.For example, the µa estimate of blood in the flow phantom was significantly increased although it was still more than a factor of two too low compared to the expected µa from literature.We did not include the Grüneisen parameter or 3D acoustic forward modelling in the digital data twins.Hence, the U-Nets are expected to be sensitive to changes in the underlying phantom material and the simulated images are not as accurate as they could be, which could be major reasons for the remaining observed discrepancies on the flow phantom data set.Similarly for the in vivo application in a mouse, DL-Exp gave a much more realistic representation of the absorption difference between the aorta and spine, and resulted in a substantial elevation of the estimated arterial oxygen saturation.The final results remained an under-estimate compared to the expected ≥ 94% but suggest a greater resilience to depth-dependent attenuation.
The U-Nets produced estimates with sharp edges on the blood-background interface in the flow phantom, likely because of the consistent nature of the spatial frequency content of the images.Contrary to the promising spectral performance, when applied to the mouse data, the U-Nets were not able to recover the high spatial frequency content of the image, suggesting a need to enrich the training data set with inclusions that better reflect the complexity of the in vivo imaging application in order to achieve high spatial accuracy in the general case.

Experimental limitations. While training on experimental
phantom measurements shows promise to advance quantitative µa estimates in PAI, there are also limitations that must be addressed in future work.Considering first experimental limitations, we only use titanium dioxide to tune µ ′ s and nigrosin to tune µa in the phantoms.This is important to minimise the influence of the molecule-and concentration-dependent Grüneisen parameter but might introduce a wavelength-dependent bias to the training data that may challenge our assumption of independence of images across wavelengths.We assumed images from different wavelengths to be independent, to maximise the number of training images even though images at neighbouring wavelengths will have a high correlation.While such an assumption could potentially lead to the introduction of a wavelength-dependent bias, our results do not show an indication for it, suggesting that the assumption -while not being ideal -did not systematically affect the learned algorithms (cf.Supplementary Fig. 7).If the assumption does not hold, it could impact the quantification of multispectrally derived biomarkers, such as blood oxygenation sO2.
A second experimental limitation of the study is the assumption that DIS measurements can be taken as a ground truth (43), where in fact they may introduce substantial uncertainty (44) for example from modelling errors by a priori assumptions for the sphere parameters, sphere corrections, or measurement error for sample thicknesses.Accurate quantification of optical properties is highly challenging, with variation of up to 30% and 40% for µa and µ ′ s , respectively, when quantifying the same sample with different instruments (45).We examined the influence of our uncertainties and found a standard error of 2.5% in the thickness measurements, which translates to approx.5% error of the optical absorption coefficient and simulated initial pressure distribution (cf.Supplementary Fig. 8).In conjunction with the 10%-20% error we determined for the DIS system based on cross-reference with time-resolved near-infrared spectroscopy (27), our reference optical material properties are subject to a standard error in the order of 20%.In future work, these uncertainties could be incorporated into model training by using Bayesian deep learning methods.
Computational limitations.We introduced a number of approximations during the Monte Carlo and k-space simulations, such as 2D acoustic modelling, the laser source geometry and beam profile, and absolute homogeneity of the phantom materials (i.e.sound speed and density).We used the delay-and-sum algorithm as the acoustic inverse operator, which is known to introduce artefacts to the reconstructed images for our imaging setup (46).We assume that a combination of these modelling errors in conjunction with the presence of measurement noise deep inside tissue is the reason for the relatively high µa quantification error for GT-ϕ.The influence of noise is highlighted when visualising the absorption estimated within larger inclusions with high µa, leading to an increasing overestimation of µa with depth and a wavelength-dependent performance of GT-ϕ (cf.Supplementary Fig. 7).

Future directions.
When only analysing results for absorbers with µa ≤ 2.5 cm −1 , our findings suggest that a learning-based approach has distinct advantages that might allow it to eventually outperform fluence correction.Learning-based models might learn to correct for effects such as measurement noise, out-of-plane signal contributions or signal blurring caused by the point spread function of the measurement system.Unfortunately, our results also show that to achieve this, we require: (1) more and broader training data; (2) highly optimised and accurate optical and 3D acoustic forward models; (3) realistic noise models; as well as (4) model-based acoustic inversion schemes to obtain more faithful reconstructions of the initial pressure distributions.Finally, in order to validate the trained methods, diverse in vivo test data sets have to be acquired that are representative of the target applications and annotated with high-quality and high-confidence labels of the underlying physiology.Since ground truth knowledge of the underlying optical properties in vivo is rarely available, the validation of learned algorithms remains a major roadblock on the path towards quantitative photoacoustics.

Conclusion
We introduce a collection of well-characterised phantoms to, for the first time, enable supervised training and evaluation of learned quantitative PAI methods on experimental data.Leveraging the power of deep learning and computational modelling of the forward processes involved in PAI, we demonstrate that our phantoms can be used as reliable calibration objects.Our experiments confirm the limitations of linear calibration schemes in accounting for depth-dependent fluence effects and show that a deep learning method trained on experimentally acquired datasets can provide better optical property estimation of previously unseen test phantoms compared to simulated data sets.With demonstrations in different test scenarios, including in living subjects, deep learning shows potential for real-world applications.Despite remaining limitations in spectral and spatial accuracy, our work represents an exciting advancement towards the development of accurate and reliable quantitative PAI data analysis.

Bibliography Supplemental Materials: Sample preparation and phantom designs
This section provides additional details on the experimental setup.In particular, we describe the sample preparation and phantom designs used in our study, which were optimized to mimic the optical and acoustic properties of biological tissue.

Supplemental Result Figures
This section contains supporting result figures that provide a visual representation of the experimental data, including images of the tissue-mimicking phantoms and PAI results obtained using our deep learning approaches.

Fig. 1 .
Fig. 1.Supervised deep learning approach.A library of tissue-mimicking phantoms with known optical properties were subjected to: (A) Experimental photoacoustic imaging (PAI), followed by delay-and-sum reconstruction, manual image quality assessment, and segmentation; and (B) Computation of light fluence and time series pressure using digital twins of the phantom and imaging system.(C) Paired experimental and computational data were used to train a supervised deep learning model to estimate optical absorption coefficients.

Fig. 2 .
Fig. 2. Optical absorption estimation methods.(A) The baseline model (Cal.) is a linear mapping from the arbitrary units PA image to reference absorption units.(B) Adding complexity, the simulated light fluence is used to correct PA images (GT-Φ) before a linear mapping is applied.(C) For supervised deep learning, a U-Net is trained on simulated input data to estimate the optical absorption based on the input PAI data (DL-Sim) and (D) trained on experimentally acquired data (DL-Exp).

Fig. 3 .
Fig. 3. Representative performance example from the held-out test set.(A) Baseline model (Cal., yellow); (B) Fluence correction with reference simulations (GT-ϕ, green); (C) U-Net trained on simulated data (DL-Sim, blue); and (D) U-Net trained on experimental data (DL-Exp, red).DL-Exp achieved a good agreement with the reference in the background (Reference: 0.098 cm −1 ) and one inclusion (Reference: 1.04 cm −1 ) but showed poorer performance in the other inclusion (Reference: 3.1 cm −1 ) at 800 nm), as shown through line profiles (E) in different colours.The methods are colour coded in their dedicated colours and the ground truth optical absorption is shown in a solid black line (GT µa).The level window in (A-D) is adjusted to clearly visualise the water background, the background material, and the inclusions.

Fig. 4 .
Fig. 4. PA signal calibration against reference optical absorption with a linear mapping function fails at depth.Linear mapping derived from the training set was applied to the test set signals from the background (A) and inclusions (B).(C) Experimental and simulated PA signals are shown as function of depth normalised to their maximum signal intensity.The shaded areas correspond to the standard deviation of the signal intensities.Vertical dashed lines highlight depths of approximately 1 mm and 13 mm, for which (D) and (E) show the scatter plot and linear fit of all data points across all phantoms in the test set found at those respective depths.

Fig. 5 .
Fig. 5. DL-Sim fails to preserve the dynamic range of oxygenation response.A Cross-sectional slice through the phantom with the blood-carrying tube and the spatial absorption profiles along the dashed line.B Mean blood oxygen saturation (sO2) derived with linear unmixing from multispectral µa estimates for each of the methods over time, with spectra at specific time points shown respectively in C and D. A reference for the expected µa spectrum of blood (scaled down by a factor of 3 for visualisation purposes) at the time points is shown as a dashed black line.

Fig. 6 .
Fig. 6.Absorption estimation methods applied in living subjects.(A) Calibrated PA signal (Cal.) and (B) corresponding oxygenation (sO2).C) µa estimates of DL-Sim and (D) corresponding sO2.E) µa estimates of DL-Exp and (F) corresponding sO2.G shows the relative estimated µa ratios between the aorta and spinal region over nine subjects and H shows the distribution of sO2 values in the aorta.The expected sO2 of 96%±2% is shown as a green area.

Fig. 1 .
Fig. 1.Sample preparation and DIS measurements.A shows the phantom preparation recipe.In brief, the following steps have to be taken: (1) Addition of nigrosin and titanium dioxide and sonication for 60 mins; (2) addition of polymers; (3) heating to 150 • C under low stirring for 60mins; (4) vacuuming for 3 mins and removing surface air bubbles; (5) pouring into mould and hardening for 60 mins at room temperature; and (6) filling of inclusions using a vacuum pump.B A double integrating sphere system is used to measure the total reflectance and diffuse transmittance of the sample.Then, the inverse adding doubling (IAD) algorithm is run.It uses the reflectance and diffuse transmittance measurements to estimate the optical absorption and reduced scattering coefficients of the sample.

Fig. 2 .
Fig. 2. Histogram plots showing the distribution of optical properties in a sample.A Distribution of the optical absorption coefficient in the background material, B distribution of the optical absorption coefficient in the inclusion material, C distribution of the reduced scattering coefficient in the background material, and D distribution of the reduced scattering coefficient in the inclusion material.The histograms provide information on the variability and range of the optical properties in the sample, which is important for understanding the optical behavior of the material and for the development of accurate models and simulations.

Fig. 4 .
Fig. 4. Side-by-side plots of two test-set phantoms and the µa estimation results.A Baseline model (Cal., yellow); B Fluence correction with reference simulations (GT-ϕ, green); C U-Net trained on experimental data (DL-Exp, red); and D U-Net trained on simulated data (DL-Sim, blue).Line profiles are shown for each method (E) in different colours.The methods are colour coded in their dedicated colours and the ground truth optical absorption is shown in a solid black line (GT µa).The level window in A-D are adjusted to clearly visualise the water background, the background material, and the inclusions.

Fig. 5 .
Fig. 5. Side-by-side plots of two test-set phantoms and the µa estimation results.A Baseline model (Cal., yellow); B Fluence correction with reference simulations (GT-ϕ, green); C U-Net trained on experimental data (DL-Exp, red); and D U-Net trained on simulated data (DL-Sim, blue).Line profiles are shown for each method (E) in different colours.The methods are colour coded in their dedicated colours and the ground truth optical absorption is shown in a solid black line (GT µa).The level window in A-D are adjusted to clearly visualise the water background, the background material, and the inclusions.

Fig. 6 .
Fig. 6.Absorption coefficient estimates of the different methods.(A, D) show the results when correcting with the fluence derived from Monte Carlo simulations (green); (B, E) show the results when estimating with the U-Net trained on simulations (blue); and (C, F) show the results with a U-Net trained on experimentally acquired data (red).(A, B, C) show the results only considering the phantom base material and (D, E, F) show the results only considering the inclusions.

Fig. 7 .
Fig. 7.The µa estimation errors as a function of wavelength.(A)shows that no significant bias exists in the background estimates.B reveals that GT-ϕ exhibits a monotonously decreasing error with wavelength, caused by the increasing SNR in depth with increasing wavelength.While also exhibiting a slight bias, the wavelength-dependent change in error is much less pronounced for the estimated of the learned methods.

Fig. 8 .
Fig. 8.The standard error propagating through various stages of the simulation process.It should be noted that the standard errors for the initial pressure and reconstructions are estimates based on low-resolution simulations with the pipeline described in the paper.