Photoacoustic Quantification of Tissue Oxygenation Using Conditional Invertible Neural Networks

Intelligent systems in interventional healthcare depend on the reliable perception of the environment. In this context, photoacoustic tomography (PAT) has emerged as a non-invasive, functional imaging modality with great clinical potential. Current research focuses on converting the high-dimensional, not human-interpretable spectral data into the underlying functional information, specifically the blood oxygenation. One of the largely unexplored issues stalling clinical advances is the fact that the quantification problem is ambiguous, i.e. that radically different tissue parameter configurations could lead to almost identical photoacoustic spectra. In the present work, we tackle this problem with conditional Invertible Neural Networks (cINNs). Going beyond traditional point estimates, our network is used to compute an approximation of the conditional posterior density of tissue parameters given the photoacoustic spectrum. To this end, an automatic mode detection algorithm extracts the plausible solution from the sample-based posterior. According to a comprehensive validation study based on both synthetic and real images, our approach is well-suited for exploring ambiguity in quantitative PAT.


I. INTRODUCTION
W ITH artificial intelligence, and machine learning in particular, increasingly set to transform modern interventional healthcare, the development of new modes of functional imaging is becoming imperative.While conventional imaging techniques based on RGB (red, green, and blue) perception mimic the mode of operation of the human eye and only allow for the morphological analysis of the tissue surface, spectral imaging techniques use multiple bands across the electromagnetic spectrum.In the context of interventional imaging, these methods can be applied to visualize both structural and functional surface and subsurface information otherwise invisible to the naked eye.In general, spectral imaging techniques exploit the fact that different tissue components have unique optical properties for each wavelength.Knowledge of these optical properties along with multispectral measurements can potentially provide information on the molecular composition of the tissue underneath the visible surface, specifically on the concentrations of deoxygenated and oxygenated blood, water, lipids, melanin, and bilirubin.Clinically relevant functional information, such as oxygen saturation (sO 2 ), can be derived from this so-called spectral unmixing [1].
A variety of medical imaging techniques exhibit spectral sensitivity, each offering specific advantages.Photoacoustic tomography (PAT) addresses the issue of limited depth range by measuring optical properties via acoustic signals ('light Uncertainty-aware approach to photoacoustic quantification of blood oxygen saturation (sO 2 ) using conditional Invertible Neural Networks (cINNs).Standard neural networks (red) that output point estimates are incapable of representing the ambiguity of the problem.As they are forced to output a single solution, they will typically output the mean of multiple plausible solutions in cases of ambiguity.In contrast, the proposed cINN (blue) makes the ambiguity fully transparent.In the provided example, the ambiguity can be removed by optimizing the acquisition pose.
in -sound out' approach) [2].In this method, the tissue is illuminated using light pulses, leading to the absorption of photons and subsequent heating.The resulting thermoelastic expansion generates pressure waves, which can then be detected by broadband ultrasonic transducers.In this manner, tomographic images of optical properties can be produced at a high level of resolution and with a range of up to several centimeters.A shared detector array in PAT devices allows for hybrid imaging as it enables both recovery of functional tissue properties from tissue optics and imaging of structural information with conventional ultrasound.Clinical applications of PAT extend from cancer research [3] to cardiovascular [4] and inflammatory [5] diseases, and beyond [6].
For many of these applications, it is imperative to accurately quantify the molecular composition from reconstructed photoacoustic images.However, this quantification problem faces inherent ambiguities that notably hinder clinical translation.Theoretically, different configurations of tissue parameters, yielding significantly varied sO 2 values, could result in nearly identical spectra due to the ill-posedness of the underlying optical and acoustic inverse problem [7].The ambiguity can be mainly attributed to the so-called spectral coloring phenomenon [8].
Classic approaches such as Linear Unmixing (LU) assume that the signal S of the reconstructed photoacoustic image is proportional to the optical absorption µ a [9]: S(λ) ∝ p 0 (λ) ∝ µ a (λ) with λ being the wavelength and p 0 the initial pressure distribution.In truth, this is not the case as S is also proportional to the light fluence φ and the Grüneisen parameter : S(λ) ∝ p 0 (λ) = µ a (λ) • φ(λ) • .φ depends on the optical parameters of the whole tissue, i.e., also depends on µ a [9].This interdependency between fluence and absorption makes the quantification an ill-posed problem, potentially resulting in ambiguous solutions [7], [9].Notably, the well-posedness of the reconstruction problem depends not only on the tissue but also on the device characteristics, such as illumination geometry, bandwidth of the detector, noise characteristics, imperfect detectors as well as on the acquisition protocol (see Fig. 1) [7], [10].
Machine learning-based approaches have become increasingly popular for recovering functional information from PAT data.One reason is that neural networks are capable of learning any arbitrarily complex mapping between input (in this case, images) and output space (in this case, physiological parameters) [11].Analytical solutions, on the other hand, often rely on simplifying assumptions (such as known scattering coefficients) and can be computationally expensive [9].So far, however, the research community has focussed on neural networks without specific representation of uncertainty [12], [13], [14], [15], as also highlighted in this recent review paper [16].Particularly, state-of-the-art methods typically provide point estimates as output and are inherently incapable of multiple (potentially equally likely) plausible solutions, as illustrated in Fig. 1.Given this gap in the literature, our contributions are: 1) Uncertainty-aware oxygenation quantification: We present a novel approach to quantitative photoacoustic tomography (qPAT) that leverages conditional invertible neural networks (cINNs) for uncertainty representation.While other uncertainty-aware methods, such as Monte Carlo dropout [17] and bootstrap aggregation [18], typically only provide summary statistics such as the standard deviation, our approach based on cINNs is capable of representing full posterior distributions, thereby capturing potential ambiguities through multiple modes.2) Comprehensive validation: Due to the non-existence of medical imaging devices capable of determining spatially-resolved oxygenation, the training and validation of quantification methods suitable for in vivo application is highly challenging in the field of photoacoustics.To overcome this this bottleneck, we employ a multi-stage validation strategy.This approach involves gradually increasing the realism of the test images, albeit at the cost of reducing control over the reference.The validation is based on a total of 1,150 synthetic images generated with the concept of device digital twins and 450 real images from a total of 25 individuals.A preliminary version of this paper, for which neural networks were trained and validated on an extremely simplified toy data set, was presented at a national conference [19].

II. MATERIALS AND METHODS
In this section, we present our new quantification approach based on cINNs (Section II-A) and a strategy for the generation of more realistic annotated training and test data (Section II-B).

A. Photoacoustic Quantification of Tissue Oxygenation
Given the pixel-wise photoacoustic spectra y, our aim is to predict the underlying sO 2 .However, instead of assuming a deterministic mapping y → sO 2 , we phrase the problem in a probabilistic manner and estimate a conditional probability density p(sO 2 | y) (posterior) with cINNs.The following paragraphs provide an intuition of the approach as well as implementation details on the cINN and our method for automatic mode detection.
1) Concept Overview: While we are not aware of any related work on addressing ambiguities in PAT with specific neural network architectures, our own prior work in the field of diffuse reflectance imaging has taken initial steps on addressing ambiguity in spectral imaging with invertible neural networks (INNs) [20].To account for the possibility that no bijective mapping from the observable y (here: a photoacoustic spectrum) to the underlying tissue parameters x (here: sO 2 ) may exist, a latent vector z is introduced, which is concatenated with y.Intuitively, z can potentially capture any information from the input that is not contained in the observable y.Then, zero padding is performed to ensure identical sizes of the input and output vectors.The training process ensures that z follows a Gaussian distribution.Given a new observation ŷ (i.e. a spectrum), the latent variable ẑ is sampled from the Gaussian distribution and concatenated with ŷ.Passing through the network in reverse yields a corresponding value for the sO 2 .The sampling process is repeated, such that a full posterior is obtained.This approach introduces new opportunities for uncertainty analysis [21], but it also has several drawbacks.Firstly, INNs are difficult to train due to the need to balance multiple loss terms.Secondly, the number of network parameters is directly affected by the input size, making it challenging to generalize the method for image-level applications.To overcome these shortcomings and inspired by [22], we leverage the concept of cINNs in this work.More specifically, we use the observable y as a conditioning for an INN that simply maps x to a latent vector z of identical size (see Fig. 2).This way, the size of the input/output vectors of the INN only depends on the size of x, not of y, which substantially reduces the number of network parameters.Furthermore, we get rid of the zero padding, which enables us to perform maximum likelihood training.
Formally expressed: Let f : (sO 2 , y) → z denote a cINN with parameters and latent variable z = (z 1 , z 2 ) ∈ R 2 , then for every y ∈ R 16 ≥0 (corresponding to 16 wavelengths) the map f is guaranteed to be invertible in the first argument.We train f to encode the training data {(sO 2i , y i )} i as a two-dimensional standard normal distribution N (µ, ) in the latent space, where µ is the zero vector and is the identity matrix.Using the transformation rule of probability distributions and maximum likelihood training leads to the Fig. 2. Proposed conditional Invertible Neural Network (cINN) architecture.To obtain a two-dimensional latent space (z), the one-dimensional input (sO 2 ) is complemented by an auxiliary variable (av ).The core of the architecture comprises a sequence of N = 20 identical blocks, each comprising a random permutation layer and affine coupling blocks [22], [23].Conditioning is performed on single-pixel spectra (y).following loss where J f denotes the log-Jacobi determinant of f and ∥ • ∥ 2 the L2-norm.At inference time, for a newly recorded spectrum ŷ, repeated sampling of latent variables ẑ = ( ẑ1 , ẑ2 ) with z i ∼ N (0, 1) and computation of the corresponding sO 2 values via f −1 (ẑ, ŷ) yields an approximation of the posterior p (sO 2 | ŷ) (see Fig. 3).Afterwards, automatic processing of the posterior yields human-interpretable quantities, like number of modes, mode location, and interquartile range (IQR), as described below.This way, uncertainties including potential ambiguities can be visualized along with the desired output parameters, as illustrated in Fig. 4.
2) cINN Implementation Details: cINNs consist of so-called coupling blocks, which are invertible by design.We base our work on affine coupling blocks as introduced in [24] and refined in [23].We implement the networks in PyTorch using the Framework for Easily Invertible Architectures (FrEIA) [22].Our networks consist of 20 blocks with glow-style affine coupling and randomly initialized, fixed channel permutations (see Fig. 2).The subnetworks used in the affine coupling consist of shallow fully connected networks with a single hidden dimension of width 1024, a rectified linear unit (ReLU) activation and dropout ( p = 0.5).The invertibility requires some of the subnetwork output to be non-zero, hence they are transformed by the exponential function.To avoid unstable training due to exploding or vanishing terms, we apply softclamping (α = 1.0) via tanh activation to the output to bound the values of the exponential term.A design restriction of the coupling blocks is a minimum input dimension of two because of internal channel splitting.As our only quantity of interest (sO 2 ) is one-dimensional, we add an auxiliary dimension consisting of standard Gaussian noise.Hence, the input x is a two-dimensional vector x = (sO 2 , av) with sO 2 ∈ [0, 1] and the auxiliary variable av ∼ N (0, 1).During inference time, a sample-based posterior distribution of sO 2 is obtained (see Fig. 3) and the auxiliary variable can be disregarded in this step.All inputs and targets are standardized using the mean and standard deviation of the training set before being fed into the network.
3) Mode Detection: To detect potential ambiguities automatically, we apply the UniDip clustering algorithm [25], which recursively applies the Hartigan Dip-test of Unimodality [26] to obtain a discrete set of clusters (modes), each of which can be interpreted as a plausible solution.As a point estimate, the median of the cluster is used.In contrast to standard methods like k-means or kernel density estimation (KDE) the method is practically parameter-free apart from a statistical significance level (here used with the default value of 0.05), which makes it robust compared to other methods.In fact, we found that it shows improved stability with respect to resampling the posterior when compared to KDE.If more than one cluster is found we mark the posterior as multimodal, otherwise as unimodal.
4) Results Interpretation: The further processing of multimodal solutions depends on the application.We distinguish three use cases: • User-based selection: In other scenarios, the user may exclude some of the solutions based on prior knowledge.
For example, a clinician may know that a pixel in a reconstructed image corresponds to an artery (rather than a vein) and unreasonably low oxygenation values may hence be excluded even if it had a higher likelihood.Thus, naively choosing the mode with highest likelihood does not always make sense for all estimations.
• Likelihood-based selection: In some cases, a user may be interested in converting the multimodal posterior to the most likely solution by picking the largest mode.This approach comes with the risk of picking the wrong mode as the size of a mode is affected by the distribution of the training data and the most frequent solution does not necessarily correspond to the correct solution.
• No selection: In many scenarios, it may be sufficient to convey the information that a current solution is ambiguous.In this case, the user may, for example, repeat image acquisition with an altered pose of the device, as illustrated in Fig. 1.Alternatively, including all possible solutions in the decision-making process can be useful.For example, when the set of possible solutions includes critical sO 2 values from a clinical perspective, additional diagnosis steps may be initiated to maximize patient safety.

B. Validation Strategy
Annotating photoacoustic images with corresponding tissue oxygenation is highly challenging as no reference method for reliably measuring spatially resolved blood oxygenation exists to date.While we train our models exclusively with synthetic data, realistic validation data is needed to reliably assess the value of our method.We approach this bottleneck with a concept leveraging a combination of real and synthetic data, enabled by a real photoacoustic imaging system and its corresponding device digital twin.As illustrated in Fig. 5, our validation concept is based on the explicit disentanglement of factors that are relevant for image formation, namely (1) properties of the imaging device, (2) anatomical tissue properties as well as (3) optical and acoustic tissue properties.The following paragraphs explain how to generate annotated in silico and in vivo data according to our strategy.
1) Photoacoustic Imaging Device and Digital Twin: Our study is based on the multispectral optoacoustic tomography (MSOT) Acuity Echo device (iThera Medical GmbH, Munich, Germany), capable of generating PAT images co-registered with ultrasound (US) images.Our device digital twin was built exactly according to the specifications of the real PAT system.The detector array has a total of 256 transducer elements, which are arranged on an arc with a radius of 40 mm and a pitch of 0.34 mm.The detector has a center frequency of 3.96 MHz and a bandwidth of 55 %.The detector array is located within the probe such that its focus (center) is 8 mm outside the probe.Between the detector and a 1 mm thick membrane, there is an acoustic couplant which consists mainly of heavy water.The device has a 30 mm wide slit illumination which is located 17.8 mm off the imaging plane and 43.2 mm behind the membrane.The light cone has a full width at half maximum of 8.66 • and a 22.2 • tilt with respect to the imaging plane such that the center of the cone intersects the imaging plane at the focus.Each photoacoustic image acquisition leads to time series data with 2030 samples recorded with a sampling frequency of 40 MHz.
2) Generation of Annotated Synthetic Data: Our framework for virtual photoacoustic imaging was developed based on the Simulation and Image Processing for Photonics and Acoustics (SIMPA) toolkit [27].Our simulation pipeline for the generation of a synthetic photoacoustic image comprises six steps: (1) definition of a device digital twin (see previous paragraph), (2) definition of a digital tissue representation, (3) 3D Monte Carlo simulation with the Monte Carlo (MC) eXtreme (MCX) simulation toolkit [28], (4) 2D acoustic simulation with k-Wave [29], (5) noise modeling, and (6) image reconstruction.For the generation of digital tissue representations, we follow a model-based approach.More specifically, we leverage prior knowledge extracted from literature to render plausible tissue geometries in a probabilistic manner.To this end, the forearm anatomy model described in [30] is used.Afterwards, the tissue geometry maps are assigned realistic optical and acoustic properties drawn from a probability distribution.For this, we use the comprehensive tissue library provided by SIMPA.It is based on various literature resources, e.g., [31], [32], [33].The tissue geometries with assigned optical and acoustic properties then serve as the basis for MC simulation.The MC method reformulates photon interactions with tissue by modeling these interactions as the process of drawing random variables from underlying probability distributions, which are then used to calculate photon step sizes and directions.By repeating this process a large number of times, the radiative transfer equation which models this process analytically can be approximated numerically.
3) Generation of Annotated in Vivo Data: Our strategy for realistic validation data generation is based on the following observation: Several complementary tissue properties are relevant for image formation, but only a subset can be recovered from reconstructed photoacoustic images in a straightforward manner.Specifically, we categorize the relevant tissue parameters into three groups: Anatomical properties are related to the geometrical properties of tissue anatomy.As blood and melanin are the primary absorbers of light in tissue [34], the geometry of skin, veins, and arteries can be assumed to be the most influential factor on photoacoustic images in terms of tissue geometry.Optical properties define how the light propagates through the tissue.The relevant quantities are the optical absorption µ a , scattering µ s , the anisotropy g, and the refractive index n (in our case set constant to one).Acoustic properties also influence the obtained photoacoustic image mainly due to variations in the speed of sound v s and tissue density ρ across different structures.While there is no reliable reference method for recovering optical and acoustic properties, geometrical properties can be extracted from the reconstructed photoacoustic images, supported by additional information from co-registered US images.To annotate in vivo images with tissue oxygenation, we therefore propose Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 6.Reference and Learned Spectral Decoloring (LSD) [15] results relative to the conditional Invertible Neural Network (cINN) prediction for cases with bimodal posterior distribution on DS Test Synthetic .For each test case, the oxygen saturation (sO 2 ) prediction results were normalized such that 0 corresponds to the largest cluster of the cINN and 1 corresponds to the second largest cluster.The reference is mostly close to one of the two solutions of the cINN.In contrast, the state-of-the-art deep learning-based method (LSD) either predicts a value close to the largest cluster of the cINN or a value that lies in between the modes.
representing tissue geometry as semantic segmentation maps that are enhanced by plausible optical and acoustic properties based on knowledge from literature.Using a digital twin of the real photoacoustic device, an MC simulation based on the semantic segmentation maps yields realistic photoacoustic images annotated with tissue oxygenation.
The data sets generated according to this strategy are introduced in Section III-A.
4) Performance Assessment: A naive validation approach would be to compare the maximum a posteriori probability (MAP) estimate (in the case of the cINN the largest cluster) to the reference point estimate.However, the MAP estimate is not necessarily the best solution.As illustrated in Fig. 6 the smaller mode might be the correct solution and correspond to the reference.As it might be possible to exclude modes based on prior knowledge, we also consider the closest mode (see Section II-A.4).Furthermore, we complement the traditional mean absolute error-based analysis with a mode-centric analysis.Specifically, we compute the Recall (or: Sensitivity) of mode detection by computing the fraction of modes with the point estimate / detected mode in the five percentage points (pp) range.For all plots and results, hierarchical aggregation is performed prior to computing descriptive statistics to account for non-independence of spectra.Specifically, the errors of the individual pixel predictions are averaged per image.

III. EXPERIMENTS AND RESULTS
Following common practice in the photoacoustics community [16], we based the validation of our photoacoustic quantification concept on human forearm data.While model training was only performed on purely synthetic data, we validated our approach in several stages featuring different degrees of realism and reliability of the reference.The following paragraphs present the data sets for this study (Section.III-A), implementation details for our method, and the baseline method (Section III-B), as well as the in silico (Section III-C) and in vivo (Section III-D) experiments.

A. Data Sets
In all data sets, the photoacoustic spectra consisted of wavelengths in the interval [700 nm, 850 nm] in 10 nm steps.L1-normalized spectra of veins and arteries were extracted for further analyses.Example images and a comparison of the data sets are shown in Table I.

1) DS Train
Synthetic : This data set was generated with our synthetic data generator (Section II-B.2) and was used for model training.Tissue classes comprised epidermis (sO 2 = 0 %) [35], muscle (sO 2 ∼ U [50 %, 80 %], blood volume fraction ∼ U [10 %, 50 %]) [36], arteries (sO 2 ∼ U [90 %, 100 %]) [32], and veins (sO 2 ∼ U [60 %, 80 %]) [31].For references where only single values were reported, we deliberately chose broad ranges around these values to increase the variability of the training data set.Furthermore, US gel, the membrane, and the heavy water inside the device were modeled.The Grüneisen parameter was set constant to ∼ 0.2 for all classes.Volumes of size 75 mm (transducer dim) x 20 mm (planar dim) x 20 mm (height) were created with a spatial resolution of 0.1 mm.Inspired by [37] we recorded water bath measurements with our photoacoustic device to obtain a realistic noise model.From those measurements, we extracted the laser energy distribution and the additive device-specific noise (mainly thermal and complex parasitic noise).Our simulation pipeline then comprises the following steps: (1) optical forward simulation using MCX with 1•10 8 photons incorporating the obtained real laser energy fluctuations, (2) 2D acoustic forward modeling with k-Wave where the digital device twin is modeled with an increased sampling frequency of 54 MHz for better numerical stability [38] and the detector elements of width 0.24 mm are represented by the off-grid kWaveArray sensor class which natively models directional sensitivity [39], (3) bandpass-filtering, (4) adding device-specific noise and (5) reconstruction with the Delay-And-Sum (DAS) algorithm [40].In step (4) instead of sampling from a Gaussian noise distribution we sampled time series data from over 1000 previously recorded water bath measurements.We removed the membrane signal from the water bath measurements leaving only the time series data consisting of device-specific noise.We then scaled this noise and added it to our in silico time series data with a scalar factor such that the peak-signal-to-noise ratio matched with in vivo measurements of the forearm.We multiplied the output of step (1) with the laser energy corresponding to the sampled water bath measurement.And subsequently divided the output of step ( 4) by the laser energy to align the simulation with the processing of the in vivo data.The resulting images were cropped and downsampled to a resolution of 0.156 25 mm (expected resolution of the real device) such that we ended up with multispectral photoacoustic images of dimension 16×128×256 voxels.1,000 volumes were created of which 100 were exclusively used for validation during model optimization.

2) DS Test
Synthetic : For an in-domain validation of our approach we generated DS Test Synthetic , comprising 150 additional volumes generated in the same way as those in DS Train Synthetic .

3) DS Test
Real : To assess the performance of our approach on real data, we acquired 150 in vivo photoacoustic images Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.from 25 healthy human volunteers (three on each arm of each volunteer).Prior to recordings, the consent of the volunteers was obtained, and relevant regulations were adhered to during the entire process (German Clinical Trials Register reference number DRKS00023205).The images were reconstructed using DAS.The postprocessing consisted of laser pulse energy correction and cropping.Afterwards, the images were semantically annotated by domain experts using both the photoacoustic signal and the corresponding US images.The classes considered were skin, muscle, artery, vein, transducer membrane, transducer head, and US gel.To account for the fact that labeling instructions are crucial for transparency and consistency [41], the annotators received the instructions proposed in [42].The resulting data set allowed for assessing whether the predictions of sO 2 for veins and arteries were in plausible ranges.

4) DS Test
Hybrid : To go beyond plausibility analyses with the acquired in vivo data, we followed the strategy described in Section II-B.3 to convert the 150 experimental images into realistic photoacoustic images annotated with reference tissue oxygenation.Real (i.e., same volunteers) but comprises 300 images of the human calf and neck.

B. Neural Networks
As a baseline method, we implemented the state-of-theart method Learned Spectral Decoloring (LSD) for estimating tissue oxygenation [15].We optimized the architecture and training procedure provided by the authors based on our data set for better performance.Our network consisted of two hidden layers of size 256, ReLU activations and dropout ( p = 0.5).In contrast to our cINN approach, this method is only capable of providing a single point estimate without an uncertainty estimate.
For all our experiments, our networks were trained for 100 epochs using the AdamW optimizer [43] (initial learning rate: 1 • 10 −3 ; weight decay strength: 0.01; batch size: 1024).The learning rate was reduced by a factor of ten in epochs 80 and 90 respectively.The training loss is plotted in Fig. 8.
The UniDip clustering was applied to a posterior consisting of 5,000 samples.When a point estimate was required, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the median of the largest cluster identified by the UniDip algorithm was used.

C. In Silico Validation
The purpose of the in silico experiments was to validate the value of our quantification approach in a setting where the distributions of training and test data are identical.According to our validation on DS Test Synthetic , the shapes of posteriors were highly variable (see Fig. 7), and multiple modes occurred in a substantial number of cases (13 %).The mean/median number of modes for multimodal posteriors was 3.3/3 with an IQR of 2-4.Thirty-one percent of the multimodal posteriors comprised two modes.We further analyzed the depth dependence of the ambiguity and found that pixels less than 4 mm away from the surface have posterior in 4 % of the case while pixels deeper than that show multimodal solutions in 23 % of the cases.The calibration curve (see Fig. 9 (top)), suggested in [44] to assess the quality of uncertainty estimation in INNs, was close to the identity with a median absolute calibration error smaller than 1 %.As shown in Fig. 9 (bottom), the oxygenation estimates were accurate, with the absolute error increasing substantially in cases of multiple modes.However, when using the median of the closest (rather than the largest) cluster as an estimate, this effect was considerably reduced.In this case, our method also outperformed the state-of-the-art method LSD [15] with a substantial error reduction of 46 % (see Table II).Both deep learning-based methods outperformed the model-based LU by a large margin.Fig. 6 shows the prediction differences in the bimodal case between LSD and cINN.In most of the cases, one of the predicted modes of the cINN is close to the reference while the baseline method (LSD) usually predicts a value similar to the largest cluster of the cINN or a value that lies in between the two modes of the cINN.More specifically, in 45 % of the cases, the LSD prediction is closer to the mean of the cINN modes than to the reference solution.When considering all the reference solutions as modes to be detected by the qPAT methods, our approach increases the Recall by 16 pp compared to the LSD method (86 % vs 70 %) in the multimodal case.

D. In Vivo Validation
The purpose of the in vivo validation was to test our method on realistic photoacoustic data.A representative output of our  method is shown in Fig. 4. As illustrated in Fig. 9 (top), the model tested on DS Test Hybrid showed a slight overconfidence of the predictions.21 % of the samples yielded a multimodal posterior distribution.The mean/median number of modes for multimodal posteriors was 3.4/3 with an IQR of 2-4.Thirty-three percent of the multimodal posteriors comprised two modes.The mean absolute errors were higher compared to validation on the in-domain data DS Test Synthetic .Moreover, multimodal posteriors led to a higher error compared to unimodal posteriors (see Fig. 9 (bottom)).Again, using the median of the closest cluster as an estimate, considerably reduced this effect, and our method outperformed LSD, with a substantial error reduction of 56 % (see Table II).The Recall in the multimodal case was 68% for the cINN and 43% for the LSD method.Real (calf and neck images).All individual pixel predictions were aggregated per vessel using the median.The colors distinguish between the 25 volunteers.For the forearm, a subset of predictions where only a set of high certainty annotations were considered is additionally presented.The top row shows an example photoacoustic image at 800 nm for each of the body sites.

The median estimates obtained for DS Test
Real were in close agreement with values from the literature.In fact, the median sO 2 computed for arteries was between 90 % and 100 %, while veins were associated with lower sO 2 values around 60 %-70 % (see Fig. 10).On average, the deep learning-based methods yielded more plausible sO 2 estimates for arteries (LSD: 94 %, cINN: 93 %) and veins (LSD: 78 %, cINN: 66 %), compared to the model-based method (LU: 78 %/46 %).However, it could also be observed that several vessels, labeled as arteries, were associated with relatively low sO 2 below 80 %.As the medical expert had reported difficulties in distinguishing arteries from veins in some cases, we reduced the test set to those pixels corresponding to vessels for which the annotations could be assumed to be correct.Specifically, we selected only those vessels that were identified as either the radial or ulnar artery, or as an accompanying vein of one of them.We assumed that restricting the annotations to those vessels, which represent easily identifiable anatomical structures, would increase the quality of the reference annotations.Reducing the analyses to only highly confident annotations yielded even more realistic estimations, as shown in Fig. 10.Furthermore, the results when applying our model to the in vivo out-of-distribution dataset DS o.o.d.Test Real (calf and neck images) are shown.Similarly to the results on the forearm data, the estimates for arteries and veins are mostly in plausible ranges.

IV. DISCUSSION
Accurate and reliable quantification of tissue oxygenation promises tremendous clinical impact as pilot studies already demonstrated applications in cancer research, cardiovascular and inflammatory diseases [3], [4], [5].In this work, we proposed the use of cINNs for the quantification of tissue oxygenation based on photoacoustic imaging.Our approach is, to our knowledge, the first to not only address uncertainties in the predictions but also to systematically capture potential underlying ambiguities.As an important finding, our analysis suggests that the problem of recovering tissue oxygenation from single PAT spectra is not generally well-posed, as suggested by the high fraction of posteriors with multiple modes.Such ambiguities can potentially be resolved by optimizing acquisition protocols, as illustrated in Fig. 1.For example, a physical point may be located directly underneath a large vessel, making the reconstruction problem very challenging.Changing the probe pose may yield a configuration in which the same point is not located underneath a huge absorber.How to exactly perform the acquisition optimization, however, remains a subject of future work.Furthermore, our approach could be used to investigate how different factors such as the optical/acoustic heterogeneity, bandwidth of the detector, noise, and the reconstruction algorithm contribute to the ambiguity of the problem.
Invertible neural network architectures feature several advantages compared to other methods for recovering full posteriors.Compared to approximate Bayesian computation (ABC) [45] and related methods, the posterior generation is computationally much more efficient.A major advantage compared to neural networks that predict parametric posteriors is the power to approximate even complex distributions with multiple modes.Samples generated with conditional generative adversarial networks (cGAN) often suffer from a lack of diversity [46].A limitation of cINNs could be seen in the lack of dimensionality reduction, leading to a huge number of parameters in case of high-dimensional data.However, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
recent advances in manifold learning might help to address this issue [47].
Application of (c)INNs in the medical imaging literature is still sparse [48], [49].We are aware of related work in the fields of astrophysics [50], particle physics [51], and computer vision [52].According to our experiments, our models are well-calibrated, indicating that the full posterior is valuable for assessing the risk of prediction errors.The fact that performance decreased on DS Test Hybrid can be attributed to the domain gap resulting from the fact that tissue geometries generated with our model-based approach differ from real tissue geometries.The uncertainty estimates still remain informative as excluding multimodal posterior distributions from the prediction greatly reduces the error (see Fig. 9).
Validation of our approach was based on a comprehensive pipeline for synthetic data generation, comprising the steps of tissue generation, optical simulation, acoustic simulation, noise modeling, and reconstruction.It should be noted, however, that photoacoustic image synthesis is a complex research topic of its own [53], and full realism has generally not yet been achieved [16].One of the main problems is the limited knowledge about the anatomical, optical, and acoustic properties of the tissue.One important limitation of our synthetic data is that we only considered the epidermis in our model of the skin.Future work should include more elaborate models such as those recently presented in [54].Moreover, the background tissue was modeled homogeneously, which does not reflect reality.However, because our method is agnostic to the specific data generation method, it can be adapted to future developments in a straightforward manner.Furthermore, our assessment on real data compensates for the imperfections of the simulations.
The predictions on real photoacoustic data for arteries and veins were consistent with values from literature [31], [32].In particular, the majority of posteriors lay within the support of prior distribution.This indicates that there is only a moderate domain shift between the synthetic training data and the real data.However, only a few venous vessels in our study showed an sO 2 prediction between 70 % and 80 %, even though venous blood in the forearm was found to have an sO 2 of 70 % with a large standard deviation of 20 % [55].There was a non-negligible fraction of vessels that were annotated as arteries but yielded an implausible estimate of sO 2 < 90 %.This can be due to either a wrong prediction of the model or wrong annotations.An initial study for inter-rater variability revealed large labeling uncertainty, indicating that pixels erroneously annotated as vessels could in fact be background and that veins and arteries could have been confused in some cases.Using only refined annotations that are more certain removed most of the implausible predictions.Our method further yields mostly plausible predictions when applied to calf and neck data which it was never trained on.However, a dedicated synthetic data set could potentially further increase the performance.In fact, although the study focuses on blood vessels, the approach can be applied to any structure given sufficient simulations.
A possibly related limitation of our approach could be seen in the fact that uncertainty estimates can only be expected to be reliable if the presented input lies within the distribution of the training data.To investigate the agreement of our training data with the real data, we followed the approach proposed in [21].More specifically, we trained an ensemble of INNs to compute the widely applicable information criterion (WAIC) as o.o.d.metric.According to our analysis, the spectra extracted from arteries are in much closer agreement with the training data compared to the spectra extracted from veins.This might be a possible explanation for the more plausible values obtained for tissue oxygenation in arteries.Future work should thus be directed to improving the realism of synthetic training data even further.
It should further be noted that a photoacoustic spectrum corresponding to a given location in the tissue highly depends on the 3D context due to the so-called spectral coloring effect [16].It may thus be desirable to train image-based models rather than models that solely use single pixels as input.We explored this approach but found that the number of simulated images (1,150) was not sufficient for the imagebased approach.At this point, the simulation becomes the bottleneck (more than 100 days on RTX 2080 GPUs for the data set).Nevertheless, we will continue to explore this path as an image-based approach may potentially resolve ambiguities that occur in single-pixel approaches and provide more accurate estimates of uncertainty.Furthermore, computational acceleration of the cINN processing pipeline should be considered as inference is currently too slow to provide real-time human-interpretable results.It is worth noting in this context that the extraction of a discrete set of plausible solutions requires two methodological challenges to be addressed; namely the computation of the posterior (here solved with cINNs) and the extraction of modes from the posterior (here addressed with a simple clustering algorithm).While our paper focuses on the first task, future research should aim at developing efficient mode detection methods that are both robust and fast.
In summary, our experiments suggest that cINNs could emerge as a powerful tool for uncertainty-aware photoacoustic quantification of tissue oxygenation.Future work should be directed at exploiting the potential of the approach to generate patient benefit.

Fig. 1 .
Fig. 1.Uncertainty-aware approach to photoacoustic quantification of blood oxygen saturation (sO 2 ) using conditional Invertible Neural Networks (cINNs).Standard neural networks (red) that output point estimates are incapable of representing the ambiguity of the problem.As they are forced to output a single solution, they will typically output the mean of multiple plausible solutions in cases of ambiguity.In contrast, the proposed cINN (blue) makes the ambiguity fully transparent.In the provided example, the ambiguity can be removed by optimizing the acquisition pose.

Fig. 3 .
Fig. 3. Visualization of the sampling process of the conditional invertible neural network (cINN).During inference time, the latent space of the cINN is sampled repeatedly (n times, here shown from left to right for the steps n = 10, 100, 5000).By using the newly recorded spectrum ŷ as a conditioning factor, traversing the cINN with different ẑ values yields an approximation of the posterior p Θ (sO 2 | ŷ).The top row shows the sampled latent vectors ẑ = (z 1 , z 2 ) with corresponding sO 2 as color coding.The bottom row shows the reference sO 2 value (green) and corresponding sO 2 distribution obtained from the cINN.

Fig. 4 .
Fig. 4. Representative output of our method on real data from DS Test Real .The central vessels are arteries which should correspond to high oxygen saturation (sO 2 ) estimated (red), while the accompanying veins should be associated with low values (blue).Implausible in vivo results such as high sO 2 (red) in veins or low sO 2 (blue) in arteries, typically align with estimates of low certainty, characterized by a high interquartile range (expressed in percentage points, pp).

Fig. 5 .
Fig. 5. Validation concept based on real, synthetic and hybrid data sets (DS).The explicit disentanglement of factors that are relevant for image formation (device, anatomical properties, optical and acoustic properties) enables us to leverage both prior knowledge encoded in literature and real data represented by measurements.The synthetic and hybrid data sets are generated according to the strategies outlined in Section II-B.

5 ):
DS o.o.d.Test Real To show generalizability to out-ofdistribution (o.o.d.) data, we validated our approach on an additional in vivo data set.It was acquired in the same way as DS Test

Fig. 7 .
Fig. 7. Representative oxygen saturation (sO 2 ) posterior distributions obtained for samples from DS Test Synthetic and corresponding reference sO 2 values (green).

Fig. 8 .
Fig. 8. Training loss for the conditional Invertible Neural Network (cINN) trained with negative log-likelihood and Learned Spectral Decoloring (LSD) trained with mean absolute error.The decrease around the 80th epoch corresponds to a learning rate reduction.

Fig. 9 .
Fig. 9. Quantitative results on DS Test Synthetic (blue) and DS Test Hybrid (orange).The top rows illustrates the calibration curve, while the bottom row displays dots representing the mean absolute error in percentage points (pp) for the individual test images.The same data is used as a basis for the box plots.

Fig. 10 .
Fig. 10.Distribution of oxygen saturation (sO 2 ) predictions for veins and arteries on DS Test Real (forearm images) and DS o.o.d.Test

TABLE I DATA
SETS USED FOR THIS WORK.WHILE TRAINING IS PERFORMED EXCLUSIVELY WITH SYNTHETIC DATA (DS SYNTHETIC ), VALIDATION IS PERFORMED ON SYNTHETIC, REAL (DS REAL ) AND HYBRID DATA (DS HYBRID )