Model based learning for accelerated, limited-view 3D photoacoustic tomography

Recent advances in deep learning for tomographic reconstructions have shown great potential to create accurate and high quality images with a considerable speed-up. In this work we present a deep neural network that is specifically designed to provide high resolution 3D images from restricted photoacoustic measurements. The network is designed to represent an iterative scheme and incorporates gradient information of the data fit to compensate for limited view artefacts. Due to the high complexity of the photoacoustic forward operator, we separate training and computation of the gradient information. A suitable prior for the desired image structures is learned as part of the training. The resulting network is trained and tested on a set of segmented vessels from lung CT scans and then applied to in-vivo photoacoustic measurement data.


I. INTRODUCTION
Photoacoustic Tomography (PAT) is an emerging "Imaging from Coupled Physics" technique [1] that can obtain high resolution 3D in-vivo images of absorbed optical energy by sensing laser-generated ultrasound (US) [2], [3], [4], [5], [6], [7]. If data is obtained over a complete surface surrounding the domain of interest, and for all times over which the acoustic waves are propagating, then the inverse problem can be solved directly by several analytical or numerical algorithms [8].
The fastest of such methods just require to solve a single wave equation; see Section II-B for details. In many practical applications, restricted spatial and/or temporal sampling of the US signal is either imposed due to geometrical limitations (e.g. limited view) [9], or by the choice to utilise a compressedsensing (CS) undersampling strategy in order to accelerate data acquisition [10]. In such cases, direct reconstruction methods are not optimal to obtain high quality reconstructions as they give rise to artefacts and/or adverse noise amplification.
Recently, several groups showed that variational image reconstruction methods that iteratively minimise a penalty function involving an explicit model of the US propagation and prior constraints on the image structure can provide significantly better results in these situations [11], [12], [13], [14], [15], [16]. However, a crucial drawback of these methods is their considerably higher computational complexity and the difficulty to handcraft prior constraints that capture the spatial structure of the target accurately enough.
As the strongest contrast in biological soft tissue is given by haemoglobin, a central promise of PAT is to deliver high quality images of blood vessel networks, e.g., for assessing the vascularization of tumors [17], [18]. Consequently we assume in this study that our targets are vessel rich and hence we learn suitable prior constraints from a set of segmented vessels.

A. Deep Learning in Image Reconstruction
The huge recent success of Deep Learning methods in image processing and computer vision has seen an increasing interest in applying similar strategies to tomographic reconstruction problems. Deep Neural Networks (DNN) are especially popular due to the low latency of a forward pass through a network which leads to prospective highly efficient reconstruction algorithms.
In this paper we differentiate between two fundamentally different approaches to involve learning in image reconstruction: 1) Reconstruction followed by learning based postprocessing. In this approach image reconstruction is carried out using a simple inversion step, and post-processing is used to remove artefacts and noise. 2) Model based learning and reconstruction. In this approach the forward and adjoint operators of the imaging problem are used directly in the inverse algorithm, with a multiscale regularisation scheme whose parameters are learned in the training phase. Many applications of Deep Learning for image reconstruction have been concentrated on the first approach by using a fast and simple direct reconstruction algorithm to obtain low quality and corrupted images and then train a convolutional neural network (CNN) on removing those artefacts, see [19], [20] for an application to CT, [21] for PAT, and MRI [22].
Alternatively following the second approach by including the physical forward model into the network has been studied in [23], [24], [25], [26], [27]. However, these improved results in reconstruction quality typically come at the cost of longer computation times which are effectively limited by the repeated simulation of the physical model.
In this paper we take the second approach. In particular, we utilize our knowledge of the forward operator in the reconstruction process, but we will not invoke handcrafted prior constraints on the vessel structures that we are interested in. Instead, we will learn them from the data.

B. Compressed Sensing and Limited View PAT
In several imaging modalities the application of compressed sensing methods has been studied as a means to achieve faster acquisition speeds and/or a reduced dose when using ionising radiation [28], [29], [30]. In PAT this has been studied for example in [13], [31], [32], [33], [34]. Because these methods mandate an appropriate regularisation strategy, the involvement of Deep Learning in compressed sensing is an important topic for study.
As well as data sub-sampling, in this paper we also consider the limited-view problem. Due to geometric restrictions, one can often only access the US field on one side of the tissue. A detailed examination and discussion of sub-sampling combined with the limited view problem for PAT can be found in [13].

C. Overview of this paper
The rest of the paper is organised as follows. In Section II we discuss the physical model of photoacoustic signal generation, as well as direct reconstruction approaches, variational and the corresponding iterative reconstruction approaches, and an outline of the model based learning approach. In Section III we give a detailed description of the architecture and implementation of the model based learning approach as well as a description of its training steps. In Section IV we discuss the measurement details, generation of training data as well as post-processing, i.e. denoising/artifact removal, of direct reconstructions. Results for simulated and 3D in-vivo data are shown. Section V provides a detailed evaluation of the results. Finally in Section VI we provide some conclusions and outlook for the future.

II. PHOTOACOUSTIC TOMOGRAPHY A. Photoacoustic Signal Generation
To generate the PA signal, a short pulse of near-infrared laser light is sent into biological tissue where the photons will get scattered and absorbed by any chromophores present. Under certain conditions (see [35] for details), part of the absorbed optical energy will be thermalised, i.e., converted to heat, and the induced local pressure increase x travels through the tissue as an US wave (photoacoustic effect). Spatio-temporal measurements of these waves at the boundary of the tissue constitute the PA signal y. A common way to model the acoustic part of the signal generation is to consider the following initial value problem for the wave equation [8], [12], [35], (1) The US sensing is then modeled as a linear operator M acting on the pressure field p(r, t) restricted to the boundary of the computational domain Ω and a finite time window (see [3], [36] for details on measurement systems): Equations (1) and (2) define a linear mapping from initial pressure x to measured pressure time series y, which constitutes the forward problem in PAT. The corresponding image reconstruction step constitutes the inverse problem to (3).
Note that x is a N x × N y × N z 3D image of initial pressure and y is a N h × N v × N t volume of acquired pressure data as a function of acoustic propagation time.
In the examples used in this paper this results in dimension of A of around 7M by 4.6M which (if fully dense) would require about 123TB of memory in single precision which is intractable for currently available computational resources. Thus image reconstruction methods require either direct, or iterative "matrix-free" implementations as discussed in the next sections.

B. Direct methods for PAT Image Reconstruction
Direct methods are especially attractive in the large scale setting as they only require solution of a single wave equation; i.e., given a computational solver for (1) we can compute an inverse solution with the same computational cost [12]. In particular in this study we choose to compute the adjoint solution A * y, which is close to the inverse solution.
Here, as the wave solver we use a pseudo-spectral timedomain method [37], [38], [39] as implemented in the k-Wave Matlab Toolbox [40], which allows to run the computations on GPU cards using fast CUDA code.
Whilst direct approaches are computationally efficient they are inadequate for dealing with the sub-sampled limited-view data employed in this paper as we demonstrate next. Figure 1 illustrates the influences of limited-view and sub-sampling on a simple numerical phantom of tubes that should mimic blood vessels. From Figure 1(c), we can see that a reconstruction by A * y suffers from severe circular artefacts [41] and a systematic loss of contrast with depth. Figure 1(d) shows that these problems are accentuated with sub-sampled data.

C. Variational approach to PAT image reconstruction
Variational methods aim to recover the PA image x in (3) from the measurement y as a minimiser of a penalty function, where the fidelity term d(y, Ax ) measures the data fit and a regularising term R(x) encodes prior knowledge about the structures in the image. Often, R(x) is convex but not differentiable. A simple approach to find a solution to (4) is given by a proximal-gradient-descent scheme: with step length γ > 0 and where the proximal operator solves an image denoising problem: The drawback of the above procedure is the difficulty to choose a suitable regularisation term R(x), a regularisation parameter λ > 0, that balances data fit and the regularisation properties, and the potentially large number of iterations it takes to converge.
As shown in, e.g., [11], [12], [13], iterative image reconstruction methods of the form (5) that solve variational regularisation problems [42] like (4) can improve upon the direct image reconstruction methods. For instance, we can incorporate the physical constraint that the initial pressure increase x is always positive by choosing R(x) to be 0 if x 0 and ∞ otherwise. For this, prox R,α (y) = max(y, 0). With the canonical choice d(y, Ax ) = 1 2 Ax − y 2 2 , (4) simply becomes a non negative least squares (NNLS) solution. Figures 1(f), 1(i) demonstrate that with increasing number of iterations, both limited-view artefacts and the systematic loss of contrast disappear. However, they also show that the convergence in deeper, non-central parts of the image is considerably slower and the limited-view will still manifest in blurry edges. For the sub-sampled data case shown in Figures 1(g), 1(j) we see similar effects although in addition, noise-like artefacts remain. As examined in [13], using noise-reducing, edgepreserving regularisation like the (isotropic) total variation (TV) functional R(x) = ∇x 1 can further improve such results as can be seen in Figures 1(h), 1(k). The main problem of such iterative approaches is in terms of computation times, compared to the linear backprojection by A * y which requires the solution of one wave equation, computing 20 iterations of NNLS or TV requires in total 40 additional solutions of a wave equation.

D. Model Based Learning
Regularisation functionals like TV are popular because they often allow for a mathematical analysis of the minimisers of (4) and have been designed to perfectly recover certain aspects of x, e.g., its singularities [43]. As such, they often yield spectacular results for simple numerical or experimental phantoms like the ones shown in Figure 1. In many applications however, typical images x have a more involved structure and the prior information expressed by simple regularisers like TV does not lead to optimal results. One example is given by sub-sampled PAT measurements of vessel networks [13]. If we have a set of typical PA images of vessel networks, we could try to learn more suitable prior information and how to best incorporate it in an iterative image reconstruction approach that also utilizes Fig. 2. Diagram of one convolutional neural network, denoted as G θ k , representing one iteration of the deep gradient descent. Image size for input and output is indicated in gray. The red arrows denote a convolutional layer with 5 × 5 × 5 kernels followed by a ReLU, the resulting channels in each layer are indicated in the squares. The blue arrow denotes a convolutional layer followed by a scalar multiplication. The residual update (by the skip connection) is then projected to the positive numbers by the last ReLU. measurement information over the gradient of the data fit, at every step k. Inspired by [44], [45] we take the structure of (5) as a starting point: Each iteration consists of updating x k by combining measurement information delivered through the gradient ∇d(y, Ax k ) with an image processing step. Instead of deriving the concrete form of this combination from a fixed penalty function (4), we propose to learn instead an update function for each iteration This implies that the effect of the regularising term is now learned from the data during training. The functions G θ k correspond to CNNs with different, learned parameters θ k but with the same architecture. The network structure is kept simple and should mimic a proximal gradient update (5). Due to the representation of each update by a CNN applied to the current x k and the gradient ∇d(y, Ax k ), we call the whole algorithm deep gradient descent (DGD). In contrast to [23], [26], [45] we train the DGD layer by layer (layer corresponding here to one iterate), i.e. we learn the parameters θ k for each iteration separately. In this way we can exclude the photoacoustic operator from the training procedure. This is necessary to make the training feasible. Note, that the photoacoustic operator has complexity O(N 4 log(N )) [12], for a volume of size n = N × N × N , compared to CT and MRI where A has complexity O(N 3 log(N )) for a volume of size n = N × N × N . Therefore we think that such layer by layer training scheme is the only feasible approach for iterative high-resolution 3D PAT imaging at the present stage.

III. IMPLEMENTATION
In a CNN, each layer is of the following form: Given the input g and output h with channel index sets I, J respectively, then with a componentwise nonlinear function ϕ and convolution * . The whole parameter set θ of the network is therefore given by the biases b i ∈ R and convolutional filters ω i,j ∈ R s n (with kernel size s and spatial dimension n) of each network layer.
The specific architecture we have chosen for the CNNs performing the update in equation (8) is illustrated in Figure  2. In each iteration we input x k and ∇d(y, Ax k ) to a similar pipeline, where both are spread to 16 and then 32 channels by a convolutional layer with kernel size s = 5 and dimension n = 3, equipped with a rectified linear unit as nonlinearity, that is defined as The output of both pipelines is added together and first reduced to 16 channels, equipped with a ReLU, and then to 1 channel without a nonlinearity, but a simple scalar multiplication. The result is added to the current iterate and projected to the positive numbers by a ReLU, similar to the proximal for NNLS discussed in Section II-C.
The architecture in this study is motivated by a typical network structure consisting of an analysis/encoding and a reassembling/decoding part. In this analysis part, the number of channels is increased between layers to refine the analysis of the features extracted in the layer before. In the reassembling part, these features must be merged/thresholded to produce an output image, so the number of channels is decreased. Since the main contribution of this work is not the specific neural network architecture, we use a simple architecture following this convention. In particular, the network structure is kept rather small with the motivation in mind that each G θ k primarily learns how to combine current iterate and gradient as well as a data specific filters, in contrast to a large post-processing network. Furthermore, a compact structure is necessary to minimise the needed memory on the GPU.

A. Training of the deep gradient descent
Given a training set {y i , x i true } i , we have two options to train the parameters θ k . The first is to pre-define a maximum of iterations k max and train all θ k , for k = 0, . . . , k max − 1, together to minimise the difference between x i true and the result of the last iteration x i kmax ; that is we seek to find for some suitable norm. The second approach is to train the parameters sequentially: θ 0 is trained to minimise the difference between x i true and x i 1 given data y i , for all indices i. After that, θ 1 is trained to minimise the difference between x i true and x i 2 , given the optimal x i 1 found in the training of the first CNN G θ0 . That means the minimisation of (9) is split into k max independent optimisation problems w.r.t. disjoint subsets of parameters θ k , k = 0, . . . , k max − 1, given by (10) The first approach has the advantage that the network is more flexible to achieve the best possible result after k max iterations, but during the training, the operators A and A * need to be evaluated many times, since for each training step all x k and their corresponding gradients have to be computed to evaluate (9). While the second approach is not optimal in the sense that it does not lead to minimal training error, it has two important advantages. Firstly, the computation of the gradient A * (Ax − y) and training decouple, which is important in view of the cost of application of A and A * in PAT. Secondly, it provides an upper bound on the training error (9). In fact, (10) can be viewed as a greedy approach which seeks to obtain a minimum in each layer k given x k−1 from the previous training step. We note that this property can be used to determine the number of layers k max of the DGD in training by controlling the training error from layer to layer in contrast to choosing it a priori. Therefore, the second approach could also be used as a pre-training stage to initialize the weights for the first approach.
As the computational complexity of simulating acoustic wave propagation in 3D prohibits computing the gradient during any training scheme, we need to follow the second approach here. The whole training procedure we use is summarized in Algorithm 1 for a given number of maximum iterations k max and the reference solution x true . Compute ∇d(y, Ax k ) = A * (Ax − y) 6: function TRAINITERATE(∇d(y, Ax k ), x k , x true ) 7: Train for given accuracy 8: end function(return θ k ) 9: x k+1 ← G θ k (∇d(y, Ax k ), x k ) , the learned iterative reconstruction scheme can be evaluated as follows: The new iterate x k+1 is computed by applying the network G θ k to the current iterate x k and the gradient of the data fit, in particular this means that the gradient has to be computed in every iteration. This procedure is equivalent to Algorithm 1 without calling TRAINITERATE in line 6-8.

IV. EXPERIMENTS
In this study we are interested in reconstructing human invivo data and hence we do not have a true target available for the training of measured data. This lack of a ground truth is one of the main challenges in supervised learning. Nevertheless, we chose to train the DGD with supervised learning using simulated data and hence a meaningful data set is crucial for a successful training, for that purpose we use segmented human vessel structures from CT scans as discussed in the next section. The training and evaluation of each network G θ k has been implemented with TensorFlow [46] in Python. All computations are done on a Titan Xp GPU with 12GB memory.

A. Training on segmented lung vessels
The training data needs to be as realistic as possible to provide a meaningful basis for the algorithm. To achieve this we have used the publicly available data from ELCAP Public Lung Image Database 1 . The data set consists of 50 whole-lung CT scans, from which we have segmented about 1200 volumes of vessel structures with a Frangi vesselness filter [47], [48]. The segmented volumes were of size 40 × 120 × 120, and were then scaled up by a factor of 2 to the final target size of 80×240×240. Out of these volumes we chose 1024 as ground truth x true for the training and simulated limited-view, subsampled data using the same measurement setup used in the invivo data: We assume that each voxel has the isotropic length dx = 84.75µm and that the full data is recorded at locations on a grid with grid size 2dx on one of the two 240 × 240 sized outer planes of the volume (i.e., the scanning geometry is similar to Figure 1(a)). In time, N t = 486 pressure samples are recorded with dt = 16.6ns. The full data is then sub-sampled as illustrated in Figure 1(e) but by a sub-sampling factor of 4. We have added normally distributed noise to the measured data, such that the resulting SNR was approximately 15 for all measurements and we assumed a sound speed of c 0 = 1580m/s. In a nutshell, we obtained the data y = Ax true + ε, where ε denotes the added noise.
The training set for one CNN G θ k then consists of current iterate x k , the gradient of the data fit ∇d(y, Ax k ) = A * (Ax k − y), and the ground truth x true . We initialize the iteration with Precomputing the gradient information for each CNN takes about 10 hours. We trained the CNNs using TensorFlow's implementation of Adam [49]. For the training we used batches of size 2, since this already fills up the memory (12GB) of the GPU completely. We trained each G θ k for 25000 iterations (i.e. approximately 50 epochs) with an initial step size of 5 · 10 −5 (learning rate), The minimised loss function, i.e. the norm in (10), is chosen as the 2 -distance of new iterate to the true solution x true , For training the first CNN G θ0 we added an additional constraint to avoid the local minima of zero solutions by penalizing a small norm loss add (x) = −α min( x 2 − β, 0), with small α, β > 0. The training of each CNN G θ k took about 1 day on the GPU. We have trained 5 iterates, i.e. k max = 5, for the deep gradient descent. In total the whole training took 7 days. We note, that this could be speed up by initialisation of θ k with θ k−1 or by more advanced optimisation strategies, see for instance the review [50]. At this point we would like to note, that had we included the operator A and A * in the training and trained all 5 iterates together, then the time needed for 25000 iterations would be in the order of 70 days, and used at least 5 times more memory. The result of the DGD for simulated data is shown in Figure  3 for an example that was not included in the training set.

B. Post-processing by Deep Learning
To complement this study, we have also implemented the first approach of learning in image reconstruction, see Section I-A, viz. taking an initial direct reconstruction and train a network to remove artefacts and noise. Especially popular for improving these initial reconstructions is a CNN introduced as U-Net [51]. We refer to the original paper for the architecture, but roughly summarized its strength lies in a series of skip connections in a multilevel decomposition. For our application, we have followed the modified U-Net architecture proposed by [20] for post-processing of 2D X-ray tomography, that learns to compute an update to the initial reconstruction. We made the necessary modifications for a three-dimensional setting and implemented training and evaluation with TensorFlow. To be consistent with the previous section our direct reconstruction, which we seek to improve upon, is obtained by the application of the adjoint x 0 = A * y. The modified U-Net is then trained on the set of pairs {x i 0 , x i true } i . Due to memory restrictions we were only able to train one pair at a time. The loss function is chosen as the combination loss(x) + loss add (x), see previous section. The training is then performed with Adam for 75 epochs and a learning rate of 10 −4 ; this took 3 days. The results for simulated data will be discussed in Section V-B.

C. Application to in-vivo data
We now apply our method to in-vivo data of a human palm. The details of the measurement set-up and procedure are described in [14], [52]. All other features like spatial dimensions of reconstruction volume, temporal sampling or the sub-sampling pattern are exactly the same as for the simulated data (cf. Section IV-A).
In-vivo data has different characteristics that are not perfectly represented by the training on synthetic data and hence a direct application of the trained network does not lead to TV sub-sampled, λ = 5 · 10 −5 20 Iterations Updated U-Net TV fully-sampled data, λ = 2 · 10 −4 20 Iterations satisfactory results, as illustrated by comparing it to a TV reconstruction in Figure 4. In particular, we see that the network has not learned to effectively threshold the noiselike artefacts in the low absorption regions i.e. regions with low concentration of chromophores. To train our approach to remove these features we simulated the effect of the low absorbing background as a Gaussian random field with short spatial correlation length, clipped the negative parts, scaled it to maximal value 0.1 and added it to each segmented volume x true where ever the intensity of x true did not exceed 0.1 (i.e., the maximum intensity of x true stays 1). The synthetic CT volumes with the added background were then used for the data generation, i.e. y i back = Ax i back + ε, whereas the clean volumes x true are used as reference for the training. Here ε is again chosen (see Section IV-A) such that the resulting measurement y i back had a SNR of approximately 15. We expect the network trained on the modified pairs {y i back , x i true } i to be capable of effectively removing the background.
Furthermore, since the expected contrast in the images is crucial for the trained reconstruction procedure, we scaled the measurement as follows. First we computed the standard deviation of the measurement data for all simulated targets. Then we rescaled the sub-sampled real measured data to have a similar standard deviation. This rescaled data is then used for reconstructing with the DGD. The result after 5 iterations is shown in Figure 5.
The results can be further improved performing a transfer training of the previously trained networks G θ k . This however requires some reference reconstructions from the same or a similar system. We were able to perform such a transfer training with a set of 20 (fully sampled) measurements of a human finger, wrist, and palm from the same system. We then sub-sampled the data (fourfold) to obtain the training data y real . As reference we took weakly regularised TV reconstructions from the fully sampled data, x T V . To update the DGD we have performed an additional 10 epochs of training on the pairs {y real , x T V }, with a reduced learning rate of 10 −5 . Such transfer training takes only 90 minutes to update the entire DGD. We denote the updated CNNs by Gθ k and the resulting outputs byx k . The effect of the updated DGD is shown in Figure 5.
Additionally, for a full comparison we have performed an update training of the U-Net with the same parameters as above, i.e. 10 epochs and a reduced learning rate of 10 −5 . The update training of the network took only 20 minutes and the result is shown in Figure 5.

V. DISCUSSION OF RESULTS
The results shown in Figure 3 and Figure 5 suggest that the formulation of a gradient descent scheme as a CNN in each iteration does produce competitive results with a considerable reduction in iterations needed, as we will discuss in this section. Furthermore, it is robust in the transition to real measurement data, which is one of the most important aspects in inverse problems and image reconstruction. During the reconstruction procedure, a major improvement is achieved in the first step, as shown in Figure 6. After one iteration of the DGD the background is cleared and the contrast is mostly restored, but there are still a few noisy patches around the vessels visible. The difference image also indicates that there are still parts insufficiently recovered on the outer area close to the boundary; these are typical limited view artefacts. After the 5th iteration these artefacts are considerably reduced and the error inside the domain is mostly uniform.
In the following, we discuss some particular aspects in more detail.

A. Quantitative analysis of simulated data
For a quantitative evaluation of the performance we have computed the relative 2 -error for the simulated example shown in this study, see e.g. Figure 4. More precisely the reconstruction quality is evaluated using a scaled and unbiased relative error defined by as suggested in [20]. This unbiased error is used to not disadvantage TV and NNLS reconstructions in the comparison.
While the networks know the absolute contrast from the training data, classical iterative methods often either need many iterations to recover it from the data or suffer from systematic contrast errors. Consequently, the optimal parameters for the reconstructions of DGD and U-Net are in most cases a = 1 and b = 0 and hence err reduces to the standard relative 2error. For a full comparison, we have computed the mean error for 16 test samples that were not included in the training set. We compare the two networks, U-net and DGD, with TV and NNLS reconstructions, as described in Section II, with the regularisation parameter for TV chosen such that err(x) is minimized. The resulting errors are plotted in Figure 7. After one iteration U-Net achieves clearly the best result, but already with 2 iterations DGD achieves a smaller error down to a substantially smaller error after 5 iterations. TV and NNLS converge considerably slower, but achieve the U-Net quality after 50 iterations and will likely go lower.
The computational time is dominated by the application of A and its adjoint A * . Computing either takes about 12 seconds on the Titan Xp GPU, see Table I for the complete computation times for each reconstruction approach. Note however, that as our implementations involve communication overhead between Matlab and Python, theses timings give an indication for the methods' efficiency rather than an absolute comparison. Consequently, a reduction in iterations has a considerable impact on the total computation time. In this respect, the U-Net structure is clearly the cheapest with just one application to compute x 0 = A * y. Iterative algorithms require additionally two applications for each iterate to compute the gradient ∇d(y, Ax k ) = A * (Ax k −y). Thus, having similar results after 2 iterations with DGD and 50 iterations of TV, see Figure 7, leads to a prospective speed-up by 20 (including the initial reconstruction x 0 = A * y). We note that the computation time for U-Net can be considerably reduced by using a kspace method [53] for the initial reconstruction.   (11). The parameter for TV has been chosen such that the best error is achieved for the given iterations.

B. Comparison to post-processing by Deep Learning
First using a direct reconstruction and then applying a DNN to remove artifacts is a valid approach in many applications, especially if one is interested in fast and prospectively realtime reconstructions. This approach only needs an initial direct reconstruction and one application of the trained network. Especially for full-view data, this is a promising approach, but even in our limited-view case this approach proves to be quite powerful. A comparison of DGD and U-Net for simulated data is shown in Figure 8 (top row). The resulting image is cleaned up and many vessels are properly reconstructed. Some smaller details are missing and can not be recovered from the initial reconstruction. The difference to the true target is also shown in Figure 8 (bottom row). The differences are most pronounced in the outer parts of the domain as a consequence of the limited view geometry. In comparison the reconstruction by DGD has a much smaller overall error, but this is especially true in the center of the domain. The maximal error of the Unet reconstruction is 0.6012 (on the scale of [0, 1]) and of the DGD reconstruction 0.4081. In conclusion we can say that the U-net architecture performs very well and is even capable of removing some limited-view artefacts, but is ultimately limited by the information contained in the initial reconstruction.

C. In-vivo data
Even though the results for simulated data are very impressive, applying the DGD trained on images with a clean background is not sufficient for real data as shown in Figure  4. The reason is that the algorithm interprets all structures in the data as important and enhances them equally. Adding a background to the training data set in order to teach the DGD thresholding those structures immensely improves the results and even fine details that were not visible before are now recovered after 5 iterations, as seen in Figure 5. Nevertheless, just an adjustment of the simulated data is not sufficient as can be seen from the quantitative measures in Table II, computed with respect to the reference reconstruction from fully-sampled limited-view data. Thus, further improvement can be achieved by an update of the DGD if one has a set of similar measurements from fully sampled data available. This update training has a considerable impact on the reconstruction quality as can been seen in Table II. Both learned methods show excellent reconstruction quality after transfer training and are able to successfully remove the undesired background structures. In comparison to the iterative reconstruction with TV both learned methods achieve a higher PSNR and SSIM to the reference reconstruction from fully-sampled data. Noteworthy, the lowest unbiased relative 2 -error (err), see (11), is achieved by the classical TV minimisation with an emphasis on the data term, this is likely due to the fact that the reference is a TV reconstruction from fully-sampled data.

D. Generalisation and robustness
Deep Learning approaches are especially powerful in a fixed measurement protocol and consistent targets, as illustrated for the simulated test data. The big question is how robust these networks are with respect to perturbations of measurement procedures or targets. First experiments indicate that the iterative network allows for small perturbations in the forward operator such as varying sub-sampling patterns (of same sub-sampling rate) or deviations in sound speed, as well as slightly varying noise level in the data. However, each variation will lead to slight deterioration of reconstruction quality. In contrast, the one step approach by U-Net was found much more sensitive to variations. In particular, we have found that a change in sampling pattern leads to a mean (for 16 U-Net DGD samples) deterioration in err by 0.5% for DGD and U-Net by 5% for the simulated test data. We think that this is due to the fact, that the gradient in each iteration encodes the model variations and as such small perturbations are corrected in the iterative network. If larger changes in the measurement protocol are expected, it is recommended to either retrain the network or perform an update training, as has been done for the in-vivo data. Furthermore, the iterative method seems to be more robust with respect to structural differences between the target and the training set. This is illustrated in Figure 9, where we have tested the networks trained on the clean segmented vessels on a tumor phantom [13]. With 5 iterations we achieve a similar err as TV after 20 iterations. As it can be seen, the network does reproduce vessels with similar characteristic as in the training set, this might be due to the learned prior-like filters. Whereas the U-Net reconstruction does not perform well with the new image structures.

VI. CONCLUSIONS
In limited-view, sub-sampled photoacoustic tomography it is essential to incorporate the physical model into the reconstruction procedure to reduce artefacts with an appropriate regularisation strategy. Here we considered three possible strategies: i) iterative total variation, ii) backprojection followed by a learned denoiser, iii) learned iterative reconstruction. In terms of image quality and robustness to perturbations in the model i) and iii) were superior to ii). Method ii) was fastest at the cost of inferior image quality and flexibility. Method iii) was considerably faster than i). Thus, we believe that learned iterative reconstructions are a realistic technique for 3D PAT. The choice between learned post-processing versus learned iterative reconstruction is a matter of speed versus quality.
This study is particularly focused on method iii) and we have shown that incorporating the physical model as the gradient of the data fit and learning an iterative algorithm consisting of several convolutional neural networks leads to a superior reconstruction quality with a considerable speed-up compared to classical, and well established, iterative reconstruction schemes. With minor modifications we were able to apply the learned algorithm to experimental in-vivo data of a human wrist and obtained far more detailed reconstructions from sub-sampled data than by TV minimisation of the same data.
Additionally, we have investigated method ii) that consists of post-processing a fast and basic direct reconstruction with a CNN, in particular we implemented an architecture introduced as U-Net that has been proven to work well on medical images. In our study this approach shows promise to produce a fast and good initial reconstruction, but since many features are Phantom DGD x 5 U-Net TV not present in simple direct reconstructions, for limited-view, sub-sampled data, this approach is limited by the quality of the initial reconstruction. Even though certain features can not be recovered, post-processing with Deep Learning is promising for applications where low latency is more important than a best quality reconstruction, such as navigational tasks during surgery. Furthermore, our study suggests that iterative networks are more robust with respect to changes in the measurement setup or imaged target. As inherent in all learning approaches, the limitation of the proposed method is dictated by the quality of the training data and the possibility to perform an update training. In future research we will consider combing the U-Net architecture with a model based approach. For instance by replacing the CNNs representing one iteration in our deep gradient descent with a U-Net like structure. For high resolution 3D imaging this would need computational resources exceeding a local workstation. Consequently, if the computational resources are available including the forward operator in the training will likely improve results even further.

ACKNOWLEDGMENT
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. This study was supported in part by funding from the European Union's Horizon 2020 research and innovation programme H2020 ICT 2016-2017 under grant agreement No 732411 and is an initiative of the Photonics Public Private Partnership. AH acknowledges support from the Wellcome-EPSRC project NS/A000027/1, "Image-Guided Intrauterine Minimally Invasive Fetal Diagnosis and Therapy" (GiftSurg). FL acknowledges financial support from EPSRC project EP/K009745/1 "Dynamic High Resolution Photoacoustic Tomography System" and of the Netherlands Organization for Scientific Research (NWO), project nr. 613.009.106/2383. JA acknowledges Swedish Foundation of Strategic Research grants AM13-0049 and ID14-0055 as well as support from Elekta.