Fetal MRI by Robust Deep Generative Prior Reconstruction and Diffeomorphic Registration

Magnetic resonance imaging of whole fetal body and placenta is limited by different sources of motion affecting the womb. Usual scanning techniques employ single-shot multi-slice sequences where anatomical information in different slices may be subject to different deformations, contrast variations or artifacts. Volumetric reconstruction formulations have been proposed to correct for these factors, but they must accommodate a non-homogeneous and non-isotropic sampling, so regularization becomes necessary. Thus, in this paper we propose a deep generative prior for robust volumetric reconstructions integrated with a diffeomorphic volume to slice registration method. Experiments are performed to validate our contributions and compare with ifdefined tmiformat R2.5a state of the art method methods in the literature in a cohort of 72 fetal datasets in the range of 20-36 weeks gestational age. Results suggest improved image resolution Quantitative as well as radiological assessment suggest improved image quality and more accurate prediction of gestational age at scan is obtained when comparing to a state of the art reconstruction method methods. In addition, gestational age prediction results from our volumetric reconstructions compare favourably are competitive with existing brain-based approaches, with boosted accuracy when integrating information of organs other than the brain. Namely, a mean absolute error of ${0}.{618}$ weeks ( ${R}^{{2}}={0}.{958}$ ) is achieved when combining fetal brain and trunk information.

tributions and compare with methods in the literature in a cohort of 72 fetal datasets in the range of 20-36 weeks gestational age. Quantitative as well as radiological assessment suggest improved image quality and more accurate prediction of gestational age at scan is obtained when comparing to state of the art reconstruction methods. In addition, gestational age prediction results from our volumetric reconstructions are competitive with existing brain-based approaches, with boosted accuracy when integrating information of organs other than the brain. Namely, a mean absolute error of 0.618 weeks (R 2 = 0.958) is achieved when combining fetal brain and trunk information.
Index Terms-Fetal magnetic resonance imaging, slice to volume reconstruction, generative image priors, diffeomorphic image registration, gestational age prediction.

I. INTRODUCTION
M AGNETIC Resonance Imaging (MRI) is indicated when both Central Nervous System (CNS) and non CNS fetal anomalies are suspected on ultrasound [1], [2]. When compared to ultrasound, MRI is a unique instrumental technique for studying fetal development due to enlarged Field Of View (FOV), superior soft tissue contrast, and lack of shadowing. Basic examination protocols involve collecting T 2 -weighted slices along the main axes of the fetal organs. In the case of brain imaging, aspirations for more quantitative imaging motivated the development of Slice to Volume (SV) reconstruction techniques aiming to obtain a volumetric representation of the fetal brain from the set of acquired slices [3], [4]. If SV reconstruction is available, acquisitions may be redesigned for collecting the most diverse and efficient set of orientations considering both scanning limitations and reconstruction properties. Free reformatting of the imaging plane may be even more important for non-brain applications, where, due to motion, limited resolution, Signal to Noise Ratio (SNR), and scanning time, obtaining the principal axes of various target organs while planning the scans may be difficult or infeasible. However, 3D reconstruction of the whole uterus faces an increased complexity due to different non-rigid sources of motion, consequently rendering rigid motion models [3], [4], [5], [6], [7], [8] ill suited.
Despite non-rigid motion transformation models for SV registration problems have been proposed some time ago [9], This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the predominance of fetal brain applications has delayed their incorporation to whole body fetal image reconstruction algorithms. After some preliminary works applying rigid correction models outside the brain, [10] proposed a method using patchbased rigid registration to approximate non-rigid deformations. Although this approach may benefit from a flexible motion model, continuity, smoothness and invertibility of motion cannot be guaranteed as no mechanism is provided for assembling per-patch estimates. Improved performance was shown in [11] when using deformable registration based on free-form deformations on a hierarchical B-Spline grid, reconstruction based on super-resolved weighted Gaussian interpolation, bias correction, and global and local outlier rejection based on normalized cross correlation and structural similarity indexes respectively. In this case, a continuous and smooth motion model is proposed but invertibility is not guaranteed, which could compromise the physical plausibility of the estimated motion fields. In addition, the regularization scheme is based on an edge-preserving term relying on a piecewise constant image assumption, which is prone to spurious boundaries or non-natural image texture due to noise amplification effect or when this assumption is violated. As reported in [9], other explored alternatives for deformable motion models in SV registration include usage of thin plate splines and finite element meshes.
On the other hand, application of Deep Learning (DL) methodologies in the orbit of fetal SV reconstruction includes methods for rigid alignment of slices to brain templates [5], [6], rigid motion tracking [8], automatic localization of the fetal brain [7], or image quality assessment [12]. Distinctly, in this work we focus on the integration of DL architectures within the reconstruction formulation. Most efforts on DL for inverse problems have focused on learning a mapping between an approximate inverse of a fully characterized measurement operator and a Ground Truth (GT) reconstruction, which is used at test time to mitigate spurious residuals in the reconstruction, typically by unrolled schemes [13]. However, generation of GT reconstructions is problematic in our application as it would involve the acquisition of oversampled datasets in a sensible population. In addition, as the measurements are affected by motion, learning may be biased by the reconstruction method employed to build the training data.
Difficulties with the GT are also an issue for validation, differences between reconstruction methods are often subtle or difficult to summarize, and improved reconstructions may not necessarily impact a particular clinical application. For these reasons, we have combined generic measures of data quality and radiological scoring with a task-based validation strategy based on assessing the Gestational Age (GA) prediction performance. This is identified as a clinically relevant task because GA knowledge is critical for fetal development characterization from imaging and errors or uncertainties in GA predictions could be indicative of developmental abnormalities [14].
In this work we introduce the Robust Generative Diffeomorphic Slice to Volume Reconstruction (RGDSVR) method which is devised for multi-slice MRI scans with data acquired along different slice orientations and tested for a single-shot Fast Spin Echo (FSE) T 2 -weighted sequence, as typically used in fetal MRI. Our contributions include an efficient version of the Large Deformation Diffeomorphic Metric Mapping (LDDMM) [15] framework adapted to the computational requirements of joint deformable registration and reconstruction problems which guarantees smooth and invertible motion fields. In addition, we propose a robust explicit inverse formulation of the reconstruction with regularization based on an untrained generative model able to preserve both the edges of the images and structured image statistics, the so-called Deep Decoder (DD) [16]. Finally, we show that free reformatting of whole body fetal SV reconstructions can be leveraged for accurate estimates of GA at scan. The source code and exemplary data required to reproduce the main results of the paper is made available at https://github.com/ lcorgra/RGDSVR/releases/tag/1.0.0.

II. METHODS
Common structural fetal MRI protocols are based on the acquisition of a series of stacks of slices along different orientations (see Fig. 1). Thus, we start from a data array y = (y m l 1 m l 2 m l 3 l ) encompassing several stacks 1 ≤ l ≤ N l , where m l 1 m l 2 m l 3 indexes a voxel in stack l, respectively along the readout, phase encode and slice directions, and N l denotes the total number of stacks in the acquisition. Then, we formulate the reconstruction problem as the recovery of a volume x from y via (1) with f a robust loss function, g a reconstruction regularizer, and h a motion regularizer. In this formulation, we model the image formation by a measurement operator A, the fetal motion by a diffeomorphic image warping operator φ with deformation described by the parameter vector φ, and use a generative DD network operator θ with learnable parameters θ and fixed random input z for regularization.
In § II-A we describe the robust loss function f , in § II-B the measurement operator A, in § II-C the temporal structure of the acquisition, in § II-D the registration methodology to obtain φ , in § II-E the deep generative prior θ z used as a regularizer, in § II-F the fetal MRI cohort used to test our reconstructions, in § II-G the adopted reconstruction implementation, and in § II-H the GA estimation procedure.

A. Robust Reconstruction
The residuals of the reconstruction are denoted by The corresponding squared residuals map is denoted by r 2 = (r 2 m l 1 m l 2 m l 3 l ). We use a smooth version of the Welsch metric due to its strong robustness to outliers [17]: Corresponding readout, phase encoding and slice axes and planes are color coded in green, blue and yellow, respectively. b) Slices collection follows different interleaving configurations in time. c) Acquired data may include corrupted slices and appear discontinuous in the slice direction due to motion. d) Volumetric reconstructions achieving a dense representation of the fetal anatomy can be obtained by accounting for these measurement factors. e) For this sake, the algorithm alternates between robust motion compensated SV reconstruction, multi-scale registration of reconstructed volume to acquired slices for motion refinement, projection of the reconstruction solution to a natural image generative representation, and regularization of the reconstruction using this projection.
with G an in-plane Gaussian smoothing operator to weight the contribution of the observation in a voxel according to the residuals in its neighborhood, σ a robust estimation of scale, σ 2 = 2.1981 median(Gr 2 ) [17], and k a tuning constant. We can solve for x using Iteratively ReWeighted Least Squares (IRWLS) [17], [18]. Defining the encoding operator as E = A, the update at iteration i + 1, which can be computed via conjugate gradient, iŝ with W (i) a diagonal matrix of weights. The diagonal entries for the weights, w (i) , are obtained from the squared residuals and the updated scale at iteration i , r 2 (i) and σ 2 (i) as [17], [18]: To improve IRWLS convergence we use homotopy continuation [19]. The computed weights are modified for reconstruction according to: so an Ordinary Least Squares (OLS) problem is solved in the first iteration (τ = 0, W = I, the identity matrix) and the target formulation is gradually reached after N i iterations (τ = 1). In Fig. 2 we show an example including a sampled slice, residuals at first and last iterations of the reconstruction and corresponding weights.

B. Measurement Operator
The measurement operator A, is defined for each stack l as A l = D l T l , where T l is a rigid transformation accounting for the orientation of the stack l and D l is the slice sampling operator. We model blurring and discretization in D l using a Gaussian slice profile and according to the slice thickness and slice separation of stack l [20]. Modeling and estimation of image inhomogeneities had a small impact in the tested data, so it is left out of this manuscript. However, inhomogeneities may become significant at higher field strengths.

C. Temporal Structure of the Scan
Multi-slice scans are typically collected in a non-sequential manner where consecutively acquired slices are located distant Fig. 2. a) Measured data y, residuals r at b) first and c) last iterations of the reconstruction, and weights w at d) first and e) last iterations of the reconstruction. The green ellipse encloses an area around the the placenta and amniotic sac boundary with high residuals when starting the reconstruction -see b)-. These become smaller at last iteration -see c)-due to corrected motion, which increases the reliability of the data in that area, as shown by the corresponding weights. The blue ellipse encloses an area where flow artifacts cause a relative enhancement of the amniotic fluid so, driven by anomalously high residuals, weights are kept low in the last iteration -see e)-. to each other. An example of the particular slice order used in a given stack of the acquisition is shown in Fig. 3a. We define an interleave as a series of slices acquired within a single FOV sweep, so in this example we have 20 interleaves. As illustrated in Fig. 3b, we can also group our slices into socalled packages, constructed by considering the gap between the first slices in two consecutive sweeps, with 4 packages per stack in our sequence.
For motion estimation we define an operator P ( j ) s , 1 ≤ s ≤ N j s , to extract the slices associated to the motion state s. Motion states are constructed to correspond to a given stack, package or interleave depending on the multi-scale motion estimation level j ∈ {STACK, PACKAGE, INTERLEAVE} at which we are operating, with N j s the number of states at level j . We start by estimating a deformation per stack. Then, we propagate these estimates as starting transformations for estimations at the package level. Finally, we propagate the latter to estimate a deformation per-interleave, which serves to accommodate different slice deformations because subsequent slices are acquired with a substantial gap.

D. Diffeomorphic Registration
In the LDDMM framework [15] a transformation between two images ϕ 1 is given as the end point of the flow of a vector field v t (ϕ t ) = ϕ t , where ϕ 0 = id with id the identity function. The diffeomorphic registration between the source and target volumes I 0 and I 1 is posed as the variational minimization of: where σ I represents the image noise variance and norms are taken in the spaces of square integrable functions L 2 and velocity fields V . Then, by enforcing a certain smoothness on V , ϕ t is guaranteed to lie in the space of diffeomorphisms. In practice, smoothness is promoted by an operator L = (−β + id) γ with the Laplacian and β and γ controlling the level and properties of smoothness respectively.
Ensuring smooth and invertible mappings by the LDDMM framework may confront numerical difficulties [21]. For efficiency, we adopt single-step temporal integration and track the invertibility based on the sufficient Lipschitz condition [22]: with p, p a pair of coordinates and Lipschitz constant a < 1.
A risk of local non-invertibility (with maximum for p taken in a discretized neighborhood of p) is used in each step to modulate the registration gradient descent step size by max (a − max p (q( p)), 0). Hence, in our joint SV reconstruction and diffeomorphic registration formulation, we face a series of volume to slice [11] registration subproblems from source reconstructions to target measurements: s .
s is the image warping operator, implemented by linear interpolation according to the diffeomorphism induced by φ ( j ) s , which parametrizes a band-limited field [23]. L represents the discrete version of the smoothness operator L. (9) corresponds . DD architecture used for reconstruction regularization. Considering a network with D scales, scaling ratio U and channel compression ratio per-scale C, the input is a random array of size N/U D−1 × C D K, with N = (N 1 , N 2 , N 3 ) the grid dimensions of the output volume, K the number of channels at the finest scale, and · the ceiling function. The coarsest scale comprises simply a linear layer. The remaining scales are connected by sinc upsampling layers and formed by blocks of batch normalization, activation and linear layers. At the finest scale, we use two such blocks followed by sigmoid activation. Linear layers at scale ¼ ≤ d < D map C d+1 K input to C d K output features but for the additional layer at full resolution that performs the final K → 1 mapping.
to the optimization with respect to the motion parameters of (1) with τ = 0 in (6) and regularization Although the robust cost function presented in § II-A could also be used for motion estimation, we have observed that OLS are effective in escaping local optima of the motion parameters. The Hilbert gradient in the space of diffeomorphisms V [15] is obtained by:

E. DD Regularization
The DD [16] is a deep architecture designed for the efficient representation of natural images without using training data. The network is based on the concatenation of upsampling, convolution, activation and batch normalization layers where 1 × 1 convolutions (i.e., linear layers) are shown to be the most cost-effective. Its parameters are optimized by fitting to a particular cost function using fixed random inputs z, and it has shown competitive results when compared to supervised methods in applications such as image compression, denoising, inpainting, and reconstruction [24].
The proposed DD architecture is depicted in Fig. 4. We have performed a series of modifications to the original network. First, swish units are used instead of rectified linear units, according to the results in [25]. Second, batch normalization is applied before rather than after activation, following the recommendations in [26]. Third, we use sinc rather than bilinear upsampling to better preserve fine detailed structures. Of independent interest, these changes have been tested for the 2D image compression experiment in [16] showing noticeable improvements for a wide range of compression rates. Finally, we parametrize our architecture using the number of scales D, the scaling ratio U , the channel compression ratio per-scale C (defined as the ratio of input and output features of the linear layers), and the number of output channels of the first linear layer at full resolution K . Similar to [16], we empirically fix D = 5 and U = 2, and K is determined by prescribing a given minimum compression rate S for the network. However, we use C = 3 instead of C = 1, as this provides a thinner network at the finest scale, which reduces peak memory consumption, a limiting factor in 3D. Namely, this modification allows the 3D problem to be fitted in the memory of our GPU card with minimum loss of detail.
The DD output is used as the expected value of the reconstruction for the following generalized Tikhonov regularizer: For a series of fixed parameters of the network at reconstruction iteration i ,θ (i) , we can compute the DD output mapped to match the dynamic range of the reconstruction at previous iteration, and solve the reconstruction problem by extending the solution in (4) with the Tikhonov term. Then, the parameters of the network can be refined according to: This scheme penalizes the deviation of the reconstruction from the space of natural or structured images (like medical images) because the DD fits structured images faster than noise (the socalled implicit bias of the architecture) [27], and corresponds to a relaxed version of the formulation in [16] and [24]. The Tikhonov regularizer in (12) is consistent with the fitting cost in (15) and, when compared with more complicated regularizers, does not introduce significant computational overload to the reconstruction.

F. Materials
We have tested our reconstruction algorithm in a cohort of 72 fetal cases consented as participants in the iFIND project (ISRCTN16542843) [28] with GA distribution ranging from 20 + 2 to 36 + 0 weeks shown in Fig. 5a. The cohort includes controls, fetuses with suspected abnormalities, and cases with incidental findings. In Fig. 5b we show the distribution of abnormalities, including controls or cases without any detected anomaly; with renal, urinary or genital abnormalities; with gastrointestinal abnormalities including abdominal wall defects; with chest abnormalities including respiratory, cardiac and thoracic; and with multiple abnormalities including skeletal and CNS anomalies.
Images were acquired on a PHILIPS INGENIA 1.5 T with a 24-channel receive coil using a single-shot FSE sequence. For each subject N l = 5 stacks were acquired, including axial (one repeat), sagittal (two repeats) and coronal (two repeats) orientations. Data was collected with an in-plane resolution of 1.25 mm isotropic, 2.5 mm slice thickness and 1.25 mm slice separation. Sensitivity encoding acceleration was set to 2 with half scan 0.575 and echo time T E = 80 ms. Number of slices was variable for adequate coverage of the targeted FOV in the range [100, 160] with number of packages and interleaves in the order of those in Fig. 3a. Total acquisition time was on average T A = 11 20 .

G. Implementation Details
We start by performing an OLS reconstruction of the acquired stacks on a 2.5 mm grid. A Region Of Interest (ROI) containing the whole uterus is drawn on the resulting volumes by the 3D implicit model tool [29] in the SEG3D software [30]. This is used to define the FOV of the reconstruction and has no effect on the quality of the final result. Then, the pipeline proceeds as described in Algorithm 1. At each motion estimation scale j the IRWLS is reset (τ = 0) and the method alternates the estimation of the reconstructed image x, DD parameters θ , priorx, motion φ, and weights w and W. This is repeated till reaching the target formulation (τ = 1) and convergence of motion estimates as dictated by maximum update below a threshold δ φ .
The algorithm confronts different subproblems for which a series of parameters need to be selected. First, for robust while 1 do 6: x (14) ← − − W,x, A, φ ( j ) , x, y (reconstruction) 7: θ (15) ← − − z, θ , x (DD fitting) 8:x (13) ← − − z, θ (DD inference) 9: if τ = 1 and update in φ ( j ) lower than δ φ then 10: break 11: end if 12: φ (9) ← − P ( j ) , L, A, φ ( j ) , x, y (motion estimation) 13: w (5) ← − G, A, φ ( j ) , x, y (weight computation) 14: W, τ (6) ← − w, τ (weight relaxation) 15: end while 16: end for reconstruction we use a full width half maximum of 2.5 mm for G, appropriate to reduce the variance in data reliability estimation and keep good localization properties, with the simple choice k = √ 2 providing strong suppresion of outliers and acceptable SNR penalty. Second, for registration we use β = 2.5 and γ = 2.5, in agreement with values suggested in the literature [15], [23], a STACK = 0.45, a PACKAGE = 0.65 and a INTERLEAVE = 0.85, which have been observed to prevent early saturation of convergence and guarantee numerical invertibility, and σ I = 0.1, chosen by visual assessment of plausibility of deformations. Third, for regularization we use S = 2, achieving a strong suppression of spurious structures while maintaing good data fidelity, visually tuned λ = 0.4, and run 180 epochs using the Adam optimizer with learning rate 5 · 10 −3 for refining the network parameters. Finally, N i = 5 iterations are enough for improved IRWLS convergence by homotopy continuation and δ φ set to half the in-plane acquisition resolution provides full motion estimation convergence. We refer the reader to the source code for further details.

H. GA Prediction
We propose to estimate the GA from the reconstructions by the following steps: 1) Obtain a set of slices by uniformly reformatting a reconstructed ROI using N r rotations evenly distributed in the 3D rotation group. Rotations are performed around the ROI center after windowing the corresponding volumes to prevent boundary artifacts, and three centered slices are extracted along the main reoriented planes. 2) Extract a set of deep features using a given pretrained model. We have considered ShuffleNet [31], ResNet-18 [32], GoogLeNet [33], ResNet-50 [32], MobileNet-v2 [34], ResNet-101 [32], and DenseNet-201 [35], all trained in the ImageNet database [36]. Slices extracted in previous step are spatially zero-padded, replicated in the channel dimension to match the input sizes of the models, and z-score normalized. 3) Taking the deep features as predictors, use zerocorrelation constrained linear GA regression [37]. 4) At inference time, ensemble the estimates for the 3N r slices into a final GA prediction by taking their median.

III. EXPERIMENTS AND RESULTS
In § III-A we run synthetic experiments analyzing the regularization and registration schemes introduced in this paper. In § III-B we quantify the relative impact of the main constituents of our proposal by an ablation study. In § III-C we compare with existing approaches using quantitative image quality metrics, resolution analysis, and radiological rating. In § III-D we characterize the performance for variying levels of motion. Finally, potential clinical utility is studied in § III-E by presenting comparative results on GA estimation from the reconstructions.

A. Synthetic Experiments
The reconstruction with minimum data corruption as assessed by the average Normalized Cross Correlation (NCC) of adjacent input slices [11] is used as a synthetic GT x * . This NCC indicator, which has been reported to agree with human observer quality ratings, is used hereafter to rank the acquired image quality of cases in our cohort. First, we study the DD regularization scheme in § II-E using L-curve analysis [38]. We simulate the acquired data using a motionless forward model and adding Gaussian noise for a SNR of 30 dB in the acquired stacks, in gross agreement with the SNR levels observed in real images. In Fig. 6a we show an L-curve traced from the regularization x −x 2 2 and residual term r 2 2 norms for different values of the regularization parameter λ -see (14)-when using an OLS reconstruction. Reconstructions are repeated 5 times with different noise realizations and average results are reported. In Fig. 6b we show the curvature of the L-curve κ and in Fig. 6c we present the reconstruction errors with respect to the GT x − x * 2 2 , both as a function of λ. We observe that the maximum curvature λ = 0.2 matches that of minimum error. Thus, L-curve may be appropriate for predicting the regularization parameter for SV reconstruction, although increased computational burden is a limiting factor. In addition, our formulation involves dynamically modifying the system matrix and uses IRWLS instead of OLS, scenarios where the properties of the L-curve are less documented. Moreover, these differences may explain the discrepancies between predicted λ and that determined visually, λ = 0.4.
Second, we study motion estimation at different levels of motion by synthesizing random diffeomorphisms with independent and identically distributed Gaussian entries in the space of velocities V and windowed rigid motion (elliptic Tukey window, taper ratio 0.5) with random rotation axes and center of rotation in the center of the FOV. We simulate the acquired data by applyling our forward model for different slice subsampling rates corresponding to the whole stack  (no subsampling), package (4× subsampling) and interleave (20× subsampling) levels. In Fig. 7 we show the Root Mean Square (RMS) error for motion estimation using the GT reconstruction to match the simulated slices, including the mean values when warping to 5 stacks simulating the orientations in our actual data and corresponding error bars. Maximum errors were also computed with similar appearance to RMS curves. Simulations at different deformation and rotation levels, with averaged maximum motion excursions and minimum Jacobian respectively around 6 mm and 0.7 for random diffeomorphisms and 16 mm and 0.5 for rigid motion, may span some typically observed motion ranges, as larger levels start to introduce unrealistic deformations as assessed from visual checks. We see that for both motion models stable motion estimates can be obtained for motion states defined at high slice subsampling rates, as those given by the interleave pattern in Fig. 3a. However, the higher the subsampling, the smaller the accuracy, which justifies the temporal multiscale motion estimation scheme described in II-C. Comparing appearance of errors when diffeomorphism generation and estimation models are matched (Fig. 7a) and when using rigid motion for motion generation (Fig. 7b), we observe close to linear estimation error in the former and supralinear trends in the latter, especially at the interleave level.

B. Ablation Study
To assess the impact of the main components of the proposed reconstruction scheme, we compare the reconstructions using the full model (x) versus not using regularization -i.e., λ = 0 in (14)-(x λ=0 ); using a handcrafted regularizer g(x) = Hx 2 2 where H is the superposition of finite difference penalizers of different orders (0 to 32) whose weights have been visually tuned for trading off noise suppression and resolution loss (x H ); not using the robust formulation -i.e., fixing τ = 0 in (6)-(x τ =0 ); and not using motion correction -i.e., σ I → ∞ in (11)-(x σ I →∞ ). The convergence criterion has been adjusted for comparable number of iterations for all alternatives. Table I shows results of Leave One Out Cross-Validation (LOOCV), where we measure the agreement between the stack with minimum motion and the propagation to the acquisition space of the reconstructions obtained by completely excluding this stack [4]. Note this is a challenging scenario as we are left with 4 most corrupted stacks only. Results of Peak SNR (PSNR), in dB, NCC and Structural SIMilarity Index (SSIM) are shown for all 72 cases in our cohort. Results without motion correction (x σ I →∞ ) are clearly worse for all compared metrics and removing the robust formulation (x λ=0 ) also shows an impact on performance as assessed by all considered metrics. Differences are smaller for different regularization choices although we observe that the DD regularizer (x) provides improvements over the handcrafted regularizer (x H ) or not using regularization (x λ=0 ). However, improvements of the full model versus any of the alternatives are highly significant when performing two-tailed sign tests for all metrics used for comparison ( p < 1 · 10 −7 ). To aid interpretation of Table I we report in Fig. 8 visual results for the scan with median level of acquired data corruption [11]. Additional results for representative subjects with different levels of data corruption are included in a Supplemental Video showing reconstructions in 3D. We observe that reconstructions without regularization (Fig. 8b) present a noisy appearance. Noise can be mitigated either by the regularizer penalizing high-frequency content (Fig. 8c) or by our proposed DD regularizer (Fig. 8a), but resolution is better preserved when using the DD regularizer, for instance at the boundaries between the fetal body and the amniotic fluid in the area enclosed in blue. We observe that differences are mainly affecting fine details of the reconstruction, which may not be captured in the LOOCV experiment due to imperfect mapping of the reconstructions or noise and artifacts in the excluded stack. This hypothesis is supported by the L-curve experiment in § III-A, where we have observed improvements approaching 5 dB when introducing DD regularization, much higher than differences with no regularization in Table I. Ability to resolve fine detailed structures is evident when comparing to results without motion compensation (Fig. 8e). Non-robust reconstructions (Fig. 8d) look similar to robust reconstructions (Fig. 8a). However, impact of non-suppressed artifacts is noticeable locally, as in the area within the blue ellipse, where more uniform fluid background is observed when the robust formulation is adopted.
In Fig. 9a we show the Power Spectral Density (PSD) averaged along the three axes of the reconstruction grid for the aforementioned reconstruction alternatives applied to the same subject. Reconstructions without regularization (x λ=0 ) present the largest power at high spatial frequencies with disruption of power law of attenuation above approximately 1.5 mm, probably stemming from a Gibbs ringing filter applied as part of the scanner k-space reconstructions. However, we know from image inspection in Fig. 8b that a significant amount of the energy at high frequencies is contributed by noise. Noise reduction was effective when using the handcrafted regularizer, but we observe here (x H ) that this comes at the price of strong suppression of high frequency signal components. The DD-based regularized reconstructions (x) lie somewhere in between both scenarios, likely with better signal preservation at high frequencies and strong noise reduction. Small differences are observed between non-robust (x τ =0 ) and robust reconstructions, but the non-robust version presents slower PSD decay rates consistent with reduced suppression of artifactual structures. Motion compensation has an impact  in moderate to high spatial frequencies, with power enhancements versus non-compensated reconstructions (x σ I →∞ ) above 5 dB at 2 mm, for instance. Table II reports results of LOOCV for Slice to Volume Reconstruction (SVR) [4], Patch to Volume Reconstruction (PVR) [10], Super-Resolution Reconstruction (SRR) [7], Deformable Slice To Volume Reconstruction (DSVR) [11], and our RGDSVR method, again using all cases in our cohort. We observe that RGDSVR generally provides higher metrics than other approaches. Statistical significance is assessed by two-tailed sign tests which have shown highly significant differences when comparing RGDSVR with any of the alternatives for all metrics used for comparison ( p < 5 · 10 −10 ). As for the remaining approaches, results are in line with those reported in [11], with DSVR generally overtaking previous methods. Therefore, in what follows experiments are focused on RGDSVR and DSVR.

C. Comparison With Other Methods
We compare the resolution provided by RGDSVR and DSVR. The reconstruction grid orientation can be different for both methods, so we have used sinc interpolation to reformat our datasets into the grid of the reference method. In Fig. 9b we show the mean±std PSD of both methods across our whole cohort. The curves suggest better preserved moderate to high spatial frequency information when using RGDSVR, with highly significant differences as assessed by a two-tailed sign test for structures below 10 mm ( p < 10 −8 ).
A radiologist with 11 years' experience in fetal MRI was asked to select the higher quality image in each of three orthogonal pairs of matched slices extracted around the center of the trunk FOV from DSVR and RGDSVR. Preferences were averaged per-subject and this resulted in 62 cases were RGDSVR ranked better (86.11 %) and 10 cases were DSVR ranked better (13.89 %). In Fig. 10 we visually illustrate the differences between DSVR and RGDSVR by selecting areas where these can be clearly perceived. Despite reconstructions are not spatially matched due to arbitrary definitions of reference deformations, we observe that RGDSVR tends to provide sharper results for comparable levels of noise and artifacts. This is illustrated by the ellipses on top of the main vessels, the bowel, and the neck, respectively in the top, central and bottom panels.

D. Performance for Variying Levels of Motion
In Fig. 11 we show examples of input data and reconstructions for low (Fig. 11a) and high (Fig. 11b) motion cases, corresponding respectively to subjects with 25-and 75-percentile acquired data degradation. Visual results are presented for the 5 acquired stacks along planes containing the slice direction and centered on relevant information about the fetal anatomy. We observe that reconstructions (bottom rows) are able to suppress the inconsistencies in the input data (upper rows) and produce sharp results for most imaged structures.
We have computed the Pearson correlation ρ of the PSNR, NCC and SSIM metrics considered in the LOOCV experiments, assumed to be related to the reconstruction quality, against the average NCC of adjacent input slices, which is related to acquired data quality, for reconstructions without and with motion correction. We observe stronger correlation between input and output data quality without (ρ PSNR = 0.349, ρ NCC = 0.371 and ρ SSIM = 0.556) than with motion correction (ρ PSNR = 0.330, ρ NCC = 0.303 and ρ SSIM = 0.439). Thus, correlation is still important but becomes weaker when introducing motion correction, which may indicate that resulting reconstructions are less sensitive to the artifacts and inconsistencies in the original data.

E. Clinically-Oriented Application: GA Prediction
Despite potential gains of proposed reconstructions are suggested by the experiments in § III-C, lack of GT makes direct comparison of methods very challenging. In addition, it is very difficult to faithfully reproduce the different sources of corruption in real data as well as complex patterns of fetal and mother motion by simulations. Therefore, we have resorted to a task-based validation where we perform comparisons on a clinically-oriented GA prediction problem. To ensure comparisons are performed using the same FOV, we assess the performance of RGDSVR and DSVR for GA estimation using a trunk ROI corresponding to the FOV used as the target reconstruction volume in the experiments in [11]. As existing methods for fetal GA estimation from MRI [14], [39], [40] are based on brain data, we also test the GA estimation performance using a brain ROI from our reconstructions. In all cases we use the GA estimation method described in § II-H with 3D space spanned by slices from N r = 200 random volume reorientations. Finally, to investigate the added value of non-brain features, we combine brain and trunk ROIs by using N r = 100 random reorientations each. In this case, common regression and z-score normalization weights are computed at training using slices from both ROIs, and joint median ensemble of estimates is used at inference.
We perform a 6-fold cross validation using the 72 cases in our cohort. Different GA regression alternatives are compared by computing the Mean Absolute Error (MAE) and coefficient of determination R 2 with results reported in Table III. GA predictions using RGDSVR consistently outperform predictions using DSVR for comparable trunk ROI. Best results in both cases are obtained using DenseNet-201, respectively with MAE (R 2 ) 0.931 (0.918) and 1.045 (0.888). We also observe consistently better results for all models when using the brain rather than the trunk ROI, with best figures 0.683 (0.950) provided again by DenseNet-201. These results compare favourably with those reported by [39], 0.751 (0.947), and [14], 0.767 (0.920), and unfavourably with those in [40], 0.508 (0.992). Further to this, with the exception of R 2 for poorly performing MobileNet-v2 and GoogLeNet, additional improvements are consistently observed for all models when combining volumetric brain and trunk information, a unique feature of the proposed technique, with MAE (R 2 ) 0.618 (0.958) for DenseNet-201. However, GT GA is obtained from the clinical records which is not free from errors. Therefore, in Fig. 12 we provide Bland-Altman plots of agreement [41] between DenseNet-201 based predictions (GA DN ) and GT GA (GA GT ) for the considered reconstructions and ROIs combinations. Results are color coded according to the abnormality categories legend in Fig. 5b. We observed no statistical significance at p = 0.05 in GA discrepancies between both methods when comparing controls and cases with anomalies with an unequal variances t-test for any of the alternatives considered in Fig. 12.

IV. DISCUSSION
We have proposed a novel methodology for robust wholebody fetal MRI reconstruction relying on diffeomorphic motion estimation to capture plausible deformations of the fetal organs and high-quality regularization using a deep generative model. We have quantitatively and qualitatively characterized the impact of these components in the reconstructions. Quantitative comparisons with other fetal MRI reconstruction methods have shown improved performance for PSNR, NCC and SSIM metrics when using stack-wise LOOCV, which has been replicated by radiological rating. Noticeable differences in reconstruction sharpness have been demonstrated by resolution analysis and a potentially strong impact of the reconstruction method in the clinical utility of fetal MRI has been showcased by a GA prediction task. Finally, we have provided a conceptually simple GA prediction method based on free reformatting the 3D reconstructions for 2D deep feature extraction and correlation constrained linear regression, showing competitive accuracy with respect to existing approaches. Similar to [3], we have built our cost function using the robust regression via M-estimators framework, so weights for outlier mitigation are directly derived from the cost function [18], instead of using independent metrics as in [11]. In addition, the introduction of the smoothing operator G simplifies previous combinations of voxelwise and slicewise weights [3], [11] by assuming that motion-induced degradation of magnetization as well as uncorrected non-rigid motion have a regional nature.
In the non-rigid registration setting, reproducible morphometry may be compromised by geometric distortions introduced by the algorithm, which can be alleviated by the diffeomorphic constraint. Of particular interest, we should highlight the brain; despite the prominent use of rigid motion models in the past, non-rigid components may become appropriate to model nonlinearities of the scanner gradient fields [42]. Although our motion model should probably be refined for the brain, for instance via decoupled motion models [43], distortion levels in the reconstructions are small enough so as to lead to accurate GA estimates.
Taking into account the limitations for generating GT datasets, we have opted for an unsupervised application of deep architectures for regularizing our reconstructions. However, there are alternatives for integration of DL into reconstruction problems [13] and DL could also be applied for extending the capture range of registration or refining the characterization of outliers. On the other hand, the memory footprint of the 3D DD architecture is a strong limitation, so deep network regularizers based on implicit representations [44] could be considered.
3D reconstruction errors in whole-body fetal MRI may arise from multiple causes. We may encounter errors due to (a) small inaccuracies in motion estimation, (b) inconsistencies in magnetization of different slices, (c) multiple poses of the fetal body throughout the scan, and (d) the fetus moving continuously across the examination. We believe that in cases (a) and (b) information coming from complementary stacks can resolve the ambiguities in many instances, so the robust formulation with deep generative regularization generally provides satisfactory solutions. However, artifacts in the reconstructions may be strong in cases (c) and (d), as the method may struggle to find a good direction for high quality convergence.
Our reconstruction tool can be applied to different acquisition settings. Its flexibility has been confirmed by preliminary application to cardiac and animal brain studies. However, results could show variable quality. In this regard, reconstruction quality is ultimately determined by available scan time, as enlarged sampling redundancy gives more flexibility to implement robust reconstructions with improved SNR. For a fixed total acquisition time, there are different acquisition choices that may impact the reconstruction reliability. Importantly, [45] studies the comparative performance of overlapped single orientation scans versus multi-oriented scans in terms of resolution retrieval, with clear benefits observed in the latter (see also [46]). In our context, this result may suggest replacing the repeated sagittal and coronal stacks by new orientations, perhaps whilst changing the overlapping factors. However, implementation of optimal sampling schemes is often limited by hardware specifications of the scanner, inherent complexity of fetal imaging in vivo, computational requirements of reconstruction algorithms, or need of harmonization with protocols currently in place.
Our GA estimation method leverages free reformatting of volumetric reconstructions to obtain a dense set of slices covering the fetal structures in the variable spatial configurations they can adopt due to fetal motion. We have shown that deep feature extraction using pre-trained models combined with correlation constrained linear regression provides accurate results for this task. Our results look competitive to existing methods [14], [39], [40], particularly when complementing brain features with trunk information, but there are differences in the cohorts considered. Most notably, existing methods use larger cohorts, including single-sequence data from 220 [40] and 289 subjects [39] and multi-sequence data from 764 subjects [14]. Small sample sizes in our case precluded isolation of a subset of subjects for testing, a potentially important limitation when compared to [14] and [40].
In the future, we plan to follow a practical roadmap to facilitate the application of our algorithm in clinical scenarios. This may include automated ROI extraction, refining and further testing the algorithm and GA estimation using additional cohorts and acquisition protocols, and move towards a comprehensive analysis pipeline by integrating techniques for whole-body fetal segmentation and atlas construction [47].

V. CONCLUSION
We have proposed a method for robust whole-body fetal and placenta MRI based on diffeomorphic registration and deep generative regularization. Volumetric reconstructions are obtained from a set of motion-affected and possibly corrupted single-shot slices. Our proposal provides alternative solutions to existing methods for the different subproblems faced in this application. These are validated by an ablation study and improved quantitative metrics, radiological scoring and conspicuity have been observed when comparing with existing methods in a cohort of 72 fetal subjects. A GA estimation task is defined to assess the clinical utility of our technique, for which we propose a simple method leveraging the 3D information, which produced competitive results. For usual levels of motion, our reconstructions provide dense and consistent representations of the fetal anatomy. Therefore, the proposed methods may find application in 3D fetal MRI morphometry, developmental assessment, or fetal surgery planning.