3-D Reconstruction in Canonical Co-Ordinate Space From Arbitrarily Oriented 2-D Images

Limited capture range, and the requirement to provide high quality initialization for optimization-based 2-D/3-D image registration methods, can significantly degrade the performance of 3-D image reconstruction and motion compensation pipelines. Challenging clinical imaging scenarios, which contain significant subject motion, such as fetal in-utero imaging, complicate the 3-D image and volume reconstruction process. In this paper, we present a learning-based image registration method capable of predicting 3-D rigid transformations of arbitrarily oriented 2-D image slices, with respect to a learned canonical atlas co-ordinate system. Only image slice intensity information is used to perform registration and canonical alignment, no spatial transform initialization is required. To find image transformations, we utilize a convolutional neural network architecture to learn the regression function capable of mapping 2-D image slices to a 3-D canonical atlas space. We extensively evaluate the effectiveness of our approach quantitatively on simulated magnetic resonance imaging (MRI), fetal brain imagery with synthetic motion and further demonstrate qualitative results on real fetal MRI data where our method is integrated into a full reconstruction and motion compensation pipeline. Our learning based registration achieves an average spatial prediction error of 7 mm on simulated data and produces qualitatively improved reconstructions for heavily moving fetuses with gestational ages of approximately 20 weeks. Our model provides a general and computationally efficient solution to the 2-D/3-D registration initialization problem and is suitable for real-time scenarios.


I. INTRODUCTION
R ECONSTRUCTING a 3D volume from misaligned and motion corrupted 2D images is a challenging task.The process often involves labour intensive pre-processing steps, such as manual landmark matching.Such steps exhibit both inter and intra-observer variance.Pre-processing of this nature is however important and necessary to achieve acceptable input for the following intensity-based pose optimization step of a volume reconstruction process.The optimization facilitates alignment and combination of intensity data from multiple image sources into a common co-ordinate system.
In this context, slice-to-volume registration methods [1]- [7] are effective in cases where a coarsely aligned initialization of the 3D volume is provided to initialize the reconstruction process.This initial 3D reference volume is used as a 2D/3D registration target to seed the iterative estimation of the slice orientation and intensity data combination.Reasonably good initial coarse alignment of 2D image slices is critical to form the seed reference volume.Arbitrary subject motion can invalidate slice alignment assumptions that are based on the scanner co-ordinate system and manual intervention may be necessary.Manual correction often becomes unfeasible in practice due to the magnitude of image data involved.
Image registration is also required for applications such as atlas-based segmentation [8], tracking [9], image fusion from multiple modalities [10] and clinical analysis of images visualized in an anatomical standard co-ordinate system [11].All of these applications suffer from poor initialization of automatic registration methods, which must be alleviated by manual pre-processing.
Two distinct registration strategies in previous work can be categorized as volume-to-slice and slice-to-volume techniques.The former are concerned with aligning a volume to a given image, e.g., aligning an intra-operative C-Arm X-Ray image to a pre-operative volumetric scan.While the latter is concerned with aligning multiple misaligned 2D slices into a unique coordinate system of a reference volume.A recent review of slice-to-volume registration techniques is given in [12].
In practice, volume-to-slice registration is often more feasible than slice-to-volume registration.Manual alignment of one or a few 3D volumes to a single 2D slice or projection is less time consuming than manual alignment of hundreds of individual slices into a common co-ordinate system.Landmarkbased techniques can help to automate this process, but this approach is heavily dependent on detection accuracy and robustness of the calculated homography between locations and the descriptive power of the used landmark encoding.2D slices also do not provide the required 3D information to establish robust landmark matching, therefore this technique cannot be used on applications such as motion compensation in fetal imaging.
Traditional 2D to 3D slice-to-volume intensity-based registration methods [3] involve solving the inverse problem of the super-resolution from slice acquisitions, shown in Eq. 1: y i denotes the ith low resolution image obtained during scan time, D i is the down-sampling matrix, B i is the blurring matrix, S i is the slice selection matrix, M i is a matrix of motion parameters, and x is the high resolution 3D volume with n i as a noise vector.More commonly, D i , B i and S i are grouped together in a single matrix W i .Obtaining the true arXiv:1709.06341v2[cs.CV] 20 Sep 2017 high-resolution volume x is ill-posed, as it requires inversion of the large, ill-defined matrix W i .
The optimization methods employed in this problem domain typically do not guarantee a globally optimal registration solution from arbitrarily seeded slice alignment.The function that maps each 2D slice to its correct anatomical position in 3D space may be subject to local minima and the requirement for small initial misalignment typically improves result quality.Previous work have attempted to make this optimization robust by introducing appropriate objective functions and outlier rejections based on robust statistics [5], [13].Despite these efforts, good reconstruction quality still depends on having good initial alignment of adjacent and intersecting slices.
The robustness of (semi-)automatic 2D-3D registration methods is characterized by their capture range, which is the maximum transformation misalignment from which a method can recover good spatial alignment.When the data available is limited, as is common in the 2D-3D case, this task becomes more challenging.
We provide a solution for the capture range problem, which is applicable to many medical imaging scenarios and can be used with any registration method.We demonstrate the capabilities of the method in the current work using in-utero fetal MRI data as a working example.During gestation, highquality and high-resolution slices can be acquired through fast scanning techniques such as Single Shot Fast Spin Echo (ssFSE) [2].Slices can be obtained in a fraction of a second, thus freezing the in-plane motion of the subject.Random motion (e.g. a fetus that is awake and active during a scan) or oscillatory motion (e.g.maternal breathing), is likely to cause adjacent slices to become incoherent and corrupt a 3D scan volume.State-of-the art slice-to-volume registration (SVR) methods [5]- [7] are able to compensate for this motion and to reconstruct a consistent volume from overlapping motioncorrupted orthogonal stacks of 2D slices.These methods tend to fail for large initial misalignment between the 2D input slices.Thus, SVR works best for neonates and older fetuses who have less space to move than for very young fetuses.However, for early diagnostics detailed anatomical reconstructions are required from an early age (younger than 25 weeks).Current volume reconstruction pipelines are often unable to successfully fully automate the 3D fetal organ reconstruction process of young and highly mobile fetuses.
Related Work: Image registration methods that can compensate for large initial alignment offsets usually require robust automatic or manual anatomical landmark detection [14]- [16] with subsequent 3D location matching.This often relies on the use of fiducial markers [16]- [19] involving special equipment and/or invasive procedures.Manual annotation of landmarks by a domain expert is the current clinical practice to initialize automatic image registration [16].Fully automatic methods are difficult to be integrated into the clinical workflow because of their limited reliability, complex nature, long computation times, susceptibility to error and lack of generalization.
Recent work [9], related to the proposed system, uses Convolutional Neural Networks (CNNs) to estimate the automatic spatial arrangement of landmarks in projection images.The method requires a robust landmark detection algorithm, which is domain and scanner specific and the detection quality degrades for motion-corrupted data.An early version of this work [20] has demonstrated the potential of CNNs for tackling the volume initialization problem for slice-to-volume 3D image reconstruction.The network architecture in [20] showed promising results of initializing scan slices for fetal brain inutero volume reconstruction, and pose estimation of Digitally Reconstructed Radiograph (DRR) scan images.However, it does not provide a means of knowing incorrect predictions.Failing to account for grossly misaligned slices, that constitute outlying samples, hinder reconstruction performance and may even result in volume reconstruction failure.
We extend the previous work of [20] by rigorously evaluating several network architectures for the volume initialization problem and introduce Monte Carlo Dropout [21] for the purpose of facilitating a prediction confidence metric.
Contributions: In this paper, we introduce a learning based approach that automatically learns the slice transform model of arbitrarily sampled slices relative to a canonical co-ordinate system (i.e., our approach learns the mapping function that maps each slice to a volumetric atlas).This is achieved using only the intensity information encoded in the slice without relying on image transformations in scanner co-ordinates.Our CNN predicts 3D rigid transformations, which are elements of a Special Euclidean group SE(3), to re-orient the slices back into atlas space.This provides an accurate initialization for subsequent automatic registration refinement using intensitybased optimization methods.New statistical analysis and metrics [22], specific for Lie groups, are being incorporated to give a more accurate measure of how far apart a predicted slice is from the ground truth.This is combined with traditional image similarity metrics such as Cross Correlation (CC) and Structural Similarity Index (SSIM).
We evaluate the prediction performance of several CNNs architectures with real and synthetic 2D slices that are corrupted with extreme motion for quantitative comparison.Synthetic slices, with known ground truth locations, are extracted from 3D MRI fetal brain volumes of approximately 20 weeks Gestational Age (GA).We additionally evaluate the approach qualitatively on heavily motion corrupted fetal MRI where ground truth slice transformations and/or 3D volumes are not available.By providing 2D slices that are aligned using our model to initialize the subsequent reconstruction and motion compensation steps, we can qualitatively assess the improvement our methods provide for the volume reconstruction task.
We implemented Monte Carlo dropout sampling at inference to show the model's epistemic uncertainty, and provide a prediction confidence for each slice.This can be used as a metric for outlier rejection (i.e., if the model is not confident about a prediction, then it can be discarded before subsequent use during reconstruction).
Our approach can be generalized to 3D-3D volumetric registration by predicting the transformation parameters of a few selected slices to be used as landmark markers.This is demonstrated in [20] by predicting thorax phantom slices where no organ specific segmentation is performed.It is also applicable to projective 2D images, which is highly valuable for X-Ray/CT registration.

II. METHOD
To fully evaluate and assess the performance of 2D/3D registration via a learning based approach, we incorporated it into a full 3D reconstruction pipeline.This features three modular components: (I) approximate organ localization, (II) canonical slice orientation estimation, and (III) intensity-based 3D volume reconstruction refinement.Organ localization (I) is concerned with localization of a Region of Interest (RoI).This can be achieved through rough manual segmentation, organ focused scan sequences or automatic methods [23]- [25].
For 3D reconstruction refinement (III), and to compensate for the remaining small misalignments between single slices caused by prediction inaccuracies, we use a modified iterative SVR method [6] incorporating super-resolution image reconstruction techniques [6], [7], [26].To provide a sufficiently high number of samples for SVR, multiple stacks of 2D-slices need to be acquired, ideally in orthogonal orientations.We modified [6] so that instead of generating an initial reference volume from all slices oriented in scanner co-ordinates the acquired slice orientations are replaced with predicted, canonical atlas co-ordinates and continue the iterative intensity-based reconstruction process from there.Since (I) and (III) can be achieved using the state-of-the-art techniques, we focus in the remainder of this section on the canonical slice orientation estimation (II), specifically the learning and prediction of 3D rigid transforms using a variety of network architectures.
At its core, our method uses a CNN, called SVRNet, to regress and predict transformation parameters Ti , such that Ti = ψ(ω i , Θ). Θ are the learned network parameters and ω i is a 2D image extracted from a series of slices that are acquired from a 3D volume Ω.Each Ω encloses a desired RoI, organ or particular use case scenario, such that ω i ∈ Ω.
To train SVRNet, slices of varying orientations and their corresponding ground truth locations are required.This can be obtained from existing organ atlases or from collections of motion-free 3D volumes, e.g., pre-interventional scans or successfully (partly manually) motion compensated subjects.
Rigid Body Transformation Parameterization: The motion of a rigid body in 3D has the six Degrees of Freedom (6DoF), namely three parameters for translation (T x , T y , T z ) and three parameters for rotation (R x , R y , R z ).To model how each slice is able to move in 3D space, we divide the parameters into two categories; in-plane transformation T x , T y and R z and out-of-plane transformation T z , R x and R y (see Fig. 1).If each DoF is allowed ten interval delineation, this would result in 10 6 slices per fetal volume.The content of a slice (e.g.segmented brain) is always center aligned during organ localization, thus vastly degrading the in-plane motion parameters T x and T y .Similar to [9], we reduce the amount of slices required to create the training and validation datasets by simplifying to a sample space constrained by the parameters: T z , R x , R y and R z .We can further reduce a portion of slices that yields little or no content at the extremities T z of the volume.
Pre-processing and Generating Data: Pre-processing the image intensities, via standardization or Z-score normalization, is a necessary step in training neural networks.It keeps the features of an image in a common scale, with a mean of zero and a standard deviation of one, and similar features across different images consistent.
During scan-time, the intensity and bias ranges are set by the radiologists based on the visual appealing for diagnosis purposes.However, this causes each scanned volume to be biased differently and an intensity range that varies from scan to scan.To be precise, all volumes should be mean centered and have a standard deviation of 1.However, this can be particularly problematic if dealing with cases involving a specific RoI.Standardizing a RoI only requires standardization of the desired content, and not the background.This would therefore require an accurate mask for RoI segmentation, either created manually or through automatic methods.Alternatively, a simpler and quicker approximation can be made by performing only intensity normalization, provided that the intensity histogram for all volumes are of similar ranges.This method does not require a mask for RoI cases.
Every Ω, used in our experiments, has been intensity normalized and standardized to a cubic volume of length L with isotropic spacing.For training and validation, ω i is extracted from 3D motion corrected volumes, which are of similar intensity range and bias as a result of the reconstruction process [6].L/4 or L/2 number of sampling planes of size L × L are evenly distributed along the Z-axis.For testing, ω i is a raw slice from a motion-corrupted 3D scanned image.
For a 3D volume containing an organ or RoI that is accurately delineated, the intensities in slices are thresholded (whether they are included in the training and validation set or not) by its variance, σ 2 (x).This is determined by the threshold variable t, where then it is omitted.A larger K value will restrict ω i to the central portion of the volume, geometrically.In our experiments, K ≈ 0.2, which samples the central 80% of the volume.
For volumes with inaccurate delineation of the RoI, z is constrained by distance so that it can sample approximately the central 70% of the volume.Therefore, ω i are extracted between −0.7×L/2 ≤ z ≤ 0.7×L/2.In such cases we choose to threshold more aggressively in order to reduce the inclusion of edge case slices, where their content can be recovered from intersecting orthogonal slices.
To create a comprehensive training set, ω i would need to cover a vast amount of different transformation permutations in Ω.Since the transformations parameters are constrained to varying T z , R x , R y and R z , it is only required to rotate the entire stack of sampling planes in many orientations, where the offset in Z-axis accounts for varying T z .
A straightforward method is to simply iterate the rotation through Euler angles, where R(x, y, z) and [x, y, z] ∈ U(−π/2, π/2).However, it may not give the best balance of training slice distribution as seen in Fig. 2b.In this figure, each dot represents the normal vector of the stack of sampling planes from the origin.The second method is to use polar coordinates, P (φ, θ).Uniform sampling of polar co-ordinates causes an imbalance of samples, with a higher density of samples near the poles.This can be seen in Fig. 2a.A better alternative is to use Fibonacci Sphere Sampling [27], which allows each point to represent approximately the same area as seen in Fig. 2c.The sampling normals are calculated by and is defined as Φ = ( √ 5 + 1)/2.The rotation of the stack of sampling planes here is defined by the rotation required to transform a 3D vector A onto a 3D vector B.Where A is the starting normal, which is a unit vector in z (i.e.(0, 0, 1)), and B is the target sampling normal.However, this does not account for any in-plane rotation, R z .The entire stack is further rotated around the z-axis through a uniform distribution of angles.
For the validation set, slices are generated with the polar coordinates method using random normals and random in-plane rotation angles that are within the bounds of the training set.This is to simulate continuous motion, as test slices will not lie on a discrete training sampling interval, as shown in Fig. 2d.
Loss Functions and Ground Truth Labels: In the literature [28], [29], the most commonly used loss function for regression problems are Euclidean norms: x − x n .However, they may not be suitable when the regression target variables lie in a non-Euclidean space, such as a manifold.For our proposed method, the slices are being transformed rigidly in 3D space, parameterization of each slice therefore lies within the bounds of the SE(3) Lie Group.This includes both a rotation as well as a translation component.There are numerous ways of representing this transformation such as, for rotation; Euler angles, quaternions, rotation matrices.
To address the aforementioned challenge, Kendall et al. [30] proposed the loss in Eq. 2 for PoseNet, which is used to regress the pose of the camera in 6DoF.x and q are the predicted Cartesian translation and quaternion rotation parameters, whilst x and q are the respective ground truth values.The equation combines the Euclidean distance of translational loss with a weighted Euclidean distance of the rotation loss.β is a tuning parameter that is used to determine the contribution between the losses, by normalizing numbers with different scales.Quaternions can represent a rotation by using four numbers between +1 and −1, however Cartesian co-ordinates span from −∞ and +∞.
Xu et al. [31] have proposed a framework based on separating loss functions that has an advantage of alleviating over-fitting.The fully connected layer of the network is split into several branches, where each branch is terminated with a separate loss function.Instead of manually tuning the contribution of discrete components in a combined loss function, the network incorporates the tuning parameter within the connection weights.As a result, the network is able to learn from multiple representations of the ground truth labels, for instance, Euler-Cartesian parameters (3 for rotation and 3 for translation) and Quaternion-Cartesian parameters (4 for rotation and 3 for translation).
We introduce a novel labelling system where the rotation and translation components of the labels are combined together.Any three points in a 3D Euclidean space form a plane, while their order defines the orientation.We therefore call them Anchor Points.
These 3 Anchor Points can be defined anywhere on the L×L 2D slice, as long as it is consistent throughout all slices in the dataset.For simplicity, we defined P 1 to be the bottom-left corner (−L, −L, z), P 2 to be the origin (0, 0, z) and P 3 the bottom right corner (L, −L, z).z corresponds to the slice's offset in the Z-axis (see Fig. 3).Each point is multiplied with the rotation matrix that is used to orient the stack of sampling planes.This moves each point, and therefore the slice, to its appropriate location in world space with respect to the volume.Consequently, Anchor Point labels consists of 9 parameters (P 1 (x, y, z), P 2 (x, y, z) and P 3 (x, y, z)).As each point is Cartesian, it can be calculated with the standard L2-norm loss function.Incorporating the multi-loss framework [31], losses for P 1 , P 2 and P 3 are calculated independently.
Modeling Network Uncertainties: In this work, we make use of different state-of-the-art network architectures, namely; CaffeNet [32], GoogLeNet [33], Inception [34], NIN [35], ResNet [36] and VGGNet [37].Where these networks are treated as black boxes, taking ω i as inputs whilst computing the loss of various labelling methods.We then evaluate the performance of each architecture on regressing the proposed Anchor Point labels in Sec.IV.
A common strategy during the training of such large, stateof-the-art networks involve the use of dropout [38].This entails muting components of the true signal, provided to individual neurons.The technique essentially provides a form of model averaging and dropout constitutes a well-understood regularization technique to reduce overfitting.As a result of dropout, neurons have the ability to produce different outputs upon successive activations.During inference, dropout is usually disabled such that network consistency is not undermined.Regression networks are therefore commonly deterministic models at inference time and do not allow for the modeling of uncertainty.Implementing fully probabilistic models that account for uncertainty in both (1) the data and (2) the model parameters (Aleatoric, Epistemic uncertainties respectively [39]) may introduce high computational cost [40].
Gal et al. [21] recently show that dropout layers in Neural Networks can be interpreted as a Bayesian approximation to probabilistic models, and can be implemented by applying a dropout before every weight layer.This is shown to be mathematically equivalent, as it approximately integrates over the models weights.The authors in [41] further show that, for the same input, performing multiple predictions during test time with dropout and taking a mean of the predictions improves result for CNN based networks.This process of performing multiple predictions from the same input by using dropout layers is called as Monte Carlo Dropout sampling which also provides model uncertainty for the given input data.
Using this technique, our experimental work in the following section focuses on taking Epistemic uncertainty into consideration in order to gauge alignment prediction confidence in real-world test cases.Slice alignment requires high precision, and we investigate the idea that a measure of prediction confidence is important to aid reconstruction quality.
Metrics on non-Euclidean Manifolds: Computing the mean prediction of the Monte Carlo Dropout samples, requires an accurate method of averaging the network output which, in our case, is a rigid transformation.However, rigid transformations do not lie in standard Euclidean spaces but in a manifold.In particular, the rigid transformations are Special Euclidean groups SE(3) and is a smooth manifold where intrinsic mean and the corresponding variance can be computed [42].
Consider N rigid transformations predicted by the network: {x i }, i = 1...N .We can compute a Riemannian center of mass that minimizes geodesic distances between all data points: where M is a Riemannian manifold and dist(x, y) defines a geodesic distance between two points x and y on this manifold.A Gauss-Newton iterative algorithm on rigid transformations can be used to compute such a mean, m, from the available data points x i 's by using a left-invariant metric: [43] where, exp Id and log Id are the exponential and logarithmic mappings from the identity defined with left-invariant Rieman-nian metric, and • is the group composition operator.Once the mean is computed, variance is straightforward to compute with σ = E[(x − m) 2 ], where the logarithmic operator defines x − m as log m (x).[43] provides a detailed treatment of these notions and algorithms.
Transformation Recovery: We transform the predicted Anchor Points, Euler-Cartesian and Quaternion-Cartesian parameters to a rotation matrix and Cartesian offset.We then transform the corresponding slice to its proper location in 3D space.However, the network may introduce a prediction error, which causes the Anchor Points not to be in a perfect triangular formation.To overcome this problem, we assume: • P 2 defines the Cartesian offset of the slice, i.e. the centre point of the slice in world space.• The vector joining points P 1 and P 3 aligns with the bottom edge of the slice (this defines the in-plane rotation) • The Anchor Points together define the plane, which contains the slice.It is used to calculate the rotation matrix to reorient the slice (see Fig. 3).The Cartesian offset is P 2 .For the orientation, we calculate three orthogonal vectors that defines the new co-ordinate system.Concatenating these vectors gives the rotation matrix, which transforms an identity plane to the newly predicted orientation.v 1 = P 3 −P 1 and v 2 = P 2 −P 1 are 2 intermediate vectors.v 1 cross v 2 gives the normal of the plane n 1 , this defines the new Z-axis.n 1 cross v 1 gives the new Y-axis n 2 , and v 1 directly defines the new X-axis.Finally, v 2 , n 2 and n 1 are concatenated together to get the rotation matrix.
Evaluation Metrics: The first and foremost naïve method is to monitor the training and validation loss to ensure that the network is learning.To examine a slice in detail, we present the network with a ω i extracted from Ω. Using the parameters obtained from the network during inference, we extract a new slice from the same Ω and compare it to ω i .This is completed via several image similarity metrics defined below.
Normalized Cross Correlation (NCC) -measures the similarity of two series (or images) as a function of the displacement of one relative to the other.This searches for features that are similar in both images by examining the pixel intensities.
where N is the number of pixels in the image and The Peak Signal-to-Noise Ratio (PSNR) -(based on the MSE metric) is a ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation.This is the delta-error that is estimated by the network during regression.
where M AX I is the maximum possible intensity (pixel value) of the image and The Structural Similarity (SSIM) index is an improved metric upon PSNR and MSE, and uses a combination luminance, contrast and structure to assess the image quality.Metrics such as MSE can give a wide variety of degraded quality images with drastically different perceptual quality, which is an undesirable trait.
where µ f and µ g are the average pixel intensities of images f and g respectively; σ 2 f and σ 2 g are the variances of f and g respectively; σ f g is the co-variance of f and g; c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 are two variables that stabilizes the division with weak denominator (i.e.cases where ; L is the maximum possible intensity of the image; with k 1 = 0.01 and k 2 = 0.03 are default constants. ta The geodesic distance in the prediction manifold -provides an intrinsic distance measure of how far the predicted rigid transformation is from the ground truth transformation.If x a ground truth rigid transformation and y is a predicted rigid transformation, a left-invariant geodesic distance with a metric at the tangent space of x can be computed as [43]: dist(x, y) = log x (y) x (10) where log uses the same notion as described in 'Modeling Network Uncertainties'.

III. IMPLEMENTATION
All CNNs were trained using the Caffe [32] package, on a computer equipped with an Intel i7 6700K CPU and a Nvidia Titan X Pascal GPU.
33 randomly chosen fetal volumes were used in our experiments.The volumes were intensity normalized between 0 and 255, and resampled to a dimension of 120 × 120 × 120 with isotropic spacing 0.75 × 0.75 × 0.75 mm.From the 33 volumes, 28 volumes were selected for the training set, with the remaining 5 for validation.
All slices are of size 120 × 120 (i.e., L = 120).6 separate datasets were generated to cover combinations of slice generation with label representations.3 datasets were generated via the Euler angle iteration, with the remaining 3 generated using the Fibonacci Sphere Sampling method.Of each slice generation method, there is a dataset for Euler-Cartesian labels, Quaternion-Cartesian labels and Anchor Point labels.
For the Euler generation method, angles are iterated through 18 • intervals from −90 • to +90 • .This gives 10 sampling intervals for each axis.Due to the fetal volumes being roughly delineated, variance thresholding is not possible as slices containing only maternal tissue may be included in the dataset.40 slices are taken in the z-axis, between -40 and +40 with 2 mm interval.This samples the middle 66% of the volume.In total, the Euler iteration method generates 2.24M images for the training set.
For the Fibonacci Sphere Sampling method, 300 sampling normals were chosen.This gives approximately 8 • separation between every normal.An additional 10 images, between 0 • and 180 • with 18 • intervals, were generated at each normal to account for in-plane rotation.Slices are also taken between -40 and +40 in the Z-axis with 4 mm interval.This generates in total 3.36M images for the training set.
The validation slices are generated in similar fashion, except at random intervals within the bounds of the training set via random sampling to model continuous fetal motion.
We also adapted the code into Matlab and Python from [22] for mean, variance and geodesic distance computations on SE(3) groups of rigid transformations.

IV. EXPERIMENTATION AND EVALUATION
Network Architecture Performance: The six networks described in Section II were explored to examine if their architectures affect the accuracy on regression, and if so, which architecture gives the best performance to training time ratio.All networks were trained using Caffe with the Adam optimizer, max iteration of 200000, batch size of 64, learning rate of 0.0001, momentum of 0.9, and drop in learning rate of 10% every 20000 iterations.Only ResNet has a batch size of 32 as it was not able to fit onto the GPU memory.
Each network is trained with the dataset generated by the Euler angle iteration method with Anchor Point labels.Fig. 4 shows the training and testing loss during the training process, where each curve represents the loss of each Anchor Point.Table II shows the amount of parameters within each network and the training time for 200000 iterations.
We further analyze the performance of the network with image similarity metrics as well as the geodesic distance from the predicted point to the ground truth location.Since we know the ground truth location, incorrectly predicted slices made by the network are first discarded (e.g., a slice that is predicted outside the volume).We therefore define these slices to have a value that is more than three scaled Median Absolute Deviations (MAD) from the median in the geodesic metric.
Table I shows the average error of 1000 randomly selected validation slices that were selected from each of the 5 test patients.It can be seen that VGGNet attained the smallest geodesic distance error, as well as correlation and MSE.This is closely followed by GoogLeNet, which managed to attain the smallest PSNR, SSIM and Average Euclidean Distance error (shows the average prediction error of a single Anchor Point to the corresponding ground truth location).NIN is the least  performing network with much greater errors.For efficiency, the GoogLeNet is the ideal choice, as it can attain almost just as good performance as VGGNet in half the time.If accuracy can be more relaxed, CaffeNet can also be used.
ResNet, Inception and NIN however takes much longer to train and perform worse compared to VGGNet.Fig. 5 shows some examples of predictions made by the network.
Regression Labels: In this experiment, we use GoogLeNet for its efficiency.Here, we test the performance of the various datasets, in particular the different types of label parameterization.All 6 datasets adopts the Multi-Loss framework, and as before, the same evaluation methodology is used.5000 slices from the validation set (1000 per fetal subject) have been randomly selected and passed through the network.Table III shows the average errors, for each dataset.
Here in every metric, Anchor Point labels were able to  Synthetic Reconstructions: Here, we evaluate the proposed pipeline for reconstructing a 3D MRI fetal brain with synthetically motion corrupted slices.200 random slices are extracted from the validation dataset in order to initialize the SVR [6].For all the 5 fetal subjects, we calculate the PSNR between the original and reconstructed 3D volumes: SVR (4 iterations)   Fetal Reconstruction: We test SVRNet on a real case where both manual and automatic reconstruction methods have failed.With no ground truth target to compare to, reconstruction quality can only be validated qualitatively.Fig. 8a,b,c show the raw scan stacks, and the degree of motion corruption.
Excessive motion can cause ambiguity for previous automatic methods.Fig. 6 shows a sequential sagittal stack of slices where the fetus has turned its head almost 90 • , causing a sagittal slice to be a coronal view.This unexpected slice does not fit in the stack, and is normally rejected by robust statistics implemented in SVR.Rejecting too many slices will cause a lack of scan data, while accepting too many slices will cause a corrupt reconstruction volume as seen in Fig. 8d.
As these slices are no longer synthetic, they may be more prone to image artifacts, speckle noise, and shadows.All these factors, combined with no bias correction and smoothing from the reconstruction process, may induce a greater prediction error.Test slices are intensity scaled [0 − 255] to match the intensity range known by the network, no further preprocessing are performed.
Monte Carlo dropout sampling is used in this experiment as a measure of the network's prediction confidence.Each slice is fed through the network 100 times.The final prediction is obtained by computing the Riemannian center of mass of all 100 predicted transformations.The variance can therefore be calculated to influence the decision whether or not to include the slice in subsequent reconstruction methods, as a large variance suggests the network is not confident with the prediction.Fig. 7 shows the prediction confidence of each slice in the 4 stacks obtained from the scanner (1 coronal, 1 sagittal and 2 axial).As we have generated slices to train the network from the central portion of the fetal volume geometrically, this effect is reflected in the confidence of prediction.It can be seen that network is less confident with edge case slices compared to central slices.The decision of whether or not to include a slice in subsequent reconstruction depends on the prediction confidence and the robustness of the chosen reconstruction algorithm for (III).Prediction confidence can be thresholded and if the reconstruction algorithm is very robust, like [6], we can make multiple predictions per slice and let the reconstruction algorithm handle further outlier rejection, which allows for a greater margin of error (see Fig. 8).

V. DISCUSSION
In this paper, we show that a learning based approach (SVRNet) are able to greatly increase the capture range for 2D-3D image registration, and can provide a robust initialization for subjects with extreme motion.The accuracy of the network predictions is influenced by efficient and novel parameterization of labels and loss functions.In particular, when dealing with parameters that do not lie on an Euclidean manifold.For instance, we found that Euler-and Quaternion-Cartesian labels attain similar performance.However, with a unique parameterization combined with the introduction of Anchor Points, we were able to further increase this accuracy.
During the training of the network, we were able to train six different architectures with fixed hyperparameter configurations achieving satisfactory accuracy to a degree.This shows that for regressing transformation parameters, hyperparameter searching can be extremely time consuming and computationally expensive, whilst providing little improvement in prediction accuracy.Generating synthetic slices for training is very challenging due to the extensive search through a very large space of parameters (6DoF).We constrain the parameter space by leveraging the effects of the in-plane transformations, based on the center-aligned content as a result of organ localization.Training a network for different organs, use cases scenarios or modalities (e.g., MRI T1, T2, X-Ray exposure, etc.) can be particularly problematic without organ atlases, or existing 3D reference volumes.SVRNet requires test images to be formatted in the same way as training, this includes identical intensity ranges, spacing and translation offset removal when pre-processing 3D volumes.In our experiments, we show that the training region does not need to be delineated accurately.Our method is not restricted with respect to the used imaging modality and scenario, as seen in [20] where the network is trained on DRR images as well as whole thorax phantoms.This is valuable for 3D to 2D alignment as the whole volume can be aligned to individual 2D slices.
Another issue, which is difficult even for human experts, is to determine left-right asymmetry of a given slice without additional information.To tackle this issue, oversampling and capturing lots of slices during scan time can allow a greater margin for mis-predicted slices.Robust Statistics [6] is able to reject slices predicted in the wrong hemisphere.
The proposed method can be formulated as a classification task, where each rigid transformation can be quantized as a class.Quantizing the permutations used in our experiments would result in 40K to 50K classes.With only 28 fetal examples per class, this will lead to a high class-imbalance introducing new difficulties in training.Reducing the number of classes, however, will decrease the prediction resolution.

VI. CONCLUSION
SVRNet is able to predict slice transformations relative to a canonical atlas co-ordinate system, using only the intensity information in the slice alone.This allows motion compensation for highly motion corrupted scans, e.g., MRI scans of very young fetuses.It allows to incorporate images that have been acquired during examination and temporal proximity is not essential for good initialization of intensity-based registration methods, as it is the case in state-of-the-art methods.
We have evaluated a wide range of state-of-the-art and popular network architectures to examine their performance on prediction accuracy.We found that VGGNet, in our experiments, attained the smallest regression errors.However, GoogLeNet is more efficient to train for repeating experiments.It can achieve similar results to VGGNet with half the training time, whilst occupying 85% less memory space.
Our work leverages the computational framework to do statistics on SE(3) Lie groups, performing Bayesian Inference and Monte Carlo dropout sampling on the rigid transformation predictions of the network.This approach can also be beneficial in other applications such as [30], where CNNs are trained to produce output transformations that are not in Euclidean space.We have shown that by calculating geodesic distances of rigid 3D transformations on a non-Euclidean manifold provides means to assess the predicted transformation parameters more accurately.This paves the way to propagate uncertainty downstream in a pipeline that uses the network output to perform other tasks in future work.
Fig. 1.Transformation Parameters Covering 6 Degrees of Freedom

Fig. 2 .
Fig.2.Slice plane normals w.r.t the origin via different generation methods.The vector from the origin to each point cross through each slice origin.Note that it is not possible to visualize in plane rotation in this visualization

Fig. 5 .
Fig. 5. Top: Validation slices that are presented to the network.Bottom: Slices extracted from the respective fetal volume using parameters predicted by the network.(a) to (f) compares the ground truth slices with predicted slice in order of increasing Geodesic Distance Error.
yield a greater accuracy compared to Euler-Cartesian and Quaternion-Cartesian labels in all test cases.A two-tails independent T-test was conducted to examine the statistical significance of Anchor Point labels to Euler-Cartesian and Quaternion-Cartesian, as shown in Tab.IV.As there are 5000 samples in each dataset, the DoF is regarded as infinity.The p-values for all tests are therefore infinitesimally small.

Fig. 6 .
Fig.6.Sequential scan slices from a sagittal image stack of a fetus with extreme motion, it can be observed that the fetus has rotated its head 90 • , causing slice #14 to be a coronal view.

Fig. 7 .
Fig. 7. Slice prediction confidence of the 4 stacks obtained from scanner

Fig. 8 .
Fig. 8. Reconstruction attempt of a fetal brain at approx.20 weeks GA, from heavily motion-corrupted stacks of T2-weighted ssFSE scans.It was not possible to reconstruct this scan with a significant amount of manual effort by two of the most experienced data scientists in the field using state-of-the-art methods.(a), (b) and (c) are 3 example orthogonal input stacks, with scan plane direction in coronal, axial and sagittal respectively.The view direction is initialized to Stack0 as shown in (a).Stacks (b) and (c) are not perfectly orthogonal, as they are taken at different time points and the fetus has moved.(d) Patch to Volume Registration (PVR) [7] with large patches, i.e. improved SVR, using the input stacks directly.(e) Gaussian average of all slices that are predicted and realigned to atlas space by SVRNet.(f) Reconstructed volume after 4 iterations of Slice to Volume Registration (SVR), initialized with slices predicted by SVRNet.The arrow points to an area where insufficient data has been acquired.The fetus moved in scan direction, thus slices are missing in this area necessary for an accurate reconstruction.(g) Training atlas representation of the slices in (e)-(f).Note that (d) is reconstructed in patient space whereas (e) and (f) are reconstructed in atlas space.

TABLE II PARAMETER
COUNT AND TRAINING DURATION FOR DIFFERENT NETWORK

TABLE III AVERAGE
ERRORS FOR DIFFERENT FETAL DATASETS