Cross-Modality Image Registration using a Training-Time Privileged Third Modality

In this work, we consider the task of pairwise cross-modality image registration, which may benefit from exploiting additional images available only at training time from an additional modality that is different to those being registered. As an example, we focus on aligning intra-subject multiparametric Magnetic Resonance (mpMR) images, between T2-weighted (T2w) scans and diffusion-weighted scans with high b-value (DWI$_{high-b}$). For the application of localising tumours in mpMR images, diffusion scans with zero b-value (DWI$_{b=0}$) are considered easier to register to T2w due to the availability of corresponding features. We propose a learning from privileged modality algorithm, using a training-only imaging modality DWI$_{b=0}$, to support the challenging multi-modality registration problems. We present experimental results based on 369 sets of 3D multiparametric MRI images from 356 prostate cancer patients and report, with statistical significance, a lowered median target registration error of 4.34 mm, when registering the holdout DWI$_{high-b}$ and T2w image pairs, compared with that of 7.96 mm before registration. Results also show that the proposed learning-based registration networks enabled efficient registration with comparable or better accuracy, compared with a classical iterative algorithm and other tested learning-based methods with/without the additional modality. These compared algorithms also failed to produce any significantly improved alignment between DWI$_{high-b}$ and T2w in this challenging application.


I. INTRODUCTION
M ULTIPARAMETRIC Magnetic Resonance (mpMR) imaging is now recommended by international guidelines for the initial detection of prostate cancer for men with suspected disease [1]- [3]. Most subtypes of prostate cancer diagnosed on mpMR manifest themselves as low signal on T2weighted MRIs, apparent diffusion coefficient (ADC) images, and high signals on high b-value diffusion MRI (DWI). As shown in both recent radiological and technical studies [4]- [6], mpMR images can lead to more accurate results on prostate cancer detection and staging, compared to only using single modality MR imaging [7]- [11], in particular, the T2-weighted and the diffusion-weighted scans have been recommended as two necessary modalities to include in any mpMR examination [12]. Jointly assessing mpMR scans can usually be expedited by accurate alignment [7], [13], especially when localisation of the pathological regions has become increasingly important for followup monitoring, diagnosis and treatment. In real-world clinical data, spatial differences exist between mpMR scans which are usually caused by patient movement during image acquisition, internal organ movement and distortions due to imperfect magnetic fields during image acquisition. However, registering multimodal mpMR images that are designed to provide complementary information is challenging. As these factors are difficult to decouple between different scans, scanner coordinates are often the only geometric reference after acquisition-time magnetic-field correction [14], [15].
For instance, echo-planar imaging using a high diffusion weighting (DWI high−b ) is considered sensitive for detecting prostate lesions in both peripheral and central gland, but it also can spatially differ from the T2-weighted (T2w) scans, the latter of which provides not only spatial reference for localising tumour of interest but also significant diagnostic value [16]. In many cases, it is visibly evident that registration between the two is required due to the coupled distortion and unknown patient/organ motion. The low signal-to-noise (SNR) in DWI high−b and lack of spatially-corresponding features between the two challenges this registration task for both classical algorithms and recent deep-learning-based methods. Feature-based or semi-automated registration methods have been proposed for this task [17], [18]. In Section IV, we provide quantitative evidence to demonstrate the need and the difficulty in direct registration between DWI high−b and T2w scans.
DWI scans with low b-value (DWI low−b ), on the other hand, are less frequently used directly for diagnosis due to its diminished added clinical benefit with the presence of both DWI high−b and T2w scans. For example, a time-critical imaging protocol for high-throughput application, e.g. [19], may suggest excluding ADC maps, which normally requires DWI low−b to calculate. However, DWI low−b scans are in general of higher SNR than DWI high−b and has better tissue contrast, closer to T2w scans, as shown in Fig. 1. Meanwhile, DWI low−b and DWI high−b scans share similar distortion patterns and we have also observed a smaller spatial difference between DWI scans with different b-values, compared to the difference between DWI and T2w scans [20]. Quantitative results for supporting this observation are reported in Section IV. This is probably because DWI low−b is less prone to artifacts and distortion. In this study we use DWI b=0 as an example of DWI low−b to facilitate the registration between T2w and DWI high−b images. Although for some simplified imaging protocols, DWI b=0 may be omitted, they can still be acquired readily for study purposes, such as neural network training. In addition, DWIs with b-values within a range of 0-100 sec/mm 2 can be used as alternatives for DWI b=0 images, as suggested by [12]. In fact, in many existing mpMR imaging protocols for prostate cancer, DWI b=0 data have been available for model training purposes. In this study, the DWI b=0 data are used as the privileged information for training registration models, which are not required at the inference stage. The possibility and the potential of using other diffusion scans with low b-values may also be interesting under different clinical context, but will not be discussed further in this work. This work has thus been motivated by a) the abovedescribed clinical scenarios that can take advantage of DWI high−b and T2w, bi-parametric imaging; and b) the hypothesized benefits of using DWI low−b in aiding the crossmodality registration. We investigate deep learning algorithms that incorporate the DWI b=0 scans in training registration networks that, once trained, take only DWI high−b and T2w images as network input to register them -a case of learning using privileged information [21], [22]. In Section II, we describe a training strategy to facilitate the use of such a privileged third modality; then compare its performance to the alternative learning-based and non-learning methods; we report in Sect. IV experimental results using independent landmarks identified on holdout image pairs of registration interest in this work, i.e. DWI high−b and T2w scans.
The aim of this work is to develop new learning methodologies and test their feasibility in improving the registration performance by incorporating extra imaging modality only in training. Learning-based registration methods have been proposed [23]- [33], especially, taking advantages of highly efficient deep registration networks during inference, with or without graphic processing units (GPUs). Learning-based registration, due to their being formulated as a machine learning task, can readily accommodate other observed latent variables to model additional information, such as a privileged third  modality that is of interest in this study. The work aims to show quantitative registration results on real clinical data and also highlight that the proposed methods utilising privileged images in this prostate cancer imaging application. However, we also envisage that this type of algorithms may be of wider applicability to other medical image registration problems. For example, longitudinal image registration when training data are available at more time points from retrospective subjects than those that need registration, or an interventional image registration task that with a missing reference image that is easier to register due to larger field-of-view or better image quality. The experiments presented in this work is focusing on unsupervised registration [34]- [38], due to the challenges in identifying substantial number of corresponding regions of interest (ROIs) labels for weak supervision [23]- [25]. Other approaches using deep feature for multi-modal registration [39], also requires anatomical annotations to learn registration-useful representations. However, when such labels are available, they may further aid cross-modality registration with the additional modality, but may be considered outside of the scope of this work.
We summarise the contributions in this work: 1) we propose to use additional images only in training to assist a challenging mpMR image registration task; 2) we propose and compare registration network training strategies using the privileged images; 3) we present experimental results using clinical imaging data from 356 prostate cancer patients; and 4) we provide quantitative results comparing the proposed methods with other learning-based registration methods with and without using the privileged third modality, in addition to a comparison to a non-learning algorithm, and report improved or non-inferior registration performance from the proposed registration algorithm.

II. METHODS
In this section, we describe a training strategy to train a registration network f M →F θ (X M , X F ) with network parameters θ and to input moving and fixed image pair (X M , X F ), given a set of training image trios {(x M n , x F n , x P n ), n = 1, 2, ...N }, where N is the total number of MR studies. x M n , x F n and x P n are moving, fixed, and privileged images available during training, respectively. The registration network f θ takes only two images as input and predicts the transformation, e.g.
n is a dense displacement field (DDF) that can be used to obtain the warped moving image x M n •µ M ←F n , where • represents the resampling operation.

A. Learning from privileged supervision
First, we describe a formulation that enables training registration networks using the third modality that is not required during inference, as the network does not take the third image modality x P n as input, but rather is considered as a special type of supervision. This is conceptually similar to weak supervision [23], where the segmentation labels of regions of interest have been proposed for weakly supervising registration networks. As illustrated in Fig. 2, the proposed registration network f M →F θ accepts the same input image pairs, x M n and x F n , but is trained by maximising the image similarity between warped privileged images x P n • µ M ←F n and fixed images x F n , as opposed to the similarity measure used in an unsupervised approach between warped moving images x M n • µ M ←F n and the fixed images. This formulation can also be considered as using the privileged-image-generated DDFsμ P ←F n as the noisy labels for µ M ←F n . It is important to highlight that the necessary condition for an unbiased estimate of µ M ←F , rather than the stringent sufficient condition µ M ←F n =μ P ←F n , ∀n, since the method aims to provide a good estimate of the expected (average) of the deformation among all image pairs, rather than precise estimation for individual training image pairs.

B. Monte-Carlo resampling for bias reduction
Second, we develop a simple yet effective numerical resampling procedure to maximise the benefits of the privileged images during registration network training described in Section II-A.
To reduce the bias , it is sufficient to spatially align the moving and the privileged images, x M n and x P n , which results in algorithms that are similar to the joint training (Section II-D.1). As argued earlier, aligning x M n and x P n is itself a multimodal image registration that can be challenging or unreliable in practice. We propose a simple Monte Carlo update step [40] to reduce the upper-bound of this bias, using affine-transformed privileged imagesx P n before being warped by the network-generated is a set of I = 5 randomly generated affine transformations. A proof of this bias upper-bound is provided in the following section. The "standard" unsupervised loss is used here between the warped privileged images and the fixed images, with a weighted deformation regularisation term C.

C. Effectiveness of surrogate supervision
Here, we provide an analysis to show that the Monte-Carlo procedure, described in Section II-A, is effective to reduce the bias between the warped privileged image and the ground-truth image without accessing to the ground-truth.
Denote a set of training image trios (x M , x F , x P ) ∈ X 3 , representing the moving, fixed and privileged images, respectively. X is the vector space for images. Given a dense displacement field µ M ←F , denote T : The registration task is thereby to minimize the difference between the warped moving image and the fixed image: where d : The proposed method uses an affine transformed privileged imagex P as the surrogate of x M , therefore it minimizes a different objective J surrogate : Using triangulation inequality [41], the difference between the two objectives has the following upper-bound: If T is Lipschitz continuous, then there exists a constant K such that Thus, the surrogate objective J surrogate approximates the target objective J, when d(x P , x M ) is minimised. This justifies the proposed update step, in which multiple random affine transformations are applied on the privileged image and the closest one to the moving image is then selected. However, in this application, the adopted mutual information based is not strictly a metric on X . Complications of the use of MI may warrant further investigation, but in practice, the above-described Monte-Carlo procedure almost always found a resampled images that lower the MI to the moving image with as few as 5-10 samples.

D. Alternative methods for utilising the third modality
Last but not least, it is important to test other, arguably simpler, approaches for training registration networks that can utilise the privileged information from the latent third modality. We describe two such alternatives as below, in which the third images are used only in training and are not required during inference. To train the latter two networks, an image dissimilarity loss, such as mutual information (MI), can be used between (x P n , x M n • µ M ←P n ) and between (x F n , x P n • µ P ←F n ).
With feature-rich moving images, a variant of the joint training can be implemented by maximising the image similarity between , without explicitly minimising the loss on the DDF difference. As the alignment between the two transformed moving images can be effectively measured by MSD. Results presented in this work are based on the following joint training loss in its general form: where, where, the image similarity and a deformation regularisation term C(µ A←B n ) are weighted by α and β, respectively, with shared values between the terms in Eq.1. L 2 -norm on DDF gradient is used in this work: 2) Mixed sampling: Rather than using the x P n as an intermediate imaging modality as in Section II-D.1, we consider to learn a shared registration network to predict the DDFs from both pairs of images, (x M n , x F n ) and (x P n , x F n ). An unsupervised registration network can be trained by sampling moving and fixed image pairs from the mixed set The loss function is given by: where, hyper-parameters α and β specify the weights on the intensity dissimilarity and deformation regularisation, respectively. This unsupervised approach utilises x P n during training, but still uses an image dissimilarity measure between transformed moving images and the fixed images. The lack of reliable and robust similarity measure between the two has not been addressed directly. While methods such as domain adaptation and semi-supervised learning make use of similarity between modalities x P n , x M n and x F n , such that the registration network predict reasonable DDFs without robust measure between x M n and x F n . These remain interesting future research, although it might also be further complicated by the distribution shift between training set Nevertheless, the described mixed sampling presents a reference performance from a single registration network, with quantitative results reported in Section IV.

E. Evaluation
All the registration networks described in Section II aim to register the moving and fixed images, without using the privileged images at test time. The anatomical and pathological landmarks are manually identified including patientspecific tumors, urethra, prostate glands and zonal structures, and labelled volumetrically as binary masks. The root-meansquare distance was computed as target registration errors (TREs), between the centers of the mass of the corresponding landmarks independently defined on the fixed and networkwarped moving images on holdout data set.
Experiment details are described in Section III, in which the intra-subject DWI high−b , T2w and DWI b=0 are used as the moving, fixed and privileged images, respectively. When appropriate, MI is also reported, which may be less relevant to the quality of registration, compared to the TREs on independent landmarks, but provides a quantitative measure how the optimisation during training and generalisation during inference perform. When comparisons are made, p-values are reported from paired two-sided t-tests at a significance level of α = 0.05.

A. Data and preprocessing
369 mpMR image studies were acquired from 356 prostate cancer patients at University College London Hospitals. One or two studies of mpMR images were available for each patient. The mpMRIs were acquired from 1.5T SIEMENS MR scanners, with original voxel resolution of 0.625×0.625×1.0 mm 3 and 1.0 × 1.0 × 5.0 mm 3 for the T2w and DWIs, respectively. All the image volumes were resampled to voxel dimension of 1.0 × 1.0 × 1.0 mm 3 and got a center-cropped volume of 104 × 104 × 92 voxels, with a normalised intensity range of [0, 1]. In order to validate the registration performance, 35 pairs of mpMRIs from 35 patients with obviously large initial misalignment were selected as the holdout set. The rest of the data set was split into 302 and 32 MRI studies, from 289 and 32 patients, for training and validation sets, respectively. Up to three pairs of landmarks were identified for each study and a total of 50 pairs of landmarks were labelled for the holdout set. The annotation of the landmarks was performed by two biomedical imaging researchers, who have completed a BAUSaccredited MRI course on prostate cancer. The landmarks were labelled by one observer before being checked by the other. To investigate the intra-observer variance, the holdout test set was annotated again, two-months after, and blind to, the first annotation. An intra-observer landmark localization error of 1.08±0.54mm is achieved.
Two additional data set were used for external validation. Data Set A was acquired from a different hospital, with an approved Institutional Review Board protocol designed at the University College London Hospital (UCLH). The original voxel resolution was 0.625 × 0.625 × 1.0 mm 3 and 2.0 × 2.0 × 5.0 mm 3 for the T2w and diffusion-weighted images, respectively. Data Set B was obtained from the Cancer Imaging Archive [42], with the original voxel resolutions of 0.27 × 0.27 × 3.0 mm 3 and 0.7 × 0.7 × 4.0 mm 3 , for the T2w and DWIs, respectively. The mpMRIs were acquired from a 3T GE MR scanner, with endorectal coil. In this public data sets, we only have access to the DWI high−b with b=1400 sec/mm 2 in this data set. The same image prepossessing and the landmark annotation were used, as on the UCLH data set. A total of 30 patients with 42 pairs of landmarks and 20 patients with 21 pairs of landmarks are used in the Data Sets A and B, respectively, for assessing the registration performance on external data sets.
The MI was adopted for the similarity measure, suggested by a previous study [43]. The MI was also used as the validation metric for hyperparameter search, specifying the weightings of loss terms α and β to 0.5 and 1 × 10 3 , respectively. Fine-tuning of these hyper-parameters by, for example, systematic or automated hyperparameter search should benefit and is a subject of future studies.

B. Network training
An encoder-decoder registration network [23] was used for DDF prediction in all the models in this work. Random affine transformations were added to the input of the network, both for data augmentation and the Monte-Carlo resampling. The method of the random affine transformation is adapted from the open-source code DeepReg [44], which is generated by randomly resampling the image corners from a uniform distribution, in order to keep minimal sampling outside the original image. The image warping method is implemented using a standard grid sampling method with trilinear interpolation and zero-padding [44]. The network training was implemented with PyTorch [45] and made opensource https://github.com/QianyeYang/mpmrireg. The Adam optimizer with an initial learning rate of 10 −5 was used. The "privileged supervision" networks described in Section 2 were trained on Nvidia Tesla V100 GPUs with a minibatch of 4 sets of image data, each containing a trio of intra-subject DWI b=2000 , T2w and DWI b=0 images. Each network was run for 600,000 iterations, approximately 50 hours. All registration networks were trained using the same training strategy unless otherwise specified.

C. Other learning-based registration
The "joint training" and the "mixed sampling" networks that implemented methods described in Section II-D.1 and Section II-D.2, respectively, were also trained to test these alternative approaches to incorporate the third image modality. Like in evaluating the privileged supervision network, these two networks were trained with the trio of intra-subjective images, but only took T2w and DWI b=2000 images as input, during test stage using the holdout set.
In addition, a learning-based registration methods were compared for directly aligning T2w and DWI scans with b values being 2000, DWI b=2000 . The registration network was trained using the unsupervised learning algorithm, similar to the one used in Section II-D.2, but with only T2w and DWI b=2000 sampled in training without DWI b=0 . This is referred to as the "Direct" method. For further understanding the role of DWI b=0 scans and the potential benefits in adding the bias-reducing Monte-Carlo resampling, described in Section. II-B, another unsupervised registration network was trained using only T2w and DWI b=0 in training without DWI b=2000 . These two registration networks were both tested on registering the T2w and DWI b=2000 images on the holdout set.
Weakly supervised registration [23], [24] methods have also been proved to be effective for the multi-modal registration problems. However, the gland masks of the DWI b=2000 in this study are not available and arguably much more difficult to annotate accurately. For example, rectal gas is known for generating magnetic distortion around posterior regions of the prostate glands, which complicates in determining capsule boundaries; and DWI high−b has high sensitivity in certain types of pathology but lacks contrast in gland itself. This study is to investigate how much improvement is feasible for using unsupervised learning methods with unlabelled image data, which are in practice more feasible to obtain.

D. Non-learning registration
Learning-based registration methods in general provide superior efficiency, compared with the alternative classical registration algorithms based on iterative optimisation, especially for large 3D volumetric medical images [46]. However, it is useful to report the performance using the classical methods which have been developed for registering multimodal image registration in similar applications [47]- [51], for last two decades.
For its fast GPU implementation, the NiftyReg package was used to compare a non-rigid B-spline based free-form deformation algorithm with the above learning-based methods, by directly registering the T2w and DWI b=2000 . The NiftyReg Package was used as an example of non-learning algorithms, with normalised mutual information and other parameter values followed a previous prostate MR registration study [52]. In our experiment, MI was used as the similarity measure for comparison purpose with a bending energy weight of 0.005, among other default configurations. These parameters may not be directly comparable to those used in the learning-based algorithms due to difference between the pairwise optimisation and stochastic-gradient-based learning process, in addition to varying implementation choices. The MI values before and after respective algorithms are reported in Sect IV.
The aim for reporting results from the non-learning registration is not intended to compare their registration accuracy, as substantially more comprehensive experiments shall be required to draw a convincing conclusion that may also be dependent on the application and the experimental data used. Rather, this provides a reference of the registration performance with a readily-available, non-learning registration algorithm that does not require the third modality in this specific prostate cancer application.

A. Registration performance on holdout set
The TRE and MI results on the holdout set are summarised in Table. I. The proposed privileged supervision increased the median MI from 0.06 to 0.20 and lowered the median TRE from 7.96 mm to 4.34 mm, improved from those before registration, both with statistical significance (p-values<0.001). All the other tested methods showed improved TREs in this application with statistical significance as well (NiftyReg:p-value=0.03, others:p-value<0.001). The Jacobian determinants of each predicted DDF was also computed and, from all proposed methods in this study, no negative values were found.
Results from two groups of cases are also summarised in Table. II, with those that have the largest initial misalignment, measured by landmark distance before registration, and the most improvement by registration, measured by TRE. It is noteworthy that the latter group was selected by the improvement after the registration results were obtained, which were not be available prior to registration, therefore only provides a selective reference for measuring the potential contribution. Together with the cases with largest misalignemnt, these two subgroup results represent the comparison on those cases that need the registration the most.
The registration results have been visually assessed and examples are provided in Fig. 3. In the first three cases, the morphology and the location of the tumor are more consistent with the fixed T2w images after registration. The fourth and the fifth cases are challenging cases with larger initial misalignment. Case 4 shows a visible improvement in morphology of the central gland. The registration compensated the distorted region near the rectum. Meanwhile, the increased hyper-intensity area on the top of the warped privileged image indicates an improved alignment of the bladder. In case 5, although the registration could be further improved, the registered location of the prostate gland and the tumor indicate the predicted contributed to visibly reduce the misalignment. Figure 4 provides further examples that demonstrate the potential benefits from the third modality during training, DWI b=0 in this case. Case 1 is an example with minor misalignment. A suspected tumour was found in the central gland in the zoomed-in ROI, with the contoured tumor being aligned visually better after registration from all methods. Case 2 presents a relatively severe misalignment of the whole prostate gland, with the gland center being aligned after registration. Case 3 shows an example of a well-aligned local area of the urethra using the proposed method. Case 4 demonstrates a highly severe misalignment, for both of the prostate gland and the contoured tumor. From Case 1 to 4, it is visually recognisable that the Privileged method outperforms the others tested. For case 4, although a minor misalignment still exists after registration, the Privileged method transformed the tumor closer to the target location and shape, with respect to the reference bounding-box ROI, while the others are absent to varying degrees. Case 5 shows an example of the distortion in the gland posterior region, which was reduced by the registration.
To further investigate the performance, a set of Bland-Altman plots, from each proposed method, are provided to show the differences in MI and TREs before and after registration in Fig. 5. Each point represents a pair of landmarks in the holdout set, where the x axis represents the TRE before registration and the y axis represents the difference in TRE after registration. For each of the proposed method, improvements are observed both on MIs and TREs. The Privileged method outperforms the others with improvements of 3.88mm and 0.14, in TREs and MIs, respectively.
For the inference time, our proposed method got 0.12s while the NiftyReg got 10.19s for registering each pair of 3D image, both with GPU acceleration.

B. The need for registering T2w and DWI b=2000
The MI and TREs on the test data set are computed to indicate the original difference between the two images without registration. All registration methods have made positive contributions to align images based on the increased MI values. All of the tested registration methods reduced TREs. This is an indication that registration in general would help align the T2w and DWI b=2000 scans in this application. Table II provide results from the 10% and 20% cases with the largest initial misalignment and 10% and 20% cases with the most improvement observed after registration. The results from both these subgroups showed a larger initial misalignment and arguably more substantial improvements from the registration. For example, for 20% cases with largest initial misalignment, the proposed privileged supervision network improved the mean TREs from 12.93 mm to 6.77 mm.
We also report a set of selective results only for inspecting the extent of the registration error, from the 10% and 20% cases with the most improvement by registration, measured by TRE. However, identifying either of these scans that need registration the most remains an interesting open research question, as it may be that, based on the clinical data set used in this work, the proposed method would be of increased clinical value when applied to this subset of patient studies. Table I and Table II. The proposed methods outperformed the alternative joint training and mixed sampling methods, in terms of TREs. The advantage is both consistent and statistically significant (p-values<0.001). This set of results demonstrate the effectiveness of the proposed privileged learning method to align the T2w and DWI b=2000 , in this application. These results concludes that adding images from a different modality to training may not be trivial and, without appropriate adaptation, may reduce the registration performance. In addition, the same network was trained with the Privileged method with 10 random affine transformations in the Monte-Carlo resampling (i.e., I = 10), described in

D. Comparison to other learning-based registration
Direct registration marginally lowered the mean TRE, although with significance (p-value<0.001). The proposed privileged learning method obtained a lower mean TRE, compared to the direct registration method with statistical significance (p-value<0.001). It is consistent with the observations from the qualitative results in 4, which indicates the privileged learning method showed effective registration itself (Sect IV-A) whilst the Direct method did not. Interestingly, the Direct method produced a relatively high MI. This may be expected as the direct algorithm was trained to maximise MI directly, but the optimization was influenced by the heavy noise can lead to inferior TREs without the potential benefits from the added DWI b=0 images, as discussed in Sect IV-A. It is also interesting to report that, using T2w and DWI b=0 as network input both in training and testing (Section III-C), the warped DWI b=2000 also led to a mean TREs of 4.59±2.01mm, outperforms the Direct, Joint, and the Mixed methods (p-value<0.001). These results summarise that the non-trivial difficulties in direct registering T2w and DWI b=2000 .

E. Comparison to non-learning registration
NiftyReg also improved the mean with statistical significance achieved(p-values=0.03), but the improvement is very limited. The mean and median TRE from privileged supervision is improved over those from NiftyReg results (p-value<0.001). Results from the two subgroups are summarised in Table II. It may be interesting to report that, the selective group (the lower two columns in Table II), on which registration provided most improvement, has a larger misalignment with the privileged modality, compared to those with NiftyReg. This perhaps indicates the potential utilisation of the extra anatomical and pathological information retained in the privileged images. Table III summarises the registration performance from each method on two external validation data sets. On both data sets, our proposed Privileged method outperforms the results before registration and from the other methods (all p-values≤0.01). Compared with the original data set, the Data Set A is with larger initial misalignment, with a mean TRE of 11.01mm. Although Data Set B is with smaller initial misalignment, it was acquired with larger difference in acquisition protocols. For example, the MRIs from Data Set B were taken with endorectal coil and the b-value of the DWI high−b is only 1400 sec/mm 2 . The Direct method and the Joint method show smaller improvements on TREs for both data sets, compared with the proposed method, albeist arguably smaller difference in the optimised MI. The Mixed method achieved relatively competitive performance on both data sets, second to the Privileged method. It probably because that the mixed sampling introduced more training data and thus increased its generalisability. It is also interesting to report that, the NiftyReg achieved lower TREs than the Direct and Joint methods in the external validation, although statistical significance was not found in these cases (p-values=0.83 and 0.27, respectively).

V. DISCUSSION
The proposed use of the third modality was not only evident in helping many cases in our application, but also provide an interesting new mechanism beyond improving registration performance, by bringing in a potentially more intuitive and radiologically-interpretable modality, for future registration studies, such as investigating registration error distribution, local loss function design and evaluation methodologies.
This work examined a particularly challenging crossmodality task in registering T2w and high b-value diffusion scans, from prostate cancer patients. During the investigation, we summarise the difficulties as follows: 1) high variance exist in both imaging and validation landmark annotating, in particular, clinical data contain variable and unknown misalignment from different patients; 2) the lack of consistent and robust similarity measures as a loss function between the two complementary imaging modalities.
Largely motivated by the high efficiency from the recent learning-based registration methods, we developed and compared registration networks and their associated training strategies. More interestingly, we proposed to use a third modality image that is arguably "closer" to both images to register to help the training procedure. In experimental results, we show that such addition could indeed help the registration in a number of scenarios, with consistent and statistically significant advantages with the moderately-sized multimodal image data set from clinical practice.
In summary, the presented experimental results confirmed that the proposed registration network training method can benefit from an additional modality during training. The improvement over other learning-based method, with different ways to make use of the "privileged modality" or without using it at all, is effective and consistent, especially for a subset of these patient cases that with largest misalignment, therefore needing the registration the most.
We have demonstrated the proposed registration method using a privileged modality with the specific prostate cancer imaging application. While this method has potentials for training registration networks using other types of available images in wider clinical applications, including and beyond those potential applications discussed in Section I, these require further investigation and validation.

VI. CONCLUSION
We have proposed strategies for the third modality images to aid the training of bi-modality image registration networks. The competitive registration accuracy has been experimentally demonstrated on mpMR data from prostate cancer patients. The proposed novel methodology may be generally applicable to a wide range of clinical image registration tasks.