Study on Correlation Between Subjective and Objective Metrics for Multimodal Retinal Image Registration

Retinal imaging is crucial in diagnosing and treating retinal diseases, and multimodal retinal image registration constitutes a major advance in understanding retinal diseases. Despite the fact that many methods have been proposed for the registration task, the evaluation metrics for successful registration have not been thoroughly studied. In this article, we present a comprehensive overview of the existing evaluation metrics for multimodal retinal image registration, and compare the similarity between the subjective grade of ophthalmologists and various objective metrics. The Pearson’s correlation coefficient and the corresponding confidence interval are used to evaluate metrics similarity. It is found that the binary and soft Dice coefficient on the segmented vessel can achieve the highest correlation with the subjective grades compared to other keypoint-supervised or unsupervised metrics. The paper established an objective metric that is highly correlated with the subjective evaluation of the ophthalmologists, which has never been studied before. The experimental results would build a connection between ophthalmology and image processing literature, and the findings may provide a good insight for researchers who investigate retinal image registration, retinal image segmentation and image domain transformation.


I. INTRODUCTION
Retinal diseases, including age-related macular degeneration, diabetes retinopathy, and vascular occlusion, are leading causes of multiple retina pathologies, and have systemic implications. The role of imaging in diagnosing and treating retina diseases is crucial [1], where the ophthalmologists have access to a large variety of retinal imaging devices, including color fundus (CF) imaging, scanning laser ophthalmoscopy (SLO), ultra-wide-field imaging, optical coherence tomography (OCT) angiography, and dye-based angiograms. Each device uses different wavelength, angiographic dyes, optical systems, and has different appearance and angles of view.
A major advance in understanding retinal disease would require the ability to register the multimodal images and integrate such functional and structural evaluations into one The associate editor coordinating the review of this manuscript and approving it for publication was Gulistan Raja . co-localizable database [2]. There is critical information available on different instruments, and the diagnostic benefit of accurately overlaying different imaging modalities has recently been demonstrated by different groups [3]- [5]. However, this is complicated by the fact that the fields of view, lens systems, light sources and manufactures of such devices are all different. Fortunately, the retinal vessels can be observed by all of these instruments in different ways, and are key features to align and overlay various diagnostic modalities.
Conventional approaches for multimodal retinal images registration can be roughly divided into three categories: area-based methods, feature-based methods, and learning-based methods. The area-based methods are designed to minimize the mutual information [6] or the entropy correlation coefficient [7] between the images to be aligned, but they are sensitive to texture variations across modalities and also require intensive computation.
The feature-based methods overcame this limitation by detecting sparse feature correspondences in both images to estimates the transformation. GDB-ICP [8] and ED-DB-ICP [9] were initially applied for multimodal retinal image registration, but they were sensitive to changes in scale. Keypoint detection and feature description was later improved by descriptors designed for retinal images [10]- [12] or domain-specific landmarks [13], combined with better vessel extraction algorithms [11], [13], [14]. The robustness of keypoint matching was also improved by graph-based matching [15], spherical model [16], robust point matching algorithm [17], asymmetric Gaussian mixture model [18], and so on. However, these conventional methods are still not robust enough for image pairs affected by disease or poor imaging quality. The recently proposed learningbased methods of the third category utilize the convolutional neural network (CNN) to achieve improved robustness and accuracy upon conventional methods [19]- [22].
However, the evaluation metrics for the multimodal retinal image registration have not been thoroughly studied and compared. The ophthalmologists use their own subjective grade to assess the accuracy of registration based on the aligned images, while most papers in image processing literature adopt several objective metrics for their fairness and simplicity. This divergence in evaluation metrics may cause performance variations from laboratory to clinical application, and poses potential barriers to improve current registration methods.
In this article, we propose a method to mathematically compare the similarity of the subjective grade and the commonly used objective evaluation metrics, and establish an objective evaluation metric that is most correlated with the subjective evaluation of the ophthalmologists. To the best of our knowledge, this work is the first extensive study on various evaluation metrics for multimodal retinal image registration. We will review the subjective and objective metrics in section II, and then introduce our method for comparing their similarity in section III. In section IV, we will compare these metrics, and provide extensive analysis. Finally, section V concludes the paper.

II. RELATED WORKS A. SUBJECTIVE
Ophthalmologists adopt a subjective grading method, where the aligned multimodal images are analyzed in 5 × 5 blocks and overlaid in two forms of checkerboard, as shown in Fig. 1. The neighboring blocks in the checkerboard image show the registered images from two different modalities. If the images are perfectly aligned, the vessels are expected to be continuous on the edge across the neighboring blocks. In this subjective grading method, a score from 0 to 5 is assigned to each block in the checkerboard image, characterizing the ratio of overlap in vessels on the edge closest to the optic nerve as specified in Fig. 1. The grading criteria is shown in Table 1 and illustrated in Fig. 2, where grade 1 and 2 are considered   as poor alignment, 3 as reasonable, 4 and 5 as good/excellent matches, and grade 0 is assigned to ungradable blocks due to absence of vessels, or no visible vessels to detect because of noise or out-of-focus. Note that in Fig. 1 and 2, the deformations on the edges of the CF image is caused by the deformable registration method applied to align the images, which will be explained in section IV.

B. OBJECTIVE
Several objective metrics have been used in image processing literature, and they can be divided into two groups:
Denote keypoint coordinates with p = (x, y) T and the set of M pairs of manually labeled point correspondences with P = {(p 1 , q 1 ), . . . , (p M , q M )}. The MAE (maximum error, different from mean absolute error used in other fields) calculates the maximum L2 norm error on the selected point correspondences where F(p) warps source keypoint p towards the target location using the transformation F(·) estimated by the registration algorithm.
Using similar notation, MEE calculates the median L2 norm error on the selected points and RMSE calculates the root mean square error on the selected points PCK sets a threshold T on the L2 norm to determine whether a pair of keypoints are correctly matched, and calculates the percentage of correct keypoints The choice of threshold T is task dependent. For retinal image registration tasks, RMSE less 5 pixels is usually considered as success registration [11], [12], [17], [18], and the threshold T can be set to 5 pixel. To compute these metrics, we need to first manually label pairs of keypoint correspondences (generally 6 or more [10]- [12], [17], [18]) for all the multimodal images, where the keypoint locations should accurately lie on salient landmarks like vessel bifurcations, and uniformly distributed in the overlapping area. However, this is not an easy task. Even with the help of user friendly software (e.g. GUI), it is still difficult for human to select points with pixel-level accuracy and to make the points uniformly distributed in the overlapping area. Besides, labeling point correspondences by hand is very time-consuming for larger datasets with more than hundreds of image pairs.

2) UNSUPERVISED
Unsupervised evaluation metrics, on the contrary, do not require manually labeled keypoint correspondences, and only take the registered images as input.
Denote the aligned images with I 1 and I 2 ∈ R H ×W , then the mean square error (MSE) is defined as The structural similarity index (SSIM) [24] is designed to improve MSE, and it is averaged on all the windowed patches W 1 , W 2 from two images I 1 , I 2 SSIM(I 1 , where µ j is the mean value of window W j (j = 1, 2), and σ 2 j and σ 2 12 are the variance of W j and the covariance of W 1 and W 2 respectively. c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 are two variables to stabilize the division with small denominator, with L denoting the dynamic range of pixel intensity and k 1 = 0.01, k 2 = 0.03 by default. SSIM is a value between 0 and 1, and the higher SSIM indicates higher similarity between two images. Note that SSIM is only designed for single-channel images image, and RGB images need to be first converted to grayscale.
The Dice coefficient is frequently used for evaluating the overlapping region between two segmentation maps [14], [19], [20], [22]. With the aid of vessel segmentation algorithms, the Dice coefficient can also be used to evaluate the accuracy of registration. Let S j denote the binary vessel segmentation of image I j (j = 1, 2), with 1 assigned to vessels and 0 assigned to background, the Dice coefficient for binary segmentation is defined as where denotes element-wise product. The Dice coefficient ranges between 0 and 1, and higher number indicates larger overlap.
The soft Dice coefficient [20] introduces a differentiable counterpart of the binary Dice coefficient for vessel probability maps, and it is defined as with P j denoting the vessel probability map of I j (j = 1, 2). The python code for binary and soft Dice coefficient can be found in Table 2.

III. METHOD
In order to build a connection between ophthalmology and image processing literature, in this article, we compute the Pearson correlation coefficient [25] between the subjective grade of ophthalmologists and various objective evaluation methods to compare their degree of similarity.

A. PEARSON CORRELATION COEFFICIENT
The Pearson correlation coefficient (PCC) for random variable X and Y is defined as VOLUME 8, 2020 where the covariance of X and Y is and σ X and σ Y are the standard deviation of X and Y, respectively. For finite N samples {(x 1 , y 1 ) · · · (x N , y N )}, PCC can be estimated by wherex = 1 n n i=1 x i denotes the sample mean, and similarly forȳ.
The PCC value ρ ranges between [−1, 1], where higher absolute PCC value |ρ| implies higher linear dependency. Positive ρ value implies that Y increases as X increases, while negative value implies that Y decreases as X increases. For example, the correlation between MAE and the subjective grade should be negative, while the correlation between PCK and the subjective grade should be positive. Based on Cohen's interpretation [26], an absolute PCC value of 0.1 is small, 0.3 is medium, and 0.5 is large. An important property of the PCC value is that it is invariant to separate linear transforms in X and Y ρ X ,Y = ρ (aX +b),(cY +d) , where ac > 0, a, b, c, d ∈ R, (12) which means that PCC can still be applied even when the scale and range are different for different criteria.

B. CONFIDENCE INTERVAL
In order to test the certainty in the estimated PCC r value in eq (11), we further calculate its confidence interval to indicate the possible range of true PCC ρ values in eq(9). The confidence interval for the population's Pearson's ρ is can be computed with three steps [27].
Firstly, a z-score is computed applying the Fisher transformation F(·) Secondly, given a significant level α which is typically set to 95%, the critical z-score z α/2 for a two-tail test can be obtained from a look-up table. Then the confidence interval for F(ρ) would be where SE is the standard error Finally, the confidence interval for r is converted back using the inverse Fisher transformation where

IV. EXPERIMENT
In this experiment, we test the Pearson correlation between various objective metrics and the subjective grade on the multimodal retinal image registration result. We select two automatic deformable registration methods to obtain aligned images for evaluation. The first one is a conventional method, MIND [28], and the second is a learning-based method [20]. As the deformable registration methods are designed to local misalignment, we generate the input image pairs by random deformations based on ground truth alignment [19], [20], as shown in Fig. 3. For the aligned images using two methods, ophthalmologists grade them for the subjective metric, and we derive the objective metric with MAE, MEE, RMSE, PCK, SSIM and Dice coefficients. Note that here, we use two automatic deformable registration methods [20] and [28] not because of comparing performance between two methods but because of providing consistency of correlation between the subjective and objective metrics for both the conventional and learning methods.

A. DATASET AND GROUND TRUTH
The dataset consists of 109 pairs of Color Fundus images (TRC-50DX color fundus images, Topcon) and infrared Scanning Laser Ophthalmoscope images (Heidelberg Engineering Spectralis SLO). The dataset includes a variety of pathologies including hemorrhages, diabetes, and macular degeneration.
The multimodal images used in the experiment include 3 different types of content change. Firstly, the pathology has different appearance in the two imaging modalities, as illustrated in the example in Fig. 4. Secondly, the images are not necessarily taken at the same time, which also makes a difference as the disease progresses. Thirdly, there are different artifacts like out-of-focus, reflection, and over or under exposed regions in the two modalities.
The source CF images are 24-bit RGB and their resolution is 3000×2672, and the target SLO images are 8-bit grayscale and resolution of the images is 768 × 768 or 1536 × 1536.  They are both padded to square shape and resized to 768 × 768 before registration, and the ground truth transformation matrices are derived by manually selected correspondences from the source to target images. We use the SuperPoint network [29] to detect keypoints on the segmentation result of [20] and select all possible pairs of point correspondences.

B. EXPERIMENT SETTING
As shown in Fig. 3, the input is generated by randomly deforming the image pair aligned by the ground truth transformation matrix, where the random deformation field is upsampled from a 4 × 4 × 2 matrix that follows normal distribution with 0 mean and standard deviation of 5 pixels. The input pair is then registered using each of the two registration methods to generate the output pair, where MIND [28] is implemented in MATLAB, and [20] is implemented in PyTorch.
The output of the registration methods are graded independently by 2 retina specialists and 3 medical students. Each pair of aligned images are graded two times based on two forms of checkerboard overlay shown in Fig. 1. We take the average of the two grades assigned to the same block if both of them are non-zero, take the non-zero grade if only one grade is zero, and keep the grade zero if both grades are zero. To verify the reliability of the grades, each of the 5 graders first grade the same subset of 10 images, and the average intraclass correlation coefficient (ICC) on the subset calculated by SPSS [30] is 0.903, which indicates excellent similarity among graders.
In the comparison, each block is treated as an individual sample, as illustrated in Fig. 5. MSE, SSIM, and Dice coefficients are computed on the image patch within each block, where the black area in CF image is excluded. MAE, MEE, RMSE, and PCK are calculated on all correspondences within the block, as shown in Fig. 6. Especially, blocks with no valid correspondences or marked as ungradable (grade 0) are excluded when calculating the correlation coefficient, as they may affect majorly the correlation coefficients otherwise.
We set the threshold T in eq. (4) at 5 pixels for PCK, which yields the maximum correlation with grade at 0.1539, as shown in Fig. 7. MSE and SSIM are computed on the aligned images after converting to grayscale. Two segmentation methods [31] and [20] are tested for the binary and soft Dice coefficients. We use the optimal threshold at 0 and 0.5 for [31] and [20] respectively when calculating the binary Dice coefficient, which yields largest correlation with the grade. Fig. 8 shows the correlations between the subjective grade and various objective metrics and their 95% confidence intervals, which are evaluated on each block or image pair. VOLUME 8, 2020   The soft Dice coefficient using segmentation [20] achieves the highest correlation with the subjective grade at 0.35, which can be considered moderate based on Cohen's interpretation [26], and the binary Dice coefficient using [20] has slightly lower correlation. The binary and soft Dice coefficient in [31] also demonstrate small correlation with the subjective grade. This coincide with the fact that [20] can obtain more accurate of vessel segmentation map compared to [31]. PCK and SSIM show small positive correlation, while  Pearson correlation coefficient between subjective grade and objective metrics with 95% confidence interval using registration method Zhang et al. [20] or Heinrich et al. [28] only. the error terms including MAE, MEE, RMSE, and MSE show smaller negative correlation. The main reason that the binary and soft Dice coefficient show the highest correlation with subjective grade may lie in their common emphasis on retinal vessels. The Dice coefficient calculates the overlapping of vessel segmentation, and the subjective grade also depends on the continuity of vessels. Meanwhile, the other metrics do not explicitly depend on vessels, which leads to lower correlation with the subjective grade.

C. RESULTS
In order to evaluate the intrinsic correlation between different metrics, we calculate the absolute PCC value between each pair of metrics shown in Fig. 9. We observe that the correlation between the supervised metrics are very high, because they all operate on the selected keypoint correspondences. MSE and SSIM also have very large correlation, FIGURE 11. Distribution of subjective grade using registration method Zhang et al. [20] or Heinrich et al. [28].
as they both take the grayscale images as input. The binary and soft Dice coefficient for the segmentation [20] are nearly perfectly correlated, as the segmentation map obtained by [20] is already close to binary. Meanwhile, the binary and soft Dice coefficient make a difference for the segmentation [31], where the vessel segmentation is closer to the probability map.
We further investigate the influence of choosing different registration method. In the following experiment, correlation is calculated separately using the two registration methods [20], [28]. As shown in Fig. 10, the soft Dice coefficient under segmentation [20] still has the highest absolute correlation in both methods. Specifically, the soft Dice coefficient under segmentation [20] has higher correlation using only registration method [28], where the grade distribution, shown in Fig. 11, better covers both higher and lower grades. This experiment demonstrates that adopting different registration methods will lead to similar conclusion in this article.

V. CONCLUSION
In this article, we presented a comprehensive overview of the existing evaluation metrics for multimodal retinal image registration, and compared the Pearson correlation coefficient between the ophthalmologists' subjective grade and several commonly used objective evaluation metrics. We found that the soft Dice coefficient using the segmentation method in [20] achieved higher correlation with the subjective grades compared to many commonly used keypoint-supervised metrics. The paper established an objective metric that is highly correlated with the subjective evaluation of the ophthalmologists. The experimental results would build a connection between ophthalmology and image processing literature, and the findings may provide a good insight for researchers who investigate retinal image registration, retinal image segmentation and image domain transformation.
There are some limitations in the experiment that can be improved by future work. For example, the dataset consists of 108 image pairs with two imaging modalities (Color Fundus and infrared SLO), and we evaluated two deformable registration methods [20], [28]. Although our dataset is one of the largest multimodal datasets in the literature, in the future work, the experiment can be improved by including larger dataset, more imaging modalities, and more registration methods. He received the Postdoctoral fellowship with the University of California at San Diego. He is currently an Associate Adjunct Professor and the Co-Director of the Jacobs Retina Center. His research interests include retinal imaging, scanning laser imaging -confocal/non-confocal, optical coherence tomography (OCT), indocyanine green and fluorescein angiography, and tomographic reconstruction of the posterior pole in patients with various retina diseases such as age-related macular degeneration, diabetes, and HIV-related complications.
WILLIAM R. FREEMAN is currently an Distinguished Professor of ophthalmology, the Director with the UCSD Jacobs Retina Center and the Vice Chair of the UCSD Department of Ophthalmology. He is a full time Retina Surgeon and also a Researcher who has held NIH grants for nearly 30 years. He works closely with imaging groups with the Department of Ophthalmology, UCSD School of Engineering. He has over 600 peer-reviewed publications. TRUONG Q. NGUYEN (Fellow, IEEE) is currently a Professor with the ECE Department, UC San Diego. His current research interests include 3D video processing and communications and their efficient implementation. He is the coauthor (with Prof. Gilbert Strang) of a popular textbook, Wavelets and Filter Banks (Wellesley-Cambridge Press, 1997), and the author of several matlab-based toolboxes on image compression, electrocardiogram compression, and filter bank design. He has over 400 publications.
He received the IEEE TRANSACTION ON SIGNAL PROCESSING Paper Award (Image and Multidimensional Processing area) for the article he co-wrote with Prof. P. P. Vaidyanathan on linear-phase perfect-reconstruction filter banks (1992 CHEOLHONG AN received the B.S. and M.S. degrees in electrical engineering from Pusan National University, Busan, South Korea, in 1996 and 1998, respectively, and Ph.D. degree in electrical and computer engineering, in 2008. He is currently an Assistant Adjunct Professor of electrical and computer engineering from the University of California at San Diego. Earlier, he worked at Samsung Electronics, South Korea, and Qualcomm, USA. His current research interests include the medical image processing and the real-time bio image processing. His research interests include 2D and 3D image processing with machine learning and sensor technology. VOLUME 8, 2020