Introduction
Image quality assessment (IQA) is a research area occupied with constructing accurate computational models to predict the perception of image quality by human subjects, the ultimate consumers of most image processing applications [1].
The growing popularity of image enhancement and image generation algorithms increases the need for a quality assessment of their performance. The demand has led to the abundance of IQA methods emerging over the last decades. The well-known full-reference (FR) metrics, such as MSE, PSNR, and SSIM [2], [3], became a de-facto standard in many computer vision applications. The more recent no-reference (NR) metrics, such as BRISQUE [4], have also found their use, especially when the ground truth images are absent or hard to access. Yet, another class of distribution-based metrics (DB) earned the community’s attention, thanks to the advent of generative adversarial networks (GANs), enabling the quality assessment using distributions of thousands of images instead of gauging them individually. The popular new DB IQA methods include such metrics as Inception Score [5], FID [6], KID [7], MSID [8], and many others. Despite being widely used, the DB metrics were neither included in the recent large scale general domain reviews [9], [10], nor in the medical ones [11].
IQA measures are applied to estimate the quality of image processing algorithms and systems. For example, when several image denoising and restoration algorithms are available to recover images distorted by blur and noise contamination, a perceptual objective IQA could help pick the one that generates the best perceptual image quality after the restoration. To do that reliably, Image Quality Metrics (IQMs) need to show a high correlation with the perceptual estimates of the quality reported by human subjects for a given image processing algorithm. However, IQA algorithms are often evaluated on non-realistic distortions, such as added noise or artificial blurring [9], [10], [11]. Such a discrepancy between the synthetic evaluation and the practical use may cause misleading results.
While most metrics are designed to predict image quality in the general domain, Magnetic Resonance Imaging (MRI) provides gray-scale data with the content and style noticeably different from the natural images. Hence, the applicability of the IQMs in the MRI domain must be validated.
Moreover, IQMs trained on natural images attempt to describe the overall perception of the quality of an entire scene. On the contrary, an MRI scan can be perceived as high-quality when specific characteristics, responsible for the scan’s value, are deemed adequate. Those are the characteristics that are deemed important components of the radiographic image quality [12], including perceived level of noise (signal-to-noise ratio, SNR), perceived soft tissue contrast (contrast-to-noise ratio, CNR), and the presence of art efacts. Unfortunately, none of the previous IQM studies considered them. Besides, these specific quality criteria are coupled. For example, some denoising algorithms tend to introduce additional blurring (lowering of CNR) in exchange for increased SNR, and some motion correction approaches tend to introduce noticeable art efacts. Therefore, a more detailed evaluation of IQM’s ability to express separate MRI quality criteria is required.
The remainder of this paper is structured as follows. After discussing the related work, we describe how we generate an image library that consists of disrupted and reference MRI image pairs. In Section IV-A, we provide a detailed description of data selection, corruption, and restoration processes that populate the image library with realistic yet diverse data. We then use the image library to survey expert radiologists and collect a set of labels to be then correlated with IQM values in Section IV-C. Finally, we report and discuss the results in Sections V and VI, where we indicate the top-performing metrics, and provide insights about their performance for different distortions, robustness to the domain shift, anatomies, and quality criteria. Section VII concludes the work by proposing the best IQA approaches for MRI. Appendices include a list of abbreviations (A), reconstruction examples (B), and a screenshot of the labeling user interface (C).
The main contributions of this paper are:
The most extensive study of IQA in medical imaging, in general, and in MRI, in particular (14,700 subjective scores collected). Unlike previous metric evaluation studies, we avoid artificially added distortions and assess the outputs of popular image restoration models instead. The assessment is based on proposed three criteria and allows us to make profound conclusions on what modern metrics can capture and when exactly they should be used.
To the best of our knowledge, we provide the first thorough study of the application of DB metrics for objective IQA of both natural and medical images. We evaluate their performance and show when they give advantage over the common FR and NR metrics. We study the robustness of metrics’ performance across these two vastly different domains and show that the best performing IQMs produce valuable results even when the data distribution drastically changes.
Related Work
The evaluation of metrics for IQA in the domain of natural images started from the early task-specific works that considered FR methods to characterize color displays and half-toning optimization methods [13].
More recent task-specific studies explored IQA for the images of scanned documents [14] and screen content [15]. Likewise, fused images [16], smartphone photographs [17], remote sensing data [18], and climate patterns [19] demanded the development of targeted IQA approaches. Historically, many of these works have been focusing on the quality degradation caused by the compression algorithms [19], [20], [21], [22], with relatively small datasets appearing publicly for the IQ evaluation. However, the small dataset size and the excessive re-use of the same test sets have led to the promotion of the IQMs poorly generalizable to the unseen distortions.
This was recognized as a major problem, stimulating the emergence of large-scale studies [9], [23]. Among the large-scale evaluations, the majority compared multiple FR metrics, ranging from just a handful [24], [25], [26], [27] to several dozens [9], [28] of IQMs analyzed on popular datasets.
The medical domain stands out from the others by a special sense of what is deemed informative and acceptable in the images [29]. Resulting from years of training and practice, the perception of medical scan quality by adept radiologists relies on a meticulous list of anatomy-specific requirements, on their familiarity with particular imaging hardware, and even on their intuition.
Given the majority of IQMs were not designed for the healthcare domain, some recent works were dedicated to the niche. One small-scale study considered a connection of IQA of natural and medical images via SNR estimation [30]. Others assessed common FR IQMs using non-expert raters [31], [32]. Sufficient for the general audience, these methods proved incapable of reflecting the fine-tuned perception of the radiologists [33].
Expert raters were then engaged in [11] and [34]. The former studied only IQMs from the SSIM family and the latter assessed 10 FR IQMs, reporting that VIF [35], FSIM [36], and NQM [37] yield the highest correlation with the radiologists’ opinions.
On the other hand, Crow et al. argue that NR IQA are preferable for assessing medical images because there may be no perfect reference image in the real-world medical imaging [38]. To address this issue, several recent studies also propose new NR IQMs for MRI image quality assessment [39], [40], [41], [42], [43], [44], [45], [46], [47], [48]. The recent survey [49] overviews MRI-specific IQMs and concludes that the number of available metrics is relatively low and that their development is hindered by the lack of publicly available datasets. Also, none of these new metrics have an open-source implementation, making verification of the claimed results problematic.
Image Quality Metrics Considered
In this work, we evaluate the most widely used and publicly available general-purpose FR, NR, and DB IQMs to find the best algorithms for the quality assessment on arguably the most important MRI-related image-to-image tasks: scan acceleration, motion correction, and denoising. Instead of modeling the disrupted images, we use outputs of trained neural networks and compare them with the clean reference images from the fastMRI dataset.
Our study includes the following 35 metrics: 17 Full-Reference IQMs (PSNR, SSIM [2], MS-SSIM [50], IW-SSIM [51], VIF [35], GMSD [52], MS-GMSD [53], FSIM [36], VSI [54], MDSI [55], HaarPSI [56], Content and Style Perceptual Scores [57], LPIPS [58], DISTS [59], PieAPP [60], DSS [61]), 3 No-Reference IQMs (BRISQUE [4], PaQ-2-PiQ [62], MetaIQA [63]), and 15 Distribution-Based IQMs (KID [7], FID [6], GS [64], Inception Score (IS) [5], MSID [8], all implemented with three different feature extractors: Inception Net [65], VGG16, and VGG19 [66]). For brevity of the presentation, we will showcase only the analysis of the best performing four metrics, in the order of their ranking: VSI [54], HaarPSI [56], DISTS [59], and FIDVGG16 [6]. All metrics were re-implemented in Python to enable a fair comparison, with the PyTorch Image Quality (PIQ) [67] chosen as the base library for implementing all metrics. The resulting implementations were verified to be consistent with the original implementations proposed by the authors of each metric.
Noteworthy, in our survey, we dismissed some recent results reported for the PIPAL dataset [68] during the 2021 NTIRE challenge [69], because the winners [70], [71], [72] released no official implementations or model weights at the time of our experiments.
For the comparison, we collect 14,700 ratings from 7 trained radiologists to evaluate the quality of reconstructed images based on three main criteria of quality: perceived level of noise (SNR), perceived soft tissue contrast (CNR), and the presence of art efacts, making this work the most comprehensive study of MRI image quality assessment to date.
Medical Evaluation
The key goal of this study is to evaluate popular selected IQMs on MRI data. Previous works [11], [34] evaluated the ability of certain IQMs to assess overall quality of data after to various types of artificial distortions.1 However, in practice, the overall image quality (IQ) rating may be insufficient due to its ambiguity: e.g., one could not truly interpret the reasons for poor or good scoring. At the same time, asking the medical experts these general questions may be challenging because of many factors, ranging from the specifics of certain clinical workflows to personal preferences.
In this work, we aspire to solve these problems by proposing the following study. First, we evaluate IQMs with regard to their ability to reflect radiologists’ perception of the quality of distorted images, comparing them to the fully-sampled artifact-free ones. We range the metrics based on three IQ criteria that are crucial for making clinical decisions: perceived level of noise (SNR), perceived level of contrast (CNR), and the presence of art efacts. Second, instead of corrupting images with artificial perturbations, for the first time in the community, we validate these metrics using the actual outputs of deep learning networks trained to solve common MR I-related tasks. As such, the art efacts originate from the imperfect solutions to the common real-world problems of motion correction, scan acceleration, and denoising.
A group of trained radiologists rated the quality of distorted images compared to the clean reference images on a scale from 1 to 4 for the three IQ criteria. Unlike the five-point Likert scale, the simplified scale balances the descriptiveness of the score with the noise in the votes of the radiologists. Our mock experiments showed that the respondents considered the selection between too many options difficult, with the five-point scale having a diluted difference between the options; whereas, the three-point scale was deemed insufficient.
After the evaluation, the aggregated results were compared with the values of selected IQA algorithms to identify the top performers – the metrics that correlate the highest with the radiologists’ votes.
A. Image Library Generation
As a data source, we use the largest publicly available repository of raw multi-coil MRI
The knee subset is divided into 4 categories: train (973 volumes), validation (199 volumes), test (118 volumes), and challenge (104 volumes). Only the multi-coil scans were selected for this study, omitting the single-coil data.
The brain subset includes 6,970 1.5 and 3 Tesla scans collected on Siemens machines using T1, T1 post-contrast, T2, and FLAIR acquisitions. Unlike the knee subset, this data are of a wide variety of reconstruction matrix sizes. For the purpose of de-identification, authors of the dataset limited the data to only 2D axial images, and replaced
Starting with the clean knee and brain data, we first generate images corrupted with three types of distortions: scan acceleration, motion, and noise. The examples of the distorted images are presented in Fig. 1. After that, we train two reconstruction models for each type of distortions using PyTorch [75]. The first model is trained until the validation loss is stabilized. The second model is trained for half as long to purposely produce imperfect reconstructions, oftentimes encountered in practice. Examples of corrupted images and the corresponding reconstructions can be found in Appendix B. The reduced training time was a conscious choice, enabling the model to produce some visible reconstruction errors. More specifically, we interrupted the training when the 90% of the loss plateau is reached, which allows for good performing models with imperfections we wanted to test for.2 Finally, we use the trained models to reconstruct the corrupted images in the validation subset of the FastMRI, from which we generate the labeling dataset.
Distortions introduced to initial artefact-free scans during training and inference. Using the raw
1) Scan Acceleration
Scan-acceleration data are generated from the ground truth images by undersampling the
To compensate for the undersampling, we used the 2019 FastMRI challenge winner Adaptive-CS-Net model [77]. Based on the Iterative Shrinkage-Thresholding Algorithm (ISTA) [78], this model consists of several trainable convolutional multi-scale transform blocks between which several prior knowledge-based computations are implemented. For scalability reasons and without substantially impacting the reconstruction results, in this study, we trained a simplified light-weight version of the Adaptive-CS-Net model. The resulting model consists of only 10 trainable blocks and 267k parameters. Unlike the full Adaptive-CS-Net model with three MRI-specific physics-inspired priors, the simplified version has only one prior module between the reconstruction blocks– the soft data consistency step. Specifically, the update for the block \begin{equation*} B_{i+1}(\textbf {x}_{i}) = \textbf {x}_{i} + \hat {\mathcal {U}}_{i}(\textit {soft}(\mathcal {U}_{i}(\textbf {x}_{i}, \textbf {e}_{i}), \lambda _{s, f_{s}})) \, \tag{1}\end{equation*}
\begin{equation*} \textbf {e}_{i} = \mathcal {F}^{-1}(M\mathcal {F}\textbf {x}_{i} - M\textbf {y}), \tag{2}\end{equation*}
We trained the simplified Adaptive-CS-Net model using RMSprop optimizer [79] to minimize L1 loss function between the reconstruction estimate and the ground truth image obtained from the fully sampled data. We used a step-wise learning rate decay of
2) Motion Correction
The in-plane motion art efacts, including rigid translation and rotation, were introduced into the Fourier-transformed data following the procedure described in [80]. For each input image, the assumed echo-train length of the turbo spin-echo readout was chosen randomly in the 8—32 range. Similarly, the assumed extent of zero-padding in
To compensate for the motion artefacts of various extent, we trained U-Net models [81] with 209k parameters. While more advanced architectures exist, we found the basic U-Net to be more than sufficient for the scope of the proposed IQA study, as it is enough to capture imperfections which are often generated by deep learning models.
The model received the motion corrupted data as the input and learned to predict the motion artefacts in a residual manner, i.e., the output of the model was a predicted image of motion in the input data. The model was trained to minimise L1 loss between the ground-truth and the predicted residual with Adam [82] optimizer using the step-wise learning rate decay of
3) Denoising
In our study, noisy magnitude images are generated from the complex
To compute the denoised images, we trained DnCNN models [84] with 556k parameters on the brain multi-coil train data using the RMSprop optimizer [79] and a step-wise learning rate decay of
4) Final Dataset for Labeling
We started the formation of the labeling dataset from the clean volumes from the validation subsets of brain and knee FastMRI datasets; hence, these scans were not used to train the artefacts correction models. In total, both validation subsets contain 1,577 volumes, resulting in 28,977 images: 199 knee volumes with 7,135 slices and 1,378 brain volumes with 21,842 slices. In each brain volume, the lower 2 and top 3 slices were discarded to restrict the analysis to clinically relevant parts of the scan. In each knee volume, the first 3 slices were discarded for the same reason. To limit the number of data points and decrease the overall variability of data types, we selected only T1-weighted (T1, T1-PRE and T1-POST) brain volumes and proton-density weighted without fat suppression (PD) knee volumes.
The data generation pipeline is summarized in Fig. 2.
Formation of Labeling and Re-labeling datasets for annotation (left) and the content of each group of images for the medical evaluation and labeling by radiologists in the Labeling dataset (right). Starting from the clean validation data, we first generate corrupted data with the acceleration artefacts, the motion artefacts, and the Gaussian noise. Then, we reconstruct the corrupted data using trained neural network models and randomly select scans to form labeling and re-labeling pairs for the experts to grade.
Using the selected subset of clean validation data, we simulated images for the reconstruction:
For the scan acceleration task, we simulated acceleration artefacts for undersampling rates of
,2\times , and4\times , following the data generation process from the FastMRI challenge;6\times For the motion correction task, we simulated motion artefacts of three different strengths using the rigid motion simulation framework described above;
For the denoising task, we simulated Gaussian noise with amplification factors of 1, 2 and 3 using the noise generation procedure described above.
After that, all generated corrupted data were reconstructed using the reconstruction models trained for the corresponding tasks. Note that we deliberately generated a fraction of data with parameters different from the ones used to train the reconstruction models. We found this approach yields various levels of artefacts typically appearing after the reconstruction process.
From the large pool of reconstructed images, we select 100 pairs of images (clean - reconstructed) for each task (scan acceleration, motion correction, denoising) and each anatomy (knee, brain), evenly distributing the data to represent each reconstruction parameter (e.g., the acceleration rate for the scan acceleration task). This strategy results in the labeling dataset of 600 pairs of images in total (3 tasks
To reach the goal labeling dataset size, we utilized the following data selection procedure:
Compute values of IQMs for all reconstructed images (for NR IQMs) or image pairs (for FR IQMs);
Normalize each IQM value to
;[{0, 1}] Compute variance between IQM values for all items;
Sort all items by the value of variance;
Select 25% of data for each task-anatomy combination from the data items with the highest variance, assuming that items with the biggest disagreement between IQMs are the most informative;
Select the rest 75% of data pseudo-randomly (preserve distribution of reconstruction parameters) to avoid introducing any bias from the variance computation.
Lastly, we deliberately duplicated 100 of the 600 prepared items for the purpose of verification of radiologists’ self-consistency, resulting in 700 image pairs to be labelled by each radiologist.
B. Experiment Setup
Within the paradigm of the model observer framework [85], the quality of a medical image can be defined as how well a clinical task (e.g., diagnostics) can be performed on it [86]. This means that the perfect MRI IQM would be some task-based score, such as the diagnostic accuracy. However, such a metric is difficult to implement due to a great diversity of diagnostic outcomes that radiologists deal with in practice. Because of that, the convention is to use a subjective estimation of the overall diagnostic value instead [11].
However, we argue that a single score is not sufficient to reflect the abundance of anatomies, pathologies, and artefactual cases that the radiologists work with. Instead, we propose to subdivide the score of the overall diagnostic quality into three main criteria that can be important for a clinical practitioner to make their decision: i) perceived level of noise, ii) perceived level of soft-tissue contrast, and iii) presence of artefacts.
1) Subjective Evaluation
Seven trained radiologists with 7 to 20 years of experience took part in this study. The participants were asked to score pairs of reconstructed-reference images using three main IQ criteria. For each image pair and each criterion, radiologists scored the perceived diagnostic quality of the reconstructed image compared to the ground-truth using a four-point scale: not acceptable (1), weakly acceptable (2), rather acceptable (3), and fully acceptable (4). The four-point scale was selected over the five-point Likert scale, previously used in [11].
Each participant performed the labeling individually using a dedicated instance of the Label Studio [89] software accessible via a web interface. The experts were asked to make all judgments about the image quality with regard to a particular diagnostic task that they would normally perform in their practice (e.g., the ability to discriminate relevant tissues, the confidence in using the image to detect a pathology, etc.). The interface provided additional functionality of scaling (zooming) the images to closer mimic the real-life workflow. The pairs of images were displayed in a random order until all pairs were labelled. Participants had an opportunity to re-label the pairs they have already scored at any point until the experiment is finished.
During the main part of the experiment, each participant labelled 600 pairs of images based on the 3 quality criteria, resulting in 4,200 annotated pairs and 12,600 labels in total. The results of the main labeling session were used for further evaluation of the IQMs. After finishing the main part of the experiment, the participants were asked to additionally label 100 randomly selected pairs from the same dataset, yielding additional 2,100 labels. The results of this additional re-labeling were used to evaluate the self-consistency of each annotator.
2) Metrics Computation
Unlike FR and NR IQMs, designed to compute an image-wise distance, the DB metrics compare distributions of sets of images. This makes them less practical for traditional IQA, the goal of which is to compute a score for a given image pair. Moreover, the need to have sets of images hinders the vote-based evaluation via the mean subjective opinion scores.
To address these problems, we adopt a different way of computing the DB IQMs. Instead of extracting features from the whole images, we crop them into overlapping tiles of size
C. Data Analysis
Here, we adapt the analysis of the scoring data proposed in [24] to the multiple IQ criteria. The voting scores for each scoring criteria are not analyzed in their raw format. Instead, they are converted to z-scores (averaged and re-scaled from 0 to 100 for each radiologist to account for their different scoring):\begin{equation*} z_{nmk} = (D_{nmk} - \mu _{mk})/\sigma _{mk}\, \tag{3}\end{equation*}
\begin{equation*} D_{nmk} = s_{mk,\text {ref}} - s_{nmk}\,. \tag{4}\end{equation*}
After standardizing the expert votes by Eq. (3), their correlation statistics with each IQM were computed in the form of SRCC and KRCC coefficients, defined as follows:\begin{equation*} \text {SRCC} = 1 - \frac {6 \sum _{i=1}^{n} d_{i} ^ {2}}{n (n^{2} - 1)}, \tag{5}\end{equation*}
\begin{equation*} \text {KRCC} = \frac {2}{n(n-1)}\sum _{i< j} \text {sign}(x_{i}-x_{j})\;\text {sign}(y_{i}-y_{j})\, \tag{6}\end{equation*}
We use SRCC as the main measure of an IQM performance, due to the non-linear relationship between the subjective and the objective scores.4
The sizes of each batch of data are described in Fig. 2 (right).
A non-linear regression was performed on the IQM scores according to the quality \begin{equation*} Q(x) = \beta _{1} \left ({\frac {1}{2} - \frac {1}{1 + \exp (\beta _{2} (x - \beta _{3}))} }\right) + \beta _{4} x + \beta _{5},\quad \tag{7}\end{equation*}
Results
Figs. 3 and 4 and Table 1 summarize the correlation study between the radiologists’ scores and the IQM values for the three proposed evaluation criteria. The figures also show the results for the natural image domain. Top 4 performers in each category are marked in bold. The best and the worst examples of the reconstructions, as judged by different metrics, are presented in Fig. 5, and the aggregate scores for the top-performing metrics in each application in Fig. 6.
Performance of IQMs on different MRI tasks and on Natural Images (NI), compared by their correlation with the expert votes (SRCC values, left) and sorted top-to-bottom by their rank (right). The ordering reflects the performance on the MRI data only. The same color-coding is used in both plots. NI-generic scores are the average between TID2013 [87] and KADID-10k [88] datasets. Note higher correlation of IQMs on NI and poor translation of ranking to MRI domain. Refer to data in Table 1 for numerical values.
Relationship between processed subjective scores and IQM values for 3 evaluation criteria, 3 target tasks, and 2 anatomies (600 annotated image pairs in total). The solid lines are fits, plotted using the non-linear regression (7) on the subsets of images split by the tasks. The top 4 metrics (along with PSNR and SSIM, as the most commonplace) are shown in the decreasing order left to right, using SRCC to gauge the performance.
The best and the worst reconstruction-reference pairs according to different metrics (their values are shown in yellow). Note how the top 4 metrics (first four columns) reflect the actual reconstruction quality better than PSNR and SSIM (which are prone to misjudging a simple shift of brightness or a blur). The brightness is adjusted for viewer’s convenience.
Aggregate relationship between the objective and the subjective scores for 3 evaluation criteria (rows), 2 anatomies (columns), and 3 tasks: scan acceleration (
Discussion
The visual inspection of the outputs of the models in Fig. 5 makes it evident how the top metrics are superior in reflecting the actual reconstruction quality over the conventional PSNR and SSIM. The latter are known to misjudge shifts of brightness or a blur, indicating high quality for the bad images, whereas the more advanced FR and DB IQMs correlate with the visual perception and the subjective scores ably. Henceforth, out of the 35 metrics considered, we only discuss the best ones, according to their rank in the correlation study (VSI, HaarPSI, DISTS, FIDVGG16) and the widely used PSNR and SSIM.
As the key observation in the first systematic study of the DB metrics, we affirm that the choice of the feature extractor plays a crucial role. In particular, the correlation scores show that the Inception-based features are almost always worse than those from VGG16 (except for the MSID metric). Moreover, we see that, despite having been designed for the evaluation of realism of generative models data, FID shows competitive SRCC scores, thus, becoming a new recommended metric for the MRI image assessment tasks.
The non-linear relationship between the subjective and the objective scores, seen in Fig. 4, portrays intricate behavior with evident dependence on the anatomy and the target task, as well as a clear clustering of the points, instrumental for selecting a proper metric in a particular application. Notable are the generally lower IQM correlation scores when the difficulty of the reconstruction routine increases (compare trends in the scan acceleration data to those in the more complex denoising and the motion correction models). Also, the evaluation values for the knee reconstruction are generically lower, which could be caused by the greater variety of anatomical structures present in the knee data, as well as the more strict pertinent medical evaluation criteria [33].
Fig. 6 aggregates the outcomes per each task, anatomy, and evaluation criteria studied in our work, with the relation between the subjective and the objective scores highlighting the differences in the average performance of the top metrics. Notably, these selected IQMs have the highest correlation with expert judgment in the scan acceleration task. However, all metrics equally struggle reflecting the opinion of the radiologists in denoising and, sometimes, in motion correction tasks, especially on brain data. We also observe that some metrics perform consistently in terms of all three evaluation criteria and all tasks for given anatomy. For instance, GMSD and DISTS, despite not being of the highest SRCC rank overall, still show consistently high correlation scores on knee data, which proffers both of them as universal choices for the IQA in orthopedic applications. On the other hand, HaarPSI consistently rates the highest for both anatomies in the scan acceleration task, an instrumental fact to know when a single machine is used to scan various body parts or when the pertinent cross-anatomy inference [91] is performed.
A. Natural vs. MRI Images
A frequent IQA-related question is how generalizable are the performance benchmarks across different datasets and image domains. To study that, we analyzed the applicability of all 35 IQMs considered herein both in the MRI and the natural image (NI) domains (Table 1). For the latter, the popular TID2013 [87] and KADID-10k [88] datasets of NIs were used. Fig. 3 illustrates the effect of the shift between the NI and the MRI domains, featuring an expected drop of the correlation values for most metrics.5 However, the domain shift affects the ranks of the IQMs differently. Some top NI metrics, such as MDSI and MS-GMSD, naturally take lower standings in the MRI domain; however, others, such as HaarPSI and VSI, remain well-correlated with the radiologists’ perception of quality. Further examples of IQMs robust to the domain shift are DISTS and FIDVGG16.
B. Labeling Discrepancies and Self-Consistency Study
Another IQA-related question encountered in survey-based studies is the trustworthiness of the votes themselves. Given that only reputable radiologists were engaged in our labeling routine, we have no grounds for doubting their annotations as far as the domain knowledge is concerned. Therefore, feasible discrepancies among their votes can be assumed to originate either from such factors as the study design, its duration, and fatigue, or from a previous experience which sometimes forms a posteriori intuition and, allegedly, influences the experts to make decisions different from the others.
While the latter is too subjective and difficult to regulate, the former could be controlled. We put effort to simplify the user experience and allowed the radiologists to approach the labeling assignment in batches at their own pace (see Appendix C). The average lead time spent labeling a pair of images,6 an arguable indicator of the scrupulousness of an annotator, is plotted in Fig. 7, where we also summarize the results of the self-consistency study. The study reports Weighted Cohen’s Kappa scores, computed between the votes provided in the main and in the additional re-labeling experiments on the same data. Interestingly, there is no significant correlation between self-consistency and the labeling time, placing other factors mentioned above, such as individual experience, at the forefront.
Correlation between the subjective scores in labeling and re-labeling sessions on the same data, with each column/color corresponding to an individual radiologist. This plot shows scoring self-consistency of the experts and the average time spent labeling one pair of images. Apparently, the time spent on labeling is not the major factor affecting the self-consistency of experienced radiologists.
We also opted for evaluating the agreement between the radiologists’ opinions by assessing the monotonic correlation between the
Pair-wise Spearman’s rank correlation coefficient between
In Fig. 7, the Weighted Cohen’s Kappa values correspond to moderate to substantial consistency of scoring (according to [93]). And, according to [92], the SRCC range in Fig. 8 corresponds to a strong agreement. Given the sufficiently trustworthy labeling, the spread of the correlation scores for the modern IQA metrics in Fig. 6, and the non-trivial correlation patterns in Fig. 4, one can conclude that the optimal MRI metric is yet to be devised.
Besides a blunt umbrella metric aggregating the top-performing predictions (e.g., those of VSI, HaarPSI, DISTS, and FIDVGG16), the future effort should be dedicated to additional forays into modeling MRI-specific perception of the radiologists and to interpreting their assessment using formalized rules taken from the medical textbooks. Such interpretatable metrics will be especially in demand, given the recent appearance of the MRI sampling approaches aimed towards optimizing downstream tasks [94], including the recently annotated FastMRI dataset [95]. Another line of future work could be ‘borrowed’ from the NI domain, where the abundance of data has led to the emergence of several NR IQMs. Although, in our study, all such metrics (classic BRISQUE [4] and the more recent PaQ-2-PiQ [62] and MetaIQA [63]) showed equally mediocre performance compared to the other IQMs, we believe their value in the MRI domain is bound to improve with the growth of available data.
Conclusion
This manuscript reports the most extensive study of the image quality metrics for Magnetic Resonance Imaging to date, evaluating 35 modern metrics and using 14,700 subjective votes from experienced radiologists.
The applicability of full-reference, no-reference, and distribution-based metrics is discussed from the standpoint of MRI-specific image reconstruction tasks (scan acceleration, denoising, and motion correction). Unlike previous IQA studies analyzing IQMs with manual distortions, we use the outputs of neural network models trained to perform these particular tasks, enabling a realistic evaluation. Different from the natural images, the MRI scans are proposed to be assessed according to the most diagnostically influential criteria for the quality of MRI scans: signal-to-noise ratio, contrast-to-noise ratio, and the presence of art efacts.
The top performers – DISTS, HaarPSI, VSI, and FIDVGG16 – are found to be efficient across three proposed quality criteria, for all considered anatomies and the target tasks.
ACKNOWLEDGMENT
The authors acknowledge the effort of radiologists from the Philips Clinical Application Team (PD CEER) for their help with data labeling and thank the supporters of their GitHub Project (https://github.com/photosynthesis-team/piq/), where each metric was independently implemented and tested.
They also declare no conflict of interest and no personal bias towards particular image quality metrics.
Appendix BReconstruction Examples
Reconstruction Examples
During the labeling experiment, experts were asked to label pairs of images. Each pair contained a low-quality image placed side-by-side with a corresponding high-quality reference. Each low-quality image was obtained by, first, corrupting the corresponding reference and, then, by reconstructing it with the models trained to solve one of the tasks described in the main text (scan acceleration, motion compensation, or denoising). Fig. 9 showcases typical pairs of images used in the experiment. The following corruption parameters were used to generate the images: acceleration factor of 4, motion amplification factor of 0.6, noise amplification factor of 2. These parameters correspond to medium strength corruptions, showcasing possible imperfect reconstruction results.
Examples of corrupted images used as inputs to the reconstruction models (left column), the reconstruction results (middle column), and the artefact-free reference images (right column). Examples with medium strength corruptions are displayed to showcase possible imperfect reconstruction results.
Appendix CLabeling User Interface
Labeling User Interface
During the labeling experiment, the participants were asked to score pairs of reconstructed-reference images presented to them side-by-side in a web interface of the Label Studio [89]. The web interface is shown in Fig. 10.
Web interface of the Label Studio software released to the expert radiologists to perform the labeling. The participants selected their answers using the proposed scale from 1 to 4, rating the images based on each proposed IQA criteria.
The labeling was done using three main IQ criteria: the presence of art efacts, the perceived level of noise, and the perceived level of soft-tissue contrast. The participants were able to select their answers using the mouse pointer or some keys on the keyboard. During the quality assessment process, the participants were able to zoom images, re-label previously labeled examples, pause and divide their evaluation session into as many labeling rounds as they wished. All labeling results were continuously saved on a remote server to eliminate the possibility of data loss. After the complete labeling process, the participants were offered the last chance to fix the scoring of the borderline examples.