Loading web-font TeX/Math/Italic
Image Quality Assessment for Magnetic Resonance Imaging | IEEE Journals & Magazine | IEEE Xplore

Image Quality Assessment for Magnetic Resonance Imaging


We study which image quality metrics are the best for assessing MRI scans.

Abstract:

Image quality assessment (IQA) algorithms aim to reproduce the human’s perception of the image quality. The growing popularity of image enhancement, generation, and recov...Show More

Abstract:

Image quality assessment (IQA) algorithms aim to reproduce the human’s perception of the image quality. The growing popularity of image enhancement, generation, and recovery models instigated the development of many methods to assess their performance. However, most IQA solutions are designed to predict image quality in the general domain, with the applicability to specific areas, such as medical imaging, remaining questionable. Moreover, the selection of these IQA metrics for a specific task typically involves intentionally induced distortions, such as manually added noise or artificial blurring; yet, the chosen metrics are then used to judge the output of real-life computer vision models. In this work, we aspire to fill these gaps by carrying out the most extensive IQA evaluation study for Magnetic Resonance Imaging (MRI) to date (14,700 subjective scores). We use outputs of neural network models trained to solve problems relevant to MRI, including image reconstruction in the scan acceleration, motion correction, and denoising. Our emphasis is on reflecting the radiologist’s perception of the reconstructed images, gauging the most diagnostically influential criteria for the quality of MRI scans: signal-to-noise ratio, contrast-to-noise ratio, and the presence of art efacts. Seven trained radiologists assess these distorted images, with their verdicts then correlated with 35 different image quality metrics (full-reference, no-reference, and distribution-based metrics considered). The top performers– DISTS, HaarPSI, VSI, and FIDVGG16– are found to be efficient across three proposed quality criteria, for all considered anatomies and the target tasks.
We study which image quality metrics are the best for assessing MRI scans.
Published in: IEEE Access ( Volume: 11)
Page(s): 14154 - 14168
Date of Publication: 08 February 2023
Electronic ISSN: 2169-3536
Citations are not available for this document.

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Image quality assessment (IQA) is a research area occupied with constructing accurate computational models to predict the perception of image quality by human subjects, the ultimate consumers of most image processing applications [1].

The growing popularity of image enhancement and image generation algorithms increases the need for a quality assessment of their performance. The demand has led to the abundance of IQA methods emerging over the last decades. The well-known full-reference (FR) metrics, such as MSE, PSNR, and SSIM [2], [3], became a de-facto standard in many computer vision applications. The more recent no-reference (NR) metrics, such as BRISQUE [4], have also found their use, especially when the ground truth images are absent or hard to access. Yet, another class of distribution-based metrics (DB) earned the community’s attention, thanks to the advent of generative adversarial networks (GANs), enabling the quality assessment using distributions of thousands of images instead of gauging them individually. The popular new DB IQA methods include such metrics as Inception Score [5], FID [6], KID [7], MSID [8], and many others. Despite being widely used, the DB metrics were neither included in the recent large scale general domain reviews [9], [10], nor in the medical ones [11].

IQA measures are applied to estimate the quality of image processing algorithms and systems. For example, when several image denoising and restoration algorithms are available to recover images distorted by blur and noise contamination, a perceptual objective IQA could help pick the one that generates the best perceptual image quality after the restoration. To do that reliably, Image Quality Metrics (IQMs) need to show a high correlation with the perceptual estimates of the quality reported by human subjects for a given image processing algorithm. However, IQA algorithms are often evaluated on non-realistic distortions, such as added noise or artificial blurring [9], [10], [11]. Such a discrepancy between the synthetic evaluation and the practical use may cause misleading results.

While most metrics are designed to predict image quality in the general domain, Magnetic Resonance Imaging (MRI) provides gray-scale data with the content and style noticeably different from the natural images. Hence, the applicability of the IQMs in the MRI domain must be validated.

Moreover, IQMs trained on natural images attempt to describe the overall perception of the quality of an entire scene. On the contrary, an MRI scan can be perceived as high-quality when specific characteristics, responsible for the scan’s value, are deemed adequate. Those are the characteristics that are deemed important components of the radiographic image quality [12], including perceived level of noise (signal-to-noise ratio, SNR), perceived soft tissue contrast (contrast-to-noise ratio, CNR), and the presence of art efacts. Unfortunately, none of the previous IQM studies considered them. Besides, these specific quality criteria are coupled. For example, some denoising algorithms tend to introduce additional blurring (lowering of CNR) in exchange for increased SNR, and some motion correction approaches tend to introduce noticeable art efacts. Therefore, a more detailed evaluation of IQM’s ability to express separate MRI quality criteria is required.

The remainder of this paper is structured as follows. After discussing the related work, we describe how we generate an image library that consists of disrupted and reference MRI image pairs. In Section IV-A, we provide a detailed description of data selection, corruption, and restoration processes that populate the image library with realistic yet diverse data. We then use the image library to survey expert radiologists and collect a set of labels to be then correlated with IQM values in Section IV-C. Finally, we report and discuss the results in Sections V and VI, where we indicate the top-performing metrics, and provide insights about their performance for different distortions, robustness to the domain shift, anatomies, and quality criteria. Section VII concludes the work by proposing the best IQA approaches for MRI. Appendices include a list of abbreviations (A), reconstruction examples (B), and a screenshot of the labeling user interface (C).

The main contributions of this paper are:

  • The most extensive study of IQA in medical imaging, in general, and in MRI, in particular (14,700 subjective scores collected). Unlike previous metric evaluation studies, we avoid artificially added distortions and assess the outputs of popular image restoration models instead. The assessment is based on proposed three criteria and allows us to make profound conclusions on what modern metrics can capture and when exactly they should be used.

  • To the best of our knowledge, we provide the first thorough study of the application of DB metrics for objective IQA of both natural and medical images. We evaluate their performance and show when they give advantage over the common FR and NR metrics. We study the robustness of metrics’ performance across these two vastly different domains and show that the best performing IQMs produce valuable results even when the data distribution drastically changes.

SECTION II.

Related Work

The evaluation of metrics for IQA in the domain of natural images started from the early task-specific works that considered FR methods to characterize color displays and half-toning optimization methods [13].

More recent task-specific studies explored IQA for the images of scanned documents [14] and screen content [15]. Likewise, fused images [16], smartphone photographs [17], remote sensing data [18], and climate patterns [19] demanded the development of targeted IQA approaches. Historically, many of these works have been focusing on the quality degradation caused by the compression algorithms [19], [20], [21], [22], with relatively small datasets appearing publicly for the IQ evaluation. However, the small dataset size and the excessive re-use of the same test sets have led to the promotion of the IQMs poorly generalizable to the unseen distortions.

This was recognized as a major problem, stimulating the emergence of large-scale studies [9], [23]. Among the large-scale evaluations, the majority compared multiple FR metrics, ranging from just a handful [24], [25], [26], [27] to several dozens [9], [28] of IQMs analyzed on popular datasets.

The medical domain stands out from the others by a special sense of what is deemed informative and acceptable in the images [29]. Resulting from years of training and practice, the perception of medical scan quality by adept radiologists relies on a meticulous list of anatomy-specific requirements, on their familiarity with particular imaging hardware, and even on their intuition.

Given the majority of IQMs were not designed for the healthcare domain, some recent works were dedicated to the niche. One small-scale study considered a connection of IQA of natural and medical images via SNR estimation [30]. Others assessed common FR IQMs using non-expert raters [31], [32]. Sufficient for the general audience, these methods proved incapable of reflecting the fine-tuned perception of the radiologists [33].

Expert raters were then engaged in [11] and [34]. The former studied only IQMs from the SSIM family and the latter assessed 10 FR IQMs, reporting that VIF [35], FSIM [36], and NQM [37] yield the highest correlation with the radiologists’ opinions.

On the other hand, Crow et al. argue that NR IQA are preferable for assessing medical images because there may be no perfect reference image in the real-world medical imaging [38]. To address this issue, several recent studies also propose new NR IQMs for MRI image quality assessment [39], [40], [41], [42], [43], [44], [45], [46], [47], [48]. The recent survey [49] overviews MRI-specific IQMs and concludes that the number of available metrics is relatively low and that their development is hindered by the lack of publicly available datasets. Also, none of these new metrics have an open-source implementation, making verification of the claimed results problematic.

SECTION III.

Image Quality Metrics Considered

In this work, we evaluate the most widely used and publicly available general-purpose FR, NR, and DB IQMs to find the best algorithms for the quality assessment on arguably the most important MRI-related image-to-image tasks: scan acceleration, motion correction, and denoising. Instead of modeling the disrupted images, we use outputs of trained neural networks and compare them with the clean reference images from the fastMRI dataset.

Our study includes the following 35 metrics: 17 Full-Reference IQMs (PSNR, SSIM [2], MS-SSIM [50], IW-SSIM [51], VIF [35], GMSD [52], MS-GMSD [53], FSIM [36], VSI [54], MDSI [55], HaarPSI [56], Content and Style Perceptual Scores [57], LPIPS [58], DISTS [59], PieAPP [60], DSS [61]), 3 No-Reference IQMs (BRISQUE [4], PaQ-2-PiQ [62], MetaIQA [63]), and 15 Distribution-Based IQMs (KID [7], FID [6], GS [64], Inception Score (IS) [5], MSID [8], all implemented with three different feature extractors: Inception Net [65], VGG16, and VGG19 [66]). For brevity of the presentation, we will showcase only the analysis of the best performing four metrics, in the order of their ranking: VSI [54], HaarPSI [56], DISTS [59], and FIDVGG16 [6]. All metrics were re-implemented in Python to enable a fair comparison, with the PyTorch Image Quality (PIQ) [67] chosen as the base library for implementing all metrics. The resulting implementations were verified to be consistent with the original implementations proposed by the authors of each metric.

Noteworthy, in our survey, we dismissed some recent results reported for the PIPAL dataset [68] during the 2021 NTIRE challenge [69], because the winners [70], [71], [72] released no official implementations or model weights at the time of our experiments.

For the comparison, we collect 14,700 ratings from 7 trained radiologists to evaluate the quality of reconstructed images based on three main criteria of quality: perceived level of noise (SNR), perceived soft tissue contrast (CNR), and the presence of art efacts, making this work the most comprehensive study of MRI image quality assessment to date.

SECTION IV.

Medical Evaluation

The key goal of this study is to evaluate popular selected IQMs on MRI data. Previous works [11], [34] evaluated the ability of certain IQMs to assess overall quality of data after to various types of artificial distortions.1 However, in practice, the overall image quality (IQ) rating may be insufficient due to its ambiguity: e.g., one could not truly interpret the reasons for poor or good scoring. At the same time, asking the medical experts these general questions may be challenging because of many factors, ranging from the specifics of certain clinical workflows to personal preferences.

In this work, we aspire to solve these problems by proposing the following study. First, we evaluate IQMs with regard to their ability to reflect radiologists’ perception of the quality of distorted images, comparing them to the fully-sampled artifact-free ones. We range the metrics based on three IQ criteria that are crucial for making clinical decisions: perceived level of noise (SNR), perceived level of contrast (CNR), and the presence of art efacts. Second, instead of corrupting images with artificial perturbations, for the first time in the community, we validate these metrics using the actual outputs of deep learning networks trained to solve common MR I-related tasks. As such, the art efacts originate from the imperfect solutions to the common real-world problems of motion correction, scan acceleration, and denoising.

A group of trained radiologists rated the quality of distorted images compared to the clean reference images on a scale from 1 to 4 for the three IQ criteria. Unlike the five-point Likert scale, the simplified scale balances the descriptiveness of the score with the noise in the votes of the radiologists. Our mock experiments showed that the respondents considered the selection between too many options difficult, with the five-point scale having a diluted difference between the options; whereas, the three-point scale was deemed insufficient.

After the evaluation, the aggregated results were compared with the values of selected IQA algorithms to identify the top performers – the metrics that correlate the highest with the radiologists’ votes.

A. Image Library Generation

As a data source, we use the largest publicly available repository of raw multi-coil MRI {k} -space data– the FastMRI dataset, containing the knee and the brain scans [73], [74]. The knee subset of FastMRI contains 1,500 fully sampled MRIs acquired with a 2D protocol in the coronal direction with 15 channel knee coil array on 3 and 1.5 Tesla Siemens MRI machines. The data consists of approximately equal number of scans acquired using the proton density weighting with (PDFS) and without (PD) fat suppression pulse sequences with the pixel size of 0.5 mm \times0.5 mm and the slice sickness of 3 mm.

The knee subset is divided into 4 categories: train (973 volumes), validation (199 volumes), test (118 volumes), and challenge (104 volumes). Only the multi-coil scans were selected for this study, omitting the single-coil data.

The brain subset includes 6,970 1.5 and 3 Tesla scans collected on Siemens machines using T1, T1 post-contrast, T2, and FLAIR acquisitions. Unlike the knee subset, this data are of a wide variety of reconstruction matrix sizes. For the purpose of de-identification, authors of the dataset limited the data to only 2D axial images, and replaced {k} -space slices \gtrapprox 5 mm below the orbital rim with zero matrices. The brain subset is divided into 6 categories: train (4,469 volumes), validation (1,378 volumes), test 4\times (281 volumes), test 8\times (277 volumes), challenge 4\times (303 volumes), and challenge 8\times (262 volumes).

Starting with the clean knee and brain data, we first generate images corrupted with three types of distortions: scan acceleration, motion, and noise. The examples of the distorted images are presented in Fig. 1. After that, we train two reconstruction models for each type of distortions using PyTorch [75]. The first model is trained until the validation loss is stabilized. The second model is trained for half as long to purposely produce imperfect reconstructions, oftentimes encountered in practice. Examples of corrupted images and the corresponding reconstructions can be found in Appendix B. The reduced training time was a conscious choice, enabling the model to produce some visible reconstruction errors. More specifically, we interrupted the training when the 90% of the loss plateau is reached, which allows for good performing models with imperfections we wanted to test for.2 Finally, we use the trained models to reconstruct the corrupted images in the validation subset of the FastMRI, from which we generate the labeling dataset.

FIGURE 1. - Distortions introduced to initial artefact-free scans during training and inference. Using the raw 
${k}$
-space data of the reference images, we undersample them with the acceleration factor of 4, impose rigid motion of a moderate amplitude, and introduce mild Gaussian noise. Note how the distortions differ from those in the Natural Images, on which the common IQMs were developed. We adjusted brightness for viewer’s convenience.
FIGURE 1.

Distortions introduced to initial artefact-free scans during training and inference. Using the raw {k} -space data of the reference images, we undersample them with the acceleration factor of 4, impose rigid motion of a moderate amplitude, and introduce mild Gaussian noise. Note how the distortions differ from those in the Natural Images, on which the common IQMs were developed. We adjusted brightness for viewer’s convenience.

1) Scan Acceleration

Scan-acceleration data are generated from the ground truth images by undersampling the {k} -space data. To train the model, we selected only T1 weighted scans (T1, T1-PRE and T1-POST) from the train category of the FastMRI brain data. The same subset of data was used for training of motion correction and denoising models. The {k} -space data were subsampled using a Cartesian mask, where {k} -space lines are set to zero in the phase encoding direction. The sampled lines are selected randomly, with the total sampling density depending on the chosen acceleration rate. Following the data generation process from the FastMRI challenge [73], all masks are fully sampled in the central area of {k} -space (the low frequencies). For the 4\times accelerated scans, this corresponds to 8%, and for the 8\times acceleration, it equals to 4%. Besides making the reconstruction problem easier to solve, such lines allow computing the low-pass filtered versions of the images for assessing the coil sensitivity maps.

To compensate for the undersampling, we used the 2019 FastMRI challenge winner Adaptive-CS-Net model [77]. Based on the Iterative Shrinkage-Thresholding Algorithm (ISTA) [78], this model consists of several trainable convolutional multi-scale transform blocks between which several prior knowledge-based computations are implemented. For scalability reasons and without substantially impacting the reconstruction results, in this study, we trained a simplified light-weight version of the Adaptive-CS-Net model. The resulting model consists of only 10 trainable blocks and 267k parameters. Unlike the full Adaptive-CS-Net model with three MRI-specific physics-inspired priors, the simplified version has only one prior module between the reconstruction blocks– the soft data consistency step. Specifically, the update for the block B_{i+1} in the simplified Adaptive-CS-Net model is defined as follows:\begin{equation*} B_{i+1}(\textbf {x}_{i}) = \textbf {x}_{i} + \hat {\mathcal {U}}_{i}(\textit {soft}(\mathcal {U}_{i}(\textbf {x}_{i}, \textbf {e}_{i}), \lambda _{s, f_{s}})) \, \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \textbf {x}_{i} denotes the i -th estimate of reconstruction, \mathcal {U} and \mathcal {\hat {U}} are the multi-scale transform and its inverse that consist of 2D convolutions and a nonlinearity in the form of Leaky-ReLU. The feature maps produced at the different scales are thresholded using the soft-max function \textit {soft}(\cdot) ,3 parameterized by a learned parameter \lambda _{s, f_{s}} for each feature channel f_{s} and scale s . In Eq. 1, the soft data consistency step \textbf {e}_{i} is defined as follows:\begin{equation*} \textbf {e}_{i} = \mathcal {F}^{-1}(M\mathcal {F}\textbf {x}_{i} - M\textbf {y}), \tag{2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \mathcal {F} and \mathcal {F}^{-1} denote Fourier transform and its inverse, My is the data measured with the sampling mask M .

We trained the simplified Adaptive-CS-Net model using RMSprop optimizer [79] to minimize L1 loss function between the reconstruction estimate and the ground truth image obtained from the fully sampled data. We used a step-wise learning rate decay of 10^{-4} and the batch size of 8 to reconstruct the data for various acceleration factors (from 2\times to 8\times ).

2) Motion Correction

The in-plane motion art efacts, including rigid translation and rotation, were introduced into the Fourier-transformed data following the procedure described in [80]. For each input image, the assumed echo-train length of the turbo spin-echo readout was chosen randomly in the 8—32 range. Similarly, the assumed extent of zero-padding in {k} -space was chosen randomly in the range of 0–100. The motion trajectories (translation/rotation vectors as a function of scan time) were generated randomly to simulate the realistic art efacts. In this study we utilized the protocol for “sudden motion” simulation. Here, the subject is assumed to lie still for a large part of the examination, until a swift translation or rotation of the head occurs. The time point of the sudden motion was taken randomly as a fraction of the total scan time in the range of one-third to seven-eighths. The maximum magnitude of the motion was chosen randomly from the range of [1], [4] pixels for the translation and [0.5,4.0] degrees for the rotation art efacts. The center of rotation was also varied randomly in the range of [0, 100] pixels in each direction. These parameter ranges were selected empirically to generate a large variety of realistic art efacts and were used consistently in the training and in the validation runs.

To compensate for the motion artefacts of various extent, we trained U-Net models [81] with 209k parameters. While more advanced architectures exist, we found the basic U-Net to be more than sufficient for the scope of the proposed IQA study, as it is enough to capture imperfections which are often generated by deep learning models.

The model received the motion corrupted data as the input and learned to predict the motion artefacts in a residual manner, i.e., the output of the model was a predicted image of motion in the input data. The model was trained to minimise L1 loss between the ground-truth and the predicted residual with Adam [82] optimizer using the step-wise learning rate decay of 10^{-4} and the batch size of 8. Preserving the same nature of art efacts, we trained our models for a range of amplification factors (from 1 to 3). For that, throughout the training, the motion amplitude was scaled by the amplification factor, yielding a consistently diverse appearance of the motion artefacts that could be met in practice.

3) Denoising

In our study, noisy magnitude images are generated from the complex {k} -space data with the Gaussian distribution taken as the representative noise model. Below, the standard deviation of the Gaussian noise is reported for a region of interest in the background of the magnitude image, as proposed in [83]. The parameters of the noise distribution for each volume are drawn from the last slice of this volume. Then, the Gaussian noise with the estimated distribution parameters is generated, scaled by an amplification factor, and added to all images of the volume. We used the amplification factor of 2 for the training and the amplification factors of 1, 2, and 3 for the test data generation to enrich the variety of the tested image qualities in the resulting dataset.

To compute the denoised images, we trained DnCNN models [84] with 556k parameters on the brain multi-coil train data using the RMSprop optimizer [79] and a step-wise learning rate decay of 10^{-4} with the batch size of 8. Similarly to the other tasks considered herein, we are not looking for the most powerful denoising algorithms but consider a very commonplace model DnCNN instead, merely to rank the modern IQA metrics for the specific task of denoising.

4) Final Dataset for Labeling

We started the formation of the labeling dataset from the clean volumes from the validation subsets of brain and knee FastMRI datasets; hence, these scans were not used to train the artefacts correction models. In total, both validation subsets contain 1,577 volumes, resulting in 28,977 images: 199 knee volumes with 7,135 slices and 1,378 brain volumes with 21,842 slices. In each brain volume, the lower 2 and top 3 slices were discarded to restrict the analysis to clinically relevant parts of the scan. In each knee volume, the first 3 slices were discarded for the same reason. To limit the number of data points and decrease the overall variability of data types, we selected only T1-weighted (T1, T1-PRE and T1-POST) brain volumes and proton-density weighted without fat suppression (PD) knee volumes.

The data generation pipeline is summarized in Fig. 2.

FIGURE 2. - Formation of Labeling and Re-labeling datasets for annotation (left) and the content of each group of images for the medical evaluation and labeling by radiologists in the Labeling dataset (right). Starting from the clean validation data, we first generate corrupted data with the acceleration artefacts, the motion artefacts, and the Gaussian noise. Then, we reconstruct the corrupted data using trained neural network models and randomly select scans to form labeling and re-labeling pairs for the experts to grade.
FIGURE 2.

Formation of Labeling and Re-labeling datasets for annotation (left) and the content of each group of images for the medical evaluation and labeling by radiologists in the Labeling dataset (right). Starting from the clean validation data, we first generate corrupted data with the acceleration artefacts, the motion artefacts, and the Gaussian noise. Then, we reconstruct the corrupted data using trained neural network models and randomly select scans to form labeling and re-labeling pairs for the experts to grade.

Using the selected subset of clean validation data, we simulated images for the reconstruction:

  • For the scan acceleration task, we simulated acceleration artefacts for undersampling rates of 2\times , 4\times , and 6\times , following the data generation process from the FastMRI challenge;

  • For the motion correction task, we simulated motion artefacts of three different strengths using the rigid motion simulation framework described above;

  • For the denoising task, we simulated Gaussian noise with amplification factors of 1, 2 and 3 using the noise generation procedure described above.

After that, all generated corrupted data were reconstructed using the reconstruction models trained for the corresponding tasks. Note that we deliberately generated a fraction of data with parameters different from the ones used to train the reconstruction models. We found this approach yields various levels of artefacts typically appearing after the reconstruction process.

From the large pool of reconstructed images, we select 100 pairs of images (clean - reconstructed) for each task (scan acceleration, motion correction, denoising) and each anatomy (knee, brain), evenly distributing the data to represent each reconstruction parameter (e.g., the acceleration rate for the scan acceleration task). This strategy results in the labeling dataset of 600 pairs of images in total (3 tasks \times 2 anatomies).

To reach the goal labeling dataset size, we utilized the following data selection procedure:

  1. Compute values of IQMs for all reconstructed images (for NR IQMs) or image pairs (for FR IQMs);

  2. Normalize each IQM value to [{0, 1}] ;

  3. Compute variance between IQM values for all items;

  4. Sort all items by the value of variance;

  5. Select 25% of data for each task-anatomy combination from the data items with the highest variance, assuming that items with the biggest disagreement between IQMs are the most informative;

  6. Select the rest 75% of data pseudo-randomly (preserve distribution of reconstruction parameters) to avoid introducing any bias from the variance computation.

Lastly, we deliberately duplicated 100 of the 600 prepared items for the purpose of verification of radiologists’ self-consistency, resulting in 700 image pairs to be labelled by each radiologist.

B. Experiment Setup

Within the paradigm of the model observer framework [85], the quality of a medical image can be defined as how well a clinical task (e.g., diagnostics) can be performed on it [86]. This means that the perfect MRI IQM would be some task-based score, such as the diagnostic accuracy. However, such a metric is difficult to implement due to a great diversity of diagnostic outcomes that radiologists deal with in practice. Because of that, the convention is to use a subjective estimation of the overall diagnostic value instead [11].

However, we argue that a single score is not sufficient to reflect the abundance of anatomies, pathologies, and artefactual cases that the radiologists work with. Instead, we propose to subdivide the score of the overall diagnostic quality into three main criteria that can be important for a clinical practitioner to make their decision: i) perceived level of noise, ii) perceived level of soft-tissue contrast, and iii) presence of artefacts.

1) Subjective Evaluation

Seven trained radiologists with 7 to 20 years of experience took part in this study. The participants were asked to score pairs of reconstructed-reference images using three main IQ criteria. For each image pair and each criterion, radiologists scored the perceived diagnostic quality of the reconstructed image compared to the ground-truth using a four-point scale: not acceptable (1), weakly acceptable (2), rather acceptable (3), and fully acceptable (4). The four-point scale was selected over the five-point Likert scale, previously used in [11].

Each participant performed the labeling individually using a dedicated instance of the Label Studio [89] software accessible via a web interface. The experts were asked to make all judgments about the image quality with regard to a particular diagnostic task that they would normally perform in their practice (e.g., the ability to discriminate relevant tissues, the confidence in using the image to detect a pathology, etc.). The interface provided additional functionality of scaling (zooming) the images to closer mimic the real-life workflow. The pairs of images were displayed in a random order until all pairs were labelled. Participants had an opportunity to re-label the pairs they have already scored at any point until the experiment is finished.

During the main part of the experiment, each participant labelled 600 pairs of images based on the 3 quality criteria, resulting in 4,200 annotated pairs and 12,600 labels in total. The results of the main labeling session were used for further evaluation of the IQMs. After finishing the main part of the experiment, the participants were asked to additionally label 100 randomly selected pairs from the same dataset, yielding additional 2,100 labels. The results of this additional re-labeling were used to evaluate the self-consistency of each annotator.

2) Metrics Computation

Unlike FR and NR IQMs, designed to compute an image-wise distance, the DB metrics compare distributions of sets of images. This makes them less practical for traditional IQA, the goal of which is to compute a score for a given image pair. Moreover, the need to have sets of images hinders the vote-based evaluation via the mean subjective opinion scores.

To address these problems, we adopt a different way of computing the DB IQMs. Instead of extracting features from the whole images, we crop them into overlapping tiles of size 96 \times 96 with stride = 32 . This pre-processing allows us to treat each pair of images as a pair of distributions of tiles, enabling further comparison. The other stages of computing the DB IQMs are kept intact.

C. Data Analysis

Here, we adapt the analysis of the scoring data proposed in [24] to the multiple IQ criteria. The voting scores for each scoring criteria are not analyzed in their raw format. Instead, they are converted to z-scores (averaged and re-scaled from 0 to 100 for each radiologist to account for their different scoring):\begin{equation*} z_{nmk} = (D_{nmk} - \mu _{mk})/\sigma _{mk}\, \tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \mu _{mk} and \sigma _{mk} are the mean and the standard deviation of the difference scores of the m^{\text {th}} radiologist on the k^{\text {th}} scoring criteria, and D_{nmk} are the difference scores for n^{\text {th}} degraded image defined as follows:\begin{equation*} D_{nmk} = s_{mk,\text {ref}} - s_{nmk}\,. \tag{4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
In Eq. (4), s_{mk,\text {ref}} is the raw score of the m^{\text {th}} radiologist on the k^{\text {th}} scoring criteria for the reference image corresponding to the n^{\text {th}} degraded image, and s_{nmk} is the raw score of the m^{\text {th}} radiologist on the n^{\text {th}} degraded image on the k^{\text {th}} scoring criteria. Note that in this study, the radiologists were asked to perform pair-wise comparison between degraded and reference images. Hence, it is possible to treat the raw labeling scores as the difference scores D_{nmk} .

After standardizing the expert votes by Eq. (3), their correlation statistics with each IQM were computed in the form of SRCC and KRCC coefficients, defined as follows:\begin{equation*} \text {SRCC} = 1 - \frac {6 \sum _{i=1}^{n} d_{i} ^ {2}}{n (n^{2} - 1)}, \tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where d_{i} is the difference between the {i} -th image’s ranks in the objective and the subjective ratings and n is the number of observations.\begin{equation*} \text {KRCC} = \frac {2}{n(n-1)}\sum _{i< j} \text {sign}(x_{i}-x_{j})\;\text {sign}(y_{i}-y_{j})\, \tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where (x_{1}, y_{1}),\ldots,(x_{n}, y_{n}) are the observations: the objective and the subjective score pairs.

We use SRCC as the main measure of an IQM performance, due to the non-linear relationship between the subjective and the objective scores.4

The sizes of each batch of data are described in Fig. 2 (right).

A non-linear regression was performed on the IQM scores according to the quality Q to fit the subjective votes:\begin{equation*} Q(x) = \beta _{1} \left ({\frac {1}{2} - \frac {1}{1 + \exp (\beta _{2} (x - \beta _{3}))} }\right) + \beta _{4} x + \beta _{5},\quad \tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where x are the original IQM scores and \beta _{1}, {\dots },\beta _{5} are the fitting coefficients.

SECTION V.

Results

Figs. 3 and 4 and Table 1 summarize the correlation study between the radiologists’ scores and the IQM values for the three proposed evaluation criteria. The figures also show the results for the natural image domain. Top 4 performers in each category are marked in bold. The best and the worst examples of the reconstructions, as judged by different metrics, are presented in Fig. 5, and the aggregate scores for the top-performing metrics in each application in Fig. 6.

TABLE 1 SRCC Values of all 35 Metrics on Natural and MRI Data. Top 4 Performers in all Categories are Marked in Bold. ^{*} Denotes Values Taken Directly From [90]
Table 1- 
SRCC Values of all 35 Metrics on Natural and MRI Data. Top 4 Performers in all Categories are Marked in Bold. 
$^{*}$
 Denotes Values Taken Directly From [90]
FIGURE 3. - Performance of IQMs on different MRI tasks and on Natural Images (NI), compared by their correlation with the expert votes (SRCC values, left) and sorted top-to-bottom by their rank (right). The ordering reflects the performance on the MRI data only. The same color-coding is used in both plots. NI-generic scores are the average between TID2013 [87] and KADID-10k [88] datasets. Note higher correlation of IQMs on NI and poor translation of ranking to MRI domain. Refer to data in Table 1 for numerical values.
FIGURE 3.

Performance of IQMs on different MRI tasks and on Natural Images (NI), compared by their correlation with the expert votes (SRCC values, left) and sorted top-to-bottom by their rank (right). The ordering reflects the performance on the MRI data only. The same color-coding is used in both plots. NI-generic scores are the average between TID2013 [87] and KADID-10k [88] datasets. Note higher correlation of IQMs on NI and poor translation of ranking to MRI domain. Refer to data in Table 1 for numerical values.

FIGURE 4. - Relationship between processed subjective scores and IQM values for 3 evaluation criteria, 3 target tasks, and 2 anatomies (600 annotated image pairs in total). The solid lines are fits, plotted using the non-linear regression (7) on the subsets of images split by the tasks. The top 4 metrics (along with PSNR and SSIM, as the most commonplace) are shown in the decreasing order left to right, using SRCC to gauge the performance.
FIGURE 4.

Relationship between processed subjective scores and IQM values for 3 evaluation criteria, 3 target tasks, and 2 anatomies (600 annotated image pairs in total). The solid lines are fits, plotted using the non-linear regression (7) on the subsets of images split by the tasks. The top 4 metrics (along with PSNR and SSIM, as the most commonplace) are shown in the decreasing order left to right, using SRCC to gauge the performance.

FIGURE 5. - The best and the worst reconstruction-reference pairs according to different metrics (their values are shown in yellow). Note how the top 4 metrics (first four columns) reflect the actual reconstruction quality better than PSNR and SSIM (which are prone to misjudging a simple shift of brightness or a blur). The brightness is adjusted for viewer’s convenience.
FIGURE 5.

The best and the worst reconstruction-reference pairs according to different metrics (their values are shown in yellow). Note how the top 4 metrics (first four columns) reflect the actual reconstruction quality better than PSNR and SSIM (which are prone to misjudging a simple shift of brightness or a blur). The brightness is adjusted for viewer’s convenience.

FIGURE 6. - Aggregate relationship between the objective and the subjective scores for 3 evaluation criteria (rows), 2 anatomies (columns), and 3 tasks: scan acceleration (
$\hbox{\(\color{Blue}{\bullet}\)}$
), denoising (
$\hbox{\(\color{Orange}{\unicode{0x25BE}}\)}$
), and motion correction (
$\hbox{\(\color{Green}{\blacksquare}\)}$
). The IQMs are ordered by decreasing average SRCC for the artefacts criterion on the brain data. This order is kept throughout all results for consistency. Note the tendency of the metrics to perform poorly in some task-anatomy combinations, e.g., in denoising the brain data.
FIGURE 6.

Aggregate relationship between the objective and the subjective scores for 3 evaluation criteria (rows), 2 anatomies (columns), and 3 tasks: scan acceleration (\hbox{\(\color{Blue}{\bullet}\)} ), denoising (\hbox{\(\color{Orange}{\unicode{0x25BE}}\)} ), and motion correction (\hbox{\(\color{Green}{\blacksquare}\)} ). The IQMs are ordered by decreasing average SRCC for the artefacts criterion on the brain data. This order is kept throughout all results for consistency. Note the tendency of the metrics to perform poorly in some task-anatomy combinations, e.g., in denoising the brain data.

SECTION VI.

Discussion

The visual inspection of the outputs of the models in Fig. 5 makes it evident how the top metrics are superior in reflecting the actual reconstruction quality over the conventional PSNR and SSIM. The latter are known to misjudge shifts of brightness or a blur, indicating high quality for the bad images, whereas the more advanced FR and DB IQMs correlate with the visual perception and the subjective scores ably. Henceforth, out of the 35 metrics considered, we only discuss the best ones, according to their rank in the correlation study (VSI, HaarPSI, DISTS, FIDVGG16) and the widely used PSNR and SSIM.

As the key observation in the first systematic study of the DB metrics, we affirm that the choice of the feature extractor plays a crucial role. In particular, the correlation scores show that the Inception-based features are almost always worse than those from VGG16 (except for the MSID metric). Moreover, we see that, despite having been designed for the evaluation of realism of generative models data, FID shows competitive SRCC scores, thus, becoming a new recommended metric for the MRI image assessment tasks.

The non-linear relationship between the subjective and the objective scores, seen in Fig. 4, portrays intricate behavior with evident dependence on the anatomy and the target task, as well as a clear clustering of the points, instrumental for selecting a proper metric in a particular application. Notable are the generally lower IQM correlation scores when the difficulty of the reconstruction routine increases (compare trends in the scan acceleration data to those in the more complex denoising and the motion correction models). Also, the evaluation values for the knee reconstruction are generically lower, which could be caused by the greater variety of anatomical structures present in the knee data, as well as the more strict pertinent medical evaluation criteria [33].

Fig. 6 aggregates the outcomes per each task, anatomy, and evaluation criteria studied in our work, with the relation between the subjective and the objective scores highlighting the differences in the average performance of the top metrics. Notably, these selected IQMs have the highest correlation with expert judgment in the scan acceleration task. However, all metrics equally struggle reflecting the opinion of the radiologists in denoising and, sometimes, in motion correction tasks, especially on brain data. We also observe that some metrics perform consistently in terms of all three evaluation criteria and all tasks for given anatomy. For instance, GMSD and DISTS, despite not being of the highest SRCC rank overall, still show consistently high correlation scores on knee data, which proffers both of them as universal choices for the IQA in orthopedic applications. On the other hand, HaarPSI consistently rates the highest for both anatomies in the scan acceleration task, an instrumental fact to know when a single machine is used to scan various body parts or when the pertinent cross-anatomy inference [91] is performed.

A. Natural vs. MRI Images

A frequent IQA-related question is how generalizable are the performance benchmarks across different datasets and image domains. To study that, we analyzed the applicability of all 35 IQMs considered herein both in the MRI and the natural image (NI) domains (Table 1). For the latter, the popular TID2013 [87] and KADID-10k [88] datasets of NIs were used. Fig. 3 illustrates the effect of the shift between the NI and the MRI domains, featuring an expected drop of the correlation values for most metrics.5 However, the domain shift affects the ranks of the IQMs differently. Some top NI metrics, such as MDSI and MS-GMSD, naturally take lower standings in the MRI domain; however, others, such as HaarPSI and VSI, remain well-correlated with the radiologists’ perception of quality. Further examples of IQMs robust to the domain shift are DISTS and FIDVGG16.

B. Labeling Discrepancies and Self-Consistency Study

Another IQA-related question encountered in survey-based studies is the trustworthiness of the votes themselves. Given that only reputable radiologists were engaged in our labeling routine, we have no grounds for doubting their annotations as far as the domain knowledge is concerned. Therefore, feasible discrepancies among their votes can be assumed to originate either from such factors as the study design, its duration, and fatigue, or from a previous experience which sometimes forms a posteriori intuition and, allegedly, influences the experts to make decisions different from the others.

While the latter is too subjective and difficult to regulate, the former could be controlled. We put effort to simplify the user experience and allowed the radiologists to approach the labeling assignment in batches at their own pace (see Appendix C). The average lead time spent labeling a pair of images,6 an arguable indicator of the scrupulousness of an annotator, is plotted in Fig. 7, where we also summarize the results of the self-consistency study. The study reports Weighted Cohen’s Kappa scores, computed between the votes provided in the main and in the additional re-labeling experiments on the same data. Interestingly, there is no significant correlation between self-consistency and the labeling time, placing other factors mentioned above, such as individual experience, at the forefront.

FIGURE 7. - Correlation between the subjective scores in labeling and re-labeling sessions on the same data, with each column/color corresponding to an individual radiologist. This plot shows scoring self-consistency of the experts and the average time spent labeling one pair of images. Apparently, the time spent on labeling is not the major factor affecting the self-consistency of experienced radiologists.
FIGURE 7.

Correlation between the subjective scores in labeling and re-labeling sessions on the same data, with each column/color corresponding to an individual radiologist. This plot shows scoring self-consistency of the experts and the average time spent labeling one pair of images. Apparently, the time spent on labeling is not the major factor affecting the self-consistency of experienced radiologists.

We also opted for evaluating the agreement between the radiologists’ opinions by assessing the monotonic correlation between the {z} -scores computed earlier, which should account for the individual scoring preferences. The SRCC correlation values, shown in Fig. 8, never drop below 0.50, with a mean of 0.55 and a median of 0.53 (corresponds to strong relationship between variables).

FIGURE 8. - Pair-wise Spearman’s rank correlation coefficient between 
${z}$
-scores from seven radiologists participating in the survey. According to [92], this pattern corresponds to a strong agreement between the experts.
FIGURE 8.

Pair-wise Spearman’s rank correlation coefficient between {z} -scores from seven radiologists participating in the survey. According to [92], this pattern corresponds to a strong agreement between the experts.

In Fig. 7, the Weighted Cohen’s Kappa values correspond to moderate to substantial consistency of scoring (according to [93]). And, according to [92], the SRCC range in Fig. 8 corresponds to a strong agreement. Given the sufficiently trustworthy labeling, the spread of the correlation scores for the modern IQA metrics in Fig. 6, and the non-trivial correlation patterns in Fig. 4, one can conclude that the optimal MRI metric is yet to be devised.

Besides a blunt umbrella metric aggregating the top-performing predictions (e.g., those of VSI, HaarPSI, DISTS, and FIDVGG16), the future effort should be dedicated to additional forays into modeling MRI-specific perception of the radiologists and to interpreting their assessment using formalized rules taken from the medical textbooks. Such interpretatable metrics will be especially in demand, given the recent appearance of the MRI sampling approaches aimed towards optimizing downstream tasks [94], including the recently annotated FastMRI dataset [95]. Another line of future work could be ‘borrowed’ from the NI domain, where the abundance of data has led to the emergence of several NR IQMs. Although, in our study, all such metrics (classic BRISQUE [4] and the more recent PaQ-2-PiQ [62] and MetaIQA [63]) showed equally mediocre performance compared to the other IQMs, we believe their value in the MRI domain is bound to improve with the growth of available data.

SECTION VII.

Conclusion

This manuscript reports the most extensive study of the image quality metrics for Magnetic Resonance Imaging to date, evaluating 35 modern metrics and using 14,700 subjective votes from experienced radiologists.

The applicability of full-reference, no-reference, and distribution-based metrics is discussed from the standpoint of MRI-specific image reconstruction tasks (scan acceleration, denoising, and motion correction). Unlike previous IQA studies analyzing IQMs with manual distortions, we use the outputs of neural network models trained to perform these particular tasks, enabling a realistic evaluation. Different from the natural images, the MRI scans are proposed to be assessed according to the most diagnostically influential criteria for the quality of MRI scans: signal-to-noise ratio, contrast-to-noise ratio, and the presence of art efacts.

The top performers – DISTS, HaarPSI, VSI, and FIDVGG16 – are found to be efficient across three proposed quality criteria, for all considered anatomies and the target tasks.

ACKNOWLEDGMENT

The authors acknowledge the effort of radiologists from the Philips Clinical Application Team (PD CEER) for their help with data labeling and thank the supporters of their GitHub Project (https://github.com/photosynthesis-team/piq/), where each metric was independently implemented and tested.

They also declare no conflict of interest and no personal bias towards particular image quality metrics.

Appendix A

Abbreviations

A list of abbreviations is provided in Table 2.

TABLE 2 List of Abbreviations
Table 2- 
List of Abbreviations

Appendix B

Reconstruction Examples

During the labeling experiment, experts were asked to label pairs of images. Each pair contained a low-quality image placed side-by-side with a corresponding high-quality reference. Each low-quality image was obtained by, first, corrupting the corresponding reference and, then, by reconstructing it with the models trained to solve one of the tasks described in the main text (scan acceleration, motion compensation, or denoising). Fig. 9 showcases typical pairs of images used in the experiment. The following corruption parameters were used to generate the images: acceleration factor of 4, motion amplification factor of 0.6, noise amplification factor of 2. These parameters correspond to medium strength corruptions, showcasing possible imperfect reconstruction results.

FIGURE 9. - Examples of corrupted images used as inputs to the reconstruction models (left column), the reconstruction results (middle column), and the artefact-free reference images (right column). Examples with medium strength corruptions are displayed to showcase possible imperfect reconstruction results.
FIGURE 9.

Examples of corrupted images used as inputs to the reconstruction models (left column), the reconstruction results (middle column), and the artefact-free reference images (right column). Examples with medium strength corruptions are displayed to showcase possible imperfect reconstruction results.

Appendix C

Labeling User Interface

During the labeling experiment, the participants were asked to score pairs of reconstructed-reference images presented to them side-by-side in a web interface of the Label Studio [89]. The web interface is shown in Fig. 10.

FIGURE 10. - Web interface of the Label Studio software released to the expert radiologists to perform the labeling. The participants selected their answers using the proposed scale from 1 to 4, rating the images based on each proposed IQA criteria.
FIGURE 10.

Web interface of the Label Studio software released to the expert radiologists to perform the labeling. The participants selected their answers using the proposed scale from 1 to 4, rating the images based on each proposed IQA criteria.

The labeling was done using three main IQ criteria: the presence of art efacts, the perceived level of noise, and the perceived level of soft-tissue contrast. The participants were able to select their answers using the mouse pointer or some keys on the keyboard. During the quality assessment process, the participants were able to zoom images, re-label previously labeled examples, pause and divide their evaluation session into as many labeling rounds as they wished. All labeling results were continuously saved on a remote server to eliminate the possibility of data loss. After the complete labeling process, the participants were offered the last chance to fix the scoring of the borderline examples.

Cites in Papers - |

Cites in Papers - IEEE (11)

Select All
1.
Clemens Karner, Janek Gröhl, Ian Selby, Judith Babar, Jake Beckford, Thomas R Else, Timothy J Sadler, Shahab Shahipasand, Arthikkaa Thavakumar, Michael Roberts, James H.F. Rudd, Carola-Bibiane Schönlieb, Jonathan R Weir-McCall, Anna Breger, "Parameter Choices in Haarpsi for IQA with Medical Images", 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), pp.1-5, 2025.
2.
Naveetha Nithianandam, Prabhjot Kaur, Anil Kumar Sao, "A Simple Yet Effective Method for Motion Detection in Structural Magnetic Resonance Images", 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), pp.1-5, 2025.
3.
Sergio A. C. Bezerra, Sérgio A. C. Bezerra, José L. De S. Pio, José R. H. Carvalho, Keiko V. O. Fonseca, "Perceptual Error Logarithm: An Efficient and Effective Analytical Method for Full-Reference Image Quality Assessment", IEEE Access, vol.13, pp.68587-68606, 2025.
4.
Jiahao Huang, Yinzhe Wu, Fanwen Wang, Yingying Fang, Yang Nan, Cagan Alkan, Daniel Abraham, Congyu Liao, Lei Xu, Zhifan Gao, Weiwen Wu, Lei Zhu, Zhaolin Chen, Peter Lally, Neal Bangerter, Kawin Setsompop, Yike Guo, Daniel Rueckert, Ge Wang, Guang Yang, "Data- and Physics-Driven Deep Learning Based Reconstruction for Fast MRI: Fundamentals and Methodologies", IEEE Reviews in Biomedical Engineering, vol.18, pp.152-171, 2025.
5.
Sergey Kastryulin, Denis Prokopenko, Artem Babenko, Dmitry V. Dylov, "QUASAR: QUality and Aesthetics Scoring With Advanced Representations", IEEE Access, vol.12, pp.160946-160956, 2024.
6.
Bassem Elmeligy, Thomas Richter, Rakesh Rao Ramachandra Rao, Siegfried Fößel, Alexander Raake, "Evaluating Visually Lossless Compression of JPEG XS, JPEG 2000, HEVC and AV1 in Selected Medical Imaging Modalities", 2024 16th International Conference on Quality of Multimedia Experience (QoMEX), pp.221-227, 2024.
7.
Vidya Prasad, Ruud J. G. van Sloun, Anna Vilanova, Nicola Pezzotti, "ProactiV: Studying Deep Learning Model Behavior Under Input Transformations", IEEE Transactions on Visualization and Computer Graphics, vol.30, no.8, pp.5651-5665, 2024.
8.
Vidya Prasad, Ruud J. G. van Sloun, Stef van den Elzen, Anna Vilanova, Nicola Pezzotti, "The Transform-and-Perform Framework: Explainable Deep Learning Beyond Classification", IEEE Transactions on Visualization and Computer Graphics, vol.30, no.2, pp.1502-1515, 2024.
9.
M. Ravichandran, T. Jerry Alexander, B.S. Sathish, L.M.I. Leo Joseph, Ganesan P, G. Sajiv, "An Effectual Methodology to Objective Quality Assessment for Noisy and Compressed Images Using Full and Blind Reference Quality Metrics", 2023 IEEE 3rd International Conference on Applied Electromagnetics, Signal Processing, & Communication (AESPC), pp.1-6, 2023.
10.
Mohamed L Seghier, "Image quality assessment for MRI: Is it up to the task?", 2023 11th European Workshop on Visual Information Processing (EUVIP), pp.1-8, 2023.
11.
Saeed Iqbal, Adnan N. Qureshi, Musaed Alhussein, Imran Arshad Choudhry, Khursheed Aurangzeb, Tariq M. Khan, "Fusion of Textural and Visual Information for Medical Image Modality Retrieval Using Deep Learning-Based Feature Engineering", IEEE Access, vol.11, pp.93238-93253, 2023.

Cites in Papers - Other Publishers (26)

1.
Somya Srivastava, Parita Jain, Sanjay Kr. Pandey, Gaurav Dubey, Nripendra Narayan Das, "Automated Brain Tumor Classification and Grading Using Multi-scale Graph Neural Network with Spatio-Temporal Transformer Attention Through MRI Scans", Interdisciplinary Sciences: Computational Life Sciences, 2025.
2.
Owen A. White, Joshua Shur, Francesca Castagnoli, Geoff Charles-Edwards, Brandon Whitcher, David J. Collins, Matthew T. D. Cashmore, Matt G. Hall, Spencer A. Thomas, Andrew Thompson, Ciara A. Harrison, Georgina Hopkinson, Dow-Mu Koh, Jessica M. Winfield, "Quantitative image quality metrics enable resource-efficient quality control of clinically applied AI-based reconstructions in MRI", Magnetic Resonance Materials in Physics, Biology and Medicine, 2025.
3.
Noor Abd Alrazak Shnain, Mohammed Abdulameer Aljanabi, "Novel Image Quality Index: Integrating UQI and GSI for Enhanced Quality Assessment", Frontiers in AI and Computational Technologies, vol.40, pp.17, 2025.
4.
Xuan Lei, Philip Schniter, Chong Chen, Rizwan Ahmad, "Groupwise image registration with edge‐based loss for low‐SNR cardiac MRI", Magnetic Resonance in Medicine, 2025.
5.
Michelle C. Pryde, James Rioux, Adela Elena Cora, David Volders, Matthias H. Schmidt, Mohammed Abdolell, Chris Bowen, Steven D. Beyea, "Correlation of objective image quality metrics with radiologists’ diagnostic confidence depends on the clinical task performed", Journal of Medical Imaging, vol.12, no.05, 2025.
6.
H. M. S. S. Herath, H. M. K. K. M. B. Herath, Nuwan Madusanka, Byeong-Il Lee, "A Systematic Review of Medical Image Quality Assessment", Journal of Imaging, vol.11, no.4, pp.100, 2025.
7.
Anna Breger, Ander Biguri, Malena Sabaté Landman, Ian Selby, Nicole Amberg, Elisabeth Brunner, Janek Gröhl, Sepideh Hatamikia, Clemens Karner, Lipeng Ning, Sören Dittmer, Michael Roberts, Carola-Bibiane Schönlieb, "A Study of Why We Need to Reassess Full Reference Image Quality Assessment with Medical Images", Journal of Imaging Informatics in Medicine, 2025.
8.
Yueyue Zhu, Haotian Jiang, Rongqing Cai, Geng Chen, "Multi-Label MambaOut for\\xa0Quality Assessment of\\xa0Low-Field Pediatric Brain MR Images", Low Field Pediatric Brain Magnetic Resonance Image Segmentation and Quality Assurance, vol.15515, pp.3, 2025.
9.
Philip M. Adamson, Arjun D. Desai, Jeffrey Dominic, Maya Varma, Christian Bluethgen, Jeff P. Wood, Ali B. Syed, Robert D. Boutin, Kathryn J. Stevens, Shreyas Vasanawala, John M. Pauly, Beliz Gunel, Akshay S. Chaudhari, "Using deep feature distances for evaluating the perceptual quality of MR image reconstructions", Magnetic Resonance in Medicine, 2025.
10.
Igor Stępień, "TIQA-MRI: Toolbox for Perceptual Image Quality Assessment of Magnetic Resonance Images", SoftwareX, vol.29, pp.102073, 2025.
11.
Yang Yang, Chang Liu, Hui Wu, Dingguo Yu, "A quality assessment algorithm for no-reference images based on transfer learning", PeerJ Computer Science, vol.11, pp.e2654, 2025.
12.
Melanie Dohmen, Mark A. Klemens, Ivo M. Baltruschat, Tuan Truong, Matthias Lenga, "Similarity and quality metrics for MR image-to-image translation", Scientific Reports, vol.15, no.1, 2025.
13.
Xiaoting Kang, Bei Zhang, "Quality Prediction Analysis of Computer Image Visual Psychology Based on Analysis of the Color Composition", International Journal of High Speed Electronics and Systems, vol.34, no.02, 2025.
14.
Eyob Mersha Woldamanuel, "Hybrid Simulated Annealing‐Evaporation Rate‐Based Water Cycle Algorithm Application for Medical Image Enhancement", Journal of Electrical and Computer Engineering, vol.2024, no.1, 2024.
15.
Hailong Liu, Yanxia Chen, Meng Zhang, Han Bu, Fenghuan Lin, Jun Chen, Mengqiang Xiao, Jie Chen, "Feasibility of knee magnetic resonance imaging protocol using artificial intelligence-assisted iterative algorithm protocols: comparison with standard MRI protocols", Frontiers in Medicine, vol.11, 2024.
16.
Diana L. Giraldo, Hamza Khan, Gustavo Pineda, Zhihua Liang, Alfonso Lozano-Castillo, Bart Van Wijmeersch, Henry C. Woodruff, Philippe Lambin, Eduardo Romero, Liesbet M. Peeters, Jan Sijbers, "Perceptual super-resolution in multiple sclerosis MRI", Frontiers in Neuroscience, vol.18, 2024.
17.
Ziad Al-Haj Hemidi, Christian Weihsbach, Mattias P. Heinrich, "IM-MoCo: Self-supervised MRI Motion Correction Using Motion-Guided Implicit Neural Representations", Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, vol.15007, pp.382, 2024.
18.
Zichen Xiong, Jingyi Hu, Nan Li, "Comprehensive quality evaluation method of ultrasound tomography algorithms based on entropy method", Measurement Science and Technology, vol.35, no.11, pp.115403, 2024.
19.
Diana L. Giraldo, Hamza Khan, Gustavo Pineda, Zhihua Liang, Alfonso Lozano, Bart Van Wijmeersch, Henry C. Woodruff, Philippe Lambin, Eduardo Romero, Liesbet M. Peeters, Jan Sijbers, , 2024.
20.
Al-yhuwert Murcia Tapias, Diana Giraldo Franco, Eduardo Romero, "Synthesizing fractional anisotropy maps from T1-weighted magnetic resonance images using a simplified generative adversarial network", Medical Imaging 2024: Clinical and Biomedical Imaging, pp.94, 2024.
21.
Katerina Nikiforaki, Ioannis Karatzanis, Aikaterini Dovrou, Maciej Bobowicz, Katarzyna Gwozdziewicz, Oliver Díaz, Manolis Tsiknakis, Dimitrios I. Fotiadis, Karim Lekadir, Kostas Marias, "Image Quality Assessment Tool for Conventional and Dynamic Magnetic Resonance Imaging Acquisitions", Journal of Imaging, vol.10, no.5, pp.115, 2024.
22.
Sabina Umirzakova, Sevara Mardieva, Shakhnoza Muksimova, Shabir Ahmad, Taegkeun Whangbo, "Enhancing the Super-Resolution of Medical Images: Introducing the Deep Residual Feature Distillation Channel Attention Network for Optimized Performance and Efficiency", Bioengineering, vol.10, no.11, pp.1332, 2023.
23.
Sam Keaveney, Alina Dragan, Mihaela Rata, Matthew Blackledge, Erica Scurr, Jessica M. Winfield, Joshua Shur, Dow-Mu Koh, Nuria Porta, Antonio Candito, Alexander King, Winston Rennie, Suchi Gaba, Priya Suresh, Paul Malcolm, Amy Davis, Anjumara Nilak, Aarti Shah, Sanjay Gandhi, Mauro Albrizio, Arnold Drury, Guy Pratt, Gordon Cook, Sadie Roberts, Matthew Jenner, Sarah Brown, Martin Kaiser, Christina Messiou, "Image quality in whole-body MRI using the MY-RADS protocol in a prospective multi-centre multiple myeloma study", Insights into Imaging, vol.14, no.1, 2023.
24.
Ovidijus Grigas, Rytis Maskeli?nas, Robertas Damasevicius, "Improving Structural MRI Preprocessing with Hybrid Transformer GANs", Life, vol.13, no.9, pp.1893, 2023.
25.
Kohei Ohashi, Yukihiro Nagatani, Makoto Yoshigoe, Kyohei Iwai, Keiko Tsuchiya, Atsunobu Hino, Yukako Kida, Asumi Yamazaki, Takayuki Ishida, "Applicability Evaluation of Full-Reference Image Quality Assessment Methods for Computed Tomography Images", Journal of Digital Imaging, 2023.
26.
Artem Razumov, Oleg Rogov, Dmitry V. Dylov, "Optimal MRI undersampling patterns for ultimate benefit of medical vision tasks", Magnetic Resonance Imaging, 2023.

References

References is not available for this document.