NoR-VDPNet++: Real-Time No-Reference Image Quality Metrics

Efficiency and efficacy are desirable properties for any evaluation metric having to do with Standard Dynamic Range (SDR) imaging or with High Dynamic Range (HDR) imaging. However, it is a daunting task to satisfy both properties simultaneously. On the one side, existing evaluation metrics like HDR-VDP 2.2 can accurately mimic the Human Visual System (HVS), but this typically comes at a very high computational cost. On the other side, computationally cheaper alternatives (e.g., PSNR, MSE, etc.) fail to capture many crucial aspects of the HVS. In this work, we present NoR-VDPNet++, a deep learning architecture for converting full-reference accurate metrics into no-reference metrics thus reducing the computational burden. We show NoR-VDPNet++ can be successfully employed in different application scenarios.


I. INTRODUCTION
The quality of natural/synthetic images is commonly assessed either through user studies or through objective metrics. This step is especially important to assess the quality of a compression/restoring/enhancing algorithm.
Although a user study is very reliable in terms of the quality of results, it is rather cumbersome to run since a considerable amount of time (spanning from weeks to months in some cases) is often required, given that a large number of participants and images should be involved. Furthermore, such studies require a careful design to avoid biases, and they can be very expensive since some money has to be invested in order to attract the participants' interest. As a result, objective metrics are typically preferred for assessing image quality; the monetary cost is greatly reduced while, at the same time, these metrics can be employed to evaluate realtime applications. Although such metrics do not require users, they represent fairly reliable solutions, especially when these The associate editor coordinating the review of this manuscript and approving it for publication was Yun Zhang . metrics take into account different aspects of the human visual system (HVS) or when they provide an accurate simulation of relevant aspects of it. An example of this last case is HDR-VDP 2.2 [32], which is, by now, a well-established metric for HDR and SDR imaging used in standardization committees. Unfortunately, HDR-VDP 2.2 presents two main limitations: i) its high computational cost prevents its use in real-time applications or large databases (e.g., standardization); ii) it requires a reference image, which may not be available in many cases (e.g., TV live broadcasting). Recently, some efforts have been paid into designing more computationally efficient metrics for specific problems. However, the most popular and reliable metrics, such as TMQI [27], [37] for assessing the quality of tone-mapped images, still require a reference image, which is a severe limiting factor.
All the above-mentioned problems make it evident the necessity of new objective metrics that (i) can be run in realtime, (ii) do not require a reference image or a ground truth, and (iii) mimic accurately the original reference-based metrics. In this paper, we propose NoR-VDPNet++, an efficient deep learning architecture for converting full-reference accurate metrics into no-reference metrics.
The rest of this paper is organized as follows. In Section II, we review previous efforts in the field. In Section III, we explain our NoR-VDPNet++ architecture in detail. In Section IV, we turn to describe the dataset and the training strategy we use, while in Section V, we report our experiments. In Section VI, we demonstrate how our system fares in real applications. Finally, Section VII wraps up, also offering pointers to potential directions for future work.

II. RELATED WORK
Nowadays, image quality evaluation through the use of objective metrics has become of high importance. Objective metrics are not only used for quality assessment in benchmark studies but are also used to monitor/guide the performance of algorithms such as 3D renderers, encoders, enhancement, deep-learning training, etc. In this work, we consider Image Quality Metrics (IQMs) which predict a single global quality score for the entire image.
IQMs can be divided into Fully-Reference (FR) and No-Reference (NR) methods. While FR-metrics receive as input a pair of images (i.e., the ground truth and the distorted images), the NR-metrics has only the distorted image as input. In this section, we will focus on state-of-the-art NR-based IQMs approaches only, which is the scenario of our study.
Typically, NR IQMs are based on statistical information derived from the distorted image [19], [31], [35], [40]. For example, NIQE [31] first computes 36 highly regular natural scene statistics from an input image, to then compute the distance from these and a multi-variate data-driven Gaussian fit. Recently, NR metrics based on machine learning made their appearance. Mittal et al. [30] proposed to extract locally normalized luminance coefficients (LNLCs) to quantify possible losses of naturalness in the image due to the presence of distortions. Subsequently, a support vector regressor (SVR) is trained to predict, from LNLCs, a proxy of human perception called BRIQUE index. Similarly, Kundu et al. [26] introduced Higrade, an NR-metric for tone-mapped images based on the extraction of log-domain gradients and an SVR. Regarding convolutional neural networks (CNNs), Kang et al. [22] proposed one of the first approaches where they presented a simple NR CNN architecture for predicting quality scores that correlate with user experiments. Kottayil et al. [24] introduced an NR-IQA deep learning scheme for HDR images that correlates with mean opinion scores. Kim and Lee [23] deal with the absence of ground truth by employing local quality maps derived by FR-IQMs as intermediate regression targets. This approach requires pre-training the FR-IQM model on data where the ground truth is available. The approach by Bosse et al. [11] is purely data-driven and does not rely on hand-crafted features or other types of prior domain knowledge about the HVS or image statistics. Zhu et al. [39] proposed MetaIQA to improve the prediction capabilities of a CNN-based metric through pre-trained architectures. Here the meta-knowledge shared by people during the evaluation of the quality of images with various distortions is learned and adapted to unknown distortions.
Recently, CNN-based architectures have been employed to transfer the knowledge of an algorithm into the parameters of a convolutional network able to produce real-time predictions. This was achieved for both the FR scenario [2] (i.e., DIQM which mimics HDR-VDP 2.2 [32] and DRIIM [5] with uses a reference) and the NR scenario [9] (i.e., NoRVDP-Net which mimics HDR-VDP 2.2 without a reference).
In this work, we present NoR-VDPNet++, an improved variant of NoR-VDPNet [9] that achieves higher accuracy while maintaining its real-time nature. In particular, we present the following contributions with respect to previous art: • NoR-VDPNet++ is a NR version of FR CNN-based metric [2], which distills HDR-VDP 2.2 and DRIIM [5] with high accuracy and efficiency. In this work, we extend NoR-VDPNet architecture testing normalization layers.
• We apply NoR-VDPNet and NoR-VDPNet++ to obtain a no-reference TMQI [37] to assess the quality of tone-mapped images and a no-reference HDR-VDP 2.2 to assess the quality of inverse tone-mapped images.
• We present two novel datasets: the former composed of tone-mapped HDR images using different tone mapping operators, and the latter composed of inverse tone-mapped images using different inverse tone mapping operators.

III. DISTILLING IMAGE QUALITY METRICS
NoR-VDPNet [9] accomplishes the conversion of HDR-VDP 2.2 [32] into a NR model encoded as a CNN. This is attained by training a CNN architecture (see the left-most architecture in Figure 1) using a medium dataset (e.g., more than 50,000-100,000 examples with/without reference) of SDR/HDR images for different scenarios such as SDR distortions detection (blur, noise, quantization, etc.), JPEG-XT [3] compression artifacts, tone/inverse tone mapping evaluation, etc. Each example pair consists of a distorted image and the ground truth quality value that HDR-VDP 2.2 or TMQI calculates using its reference; see Figure2. Note that the key for distilling HDR-VDP 2.2 or TMQI into a no-reference metric comes down to omitting the reference during training.
In this work, we explore techniques aimed at improving the stability of the training phase of the previous version NoR-VDPNet and increasing accuracy at inference time. The resulting model, which we dub NoR-VDPNet++, comes in two flavors, one that uses Batch Normalization [21] as a way to counter the internal covariate shift, and another that instead uses the more recent ReZero [6] normalization layer to speed up training convergence. We experiment with both variants and discuss the merits of each.   The process for creating a sample for our NR datasets. We use a FR metric for computing the quality value between the ground truth and the distorted images. Then, the sample is created by discarding the ground truth; the input for the network is the distorted image and the target output to minimize is the computed quality value Q.
Batch Normalization [21] has been shown to effectively help reduce the covariate shift between layers and to allow for faster and more robust training. Batch Normalization comes down to independently re-centering and re-scaling the dimensions of data tensors by using an approximation of the mean and standard deviation computed on the batch of examples. Equation 1 describes the computation for the kth dimension of a vector x = (x (1) , · · · , x (m) ); µ ReZero [6], on the other hand, was recently proposed as a novel way for reducing the problems of vanishing and exploding gradients typical of deep learning training with residual layers. As a residual block, it allows deep architectures to become deeper while at the same time being much more efficient than other normalization techniques. The computation of ReZero between two subsequent layers (l and l + 1) is described by Equation 2 and comes down to a residual connection with a trainable parameter (α l ) used to modulate the transformation F of the data tensor.
Both variants of NoR-VDPNet++ achieve a lower prediction error than the original NoR-VDPNet and still preserve real-time performance. When equipped with the ReZero connections, NoR-VDPNet++ produces lower errors in some scenarios; Figure 1 shows NoR-VDPNet before (left) and after (right) these changes.
For the TMO dataset, we applied 18 TMOs (see Figure 6) to all images in I HDR using the HDR Toolbox [8]. Then, we ran TMQI using the original HDR images and their tonemapped versions, storing the TMQI score as the target output. The no-reference dataset comprises the tone-mapped images stored at 8-bit per color channel in the sRGB color space and its TMQI score.
Regarding ITMO, we applied six inverse tone mapping operators (ITMOs) to the SDR versions (i.e., with a f-stop that maximizes the total well-exposed pixels) of the HDR images in I HDR . These operators are: Akyuz et al. [1], Huo et al. [20], Kovaleski and Oliveira [25], and Masia et al. [29], Eilertsen et al. [13], and Santos et al. [33]. We ran HDR-VDP 2.2 between the original HDR images and their inverse tone-mapped one storing the HDR-VDP 2.2 Q value. The no-reference dataset comprises inverse tone-mapped images stored at 32-bit per color channel in the sRGB color space and its HDR-VDP 2.2 score. To further stress-test different input conditions, we applied an exposure augmentation; i.e., we applied a +1.5-stop and a 3.0-stop increase from a  well-exposed input image (with only clipped highlights); see Figure 5.
Given that the same HDR image is tone/inverse tone mapped with different TMOs/ITMOs, this is actually equivalent to performing data augmentation. Therefore, for each image, all different tone/inverse tone mapped images are placed either in the training set, in the evaluation set, or in the test set.
Note that for HDR-C and SDR-D, we extended Scenario 1 and Scenario 2 from Artusi et al.'s work [2] by increasing the number of samples by 3.8 times and 7 times, respectively. We achieved that using images from [12] for SDR-D, and the new images from I HDR for HDR-C.
To further increase the size of the dataset, we performed further data augmentation by applying 90 • /180 • /270 • rotations and horizontal/vertical image flips. Note that HDR-VDP 2.2 requires physical values in order to obtain meaningful results, so images were converted from relative values to display-referred values. For the SDR-D dataset, the reference display had the characteristics of nowadays standard 8-bit display; i.e., the display peak brightness and black level were, respectively, set to 250 cd/m 2 and 0.5 cd/m 2 . Regarding ITMO and HDR-C datasets, the reference HDR display was the DisplayHDR1400 standard 1 with a peak luminance of 1, 400 cd/m 2 and a black level of 0.02 cd/m 2 . The TMO dataset had no reference display because TMQI [37] works on normalized values for both the HDR and tonemapped images.

TABLE 2. Performance evaluation in terms of MSE (lower is better).
Boldface indicates the best method overall for each scenario. Superscripts ‡ denotes the method (if any) whose MSE score is not statistically significantly different from the best one in terms of a two-tailored t-test in the differences in performance: symbol ‡ indicates 0.01 < p − value; i.e., the methods behave similarly with very high confidence. code of NoR-VDPNet 3 that uses PyTorch 1.3.1 deep-learning framework. For ResNet-18, we employed the PyTorch implementation using its original weights and fine-tuning weights using SDR-D, TMO, HDR-C, and ITMO training sets. During training, we employed Adam as the optimizer with default parameters and learning rate initialized to 10 −5 ; we halved the learning rate whenever a plateau was reached. We trained all our networks for 100 epochs and certified that the optimization search converged in all cases. Typically, convergence is reached after 60 or 70 epochs.

V. RESULTS
In order to assess the quality of the predictions that our new NR model yields, we compared the Mean Squared Error (MSE) of the predictions against the FR target quality values (as produced by HDR-VDP 2.2) for the test datasets of SDR-D, HDR-C, TMO, and ITMO. Table 2 reports performance comparisons in terms of MSE between the original NoR-VDPNet [9], ResNet-18, and the new variants NoR-VDPNet++ when equipped with Batch Normalization (BN) or with ReZero (RZ), for SDR-D, HDR-C, TMO, and ITMO. Statistical significance of the averaged scores is tested according to a two-tailored t-test at different confidence levels (α = 0.01 and α = 0.001).
These results reveal some interesting facts. First of all, there is a clear advantage (i.e., a statistically significant improvement), in terms of error score, when equipping the network with sophisticated normalization layers, when compared to the classical NoR variant; see Table 2. Another interesting aspect that jumps to the eye, is that NoR++BN and NoR++RZ both perform substantially better, in a statistically significant sense, than ResNet-18 in terms of error score. Interestingly enough, this improvement does not come   at an extra cost. Indeed, ResNet-18 requires 58 hours for training on the SDR-D dataset, while NoR++RZ requires only 11 hours on the same dataset. Figure 4 shows the error distributions for the testing datasets. Note that, amongst all methods, NoR-VDPNet++RZ displays the narrowest histogram centered around 0 for the majority of scenarios.
For a clearer picture, Figure 3 shows the scatter plots between the predicted valueQ and its ground truth Q by also FIGURE 8. An example in which NoR-VDPNet++ is used to choose high-quality images from an image collection. NoR-VDPNet++ predicts a high Q-score (i.e., Q > 70) for sharp images, and a low one (i.e., Q < 60) for blurred images.
reporting the Pearson correlation coefficient ρ. The scatter plots exhibit a linear relationship between the inputs and the predicted values that tend to lie close to the main diagonal. From these plots, we can notice that ITMO is the most difficult case overall. This is due to the fact that an inverse tone mapping operator (both classic methods and especially deep-learning-based ones) applies many different processing operations at the same time on the same image.
Training times are reported in Table 3. It is worth noting that NoR++RZ displays comparable training times to NoR, while yielding better performance in terms of quality; see Table 2. In terms of computational efficiency at inference time, the new architectures maintain real-time performance; i.e., both variants BN and RZ can issue predictions for 4-MPixel images in less than 24ms; see Figure 7. In our implementation, RZ is 44% faster than BN at high resolutions (i.e., >2-MPixel) because the implementation of Equation 1 is computationally more expensive than that of Equation 2.

VI. APPLICATIONS
NoR-VDPNet++ is a real-time metric, meaning that it can be employed in several optimization-based applications in which the parameters need to be optimized for a specific quality metric. A straightforward application of our work is the selection of high-quality images from an image collection; see Figure 8. This might be particularly useful for sorting vacation photographs or eliminating low-quality images in computer vision applications such as Structurefrom-Motion [10] (e.g., removing blurred frames in a 3D reconstruction). Another interesting application is to use our metric trained on TMQI to optimize tone mapping operators parameters. To prove this possibility, we made an application that try to optimize the parameter of this sigmoid TMO: where C w and C d are, respectively, a HDR and a SDR color channel; L w and L d are, respectively, the HDR and SDR luminance; α and µ are tone-curve parameters, and γ is a color saturation parameter. Figure 9 shows tone mapped images using this optimization process, displaying the TMQI predicted by the network and its corresponding real value. The proposed tone mapping optimization can be also employed for selecting TMO parameters for JPEG-XT [28] compression using HDR-C results. In a similar way, our metric trained on ITMO can be employed to optimize inverse tone mapping operators (be them relying on neural or non-neural implementations).

VII. DISCUSSION AND CONCLUSION
We have shown that CNN architectures can successfully distill the knowledge of existing reference metrics like HDR-VDP 2.2 [32] and TMQI [37]. In this work, we have presented NoR-VDPNet++, an improved variant of NoR-VDPNet [9]. This variant achieves more reliable results in general, and also in a newly introduced scenario, i.e., the evaluation of inverse tone mapped images. We also showed NoR-VDPNet++ outperforms other comparatively more complex networks like ResNet-18, while at the same time requiring less time to train, and being faster at inference time.
NoR-VDPNet++ maintains real-time performance, allowing it to be employed in any real-time constrained applications such as optimization processes for parameter selections like tone mapping, image selection from collections of photographs, or Structure-from-Motions tasks, to name a few.
Recent efforts have been paid in order to better understand how intermediate feature maps of pre-trained CNNs can be used to predict image distortion similarly to how humans do. For example, Zhang et al. [38] show a systematic study on how to evaluate feature maps across different CNN architectures, obtaining important improvements with respect to classical objective metrics. Tariq et al. [34], have shown the existing correlation between the capabilities of pre-trained CNN features in optimizing the perceptual quality, with their accuracy in capturing basic human visual perception characteristics. This altogether suggests that more efforts have to be devoted to better understanding the potential benefits that using feature maps from pre-trained CNNs as an objective metric can bring to bear in image/video evaluation. In future work, we plan to carry out a systematic study in this direction, analyzing ways for employing these feature maps in NoR-VDPNet++ in an effective manner.