NIMA: Neural Image Assessment

Automatically learned quality assessment for images has recently become a hot topic due to its usefulness in a wide variety of applications such as evaluating image capture pipelines, storage techniques and sharing media. Despite the subjective nature of this problem, most existing methods only predict the mean opinion score provided by datasets such as AVA [1] and TID2013 [2]. Our approach differs from others in that we predict the distribution of human opinion scores using a convolutional neural network. Our architecture also has the advantage of being significantly simpler than other methods with comparable performance. Our proposed approach relies on the success (and retraining) of proven, state-of-the-art deep object recognition networks. Our resulting network can be used to not only score images reliably and with high correlation to human perception, but also to assist with adaptation and optimization of photo editing/enhancement algorithms in a photographic pipeline. All this is done without need of a"golden"reference image, consequently allowing for single-image, semantic- and perceptually-aware, no-reference quality assessment.


I. INTRODUCTION
Quantification of image quality and aesthetics have been a long-standing problem in image processing and computer vision. While technical quality assessment deals with measuring low-level degradations such as noise, blur, compression artifacts, etc., aesthetic assessment quantifies semantic level characteristics associated with emotions and beauty in images. In general, image quality assessment can be categorized into full-reference and no-reference approaches. While availability of a reference image is assumed in the former (metrics such as PSNR, SSIM [3], etc.), typically blind (no-reference) approaches rely on a statistical model of distortions to predict image quality. The main goal of both categories is to predict a quality score that correlates well with human perception. Yet, the subjective nature of image quality remains the fundamental issue. Recently, more complex models such as deep convolutional neural networks (CNNs) have been used to address this problem [4]- [11]. Emergence of labeled data by human ratings has encouraged these efforts [1], [2], [12]- [14]. In a typical deep CNN approach, weights are initialized by training on classification related datasets (e.g. ImageNet [15]), and then fine tuned on annotated data for perceptual quality assessment tasks.

A. Related Work
Machine learning has shown promising success in predicting technical quality of images [4]- [7]. Kang et. al. [5] show that extracting high level features using CNNs can result in stateof-the-art blind quality assessment performance. It appears that replacing hand-crafted features with an end-to-end feature H. Talebi  learning system is the main advantage of using CNNs for pixel-level quality assessment tasks [5], [6]. The proposed method in [5] is a shallow network with one convolutional layer and two fully-connected layers, and input patches are of size 32 × 32. Bosse et al. [6] use a deep CNN with 12 layers to improve on image quality predictions of [5]. Given the small input size (32 × 32 patch), both methods require score aggregation across the whole image. Bianco et al. in [7] propose a deep quality predictor based on AlexNet [15]. Multiple CNN features are extracted from image crops of size 227 × 227, and then regressed to the human scores.
Success of CNNs on object recognition tasks has significantly benefited the research on aesthetic assessment. This seems natural, as semantic level qualities are directly related to image content. Recent CNN-based methods [8]- [11] show a significant performance improvement compared to earlier works based on hand-crafted features [1]. Murray et al. [1] is the benchmark on aesthetic assessment. They introduce the AVA dataset and propose a technique to use manually designed features for style classification. Later, Lu et al. [8], [16] show that deep CNNs are well suited to the aesthetic assessment task. Their double-column CNN [16] consists of four convolutional and two fully-connected layers, and its inputs are the resized image and cropped windows of size 224 × 224. Predictions from these global and local image views are aggregated to an overall score by a fully-connected layer. Similar to Murray et al. [1], in [16] images are also categorized to low and high aesthetics based on mean human ratings. A regression loss and an AlexNet inspired architecture is used in [9] to predict the mean scores. In a similar approach to [9], Bin et al. [11] fine-tune a VGG network [17] to learn the human ratings of the AVA dataset. They use a regression framework to predict the histogram of ratings. More recently, [10] uses an adaptive spatial pooling to allow for feeding multiple scales of the input image with fixed size aspect ratios to their CNN. This work presents a multi-net (each network a pre-trained VGG) approach which extracts features at multiple scales, and uses a scene aware aggregation layer to combine predictions of the sub-networks. Similarly, Ma et al. [18] propose a layout-aware framework in which a saliency map is used to select patches with highest impact on predicted aesthetic score. Overall, none of these methods reported correlation of their predictions with respect to ground truth ratings. Recently, Kong et al. in [14] proposed a method to aesthetically rank photos by training on AVA with a rankbased loss function. They trained an AlexNet-based CNN to learn the difference of the aesthetic scores from two input images, and as a result, indirectly optimize for rank correlation. To the best of our knowledge, [14] is the only work that performed a correlation evaluation against AVA ratings.

B. Our Contributions
In this work, we introduce a novel approach to predict both technical and aesthetic qualities of images. We show that models with the same CNN architecture, trained on different datasets, lead to state-of-the-art performance for both tasks. Since we aim for predictions with higher correlation with human ratings, instead of classifying images to low/high score or regressing to the mean score, the distribution of ratings are predicted as a histogram. To this end, we use the squared EMD (earth mover's distance) loss proposed in [19], which shows a performance boost in classification with ordered classes. Our experiments show that this approach also leads to more accurate prediction of the mean score. Also, as shown in aesthetic assessment case [1], non-conventionality of images is directly related to score deviations. Our proposed paradigm allows for predicting this metric as well. This paper begins with reviewing two widely used datasets for quality assessment. Then, our proposed method is explained in more detail. Finally, performance of this work is quantified and compared to the existing methods.
C. A Large-Scale Database for Aesthetic Visual Analysis (AVA) [1] The AVA dataset contains about 255,000 images, rated based on aesthetic qualities by amateur photographers 1 . Each photo is scored by an average of 200 people in response to photography contests. Each image is associated to a single challenge theme, with nearly 900 different contests in the AVA. The image ratings range from 1 to 10, with 10 being the highest aesthetic score associated to an image. Histograms of AVA ratings are shown in Fig. 1. As can be seen, mean ratings are concentrated around the overall mean score (≈5.5). Also, ratings of roughly half of the photos in AVA dataset have a standard deviation greater than 1.4. As pointed out in [1], presumably images with high score variance tend to be subject to interpretation, whereas images with low score variance seem to represent conventional styles or subject matter. A few examples with ratings associated with different levels of aesthetic quality and unconventionality are illustrated in Fig. 2. It seems that aesthetic quality of a photograph can be represented by the mean score, and unconventionality of it closely correlates to the score deviation. Given the distribution of AVA scores, typically, training a model on AVA data results in predictions with small deviations around the overall mean (5.5).
It is worth mentioning that the joint histogram in Fig. 1 shows higher deviations for very low/high ratings (compared to the overall mean 5.5, and mean standard deviation 1.43). In other words, divergence of opinion is more consistent in AVA images with extreme aesthetic qualities. As discussed in [1], distribution of ratings with mean value between 2 and 8 can be closely approximated by Gaussian functions, and highly skewed ratings can be modeled by Gamma distributions. TID2013 is curated for evaluation of full-reference perceptual image quality. It contains 3000 images, from 25 reference (clean) images (Kodak images [20]), 24 types of distortions with 5 levels for each distortion. This leads to 120 distorted images for each reference image; including different types of distortions such as compression artifacts, noise, blur and color artifacts.
Human ratings of TID2013 images are collected through a forced choice experiment, where observers select a better image between two distorted choices. Set up of the experiment allows raters to view the reference image while making a decision. In each experiment, every distorted image is used in 9 random pairwise comparisons. The selected image gets one point, and other image gets zero points. At the end of the experiment, sum of the points is used as the quality score associated with an image (this leads to scores ranging from 0 to 9). To obtain the overall mean scores, total of 985 experiments are carried out.
Mean and standard deviation of TID2013 ratings are shown in Fig. 3. As can be seen in Fig. 3(c), the mean and score deviation values are weakly correlated. A few images from TID2013 are illustrated in Fig. 4 and Fig. 5. All five levels of JPEG compression artifacts and the respective ratings are illustrated in Fig. 4. Evidently higher distortion level leads to lower mean score 2 . Effect of contrast compression/stretching distortion on the human ratings is demonstrated in Fig. 5. Interestingly, stretch of contrast ( Fig. 5(c) and Fig. 5(e)) leads to relatively higher perceptual quality.
Unlike AVA, which includes distribution of ratings for each image, TID2013 only provides mean and standard deviation of the opinion scores. Since our proposed method requires training on score probabilities, the score distributions are approximated through maximum entropy optimization [21].

II. PROPOSED METHOD
Our proposed quality and aesthetic predictor stands on image classifier architectures. More explicitly, we explore a few different classifier architectures such as VGG16 [ We replaced the last layer of the baseline CNN with a fully-connected layer with 10 neurons followed by soft-max activations (shown in Fig. 6). Baseline CNN weights are initialized by training on the ImageNet dataset [15], and then an end-to-end training on quality assessment is performed. In this paper, we discuss performance of the proposed model with various baseline CNNs.
In training, input images are rescaled to 256 × 256, and then a crop of size 224 × 224 crop is randomly extracted. This lessens potential over-fitting issues, especially when training on relatively small datasets (e.g. TID2013). It is worth noting that we also tried training with random crops without rescaling. However, results were not compelling. This is due to the inevitable change in image composition. Another random data augmentation in our training process is horizontal flipping of the image crops.
Our goal is to predict the distribution of ratings for a given image. Ground truth distribution of human ratings of a given image can be expressed as an empirical probability mass function p = [p s1 , . . . , p s N ] with s 1 ≤ s i ≤ s N , where s i denotes the ith score bucket, and N denotes the total number of score buckets. In both AVA and TID2013 datasets N = 10, in AVA, s 1 = 1 and s N = 10, and in TID s 1 = 0 and s N = 9. Since N i=1 p si = 1, p si represents the probability of a quality score falling in the ith bucket. Given the distribution of ratings as p, mean quality score is defined as µ = N i=1 s i × p si , and standard deviation of the score is computed as σ = ( N i=1 (s i − µ) 2 × p si ) 1/2 . As discussed in the previous section, one can qualitatively compare images by mean and standard deviation of scores.
Each example in the dataset consists of an image and its ground truth (user) ratings p. Our objective is to find the  Fig. 6: Modified baseline image classifier network used in our framework. Last layer of classifier network is replaced by a fully-connected layer to output 10 classes of quality scores. Baseline network weights are initialized by training on ImageNet dataset [15], and the added fully-connected weights are initialized randomly.
probability mass function p that is an accurate estimate of p. Next, our training loss function is discussed.

A. Loss Function
Soft-max cross-entropy is widely used as training loss in classification tasks. This loss can be represented as N i=1 −p si log( p si ) (where p si denotes estimated probability of ith score bucket) to maximize predicted probability of the correct labels. However, in the case of ordered-classes (e.g. aesthetic and quality estimation), cross-entropy loss lacks the inter-class relationships between score buckets. One might argue that ordered-classes can be represented by a real number, and consequently, can be learned through a regression framework. Yet, it has been shown that for ordered classes, classification frameworks can outperform regression models [19], [25]. Hou et al. [19] show that training on datasets with intrinsic ordering between classes can benefit from EMDbased losses. These loss functions penalize mis-classifications according to class distances.
For image quality ratings, classes are inherently ordered as s 1 < · · · < s N , and r−norm distance between classes is defined as s i − s j r , where 1 ≤ i, j ≤ N . EMD is defined as the minimum cost to move the mass of one distribution to another. Given the ground truth and estimated probability mass functions p and p, with N ordered classes of distance s i − s j r , the normalized Earth Mover's Distance can be expressed as [26]: where CDF p (k) is the cumulative distribution function as k i=1 p si . It is worth noting that this closed-form solution requires both distributions to have equal mass as N i=1 p si = N i=1 p si . As shown in Fig. 6, our predicted quality probabilities are fed to a soft-max function to guarantee that N i=1 p si = 1. Similar to [19], in our training framework, r is set as 2 to penalize the Euclidean distance between the CDFs. r = 2 allows easier optimization when working with gradient descent.

III. EXPERIMENTAL RESULTS
We train two separate models for aesthetics and technical quality assessment on AVA and TID2013. For each case, we split the AVA and TID datasets to train and test sets, such that 20% of the data is used for testing. In this section, performance of the proposed models on the test sets are discussed and compared to the existing methods. Then, applications of the proposed technique in photo ranking and image enhancement are explored. Before moving forward, details of our implementation are explained.
The CNNs presented in this paper are implemented using TensorFlow [27], [28]. The baseline CNN weights are initialized by training on ImageNet [15], and the last fully-connected layer is randomly initialized. The weight and bias momentums are set to 0.9, and a dropout rate of 0.75 is applied on the last layer of the baseline network. The learning rate of the baseline CNN layers and the last fully-connected layers are set as 3 × 10 −7 and 3 × 10 −6 , respectively. We observed that setting a low learning rate on baseline CNN layers results in easier and faster optimization when using stochastic gradient descent. Also, after every 10 epochs of training, an exponential decay with decay factor 0.95 is applied to all learning rates.

A. Performance Comparisons
Accuracy, correlation and EMD values of our evaluations on the aesthetic assessment model on AVA are presented in Table I. Most methods in Table I are designed to perform binary classification on the aesthetic scores, and as a result, only accuracy evaluations of two-class quality categorization are reported. In two-class aesthetic categorization task, results from [18], and NIMA(Inception-v2) show the highest accuracy. Also, in terms of rank correlation, NIMA(VGG16) and NIMA(Inception-v2) outperform [14]. NIMA is much cheaper: [18] applies multiple VGG16 nets on image patches to generate a single quality score, whereas computational complexity of NIMA(Inception-v2) is roughly one pass of Inception-v2 (see Table III).
Our technical quality assessment model on TID2013 is compared to other existing methods in Table II. While most of these methods regress to the mean opinion score, our proposed technique predicts the distribution of ratings, as well as mean opinion score. Correlation between ground truth and results of NIMA(VGG16) are close to the state-of-the-art results in [29] and [7]. It is worth highlighting that Bianco et al. [7] feed multiple image crops to a deep CNN, whereas our method takes only the rescaled image.  [1] compared to the state-ofthe-art. Reported accuracy values are based on classification of photos to two classes (column 2). LCC (linear correlation coefficient) and SRCC (Spearman's rank correlation coefficient) are computed between predicted and ground truth mean scores (column 3 and 4) and standard deviation of scores (column 5 and 6). EMD measures closeness of the predicted and ground truth rating distributions with r = 1 in Eq. 1.

B. Photo Ranking
Predicted mean scores can be used to rank photos, aesthetically. Some test photos from AVA dataset are ranked in Fig. 7 and Fig. 8. Predicted NIMA scores and ground truth AVA scores are shown below each image. Results in Fig. 7 suggest that in addition to image content, other factors such as tone, contrast and composition of photos are important aesthetic qualities. Also, as shown in Fig. 8, besides image semantics, framing and color palette are key qualities in these photos. These aesthetic attributes are closely predicted by our trained models on AVA.
Predicted mean scores are used to qualitatively rank photos in Fig. 9. These images are part of our TID2013 test set, which contain various types and levels of distortions. Comparing ground truth and predicted scores indicates that our trained model on TID2013 accurately ranks the test images.

C. Image Enhancement
Quality and aesthetic scores can be used to perceptually tune image enhancement operators. In other words, maximizing NIMA score as a prior can increase the likelihood of enhancing perceptual quality of an image. Typically, parameters of enhancement operators such as image denoising and contrast enhancement are selected by extensive experiments under various photographic conditions. Perceptual tuning could be quite expensive and time consuming, especially when human opinion is required. In this section, our proposed models are used to tune a tone enhancement method [37], and an image denoiser [38].
The multi-layer Laplacian technique [37] enhances local and global contrast of images. Parameters of this method control the amount of detail, shadow, and brightness of an image. Fig. 10 shows a few examples of the multi-layer Laplacian with different sets of parameters. We observed that the predicted aesthetic ratings from training on the AVA dataset can be improved by contrast adjustments. Consequently, our model is able to guide the multi-layer Laplacian filter to find aesthetically near-optimal settings of its parameters. Examples of this type of image editing are represented in Fig. 11, where a combination of detail, shadow and brightness change is applied on each image. In each example, 6 levels of detail boost, 11 levels of shadow change, and 11 levels of brightness change account for a total of 726 variations. The aesthetic assessment model tends to prefer high contrast images with boosted details. This is consistent with the ground truth results from AVA illustrated in Fig. 7.
Turbo denoising [38] is a technique which uses the domain transform [39] as its core filter. Performance of Turbo denoising depends on spatial and range smoothing parameters, and consequently, proper tuning of these parameters can effectively boost performance of the denoiser. We observed that varying the spatial smoothing parameter makes the most significant perceptual difference, and as a result, we use our quality assessment model trained on TID2013 dataset to tune this denoiser. Application of our no-reference quality metric as a prior in image denoising is similar to the work of Zhu et al. [40], [41]. Our results are shown in Fig. 12. Additive white Gaussian noise with standard deviation 30 is added to the clean image, and Turbo denoising with various spatial parameters is used to denoise the noisy image. To reduce the score deviation, 50 random crops are extracted from denoised image. These scores are averaged to obtain the plots illustrated in Fig. 12. As can be seen, although the same amount of noise is added to each image, maximum quality scores correspond to different denoising parameters in each example. For relatively smooth images such as (a) and (g), optimal spatial parameter of Turbo denoising is higher (which implies stronger smoothing) than the textured image in (j). This is probably due to the relatively high signal-to-noise ratio of (j). In other words, the quality assessment model tends to respect textures and avoid over-  [2] compared to the state-of-the-art. LCC (linear correlation coefficient) and SRCC (Spearman's rank correlation coefficient) are computed between predicted and ground truth mean scores (column 2 and 3) and standard deviation of scores (column 4 and 5). EMD measures closeness of the predicted and ground truth rating distributions with r = 1 in Eq. 1.

D. Computational Costs
Computational complexity of NIMA models are compared in Table III. Our inference TensorFlow implementation is tested on an Intel Xeon CPU @ 3.5 GHz with 32 GB memory and 12 cores, and NVIDIA Quadro K620 GPU. Timings of one pass of NIMA models on an image of size 224 × 224 × 3 are reported in Table III. Evidently, NIMA(MobileNet) is significantly lighter and faster than other models. This comes at the expense of a slight performance drop (shown in Table I  and Table II).

IV. CONCLUSION
In this work we introduced a CNN-based image assessment method, which can be trained on both aesthetic and pixel-level quality datasets. Our models effectively predict the distribution of quality ratings, rather than just the mean scores. This leads (a) 6.88 (7.40) (b) 6.63 (6.89) (c) 6 perior results. As part of our future work, we will exploit the trained models on other image enhancement applications. Our current experimental setup requires the enhancement operator to be evaluated multiple times. This limits real-time application of the proposed method. One might argue that in case of an enhancement operator with well-defined derivatives, using NIMA as the loss function is a more efficient approach.