A Survey of DNN Methods for Blind Image Quality Assessment

Blind image quality assessment (BIQA) methods aim to predict quality of images as perceived by humans without access to a reference image. Recently, deep learning methods have gained substantial attention in the research community and have proven useful for BIQA. Although previous study of deep neural networks (DNN) methods is presented, some novelty DNN methods, which are recently proposed, are not summarized for BIQA. In this paper, we provide a survey covering various DNN methods for BIQA. First, we systematically analyze the existing DNN-based quality assessment methods according to the role of DNN. Then, we compare the prediction performance of various DNN methods on the synthetic databases (LIVE, TID2013, CSIQ, LIVE multiply distorted) and authentic databases (LIVE challenge), providing important information that can help understand the underlying properties between different DNN methods for BIQA. Finally, we describe some emerging challenges in designing and training DNN-based BIQA, along with few directions that are worth further investigations in the future.


I. INTRODUCTION
W ITH the development of social media and the in- creasing demand for imaging services, an enormous amount of visual data is making its way to consumers.Digital images are likely to be inevitably degraded in the processes from content generation to consumption.The acquisition, processing, compression, transmission, or storage of images is subject to various distortions, resulting degradation in visual quality.Therefore, methods for image quality assessment (IQA) have been extensively studied for the purpose of maintain, control and enhance the perceived image quality.
In principal, subjective assessment is the most reliable way to evaluate the visual quality of images [1], [2].But this method is time-consuming, expensive, and difficult to implement in real-world systems.Therefore, objective assessment of image quality has gained growing attention in recent years.To what extent a reference image is used for quality assessment, existing objective IQA methods can be classified into three categories: full-reference (FR), reducedreference (RR) and no-reference/blind (NR/B) methods.The FR IQA methods make full use of the undistorted reference images to compare with distorted images and measure the difference between them [3]- [5], while the RR IQA methods use partial information in reference images [6]- [8].However, in many practical applications, it is difficult to obtain a reference image of the distorted image to be assessed, making powerful FR and RR IQA methods inapplicable.On the contrary, the BIQA methods have no access to the reference images to evaluate image quality [9], [10].Thus, it has become increasingly important to develop effective BIQA methods which can predict image quality without any additional information.
Most exiting BIQA methods follow the flowchart shown in Fig. 1.Some BIQA methods is developed based on classical regression methods [11].Researchers attempt to design some hand-crafted features that could discriminate distorted images, and then train a regression model to predict image quality.Early BIQA methods are based on a distortion specific approach [78], [79], which commonly uses the prior knowledge of the distortion types for quality prediction.In this approach, the distortion-specific features relevant to quality perception are extracted and used for quality estimation.Li et al. [78] propose a BIQA method based on the blur distortion.They first calculate the gradient image to characterize the blur distortion.Then, they divide the gradient image into blocks and extract the energy features of each block relevant to the blur distortion.Finally, the image quality is obtained by normalizing the moment energy.However, when image is distorted via unknown distortion channels, it becomes much more difficult to find specific features to measure image quality.
Recently, in order to assess the image quality without the prior knowledge of distortions, the non-distortion-specific BIQA methods have been developed.The natural scene statistics (NSS)-based methods are widely used to extract reliable features, which assume the natural images share certain statistics and the occurrence of distortions may change these statistics [14]- [16], [80]- [82].In [14], [16], they aim to utilize NSS model, including the multivariate Gaussian (MVG) model [14] and the Generalized Gaussian distribution (GGD) model [16], to extract low-level image features for quality prediction.Although those methods have greatly improved the BIQA performance, there still exists a large gap between prediction scores and subjective scores.In order to further improve prediction performance, Wu et al. [15] use the multi-channel fused image features to simulate the hierarchical and trichromatic properties of the human vision.Then, the k-nearest-neighbor(KNN)-based model is used to evaluate image quality.Similarly, Ji et al. [80] assume that image quality is highly correlated with the degraded visual information.Therefore, they use the joint entropy of degraded features to assess image quality, which stimulates the visual information of the images.Instead of studying the quality-relevant image features, Wu et al. [81] focus on exploring efficient learning models.They propose a novel local learning method to improve the prediction performance, which is beneficial to the training of the complex and large data sets.
However, the obvious limitation of those BIQA methods is that the hand-crafted features may not be able to adequately represent complex image structures and distortions.Therefore, to improve prediction performance, attempts have been made to adopt deep BIQA methods, recently.The motivation is that the deep neural network (DNN) can automatically capture more deep features relevant to quality assessment and optimize these features to improve prediction performance by using back propagation method.Therefore, the DNN can be applied to various image quality assessment (IQA) methods [83], [84] and provides a very promising option for addressing the challenging BIQA task.
It is well known that deep learning techniques have achieved great success in solving various images recognition and object detection tasks [17]- [20].The main reason is that it relies heavily on large-scale annotated data, like the image recognition oriented ImageNet [21] dataset.Unfortunately, for BIQA task, since there is a lack of sufficient ground truth labels IQA data for training, it is difficult to straightforwardly apply DNN to BIQA directly.This is because the DNN can lead to overfitting phenomenon, which means the trained model would have a perfect performance for training data but the performance is unreliable for unseen data.Therefore, researchers in the image quality community pay more attention to explore the useful DNN-based methods to solve this problem.
Previous surveys have also been summarized for BIQA methods, including classical methods [22]- [24] and DNN methods [25], [32].However, the surveys of classical methods lack the analysis of the popular DNN methods [22]- [24].And although some DNN methods are reviewed in [25], these methods can only be applied to the case where DNN input is the image patch.At present, there are still many novel DNN methods that have not been summarized [26]- [31].In addition, a simple comparison of different DNN methods is represented in our previous work [32], but we have not made a comprehensive analysis and evaluation of various DNN methods, including the design strategy, network architecture, network complexity, advantages and disadvantages.
Therefore, in this paper, we intend to systematically analyze the various DNN methods, which aims to summarize the intrinsic relationship among various DNN methods.First, according to the different role of DNN, we divide the DNN methods into two categories, which could distinguish different DNN methods easily.One is the support vector regression (SVR)-based BIQA methods, which use DNN to extract deep feaures and SVR methods to predict image quality.The other is the DNN-based BIQA methods, which takes full advantage of back-propagated capability of DNN to optimize prediction accuracy.Moreover, we analyze the first type of DNN methods according to whether the input of DNN is low-level features or image/image patch data.Similarly, we analyze the second type of DNN methods according to the difference of DNN output.Fig. 2 shows the classification of different DNN methods, which aims to better understand different DNN methods easily.Finally, we summarize useful findings and discuss the challenges of DNN methods for BIQA.We hope that this study will be beneficial for the researchers to better understand this field.
Our contributions can be summarized as follows.
1)According to the different roles of DNN, we propose a new classification method, which could distinguish and improve understanding different DNN methods.
2)We analyze the DNN methods proposed in recent years, in terms of the contributions, the network architecture, the complexity, and the advantages and disadvantages.Especially, we also summarize many novel DNN methods that have not been discussed in previous literature surveys.
3)We systematically evaluate the prediction performance in difference DNN methods and obtain some interesting conclusions.Meanwhile, we also discuss some potential challenges and solutions for future research.
The rest of this paper is organized as follows.In Sec.II, we reviews the methods of SVR-based image quality prediction

II. SVR-BASED IMAGE QUALITY PREDICTION USING DEEP FEATURES EXTRACTED BY DNN
Since the deep features from DNN can capture more useful information related to image distortions and human perceptions [25], the straightforward approach to employing DNN models is to extract discriminative deep features for various distorted images, and then evaluate the image quality using conventional SVR method.Recent work in the literature using DNN to extract deep features can be classified into two major schemes: 1) extracting from low-level features of image and 2) extracting from data of image/image patch.Figure 3 shows the flow diagram of these methods [33]- [35], [37]- [39].

A. DEEP FEATURES EXTRACTED FROM IMAGE LOW-LEVEL FEATURES
This kind of method aims to feed a large number of lowlevel image features relevant to quality perception into a DNN to evaluate image quality.Commonly, the low-level features are based on the NSS and other complementary features, which can accurately describe the structure features of distorted images.Then, these low-level features can be fed into the pre-trained DNN, including deep belief network (DBN) or stacked auto-encoder (SAE) network [33]- [35], to extract deep features.Especially, the unsupervised training method [36] is adopted to pre-train the DBN or SAE network.The goal is to overcome small IQA database problem and initialize each layer parameters of the pre-trained the DBN or SAE network.Afterwards, the parameters of entire network are fine-tuned with the labeled image features.Finally, the deep features extracted from the DBN or SAE model, along with the corresponding subjective scores are used to evaluate image quality by SVR method.Table 1 shows the details of these methods.Tang et.al.[33] extract three types of low-level features, including NSS, texture, and blur/noise features.The NSS and texture features include the univariate and cross-scale histograms and statistics of complex wavelet transform of images (the real part, absolute value, and phase).These features aim to describe image global and local distortions.The blur/noise features include the patch PCA singularity [86], the two color model coefficient histograms [87], and the step edge based blur/noise estimation [88].The blur/noise features can be added because these distortions are fundamental to various distortion types.Then, all of these low-level features are used to pre-train each layer of the DBN.And, the low-level features of IQA database with ground truth scores are used to fine-tune the entire DBN.Finally, a Gaussian process regression is used to obtain synthetic image quality score.
Ghadiyaram et al. further extend this work in [34] by combining DBN with SVR to predict authentically distorted images' quality.They adopt FRIQUEE method to extract low-level features of authentic images.FRIQUEE [77] first constructs several feature maps in multiple color spaces and transform domain, including luminance feature maps, LAB feature maps, and LMS feature maps.Then, the GGD, AGGD, and wrapped Cauchy models are used to fit feature maps and extract statistical features.Finally, these low-level features can be fed into a DBN model with extracted deep features and image quality scores are predicted by using SVR method.
In addition, Lv et al. [35] further improved the prediction accuracy and generalization ability.The authors select the multi-scale difference of Gaussian (DoG) features that are highly correlation with perceptual quality.This is because DoG is believed to simulate the neural processing procedure of how eye extracts details from images and convey them to the brain.Then, the SAE model is used to obtain deep features.Finally, these deep features are used to train an SVM regression model to predict image quality.
Compared with traditional BIQA methods, the major advantage is deep features extracted from low-level features is highly related to quality degradation.But the limitation is hand-crafted low-level features need to be carefully designed as the input to DNN, which does not take full advantage of DNN.

B. DEEP FEATURES EXTRACTED FROM IMAGE/IMAGE PATCHES
It is also observed that the deep features can be effectively mined by feeding data of image or image patches into the pre-trained DNN [37]- [39] for classification or recognition task, such as AlexNet [17], GoogleNet [18], RestNet [19], VGGNet [20].Since the IQA is the human visual perception of the high-level semantics [40], the methods of image or image patches as DNN input can avoid the limitation of selecting low-level features to represent image high-level semantics accurately.
More specifically, some methods use image patches to extract deep features and these deep features derived from image patches are aggregated or pooled.Then, the predicted DNN methods for BIQA Extracting from low-level features [33]- [35] Extracting from image/image patch [37]- [39] predicting image quality categories [43]- [47] predicting image quality scores [26]- [31], [48]- [57], [61] The patch-input methods The imageinput methods SS as patch label [30], [48]- [53] Expanding distorted images [26]- [27], [29], [31], [61] Expanding reference images [28] FR as patch label  NSS,texture,and blur/noise features DBN Gaussian process [34] Statistical features of GGD, AGGD, wrapped Cauchy models DBN SVR [35] multi-scale difference of Gaussian (DoG) features SAE SVR quality of images is obtained by SVR method.In [37], the authors use multiple overlapping image patches as input to represent the whole image.They select the optimal layer of the pre-trained DNN model to extract deep features of each patch.Then, three kinds of statistical methods can be adopted to aggregate high-level semantic features of different patches.These aggregated features related to the whole image are fed into a linear regression model to predict image quality.
In addition, the deep features involving high-level semantic information of images are often used to evaluate image quality [38], [39], which is more consistent with human perception of images [41].Sun et al. [38] proposed a BIQA framework, which is inspired by the human visual perception of image quality that involves the integrated analysis of global high-level semantics and local low-level characteristics.They use the first fully-connected (FC) layer of pretrained AlexNet architecture to extract deep features, which aim to represent high-level semantic features associated with global image content.In addition to considering the highlevel semantics, they also utilize the saliency detection and Gabor filters to perform local low-level features relevant to local image content.These features are combined to evaluate overall image quality by using SVR method.Similarly, Wu et.al [83] hypothesize that different levels of distortion generate individual degradations on hierarchical features.Therefore, they propose a BIQA framework based on hierarchical feature degradation.They first extract the low-level image features based on the orientation selectivity mechanism in the primary visual cortex, and then they use the last layer of the residual network (ResNet50) to extract deep features of visual content.Combining with the low-level image features and deep features, the image quality score is predicted by SVR methods.To further improve the prediction accuracy, Gao et.al [39] exploit multi-level deep feature fusion method to evaluate image quality.They assume that using only the last few layers' deep features may unduly generalize over local artifacts.Therefore, multi-level features representation compensates for local degradations.A DNN model formed by the pre-trained VGGNet is used to extracted image deep features over each layer.Afterwards, they utilize the SVR method to estimate the quality score from each layer's feature vector.The image quality is estimated by averaging layerwise predicted score.
Considering that training a deep network is typically difficult for the small IQA database, these methods tackle the insufficient IQA database by extracting deep features from the pre-trained DNN model.Meanwhile, instead of selected low-level features as network input, the mehtods of deep features extracted from image or image patch data directly are more accurate.However, since the deep features extracted from the pre-trained DNN aims to deal with classification or recognition tasks, applying these features directly to our IQA task may not all be useful.

III. DNN-BASED BIQA USING DEEP FEATURES AND QUALITY PREDICTION TOGETHER
Instead of using DNN models to extract deep features related to quality degradation, this method directly uses the DNN FIGURE 3: The flowchart of extracting deep features methods from DNN in [33]- [35], [37]- [39] model to predict image quality.According to different evaluation metrics for quality prediction, there are two kinds of popular evaluation methods in recent years: predicting image quality categories and predicting image quality scores.

A. PREDICTING IMAGE QUALITY CATEGORIES
The DNN methods of predicting image quality catagories can be used to predict image quality categories, such as excellent, good, fair, poor or bad [42].These labels have explicit semantic meanings in different quality ranges, so the category results can be directly used to describe the image quality.Meanwhile, the categorical quality assessment is a natural and viable way for human perception and can potentially reduce the randomness of the quality scores.Therefore, this kind of method treats BIQA as a classification problem to satisfy human visual behaviors.[43]- [47].The general flowchart of these methods is shown in Figure 4.
Hou et al. [43] design deep network to classify images to five grades-excellent, good, fair, poor, or bad corresponding to the five point quality scale recommend by the International Telecommunication Union.The low-level features of NSS relevant to gray images can be extracted in the wavelet domain and fed into the DBN for layer-by-layer pre-training.Then, they recast image quality into five grades by using subjective method.Finally, they fine-tune the DBN to classify image grades by maximizing the probabilistic distribution.Further, considering not every region contributes to image quality perception, Hou et al. [44] also propose saliencyguided deep framework to improve prediction performance.First, they extract salient patches of natural image and adopt independent component analysis (ICA) method to learn basic filters.The same procedure can be applied to encoder salient patches of distortion image.The image-level features are a histogram that represents the frequency of learned ICA filters.Second, the DBN is pre-trained by layer-wise learning method and is fine-tuned by discriminative learning method, which makes the deep network can classify image grades.
The previous works pay attention to describe how to construct deep network but ignore to provide a clear under-standing of why their framework performs so well.In [45], the authors not only propose a SAE method to classify image grades but also try to give a visualization explanation of how it works and why it works well.This is the first time to analyze and visualize deep network framework.Similar to the methods in [43], [44], they derive NSS-based features from shearlet-transformed RGB images and use the SAE model to classify seven quality grades that the train process is similar to DBN.In addition, they visualize the progression of training features to understand the DNN framework in the fine tuning stage.
The disadvantage of these methods is that the handcrafted features as network input cannot completely represent image distortions and contents.In order to overcome this problem, Bianco et al. [46] propose the end-to-end DNN framework to improve the prediction performance.They first pre-train AlexNet for classification task, which use 3.5 million images to pre-training from the ImageNet and Places databases.Then the pre-trained AlexNet is fine-tuned to classify the five image quality grades.Further, the prediction performance is better than the previous methods [43], [44].
In [47] a vector regression DNN model is proposed to obtain image quality grades.They divide image scores into five ordered intervals in response to five different grades.A belief score vector is computed by (1) to describe the probabilities of an image being assigned to different quality grades.
where S is a belief score vector, which collects five quality grade; s k is the defined belief score to describe quality grade; y is the mean opinion score (MOS) of an image.
The DNN is trained to capture the associated belief score vector.It suggests that the smaller the value of |s k | is, the image quality is closer to the k-th grade.Finally, they propose an object pooling strategy to convert image quality grade into score, which fully takes into account the influence of the salient objects on image quality.
Although prediction grade methods are much more natural

Saliency features
Image data FIGURE 4: The flowchart of predicting image quality categories' methods in [43]- [46] to evaluate image quality, the drawback is that different definitions of grades of subjective opinions can significantly impact the prediction performance of algorithms.Meanwhile, in order to make a fair comparison with other algorithms, the qualitative evaluations are converted into numerical scores by using different methods.Different conversion methods will also affect the final evaluation performance.

B. PREDICTING IMAGE QUALITY SCORES
The methods of predicting image quality scores are the most popular for BIQA.The characteristic of this method is purely data-driven and allows for end-to-end optimization of feature extraction and regression.It means that these DNN methods can be used to predict image quality scores, such as DMOS=72.34,DMOS=25.2.This gives a specific scalar as a score to measure image quality.Especially, most of DNN methods adopt this approach to predict image quality, because many of IQA databases use scalar scores to describe image quality.Therefore, in order to keep the predicted results in consistent with the IQA databases, this kind of method can be treated as a regression problem.Although previous work has summarized this method [25], it only introduce the methods using image patch as DNN input and some novel DNN methods that have been appeared recently are not analyzed [26]- [31], [53], [54], [57].Thus, we will systematically summarize and analyze the existing methods.According to different input in DNN, we propose a classification method: the patch-input methods and the image-input methods.

1) The patch-input methods
The performance of DNN heavily depends on the number of training data.However, the currently available IQA databases are much smaller compared to the classification or recognition tasks [17], [18].Moreover, obtaining large-scale reliable human subjective labels is very difficult.To expand the training database, the patch-input method aims to divide image into multiple patches as DNN input to increase training samples.
There are many methods based on image patches as DNN input.According to the different labels of training patches, we discuss these methods in two ways.One is to use the image subjective score (SS) as image patch label [30], [48]- [53].The other is to use FR as image patch label [54]- [57].

a: SS as image patch label methods
In [48], this is the earliest method that integrates feature learning and patch quality prediction into an end-to-end network.They divide gray images into 32 × 32 patches.Each image patch with image subjective score as input is used to train DNN, which consists of 1 convolutional (C), 2 pooling (P) and 3 full-connected (FC) layers.The image quality is estimated by the average score of all image patches.Nevertheless, the problem is that they ignore that the visual quality of different local regions is often different and humans tend to concentrate on the regions of saliency when evaluating an image.Therefore, the salient patches of images can be considered to predict image quality in the following methods [49]- [51].
In [49], the authors design a seven-layer DNN architecture to capture patch-level quality prediction focusing on color images.They then perform the saliency detection with free energy based neural theory to obtain image saliency map [58].After that, they define the weights of image patches by the corresponding saliency map.The final image quality score is yielded with the weighted average of each image patch.To further improve prediction performance, in [50], [51], they consider only the salient patches to evaluate image quality score.First, they also split the image into patches and use typical saliency detection methods to obtain image saliency map.Further, they assign a threshold to remove nonsalient patches.The remaining salient patches are reweighted into the range of [0, 1].The whole image quality score is calculated by the weighted average over salient patches.The general flowchart is shown in Figure 5.
However, the previous weights of saliency maps are set artificially, which is inaccurate to image quality.Some methods study the use of end-to-end DNN to simultaneously obtain patches' scores and corresponding weights.The weights obtained by DNN learning method more accurately respond to the image perception.In [52], the distorted image patches can be fed into DNN, which consists of 9 C layers, 5 P layers for feature extraction and 2 FC layers for regression.The role of first FC layer of DNN architecture is used to learn patches' weights and the second FC layer is used to learning patches' scores.The image quality score is calculated by weighting average of all patches' scores.Compared with the models employing simple average pooling or artificial setting weight pooling, this method improves prediction accuracy and has well generalization ability.Similarly, in [53], they also divide image into 100 image patches and fed them into the DNN to obtain patch score and weight.Considering the relationship between image contents and patches' weights, the global regression layer is used to optimize image prediction score.
In addition, in order to learn the complicated relationship between visual appearance and the perceived quality, Yan et al. propose a novel two-stream DNN architecture, which takes the raw image and the gradient image as input visa two sub-networks [30].The motivation of this design is to integrate input information from different domains to represent the quality of distorted images.Each image is divided into different patches as the inputs of the image stream subnetwork.Each of the sub-network consists of ten layers to extract image features.Especially, the region-based full convolutional layer is used to handle the locally non-uniform distortions of images.The gradient stream sub-network is similar to image stream and the input is gradient patches.Then, a concatenate layer is used to fuse features from the two streams and the followed three FC layers are used to predict patch quality.Finally, the quality score of the whole image is calculated by averaging all patches' scores.The overall framework of the algorithm is presented in Figure 6.
Table 2 compares the implementation of reported patchinput algorithms, which the path label is the ground truth score.It is worth nothing that C, P and F mean convolutional layer, pooling layer and full-connected layer, respectively.w i means the weight of the i-th patch.M means the number of all patches of an image.K means the number of salient patches of an image.q i is the prediction patch score from DNN model.In table 2, we find that because of the increase of training samples, the patch-input algorithms can design a deeper network to evaluate image quality score.Meanwhile, these methods mainly pay attention to the effect of salient patches on image quality.However, the labeling of image patches with the whole image subjective score is problematic, because the ground truth score for each patch does not exist.In addition, the whole image quality score is calculated by the sample mathematical method, which may affect the accuracy of image quality prediction.

b: FR as image patch label methods
To overcome the problem of inaccurate patch label, the strategy that FR methods are used to calculated proxy score of image patch has been studied [54]- [57].Figure 7 shows the flowchart of these methods.
In [54], it is a novel completely blind DNN methods.By taking the large scale of image patches as the training set, the authors design a feature fusion DNN in different layers and use FSIM as the label to train DNN architecture.The DNN consists of 6 C layers, 1 P layer, 2 sum (SU) layers and 2 FC layers.The role of the sum layer is to fuse different layer features to prevent gradient vanishing [19].Especially, the training patch label is calculated by using the FR method, which is an accurate method to calculate patch label without subjective scores.
In [55], J. Kim et al. propose a two-stage DNN-based to evaluate image quality.The patch quality score generated by FSIM method are used as proxy patch label in the first stage of training.In the second stage, the feature vectors obtained from image patches are aggregated using statistical moments and then a global regression layer is used to predict image quality score.Rather than using complex DNN to produce proxy scores, the same authors develop a novel DNN, which aims to regress into objective error maps [56].In the first stage, the objective error maps are used as proxy regression targets to train DNN, which is calculated by the absolute   difference between the reference image patch and distortion image patch.In the second stage, the extracted feature maps from DNN are fed into the global average pooling layer, then regress onto ground-truth scores by using two fully connected layers.The prediction accuracy is competitive with the state-of-the-art methods.
To further improve prediction performance, Pan et al. propose a novel framework for BIQA, which consists of a generative quality map network and a quality pooling network [57].They employ MDSI [59] to generate patches' quality maps as labels and select U-Net [60] as a base of generative network to train image patch quality map.The output quality maps are fed directly into the pooling network to regress patches' scores.Finally, the final score of the whole image is obtained by using the average of all image patches' scores.
Table 3 compares these algorithms to obtain patch label by using the FR methods.Compared with the methods of subjective score as patch label, the FR metrics are used as intermediate local targets for each image patch, which reduce the error of using the whole image subjective score as patch label.In addition, instead of the simple mathematical calculation to obtain image quality score, the global opti-mization method is more accurate for DNN.Whereas, the disadvantage of using FR methods as patch label is that it is very hard to obtain reference images in many practical applications for the FR metrics.
2) The image-input methods Rather than using image patches as the input, the image-input methods aim to train a prediction model by using the whole image and its associated ground truth, which can effectively overcome the difficulty of being able to obtain the ground truth of image patches.However, there has been limited effort towards end-to-end optimized BIQA using DNN, primarily due to the lack of sufficient ground truth labels of images.
Recently, the image-input methods are developed [26]- [29], [31], [61].The novelty is that, despite a lack of image databases, the DNN based on image as input can also evaluate image quality very well.This is because the image expansion techniques are used to solve insufficient IQA database.According to the different extended objects, we classify these methods into two sub-categories: expanding distorted images and expanding reference images.For expanding distorted images' methods, two expanded ways are shown: large databases, such as the ImageNet [21], Places2 [62], and the artificial generation images [26]- [29].
The DNN then is trained by the transfer learning method [63].This is a common way to overcome the small database task.When the distorted images come from the large database, these distorted images can be used to pre-train a DNN.Then, the small IQA database is used to fine-tune the pretrained DNN to evaluate image quality score.In [61], Li et al utilize Network in Network (NIN) [64] and transfer learning technique to deal with BIQA problem.The first step is that the NIN is pre-trained for the classification task on the large-scale ImageNet database.Through this pre-training process, the good initial weights can be obtained, which is much better than randomly initialized weights.In the second step, they modify the pre-trained NIN architecture, which the final layer is replaced by regression layers.In the third step, only the small IQA images with ground truth scores are used to fine-tune the pre-trained NIN.However, for synthetic IQA database, such as LIVE [65], TID2013 [66], CSIQ [67], LIVE multiply distorted (MD) [68], the prediction performance is not accurate.This is because the pre-trained NIN learns the features of authentic distortions of the ImageNet database, which is different from synthetic distortions.
In [31], they assume that various kinds of distortions exist in different IQA databases, which requires different level features to predict visual quality.Therefore, they propose a DNN model using multiple levels of features simultaneously to achieve a consistent performance over different IQA databases.The ResNet-50 [19] model which is pre-trained on the ImageNet database is adopted as baseline.In the fine-tuning stage, they divided all ResNet blocks into four groups and extract each group's features.Then, they define an encoder layer to unify the feature size from different levels.Finally, these multiple levels of features are combined and fed into the FC layer to evaluate image quality score.This method shows the state-of-the-art accuracy on different IQA databases.
Besides, the artificial generation method [26], [27], [29] can be used to construct the large-scale pre-training distortion images, which is similar to the IQA database.It is far from realistic to carry out a full subjective test to obtain a MOS/Difference MOS (DMOS) for each image.Whereas, the challenge of this method is how to obtain the ground truth labels of generated images in the pre-training stage.
To overcome this problem, the motivation of Rank [26] is to design a new strategy to generate the large-scale distortion images without laborious human labeling.According to the rule that the image quality decreases with the increase of the distortion levels, they synthetically generate the ranked image pairs with five different distortion levels from Waterloo Exploration database [69].The Waterloo Exploration database contains 4744 pristine images and covering various image contents.Especially, the generated distortion image pairs are similar to the IQA database.In the LIVE database, they exclude fast fading (FF) distortion type and generate the remaining four distortion types: JPEG compression (JPEG), JPEG2000 compression (JPEG2000), additive while Gaussian noise (WN), Gaussian blur (GB).In the TID2013 database, they generate 17 out of a total of 24 distortion types.Moreover, we do know for any pair of images which is of higher quality.Then, using the pairs of the ranked images, we pre-train a Siamese network [70] to learn image distortion levels by using the proposed Siamese back-propagation technique.Finally, they fine-tune a branch of Siamese network to predict image score, which aims to transfer image distortion levels to quality scores.Figure 8 shows the flowchart of Rank method.Compared with existing BIQA methods, the prediction performance is the best in LIVE database and even outperforms the state-of-the-art in FR methods.However, the limitation of the Rank method is that it can only simulate distortion images in artificially synthetic IQA database, but it is difficult to apply this method to authentic IQA database.This is because we cannot know the priori information of authentic distortion images.Therefore, to improve performance of different IQA databases, Zhang et al. design an end-to-end DB-CNN solution for BIQA that works for both synthetically and authentically distorted images [27].First, they describe the generation process of the largescale database in the pre-training step.They use two largescale databases: Waterloo Exploration database and PASCAL VOC 2012 [71] to generate distorted images.Considering the distortion types of the synthetic IQA databases, they produce nine distortion types related to the LIVE, TID2013, CSIQ and LIVEMD databases, i.e., JPEG, JPEG2000, WN, GB, pink noise, contrast stretching, image quantization with color dithering (ICQD), over-exposure and under-exposure.Especially, the first six distortion types cover the entire CSIQ database.Meanwhile, they synthesize distorted images with five distortion levels except for over-exposure and underexposure, for which only two levels are generated.In summary, the pre-training database contains 852891 distorted images.The ground truth label is presented as a 39-class indicator vector to encode underlying distortion types at the specific distortion level.The dimension of ground truth vector comes from the fact that there are seven distortion types with five levels and two distortion types with two levels.
Then, they design the architecture of the S-CNN for synthetically distorted images, which consists of 9 C layers, 1 P, 3 FC layers and a softmax (S) layer.It aims to classify the probability of each distortion type at the specific degradation level.Considering this DNN model is not beneficial for authentic IQA databases, they select the pre-trained VGG-16 network for the classification task on ImageNet as another branch to extract relevant features for authentically distorted images.This is because the distortions in ImageNet occur as a natural consequence of photography rather than simulations.Finally, in the fine-tuning step, they tailor the pre-trained S-CNN and VGG-16 and introduce bilinear pooling module to combine the S-CNN for synthetic distortions and VGG-16 for authentic distortions into a single model, which aims to discriminate synthetic or authentic distortions.The FC layer follows the bilinear pooling layer to predict image quality score.The flowchart of DB-CNN can be shown in Figure 9.
A closely related work to DB-CNN [27] is MEON [29], a cascaded multi-task DNN framework for BIQA.This method also pays attention to the influence of distortion types and levels on quality degradation.Figure 10 shows the flowchart of MEON method.The subtask I aims to pretrain a distortion type identification network, for which largescale training samples are readily available.They select 840 high-resolution natural images to generate C distortion types' images and each distortion type images has five distortion levels.The ground truth label is a C-dimensional vector to encode distortion types.This network consists of 4 C layers, 4 P layers, 2 FC layers and 1 S layer.Especially, they choose biologically inspired generalized divisive normalization (GDN) instead of rectified linear unit as the activation function of C layers and FC layers.The sub-task II network appends two FC layers after the shared DNN architecture from sub-task I.Then, they define a fusion layer (FS) that combines the distortion types' features from sub-task I and the distortion levels' features from sub-task II to yield an overall quality score.
Table 4 summarizes the expanding distorted images' methods.LM means the learn method, GT means the ground truth of generation images and NGI means the number of generation image.We clearly see that the transfer learning method is used to overcome the small IQA databases.The pre-training DNN aims to resolve the classification problem, because the ground truth labels can be easily known instead of humans' subjective judgment.Especially, the depth of network is proportional to the number of pre-trained samples.Moreover, in order to deal with authentic images, they add the sub-network to meet the prediction of authentic IQA  This is a novel topic to use generative adversarial network (GAN) to augment images.Since the distortion images and corresponding non-distortion reference images are typically absent in IQA databases, it leads to the prediction performance of image quality being not accurate.Thus, the HIQA method [28] aims to address this problem by combining the GAN and the GAN-guided quality regression (R) networks.The Fig. 11 shows the flowchart of the GAN method.First, the quality-aware generative (G) network can be used to overcome the absence of reference image, which aims to generate the hallucinated reference image I h conditioned on the distorted image I d .In order to reduce the difference between the hallucinated image and the corresponding reference image, the loss function of G network can be designed by using the pixel-wise error and the perception-wise difference.Second, they propose a IQA-Discriminator (D) network to adjust the loss of G to produce high perceptual outputs, even when G fails to generate hallucination images, the predicted scores of R network should still be reasonable value.Finally, the distorted images and their discrepancy maps between hallucinated images and its corresponding distortion images are fed into the R network and the high-level features fusion scheme is adopted to optimize R network.Especially, the training strategy is set.The GAN network is trained to generate a large number of the hallucinated images, which is similar to the reference images in IQA database.And then, the R network is trained to predict image quality score.In GAN network, the D network is first trained to distinguish the fake reference images from the reference images of the IQA database.Then, the G network is trained to generate images, which is similar to the real reference images in the IQA database.Finally, the image quality score can be predicted by optimizing the loss of the R network.

A. DESCRIPTION OF PUBLIC DATABASES AND EVALUATION METRICS
The choice of a database for training is important for deeplearning-based models, since their performance highly depends on the size of the training set.We briefly describe several popular public databases for BIQA, including LIVE [65], TID2013 [66], CSIQ [67], LIVE MD [68], LIVE In the Wild Image Quality Challenge Database (LIVEC) [72].
1) The LIVE database [65] includes 29 reference images and 779 distorted images degraded by five types of distortions (JPEG, JP2K, WN, GB, Rayleigh fast-fading channel distortion (FF)).Subjective quality scores are provided in the form of difference mean opinion score (DMOS) ranging from 0 to 100, where a lower score indicates better image quality.
2) The TID2013 database [66] contains the largest number of distorted images.It consists of 25 reference images and 3000 distorted images with 24 different distortion types at five levels of degradation.The database also provides the MOS, ranging from 0 to 9. A higher value of MOS indicates

GAN Regression network
Reference images Discrepany maps IQA database FIGURE 11: The flowchart of the HIQA method in [28] higher quality.The distortion types include a range of noise, compression, and transmission artifacts.
3) The CSIQ database [67] consists of 30 reference images and 866 distorted images corrupted by six types of distortions: JPEG, JP2K, WN, GB, pink Gaussian noise and global contrast decrements.Each image is distorted by five different distortion levels.Subjective quality scores are provided in the form of DMOS ranging from 0 to 1.
4) The LIVE MD database [68] was the first to include multiple distorted images.Images are distorted by two types of distortions in two combinations: simulated GB followed by JPEG and GB followed by additive WN.It contains 15 references and 450 distorted images, and the DMOS of each distorted image is provided, ranging from 0 to 100.
5) The LIVE In the Wild Image Quality Challenge Database (LIVEC) [72]   ground truth quality of each patch does not exist.For image-input methods, we clearly see that in the synthetic IQA databases, the methods of expanding distorted images [26]- [28] are more benefit than that of directly using large database methods [31].In the LIVE database, the RANK, DB-CNN methods perform superior to the MFIQA, because artificial generation method can simulate images with similar distortion types and levels in synthetic IQA database.Hence, the DNN can roughly learn the features of similar distortion images with IQA database in the pretraining stage.On the contrary, in the LIVEC database, MFIQA method is better than RANK, because the pre-trained DNN in the large database learn the real distortion features.However, due to the limitation of synthetic distortion images, it cannot meet the needs of various databases, which leads to poor generalization ability in different databases.In order to overcome this problem, the DB-CNN method design two sub-networks that can satisfy both synthetic and authentic distortion, thus improving the prediction accuracy.In addition, the expanding distorted images' methods compete with the expanding reference images.However, it is worth noting that the popular GAN method is first used to solve insufficient IQA database problem.

C. PERFORMANCE ON CROSS-DATABASE
It is expected that a robust BIQA model that has learned on one image quality database should be able to accurately assess the quality of images in other databases.Therefore, in table 8, we compare the results of generalizability of the classic BIQA methods and DNN methods only in the synthetic distortion databases.But we do not consider train the DNN model on the authentic image distortion database (LIVEC).On the one hand, this is because some DNN methods need to use the reference images or simulated distortions method to train DNN model, such as DIQA [56], RANK [26], while the LIVEC is the authentic image distortions without the reference images or prior distortion types.On the other hand, because of the largely difference between synthetic and authentic images, many DNN methods do not discuss cross dataset test between synthetic and authentic datasets.Therefore, the compared BIQA methods are trained using all the images from one synthetic database, and then tested on another database.In the CSIQ and TID2013 databases, four overlapping distortion types (WN, GB, JPEG, JP2K) are used.
In table 8, it can be seen that the DNN method is the best performance when LIVE database is trained and other subset databases are tested.The MFIQA and DIQA obtain the better performance than other methods when CSIQ subset and TID2013 subset are trained, respectively.Therefore, the generalization ability of the end-to-end DNN methods is generally better than the classic BIQA methods and the extracted deep features' BIQA methods.This is because the end-toend methods can use images/image patches data to learn deep features and reduce the prediction errors by back propagation method.However, the classic BIQA methods are limited in extracting hand-crafted features, which cannot completely represent the image structures and distortions.Meanwhile, the prediction performance of shallow regression, such as SVR, is not as good as that of deep regression network.Similarly, although the extracted deep features methods can further extract the deep features from the limited low-level features, the shallow regression restricts the generalization ability.
Furthermore, in DNN methods, the generalizability of the patch-input methods [48], [52], [56] is better than the imageinput methods [26], [31].The main reason is the patch-input methods use the images of IQA database to expand training samples to train DNN network, but the image-input methods expand the IQA database by using exterior images.These exterior images can be fitted as IQA images to expand IQA database.Because the difference between the fitted images and IQA images, it reduce the generalization ability of the DNN model.

D. THE COMPLEXITY OF DIFFERENT DNN METHODS
We calculate the complexity of different DNN methods as shown in table 9, including CNN, DIQaM, BIECON, RANK, DB-CNN.Especially, WPs and BPs mean the weight parameters and basis parameters, respectively.ATPs means the total parameters of the DNN.CTs means the parameters of all C layers and FTs means the parameters of all F layers.Since C and F layers are used to update network parameters, the complexity of algorithm is closely related to the C and F layers' parameters.In table 9, we clearly see that the complexity of CNN is lower than the DIQaM, BIECON, RANK, DB-CNN, because the number of layers of the DNN is smaller than that of other methods.Further, the complexity of F layers is higher than that of C layers expect for DIQaM.Especially, in the DB-CNN, RANK, although the number of F layer is much smaller than that of C layer, the complexity of F layer is still higher than the C layer.This is because the F layer optimizes all local features jointly, while the C layer only optimizes local features.Compared with DIQaM and BIECON, since the number FC layers of BIECON methods is much larger than the DIQaM, it is easy to understand that the complexity of BIECON method is higher than the DIQaM.Therefore, F layer has higher effect on DNN complexity than C layer.It is worth noting that when designing the deep network, we need to consider the number of layers and the proportion of C and F layers.

E. DISCUSSION OF DIFFERENT DNN METHODS
As shown in table 10, we compare the implementations and of different DNN methods.The first three DNN models are based on the patch-input methods and the last two DNN methods are based on the image-input methods.Note that SS means image subjective score (SS).DL and DT mean the distortion level (DL) and type (DT), respectively.The comprehensive performance is presented in five different databases (LIVE, TID2013, CSIQ, LIVEMD, LIVEC).In table 10, we find that the prediction performance is not

V. CHALLENGES OF DNN METHODS
In the previous sections, we present a comprehensive review of the recent literature in DNN models for BIQA.Although DNN-based BIQA methods can achieve outstanding performance due to their strong representation capability, there are several challenges at the same time.Meanwhile, we provide some solutions to these challenges.
1) Creating the large-scale IQA database The number of training samples is critical to the success of DNN models.Currently, the lack of large training data sets is often mentioned as a challenge.Although both the image-input methods and the patch-input methods overcome the problem of insufficient IQA database to some extent, these methods have their own shortcomings to the label accuracy of generation images.Therefore, understanding how to successfully create reliable, very large-scale databases is still an open question.Therefore, the online crowdsourcing system is one possible solution, which aims to gather very rich human data in term of subjective testing.In addition, if a large social media company were to engage their customers to provide image quality scores, it would also ensure the aggregate quality of the collected human data.
2) Exploring unsupervised DNN methods The current DNN models mainly use the supervised end-to-end optimization to evaluate image quality.However, the lack of sufficient ground truth labels is a serious problem for BIQA.Therefore, we expect that training an end-to-end DNN model in a completely unsupervised manner is worth further investigations in the future.This is because obtaining large amounts of unlabeled data is generally much easier than labeled data and human learning is largely unsupervised: we discover the structure of the word by observing it, not by being told the specific labels.Thus,we could try to design two branch networks to the unsupervised method.The one is used to learn the features of reference images and the other is used to learn the distorted images' features.Then, the most important is we need to establish a loss mechanism to quantify the difference between the two branch networks.In addition, the proxy mechanism may be designed to replace the image subjective scores.
3) Explaining the theoretical basis of DNN methods Although DNN thoroughly understands the data distribution and results, for human, there is no theoretical analysis explaining why it works well to the designed DNN architecture and how to further improve the prediction performance.Therefore, it is meaningful to explore the theoretical guarantee of DNN model, in order to guide further researches in this field.The two methods may be selected to explain DNN algorithms.One approach could analyze DNN architecture by using visual method [85].The visualization of layer-bylayer features helps understand how the DNN learns useful features for IQA task.Another is to explain the functions of DNN according to the algorithms' requirements so that the functions of DNN could deal with the IQA problems.

VI. CONCLUSION
This paper presents a systematic survey of various DNNbased methods for BIQA.We discussed and analyzed the state-of-the-art DNN methods according to different strategies of DNN models.This classification strategy explicitly shows the characteristics, advantages and disadvantages of different DNN methods for BIQA.Especially, some novel

VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2938900,IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 5 :
FIGURE 5: The general flowchart of SS as image patch label methods in [49]-[51]

FIGURE 8 :
FIGURE 8:  The flowchart of Rank method in[26]

FIGURE 10 :
FIGURE 10:  The flowchart of MEON method in[29] comprises 1162 images, which are captured using modern mobile devices and contain diverse authentic image distortions.In addition, no undistorted reference images are available in LIVEC.Subjective scores are obtained in the form of MOS in an online crowdsourcing platform.MOS values lie in the range [0, 100].The summary of the above databases is shown in Table 5.Note that Ref means the number of reference images.Dist means the number of distorted images.DT means the number of distortion types.SST and SR mean subjective score's type and range.

TABLE 3 :
The comparison of DNN methods by using FR as patch label a: Expanding distorted images' methods

TABLE 5 :
Comparison of different IQA databases PLCC) are used for performance evaluation.These metrics are to measure the correlation between a set of estimated visual quality scores Q est and a set of human subjective quality scores Q sub , as: This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2938900,IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS