End-to-End Image Patch Quality Assessment for Image/Video With Compression Artifacts

In this paper, we present an experimental image quality assessment (IQA) method for image/video patches with compression artifacts. Using the High Efficiency Video Coding (HEVC) standard, we create a new database of image patches with compression artifacts. Then, we conduct a completed subjective testing process to obtain the ‘ground truth’ quality scores for the mentioned database. Finally, we employ an end-to-end learning method to estimate the IQA model for the patches with HEVC compression artifacts. In such proposed method, a modified convolutional neural network (CNN) architecture is exploited for feature extraction while an adaptive moment estimation optimizer solution is used to perform the training process. Experimental results show that the proposed end-to-end IQA method significantly outperforms the relevant IQA benchmarks, especially when the compression artifacts are strongly realized in image/video patches. The proposed IQA method is expected to drive a new set of image/video compression solutions in future image/video coding and transmissions.


I. INTRODUCTION
Currently, image quality assessment (IQA) has been playing a critical part in image and video communications. IQA is a basic and important requirement in encoding images and videos [1]. Generally, either subjective or objective methods can be used to evaluate the quality of the image.
Subjective assessment methods are highly effective, but they can be infeasible in conducting the assessment in real time and on large scale. It requires the engagement of a number of human viewers who will give their views on image/ video quality under a variety of test conditions. Thus, it is necessary that testing conditions be closely monitored, with careful selection of observers and processing of the findings to ensure their consistency and statistical significance. Consequently, they are costly and time consuming.
Unlike the subjective method, objective quality assessment adopts criteria that attempt to imitate the ability to perceive images via the human visual system (HVS). In some conventional methods, the absolute or squared difference between distorted images and their original ones is utilized.
The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou .
The traditional methods of image compression mainly use the quality metrics based on signal-fidelity, which poorly correlate with the quality perceived by humans, such as MAE (mean absolute error), MSE (mean square error), PSNR (peak SNR) and their inheritances [2]. While these metrics have a lot of positive features, e.g. clear physical meaning and high calculation efficiency, they create adverse impact on the efficiency of compression as they fail to exclude image visual redundancies which are inconsistent with human visual perception.
A number of perceptual quality metrics have been introduced in recent years to obtain measures more consistent with human visual perception. One class of these algorithms including SSIM [3], FSIM [4], RFSIM [5] with an application of handcrafted features that supposedly capture relevant factors affecting to image quality. Although they are widely accepted and applied, the accuracy with which they reproduce human perception of image quality need to be enhanced.
Another set of methods adopt convolutional neural network (CNN) based approaches [6], [7]. In this approach, some features are extracted from the original image's pixels, which are automatically learnt and embedded within the network. Some available image quality databases have been introduced in literature, including LIVE Image [8], TID2008 [9], TID2013 [10], CSIQ [11], IVC [12] and MICT [13]. Generally, these methods estimate the quality of the image patches and propose a most apparent artifact IQA metric which only well perform with desirable results on some typical image/video databases. Their architecture is more suitable for evaluating the quality of block size image in rate-distortion optimization manner which share a block-based common hybrid coding framework. However, their IQA metrics which do not synthesize the video compression features fail to compute the quality of block size video frame. Jin, in [14], has stated that there is no comprehensive method which is exactly comparable to human perception and can be equally well applied in different areas. Therefore, a subjective rating database of image/video patches with compression artifacts is necessary in constituting ground truth needed for training, testing, and benchmarking in video coding.
Based on these observations, we propose a large-scale Image-Patch Quality Assessment database with video compression distortion in this paper. State-of-the-art IQA methods on the proposed database are analysed and it can be seen that IQA's accuracy can be improved in predicting image quality. Finally, we propose a full-reference (FR) Image-Patch model to determine image patches' quality based on the CNN architecture.
In summary, this paper includes the following contributions:

1) A COMPRESSED IMAGE-PATCH DATABASE
To our best knowledge, this database is the first one to be constructed, serving as a benchmark for assessing compressed image and the quality of video's frame patch, and being beneficial for image and video compression on the basis of human perception. The existing databases with coarse-grained quality are inefficient to evaluate IQA algorithms especially patch-based methods on images with fine-grained quality differences. One of the problems in perceptual-based image and video compression is to select the optimal coding mode for each coding block according to their rate-distortion. Therefore, the proposed database can be helpful for researchers in the field of image compression to select the most appropriate IQA method to gain the perceptual based image optimization.

2) A DEEP IMAGE-PATCH NEURAL NETWORK DESIGN
We also investigate different FR methods to model the relationship between image patch and patch quality score. After multiple experiments, Deep Image-Patch Quality Assessment is proposed to address the problem of end-to-end optimization. We use the adjusted concept of Siamese networks mentioned in tasks of [15], [16] to extract features from the reference and distorted patches based on a deep convolution neural network. This paper is organized as follows: The related studies on IQA databases and Deep Learning IQA Methods are reviewed in Section II. Then, Image/video patches with compression artifacts created and the proposed method are described in Section III. Extensive experimental results are presented in Section IV. Finally, conclusions are given in Section V.

II. RELATED WORKS A. IQA DATABASES
A brief overview on test material and experimental details of existing databases is presented in this part. As can be seen from Table 1, experimental data are generated from original images . Various distortions are added to these images at different levels to form testing images whose quality is assessed via subjective rating by a certain number of observers (usually from [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]. The testing methods frequently used are ''Double stimulus categorical rating'' (DCIS) and ''Single stimulus categorical rating'' (e.g., Absolute Category Rating (ACR). The assessment scores used are differential mean opinion score (DMOS) or mean opinion score (MOS) in combination with standard deviation. Among those databases, LIVE [8], CSIQ [11], TID2008 [9], TID2013 [10] are widely utilized in bench-marking, testing and training of IQA. Releasing 2 of LIVE [8], [17] consists of 779 testing images generated from 29 original images which are added with five different distortions. They are JPEG, JPEG2000, white noise, Gaussian blur, and simulated Rayleigh fading channel (JPEG2000 bitstream) of which each comprises 7-8 levels. As for Categorical Subjective Image Quality (CSIQ) Database [11], there are 30 original images, of which each is processed with the application of six types of distortions including JPEG compression, JPEG-2000 compression, global contrast decrements, additive pink Gaussian noise, and Gaussian blurring. For each type of disortion, four to five different levels are used. The result is that 866 distorted images are produced. With The Tampere Image Database (TID 2008), 1700 distorted images are produced from 25 reference images using 17 disortion types at 4 levels of degradation for each distortion. TID 2013 [10], [18] extended from TID 2008 comprises 3000 distorted images by using 24 distortions instead of 17. Currently this image quality database has the largest number of both testing images and subjects in the public domain. Liu et al. [19] recently introduced the PDAP-HDDS image quality database including 2,000 high-definition resolution test images. This database consists of a total of 12000 MOS calculated from 360,000 opinions subject rating by 38 observers. Due to the fact that limited distortion levels (4-8 ones) covering the whole quality range from ''Bad'' to ''Excellent'' are available in most of the existing IQA databases, it is obviously different and easy to rank the images in adjacent distortion levels. These databases contain some distortion types that do not occur during modern image or video compression.
All existing databases evaluate subjective quality for full image while the quality in its regions is far different. Because of compressing or adding noise to the image, each region has its own characteristics resulting in annoying artifacts with which HVS has different sensitivity. Moreover, there is a minimum visibility threshold which no change can be perceived below [26]. Figure 1 shows that distortions between the houses and the sky regions (1) and the edge region (5) are easily observable. However, those on flat region (2) and textural regions (3,4) are less noticeable. In addition, HVS has the ability to effortlessly identify salient objects even in a complex scene by exploiting the inherent visual attention mechanism. As stated in [27], [28], many physiological experiments have proved that only the significant portion of the scene projected onto the retina is thoroughly processed by human brain for semantic understanding. Therefore, it is inappropriate to take subjective rating of the whole image for all areas in the image.

B. DEEP LEARNING IQA METHODS
In Deep learning IQA methods, a number of patches from images are usually selected, then fed into a CNN model depending on the distortion type and distortion level information. Subsequently, the CNN model extracts features from each selected patch and evaluates its quality score. Finally, all scores of patches are weighted to obtain the quality of the image. Bosse et al. [6] justrandomly selected patches and acquired the image quality in two ways: simply averaging and learned weights of the patches' score. Li and Yue [29] used the idea of visual saliency to calculate the weight of each patch in an image and selected top weighting patches. Kim and Lee [30] pre-trained a CNN model using numerous patches with proxy quality scores provided by a full-reference IQA model. Recently, Zhang et al. [31] proposed a deep bilinear model for blind image quality assessment (BIQA) that handled both synthetic and authentic distortions. Last but not least, in [32] the authors proposed a multi-task CNN to predict the type of distortions and image quality from the last fully connected layer in the network. In summary, the afore-mentioned methods partially address the patch-based training data shortage problem, but it is difficult to extend them to ensure the subjectivity of original databases. To overcome that weakness, Wu et al. [33] deployed a new large-scale training dataset (including 80,000 labeled images using advanced FR-IQA metric) to develop a novel no-reference (NR) model for accessing the perceptual quality of screen content pictures.
Generally, image quality assessments (IQAs) are divided into three types, namely full-reference (FR), no-reference (NR) and reduced-reference (RR). Full-reference (FR) approaches can fully access reference images, but no-reference (NR) approaches only use distorted images. The majority of aforementioned deep learning based methods are included in NR type with the exception of the method proposed in [6] which belongs to FR IQA type. No-reference approaches yield the good performance of cross test in the training database, albeit showing poor results when being tested in others [6], [31].
The SSIM [3] can be considered as the most popular perceptual approach in FR IQA. It is determined by pooling luminance similarity, contrast similarity and structural similarity. The SSIM is not only developed into MS-SSIM metric [34], but also be applied for FR IQA, such as the FSIM [4], the SRSIM [35] or RFSIM [5] obtaining promising results. However, testing conducted in [6] indicates that WaDIQaM-FR method shows better performance compared with the above-mentioned methods. Other FR IQAs approaches as in [36] use an adaptive representation of local patch structure yielding rewarding results, but they can only be applied to certain distortion types. Therefore, this study is only intended for comparison of performance of full-reference (FR IQA) approaches in [6].
In a recent work [37], we propose a quality assessment approach database for image patch with the desire to create a new perception-based metric to apply for each region. Then, a coding distortion modelling method for local image perception which is able to predict objective evaluation from the perceptual point of local image content is presented. Experimental results show that compressed image quality decreases depending on the visual features of image. However, the testing image database only has 600 samples, so it is not enough to cover all features of human visual perception.

III. PROPOSED IMAGE PATCH QUALITY ASSESSMENT
Given the necessity of an efficient IQA method for image/video patches with compression artifacts, we present in this section a novel end-to-end IQA method. After discussing the characteristics of image/video patches with compression artifacts, we introduce the architecture of the proposed method. Finally, we present the feature engineering and training optimization.

A. IMAGE/VIDEO PATCHES WITH COMPRESSION ARTIFACTS
All available image quality benchmark databases are only suitable for evaluating the quality of images as a whole and not able to investigate which part of the testing image contributing to the testing results or to the score of a particular patch of image. In this work, we set up an experimental database to evaluate the quality that human perceive for each image patch. The testing data are preprocessed to remove noises and outliers.

1) GROUND TRUTH SCORE ACHIEVEMENT
The goal of our study is to create a testing image database for local image perception. Due to the research orientation for video coding, testing images are extracted from the video test sequence and noise types are added to the original video by H.265/HEVC compression before extracting. There are 40 original source videos of high-definition (1280 × 720) and full high-definition (1920×1080) being compressed with different quantization parameters (QPs) in range from 2 to 50. For each video sequence, depending on the length of such video, a different number of original frames are selected as reference images, as summarized in Table 2. The reference frames are selected evenly throughout the video sequence to diversify the content. For each reference image, 200 pairs of 128 × 128 patches are randomly selected by position to crop and by quantization parameters in a frame of the compressed video which has the same index with the aforesaid reference image. We also crop the center 64 × 64 patches from the original pair 128×128 to evaluate in the experiments. Finally, we obtain 246, 400 images: 61, 600 pairs of 64 × 64 patches and 61, 600 pairs of 128 × 128 patches. All patches are annotated with their position.

2) TESTING METHODOLOGY
Observers may be experts or non-experts depending on specific requirements of the test. Studies have found that systematic differences may occur among different laboratories conducting similar tests [38]. One of the reasons for this is that expert observers have different view in compare with non-experts. Other factors may include gender, age, and occupation. However, the majority of consumers in reality are non-experts; hence, for the purpose of this study non-expert observers are recruited. Before the final selection round, all candidates have been checked to ensure that they possess normal visual acuity (color vision included). In [38], it is recommended that observers should not be directly involved in image quality evaluation and should not be experienced assessors. In this study, more than 2000 people (2159-undergraduates, 20-graduates, 3-researchers,7 -lecturers) are employed.
For the purpose of subject testing methodology, the International Telecommunication Union set the ITU-R BT.500-11 standard. In this standard, there are several popular subjective methodologies for testing such as ''Single stimulus categorical rating'', ''Double stimulus categorical rating'', ''Ordering by force-choice pairwise comparison'' and ''Pairwise similarity judgments''. Double stimulus categorical rating is chosen in this test. In this method, both the testing and reference images, which randomly appear, are viewed by observers for a fixed period of time. After that, the observers are asked to vote for the quality of the testing image in accordance with the scale of five categories: ''excellent'', ''good'', ''fair'', ''poor'' or ''bad''. Prior to each session, the observers have been instructed about assessment modes, assessment scale and the procedures (reference image, grey, test image, voting period).
Since the image quality assessment methods stated in [38] are only suitable for assessing quality of image as a whole, they cannot be directly applied for our testing experiments. Therefore, we modify this image selection method into the standard so that the users can focus and only assess the local image patch instead of the whole image. Figure 3 represents the created testing software in the experiment. Each pair quality is assessed following the procedure of 2. The subjects observe the original image within the time T1 at minimum 5s, then click on the observing image patch to observe the compressed image within the time T2. After watching at least twice per image, observers are requested to score on the scale of 5.

3) DATA PREPROCESSING
In our experiment, 2189 different subjects rate 61600 image patch pairs of which each p th is observed by S p (up to 20) subjects. The differential mean opinion score (DMOS) of each patch pair is calculated by: where y p,s is the differential opinion score of a subjective rating by subject s for patch pair p th . Let Y o denote the raw data and each image patch pair (R p , D p ) is evaluated by at least 15 observers as folow:  The raw score database is not entirely good because some observers evaluate carelessly. To remove outlier in this data, we use z-score. The z-score of a subjective rating for patch pair p th is calculated by the following formula: where µ p denotes the mean and σ p denotes the standard deviation of rating pairs. The figure below (Fig.4) shows that distribution of z-score is the standard normal distribution side-by-side. According to empirical rule, 95%, 98.7% and 99.7% of the values lie within 2, 2.5 and 3σ , respectively. After applying this rule we achieve the results as in table 4. We select 2σ to minimize the error resulting in a reduction of 422 image patch pair scores in database. Fig. 6 shows VOLUME 8, 2020 the standard deviation of subjective ratings before and after outlier rejection. Most samples having deviations greater than 0.5 have been removed. The filtered data are presented as follows: Finally, each pair of patches is evaluated by the average cleaned scores of at least 15 observers (with 5 levels). N = |Y f | = 40, 286 cleaned pairs are kept to make two final HMII (Human Machine Interaction Image) databases (Table 3). Each database comprises 40,286 quality annotated images based on 40,286 source reference image patches that are subject to different distortion levels of compression as in table. Fig. 5 is an example of image patch pair with two diference sizes. Differential mean opinion score (DMOS) for this dataset is computed for each pair, ranging from 10 to 50.

B. BENCHMARK ANALYSES 1) HMII DATABASE BENCHMARK ANALYSES
We implement of seven state-of-the-art algorithms (PSNR, UQI, VSI, SSIM, RFSIM, FSI and SRSIM) and two new methods (DIQaM-FR WaDIQaM-FR) [6] to predict objective scores for the entire HMII database. Table 5 shows that the pairwise preference consistency is evaluated using the classic correlation coefficients SRCC and PLCC. The SRCC and PLCC are the average values for the testing image patches of the same reference image patches, and the top two correlation coefficient values are in bold. It is seen that PSNR and UQI are less correlated with the quality perceived by humans, and even contrary to subjective results. This defective performance of PSNR is also mentioned in the work of Zhang et al. [39] about Fine-Grained Quality Assessment. Although VSI consists of HVS features and achieves more consistent results than PSNR in global image assessment, it is poorly correlated with human perceptual quality in fine-grained patch quality assessment. In general, FSIM achieves top two performances for all cases and SSIM achieves better performance with PLCC while SRSIM performs better with SRCC. For the two correlation coefficients, the above-mentioned IQA methods shows quite similar characteristics, while two new methods methods based on machine learning fails in proposed database. The reason is that each structure of a machine learning problem is only suitable for its own database.

2) SIMPLE IMAGE-PATCH MODELS
In this experiment, different models are compared to find the best 'ground truth' predictor for patch quality. We use the following models with our database: • IPM: Zhang in [39] assumes the curve model to predict image-patch quality is a cubic polynomial function: where θ = a 1 , a 2 , a 3 , a 4 is the parameter for the non-linear function of Image-Patch model and (R p ) are the feature of reference patch R p . MSE and SSIM are chosen for the design of features. In our work, we try top three FR-IQA methods: SSIM, FSIM and SRSIM.
• DIQaM: Bosse in [6] presents a CNN model for image quality assessment which obtains superior performance on different IQA benchmarks. We utilize the extractor architecture from this paper to train a Deep Neural Network on our database. Firstly, we use previous works to extract SSIM, FSIM and SRSIM features for IPM. Then, the above curve model is fitted using the least mean square method to adapt the coefficients that best fit the database. DIPQA is developed based on DIQaM's extractor architecture which shares the same regression part. With the DIPQA, we use VGGNet as a feature extractor, this part is trained with the entire network. Table 6 shows the performance of the simple models on HMII database. With any of the two correlation coefficients, DIPQA (VGG extractor) achieves superior performance to the others'. From the results of this experiment, it can be   seen that objective models assess image-patch quality more accurately with the larger size of the patch.

C. PROPOSED IQA METHOD 1) ARCHITECTURE OF THE PROPOSED METHOD
Being known as a designed architecture to learn the similarity relations between two given inputs, Siamese network has been applied for face verification [38] and signature [42] tasks. The main concept of this method is to process two networks that share the same architecture and weights in parallel. In this work, we employ a Siamese network for feature extraction. Before feeding the feature vector as input to the regression layers, two extracted feature vectors are combined by a fusion step. The proposed architecture of IQA method is sketched in Fig 7. For training our IQA method, data set Y f is randomly splitted by reference image. The training set is based on H pairs of reference and distort image, testing set on N − H pairs. The training set is: and testing set is: Let us describe the notations in convolution filter of CNN feature extractor which consists of a stack of convolutional layers, pooling layers, and full-connected layers. Let l denote the l th layer where L is the number of layers. H × W (pixels) respectively be the width and height of input image patch x. Let w l m,n denote the weight matrix between neurons of layer l and neurons of layer l − 1. The convolved data streams at layer l plus the bias unit b l are defined as follows: where f (·) denotes the activation function. Application of the activation layer to the convolved input vector at layer l is VOLUME 8, 2020 given by: where a i is the coefficient controlling the slope of the negative part. All layers of feature extractor are activated by a rectified linear unit (ReLu) [43] when a i = 0. The resultant activation function is of the form f (x l i,j ) = max(0, x l i,j ). Let y i denotes the ground-truth indicator vector, feature extractor produces the activation of the reference patches feature vector F = [F 1 , . . . , F P ] and the distortion patches feature vector F = [F 1 , . . . , F P ]. To denote the weights of last convolution filter in feature extractor by w L , we define the feature vectors function according to the formulas: and where R l p and X l p represent a feature map in l layer of the input patch R p and D p , respectively.
The feature extraction layers extract F and F which are the feature vectors of reference and distorted patch respectively. The regression layers require the network to combine these two vectors in a feature fusion step. The simplest strategy is concatenating F and F to an unique vector (F, F ). Besides, F − F is known as a meaningful representation for distance in feature space. Therefore, concatenating F − F is expected to contribute to learning relations between reference and distorted patch. The final output of this state is: Finally, the fused features are regressed by a sequence of two fully-connected layer including: FC-512 and FC-1. The inference of fully connected layer can be represented by: where q p represents the output IQA method, * is the convolutional operation.

2) EVALUATION CRITERIA
IQA estimation algorithm's performance is measured through the deviation between the estimated and actual values. The common method to test the efficiency of IQA estimation algorithms is using Mean Absolute Error (MAE). The smaller MAE value obtained is, the smaller the error margin in prediction is made. Let y p denotes the subjective testing IQA and q p denotes the predicted IQA of the pair p th . MAE is calculated as the average of the sum of absolute differences between two IQA variables in the following equation (15).
where N − H is the number of testing patch pairs.

3) TRAINING OPTIMIZATION
Our network is trained end-to-end by backpropagation, over a number of epochs. The adaptive moment estimation optimizer (ADAM) is used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data. Optimization problem is to minimize the cost function J (.) as defined in equation (17).
where w is the weight vector. ADAM uses exponentially decaying average of past gradients, m k (first moment) and past squared gradients, v k (second moment) as given in equation (18) and (19) respectively. Adam weight update equation can be mathematically represented as equation (20).

4) FEATURE ENGINEERING
With the general architecture in Fig. 7, we select one of five different feature extractors as follows: 1) VGGnet: With the successful adaptation for various computer vision tasks [45], [46], especially in image quality assessment [6], VGGnet [47] is chosen as a base network for the feature extraction. The input of the VGG network is the size of 224 × 224 pixels. For the purpose of adjusting the network for 64 × 64 and 128 × 128 pixels, we have tried to change the architecture of VGG network such as: extending the network by 3 layers [6], cutting last 3 layers, last 6 layers or even replacing VGG with Resnet. Finally, we choose to cut the last 3 layers of VGGnet and achieve the best performance comparing to other approaches (Fig. 8).
Our VGGnet-inspired DCNN comprises 12 weight layers as a feature extraction module and a regression module. The features are extracted in a series of conv3-64, conv3-64, maxpool, conv3-128, conv3-128, maxpool, conv3-256, conv3-256, maxpool, conv3-512, conv3-512 and maxpool layers. This results in about   17.3 million trainable network parameters. All convolutional layers apply 3 × 3 pixel-size convolution kernels. 2) ResNeXt-50: ResNext, also known as Aggregated Residual Transform Network, is an improvement over the Inception Network. Xie et al. [48] exploited the topology of the split, transformed and merged in a powerful but simple way by introducing a new term "cardinality". The model consists of 1 convolutional (7 × 7) layer, 1 maxpooling layer, 4 convolutional blocks alternated by 4 groups of identity blocks. We modify this network for extraction by removing the last Average layer and adding a Single convolutional layer instead before features fusing as shown in Fig. 9. 3) Xception: Chollet in [49] revised the idea of Inception modules and offered to use depthwise separable convolutions by maximizing the number of towers in a module. Basically we use the original, however, similar to ResNeXt-50 network, the last part is changed as shown in Fig.10. 4) Inception-v4: Inception V4 [50] is a well-known architecture developed based on the GoogLeNet platform, the input of this network is an image patch (299 × 299 pixels), the output depends on how many classes targeted to predict. In the pre-trained model used in this research, we keep Average Pooling layer and add a Single convolutional layer before features fusing in Fig. 11. 5) Inception-ResNets: Inception-ResNet-v2 is a combination of two recent networks, residual connections [51]. The Inception models are famous for their multi-branch architectures. They have a set of filters (1 × 1, 3 × 3, 5 × 5, etc.) that are merged with concatenation in each branch. The split-transform-merge architecture of the inception module is observed as a powerful  representational ability in its dense layers. The hybrid Inception-ResNet-v2 network shows in Fig. 12.

IV. EXPERIMENTAL RESULTS
In this section, we introduce the experimental setup including the Datasets, the Evaluation Metrics and the implementation details of IQA model. The best feature extractor applied in the proposed model (HMI-IQA) is chosen from 5 CNN architectures. In addition, we test the proposed model for 4 common datasets including [8], TID2008 [9], TID2013 [10] and CSIQ [11].

A. EXPERIMENTAL SETUP 1) DATASETS
There are 2 types of datasets used in experiments. The first one, training datasets, is used for optimizing the proposed IQA deep neuron network model and the other, cross-evaluation datasets, is for the independent evaluation of the proposed model: • Training Datasets: We use datasets HMII_64 and HMII_128 as presented in Table 3 and Fig. 5 for the purpose of our IQA deep neuron nework model optimization. Reported results are based on the average performance of ten folds cross-validation. For cross-validation, a HMII database is randomly split into 8:1:1 ratio for training, validating and testing sets, respectively. Deep learning models converge after 50 epochs for each dataset.
• Cross-evaluation Datasets: For evaluating the generalization ability of the proposed IQA model after training step, we use the verified IQA databases in public domain. We choose four comprehensive databases mentioned above to be used as benchmarks: LIVE [8], CSIQ [11], TID2008 [9] and TID2013 [10]. All perceptual quality of Datasets are normalized in range [10;50].

2) EVALUATION METRICS
To evaluate the performances of the IQA algorithms, two measures are used including Spearman's rank order correlation coefficient (SRCC) and Pearson's linear correlation coefficient (PLCC). PLCC measures the linear dependence between two quantities and SRCC measures how well one quantity can be described as a monotonic function of another quantity.

1) PERFORMANCE OF PROPOSED METHOD
In our experiment, four other feature extractors in turn replace the VGG-16 extractor in the first one. Only the HMII dataset with the size of 128 × 128 × 3 is used to learn models. The same number of epochs and other criteria are run with the first experiment. The default size of input patches is 224 × 224 for ResNext and VGGnet and 299 × 299 for Inception-V4, Inception-ResNet-V2 and Xception which are different from that of patches in our database. Thus, the architecture of feature extractors has been adjusted to fit our inputs while its outputs are feature vectors. Features are extracted from the distorted patch and the reference patch by a CNN and fused as difference, concatenation or concatenation supplementary with the difference vector. Table 7 shows the results of the two experiments comparing five models with different feature extractors. The model using  Resnext-50 named HMI-IQA has the best performance and is in bold.

2) CROSS-DATABASE EVALUATION
We train HMI-IQA models on HMII database and test them in the four above-mentioned Evaluation Datasets. Each referent image is divided into 64 × 64 and enlarged to 128 × 128 patches which are used to predict DMOS score of the same local distort image patch. The DMOS value of the distorted image is calculated in 2 ways: average (HMI-IQA Aver) and saliency (HMI-IQA Sal) weight of local qualities. In the first one, q is estimated by taking the average of local visual qualities q i as following formula: where N p denotes the number of patches from the image. The second way combines models of visual saliency with IQAs by weighting the local quality q j of a region j with the corresponding local weight w j . To determine w j , the framework in [52] is used to detect regional saliency adjustment and measure the weight by counting pixels in this region. The overall image quality q is:     Fig. 14 shows an image subject to JPEG compression from TID2008. It can be seen in Fig. 13 and Fig. 14 that the predicted quality of patches are spatially correlated as a result, the interpolation method can be used to compute the quality of the smaller patches.
The proposed model is tested on a specific distortion type and shows the results on TID2013, TID2008, LIVE and CSIQ in Table 8, 9, 10 and 11, respectively. These tables show that HMI-IQA Sal performs better than HMI-IQA Aver and HMI-IQA Sal is among the top performing models with 28 out of 52 times. Specifically, HMI-IQA  model outperforms in compressed distortions by JPEG and JPEG2000 because of the effectiveness of pre-training in HMII database. We observed that HMI-IQA method performs well on unseen distortion types, including lossy compression of noisy images, sparse sampling and reconstruction, spatially correlated noise and comfort noise. In addition, proposed model fails in three distortion types on TID2013, i.e., compression transmission errors, local block-wise distortions, and contrast change, whose characteristics are difficult to model. However, the performance of HMI-IQA model is only equivalent to WaDIQaM but insignificantly better than other methods. The reason is that this study only focuses on patch quality without considering its contribution to the whole image quality. HMI-IQA Sal method uses simple saliency model only to test the effectiveness of the patch-based method. To tackle this issue, we will use whole image quality databases to develop a weighted estimate method instead of a replacement of existing visual saliency models.
The proposed HMI-IQA model is evaluated on subsets of CSIQ, LIVE, TID2008 and TID2013, containing only the four distortions types shared among the four databases (JPEG, JP2K, Gaussian blur and white noise). Table 12 shows that performances of the proposed model are relatively stable for all subsets. Fig.15 shows the scattering distributions of subjective DMOS versus the predicted scores obtained by the HMI-IQA Sal on four subset databases. The proposed model still anticipates some other common types of distortion quite well as various types of distortion are generated in training database during the process of video compression. However, the prediction is inadequately good in case of high noise level. As can be seen in Fig.15 when the image quality is reduced, the prediction is less accurate. In further study, we will collect more experimental data of other distortion types to enhance the accuracy.

V. CONCLUSION
In this paper, we present an experimental image quality assessment solution for image/video with compression artifacts. First, the subject quality rating database considering image patch quality assessment method for image/video with compression artifacts are introduced. Due to the lack of 'ground truth' quality of patches, we expect the proposed image patches database to be useful for further investigation. Second, we introduce an efficient deep neural network based image patch quality assessment solution with several feature engineering options. Experimental results conducted for a rich set of image/video database shows that the proposed IQA method is particularly suitable for image/video with compression artifacts, not only under the video HEVC compression but also with image JPEG or JPEG-2000 compression standards. For future works, we will explore the proposed IQA model to improve the image/video compression efficiency by directly integrating it into those standards. He is currently working with the Medical Imaging Department, Vingroup Big Data Institute (VinBDI), Hanoi, Vietnam, aiming to conduct research on important areas of big data. His research interests include computer vision, computer graphics, artificial intelligence, and human-computer interaction.
DUONG TRIEU DINH received the B.S. and M.S. degrees in electrical engineering from the College of Technology, Vietnam National University, Hanoi, and the Ph.D. degree from Korea University, South Korea, in 2010.
He is currently a Lecture with the Faculty of Electronics and Telecommunications, Vietnam National University-University of Engineering and Technology, Hanoi, Vietnam. His research interests includes telecommunication, video coding, and communication.
LE THANH HA received the B.S. and M.S. degrees in information technology from the College of Technology, Vietnam National University, Hanoi, and the Ph.D. degree, in 2010.
In 2005, he received a Korean Government Scholarship for Ph.D. program with the Department of Electronics Engineering, Korea University. He is currently an Assistant Professor with the Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi. His research interests include computer vision, computer graphics, artificial intelligence, and human-computer interaction. He has also been principle and a main Investigator of many fundamental research and technology development projects funded by both domestic and international organizations. He also makes contributions in serving many domestic and international ICT academic conferences, including KSE, NICS, ATC, SoICT, and ICEIC. In addition, he is a member of the Institute of Electrical and Electronics Engineers (IEEE), The Institute of Electronics, Information, and Communication Engineers (IEICE), and The Vietnamese Association for Pattern Recognition (VAPR).