Blind Stereoscopic Image Quality Assessment Accounting for Human Monocular Visual Properties and Binocular Interactions

Human visual perceptual model is a key factor for evaluating stereoscopic image quality. This paper focuses on the contributions of monocular and binocular properties on quality perception and proposes a novel blind stereoscopic image quality assessment model by comprehensively digging the relationship between visual features and quality perception. The statistical quality-aware monocular features are extracted from both left view and right view to reveal monocular quality perception, including the color statistical features which are missed in most previous models, while the multiple features of the summation signal and the entropy features of the difference signal are extracted to quantify the binocular quality perception. Finally, support vector regression (SVR) is utilized to train a regression model based on the extracted features and the subjective scores. Three public databases, LIVE 3D Phase I, LIVE 3D Phase II, and MCL 3D Database, are adopted to prove the effectiveness of the proposed model. Experimental results demonstrate that the proposed model is superior to other existing state-of-the-art quality metrics.


I. INTRODUCTION
Over recent years, stereoscopic contents have grown explosively, including stereoscopic film, virtual reality, 3D-TV, and so on, which drive the demand and development of stereoscopic image quality assessment (SIQA). Consequently, the quality assessment of stereoscopic contents aroused much attention and became an important research topic. Different from two-dimensional (2D) image, the three-dimensional (3D) image has the depth information, which makes the stereoscopic image quality assessment more difficult. Many types of distortions, such as JPEG compression (JPEG) distortion, JPEG2000 compression (JP2K) distortion, additive white noise (WN) distortion, the fast fading (FF) distortion, and the Gaussian blur (Blur) distortion, may be introduced in 3D images to affect not only visual perception quality but also depth perception.
The associate editor coordinating the review of this manuscript and approving it for publication was You Yang . In general, SIQA method can be divided into three categories based on the availability of the reference stereoscopic image: full-reference (FR) method, reduced-reference (RR) method and no-reference (NR) method. The FR method, which requires all information of the reference image, usually achieves the best results among the above three methods. At the earlier stage, FR SIQA models are proposed based on 2D image quality assessment (IQA) model [1]- [3]. The overall performance of the 2D extended type models on 3D image is not as good as that on 2D image. Since human is the final judgment subject, the characteristics of the human visual system (HVS) should be considered. Many works about how human eyes perceive 3D images and get the depth perception are proposed in SIQA area. These works proved that there exists a visual channel to process the information received from human eyes, and there is one combined image (called ''cyclopean image'') formed in human brain to perceive the depth perception. Chen et al. [4] proposed an FR SIQA method to model the whole procedure and evaluate 3D image quality. Bensalma and Larabi [5] then considering this binocular theory, built an effective quality index based on the complex wavelet transform (CWT). By investigating human visual cortex and binocular visual properties, Lin and Wu [6], Shao et al. [7], and Shao et al. [8] presented many good models, which further proved the effectiveness of human visual characteristics on SIQA area.
The RR method, proposed based on partial information of the reference image, is generally second to the FR method, which mainly applied to access the distortion of the image in the transmission process. Because the reference image is not always available in practical cases, therefore, the NR method, without referring any information of the reference image, receives increasing attention. The NR method is also referred to as the blind SIQA method. Inspired by human binocular perception model, Ryu and Sohn [9] proposed a binocular perception metric by measuring the blurriness and the blockiness of the image views. Zhou et al. [10] proposed an NR SIQA metric based on two binocular combination models: Eye-Weighing model and the Contrast Gain-Control model. Considering the importance of the depth perception, Chen et al. [11] proposed a cyclopean image combination model by utilizing 3D cues in the disparity map to estimate the 3D image quality, which yield better performance than the models without considering depth information. With the development of machine learning technology, researchers focus on modeling human binocular visual combination behavior to solve the SIQA problem. Zhou et al. [12] built a blind SIQA metric by utilizing the self-similarity of binocular features and applied the support vector regression (SVR) model to drive the overall quality score. The above models are built based on the visual single-channel theory, while the double-channel model is proved to be a more plausible way, which has been applied and proved its effectiveness in our previous SIQA works [13], [14]. Then based on the model of human double-channel theory, Yang et al. [15] modeled human visual summation and difference channels to extract image features and utilized the SVR method to predict the quality of the stereoscopic images. These models have better performance than previous works, which provide us a way to extract the quality-sensitive visual features to build the SIQA method.
Since one stereoscopic image contains two views (left and right views), the monocular view quality and the interactions between these two views inevitably affect the overall quality perception. Various factors should be considered, including monocular image quality, binocular interaction (binocular rivalry, suppression, etc.), and depth information. To cope with this challenge, many works have been proposed. Shao et al. [16] proposed a blind SIQA method using joint sparse representation, in which stereoscopic image quality is predicted based on the weighted monocular quality score. To handle the asymmetric distortion, Shao et al. [17] designed an NR metric by utilizing both monocular and binocular properties for quality assessment. Likewise, Liu et al. [18] combined the monocular and binocular information and proposed a blind SIQA model by using the SVM method to fuse the quality score.
Considering the promising prospect of the NR SIQA model in practical application, we propose a novel blind SIQA model by comprehensively digging the intrinsic relationship between human visual properties and quality perception. Motivated by the above works, the monocular and binocular visual features are both extracted, which are correlate well with human subjective observation, especially for evaluating the asymmetric degradation on 3D images. Finally, SVR is utilized to predict the final quality score. The advantages of this work are as follows: (1) Towards the monocular features, three types of quality-aware features are extracted to reveal image quality under different types and degrees of distortions, including the color statistical features, which have a significant influence on quality perception but ignored in most SIQA models.
(2) Multi-scale and multi-orientation visual properties are deployed to address the summation visual features and quantify the naturalness of the stereoscopic images, which can improve the accuracy of the proposed model.
(3) The difference signal implicitly contains the disparity and depth information, in which the pixels exhibit strong dependence that affects visual perception. Two types of entropies are adopted to estimate the dependence level between pixels and the depth degradation, which is a simple way to get rid of depth information and obtains a low computational complexity.
The rest of this paper is organized as follows. Section II presents the motivations of this work. Section III describes the proposed framework. The experimental results and discussion are declared in Section IV, and the conclusions are drawn in Section V.

II. MOTIVATIONS
For the problem of SIQA, the most intuitive way is to simulate the human visual perception characteristics to build a subjective assessment-like model. Therefore, numbers of computational models have been proposed to explain the possible mechanism of the human visual system [6], [19]- [21]. However, the above models mainly focus on the binocular interaction in the human brain, which is not enough to simulate human visual perception in image quality assessment. Seuntiens et al. [22] reported that the overall 3D image quality is not only affected by the binocular information, but also by the information of the left view and the right view (especially for the asymmetrically distorted 3D image). Lv et al. [23] summarized that 3D image quality assessment should be account for two types of distortions related to 3D perception quality, that is monocular distortion and binocular distortion. Each view of one 3D image suffers from the distortion called monocular distortion, which will cause monocular quality degradation, while both views suffer from symmetric or asymmetric distortion called binocular distortion, which will cause binocular confusion, visual discomfort and so on. Wang et al. [24] also found that the monocular cues have a strong impact on the depth perception based on subjective experiments. Jiang et al. [25] addressed both monocular and binocular quality issues and built an effective NR quality evaluation method for the stereoscopic image using a three-column nonnegativity constrained sparse auto-encoder framework, which proves that the 3D image quality perception is determined not only by the qualities of each view but also their interactions.
Based on the neural processing visual pathways in the visual cortex, Shao et al. [26] modeled the perceived stereoscopic image quality ς as the posterior probability with the left view I L and right view I R . By simulating human visual cortex, they divided the quality index into two independent parts, as follows: where P M and P B are the monocular perception index and binocular perception index, respectively. The work [26] proved the reasonable and effectiveness of the combination of monocular and binocular information on 3D image quality evaluation. Therefore, we follow the above conclusions and build a blind SIQA model based on two indexes: monocular and binocular perception indexes.

A. MONOCULAR PERCEPTION
Lots of effective assessment methods for single view image have been proposed based on HVS [27]- [29]. With the development of the conventional neural network (CNN), many NR IQA models have been built by learning image features and regression model [30]- [32]. However, the color information, which plays an important role in visual perception, is ignored in the above models. Zhang et al. [33] extended their FR IQA framework on color information and found that the later model with color information has better overall performance than the original model. Wang et al. [34], considering the importance of the color information in quality perception, studied an NR-IQA model using the natural color statistic features. Later, Ziaei Nafchi et al. [35] and Sun et al. [36] took image color information as an important part in the IQA model, and obtained good results in IQA area. Besides, color related artifacts also play an important role in visual discomfort estimation for stereoscopic image quality perception, which should be considered in stereoscopic image quality assessment [37], [38]. The above works reveal that the color information embodies important quality information and can be effectively used to learn useful quality-sensitive features, which is adopted to address the monocular quality in our model. Since the left view is similar to the right view, we extract the monocular color features from the left view to as one set of the monocular features. The natural scene statistics (NSS) features of the image can reflect the naturalness of the distorted images, which have been demonstrated its effectiveness on IQA area [39]- [43]. This paper takes the NSS features of each view's luminance information as the other two sets of monocular features, which contain important asymmetric degradation information on quality perception.

1) BINOCULAR PERCEPTION
As the supplement of the monocular perception, the binocular information is considered to study binocular perception quality in our proposed model. Many biological visual models have been proposed to explain human binocular behavior, which can be divided into two categories: single-channel model and double-channel model. The former predicts that human eyes collect the individual eye view separately, and then combine them to generate the unique stereopsis in the human brain [44]. An alternative way, the double-channel model, suggests that there are two separate visual signals in the human brain to obtain the stereopsis: the summation signal and the difference signal [45], [46], shown in Fig.1. Kingdom [47] proved the existence of these two adaptable binocular signals in human binocular perception. May et al. [48] and May and Zhaoping [49] later gave strong support to this theory and proved that human brain contains the summation channel and the difference channel that adapt to prevailing binocular statistics, which provide an optimal way for visual system to transmit the binocular information.
At the physiological level, it proved that V1 neurons receive the multiplexed signals from the summation and difference channels, so that they can tune to different disparities [50]. Besides, the summation and difference signals have been applied to evaluate the stereoscopic image quality and achieved better performance than previous models [43], [51]. Considering the binocular difference channel is missing from the single-channel model, we take the double-channel model to simulate human binocular interaction behavior.
For the summation signal, some biological models have been proposed to describe it [52]- [54]. Ding and Sperling [55] proposed a Grain-Control model, which can well explain an early stage of human binocular visual behavior, such as Fechner's paradox [56] and cyclopean perception (including binocular fusion and binocular rivalry). Thus, we utilize the Gain-Control cyclopean model to synthesize the summation signal. Since the image local structural information is sensitive to the distortion, and the image gradient has a strong ability to represent the image structure distortion. Thus, we obtain the gradient-related information of the cyclopean signal to get the binocular features. 33668 VOLUME 8, 2020 The difference signal is given by subtracting the right image from the left image, which has been proved that the difference information can well explain disparity perception in SIQA area [15], [57]. Works [58], [69] revealed that the difference signal is sensitive to the disparity and carries information critical for stereo perception. While image entropy, a tool for measuring the amount of information, captures the statistical information over scales [60], which is significantly sensitive to quality degradation and distortion [39]. There are two types of image entropies: spatial entropy and spectral entropy. The former one reveals the statistical characteristics of the local pixel, while the spectral entropy reflects that of the local DCT coefficients. Both of them are sensitive to the types and degrees of distortions [43], [61]. We hypothesize that quantifying the changes of these two entropies will make it possible to predict the type and degree of distortions that affecting the difference signal. Thus, we address the features of the difference signal from the spatial entropy and the spectral entropy.

III. PROPOSED MODEL
In this paper, we build a blind SIQA model by considering both monocular and binocular interaction features. The whole framework is shown in Fig.2. Firstly, luminance NSS features of two views and the color features of the left view are extracted to estimate the monocular quality. Secondly, the binocular quality index is computed based on the multi-scale and multi-orientation features extracted from the summation signal (the cyclopean image) and the entropy features from the difference signal. Finally, the overall quality score is obtained based on a two-stage regression method: training stage and testing stage. In the training stage, a regression model is learned based on the above features to build the relationship between features and subjective scores via SVR. In the testing stage, the final quality score is estimated by input all the extracted features into the trained model.

A. MONOCULAR FEATURES
For the monocular features, we firstly extract the luminance NSS features of two views in the spatial domain. Take the left view as an example, we operate a divisive normalization to its luminance information I L , as follows: where µ L (i, j) and σ L (i, j) are the local mean and the standard deviation of the left view's luminance information, which are computed by using a 2D Gaussian kernel with the size of 3 * 3 as in [62]. As we all know, the coefficients of Eq.2 follow a Gaussian distribution, which can be modeled by a zeromean generalized Gaussian distribution (GGD), as follows: where (x) = ∞ 0 t x−1 e −t dt, x > 0. α and σ 2 reflect the shape and the variance of the distribution, respectively. When the signal suffers distortion, this distribution will change. These changes can be quantified by using these two parameters: α, σ 2 , which related to the distortion types and degrees, and can be used to predict the image quality [63]. Here, we apply the locally mean subtracted and contrast normalized (MSCN) coefficients to obtain these two parameters to build the first set of luminance NSS features f L1 = (α, σ 2 ).
Besides, we extract the other set of luminance NSS features of the left view from the distribution of the products of pairs of adjacent MSCN coefficients, which also contains image quality information [63]. The adjacent MSCN coefficients are obtained along four orientations: horizontal, vertical, main-diagonal and secondary-diagonal orientations. Similar to the MSCN coefficients, the adjacent MSCN coefficients can be modeled by a zero mode asymmetric generalized Gaussian distribution (AGGD) [64], as follows: The mean of the AGGD distribution can be obtained by: The parameters (γ , β l , β r , η) are extracted as the second set of the luminance NSS features of the left view f L2 . Then, 16 parameters (4 parameters * 4 orientations) of AGGD are extracted. Considering the multi-scale information in image, we extract the left view monocular features at two scales: the original image scale and a reduced scale down-sampling the image by a factor of 2. Thus, a total of 36 features are used to build the left view monocular feature map. Similarly, we extract the above two sets of luminance features (f R1 , f R2 ) from the right view to get the right view monocular feature map.
As pointed out in [34], the color statistical properties can't directly be applied to get the above two types of NSS features. Hurvich [65] suggested that there are three types of cones in human retina, and each of them has two opponent color members. Later, Ruderman et al. [66] indicated that the logarithmic-scale color information follows a Gaussian probability model, which provide us a way to capture the color statistic features. We extract the color features from the left view in RGB space based on the logarithmic scale transformation as follows: where X is the mean of X . In fixing the axes of the new logarithmic space, an orthogonal transformation is conducted as follows: The coefficients of these unit direction vectors (C 1 , C 2 , C 3 ) follow a Gaussian probability law, which can be modeled by a Gaussian model as follows: The parameters (ς, ρ 2 ) of three color components (C 1 , C 2 , C 3 ) at two scales are extracted as the color statistic features f C (3 color components * 2 parameters * 2 scales =12 color features). Then the monocular visual features extraction is finished (total 36 * 2 + 12 = 84 features).

B. BINOCULAR FEATURES
As pointed out in [49], [50], the binocular features are extracted from the summation signal and the difference signal, both of which play an important role in binocular interaction and quality perception. Firstly, we compute the cyclopean image based on the Gain-Control model [55] to represent the summation signal in the human brain, shown as follows: where E L and E R are the normalized energy responses of the left and right views, respectively, which are computed by the local energy of the log-Gabor filter response. More details can refer to [33]. The 2D log-Gabor filter is given by: where ω and ω s are the radial frequency and the center frequency, respectively.
The local energy of the log-Gabor filter responses is then obtained as follows: Since the distortion affects the image quality across multiple scales and multiple orientations, we apply the multi-scale log-Gabor filter to extract the binocular NSS features, which can well explain the spatial-frequency response of visual cortical cells in each eye's receptive filed [67], [68]. Given that gradient information has a strong ability to reflect the local structural distortion, the binocular features of the summation signal are extracted from the gradient map of the cyclopean image. By deploying the log-Gabor filter of 3 different center frequencies and 4 different orientations to the cyclopean image, 12 response maps can be obtained. The related parameters are set as follows: σ s = 0.6 and σ s = 0.71. Define the response of the real part as R m,n , and that of the imaginary part as IM m,n (m = 0, 1, 2; n = 0, 1, 2, 3). We first adopt the GGD model to extract the best-fit model parameters (α and σ 2 ) of the above two response maps and that of the smoothed directional gradient components of the above two response maps as the first set of NSS features of the summation signal f S1 , and a total of 144 features are extracted.
Then a Weibull distribution is applied to model the gradient magnitude of the above response maps to obtain the additional NSS features of the summation signal, which is given by: The parameter α reflects the texture of the gradient magnitude map, while b reflects its local contrast. Hence we take these two best-fit model parameters as the additional summation NSS features f S2 = (a, b), and a total of 48 features are extracted. To illustrate the above features, the responses of 3 scales and 4 orientations filter banks of a cyclopean image are shown in Fig.3. It needs to mention that different content images have the same behavior, and we only present one image as an example. Fig.3(a) is the cyclopean image, and (b)-(d) are the log-Gabor responses, the directional gradient components and the gradient magnitudes, respectively. It reveals that the responses of different scales and orientations differ from each other, and different components have different information, which prove the reasonable and plausible of extracting the multi-scale and multi-orientation features to generate the quality-aware features.
The difference signal represents the difference between two views, and the amount of information within it reveals the depth perception and the asymmetric distortion. The entropy is an effective way to compute the amount of information, and high sensitive to the degree and type of image distortion [43], [61], which provides us a way to extract the features of the difference signal. Different from the NSS features based on pixels relation, we compute the statistic features of the difference signal based on the local region (8 * 8 image patch), which can represent the local image structural information and quality perception. Spatial and spectral entropies are both considered in this paper. In order to visualize the behavior of the above two entropies against the types and the degrees of distortions, images affected with different degrees and different types of distortions and the corresponding histogram distributions of two entropies are shown in Fig.4 and Fig.5. It needs to mention that different content images have the same behavior, and we only present one image as an example. Fig.4 shows the left view of four 3D images affected by four different degrees of WN distortion from LIVE 3D Phase I [69]: (a) Figure d1 with the minor degree, (d) Figure d4 with the most severe degree, and (b)(c) with the middle degree. Fig.5(a) is the spatial entropy histograms of the original difference signal and of Fig.4(a)-(d). Fig.5(b) is the spectral entropy histograms of the original difference signal and of Fig.4(a)-(d). Fig.5(c)(d) is the spatial entropy histograms and the spectral entropy histograms of the original difference signal and of five difference signals with five types of distortions, respectively. Fig.5(e)(f) is the spatial entropy histograms and the spectral entropy histograms of the original difference signal and of two distorted difference signals affected by symmetrically and asymmetrically distortions. Fig.5(a) reveals that the distortions in Figures d1, d2, and d3 increase the mean and skew the spatial entropy histogram to the left, while the distortion in Figure d4 reduces the mean and skews the histogram to the right. In Fig.5(b), the distortions in Figures d1 and d2 increase the mean and skew the spectral entropy histogram to the left, while the distortions in Figures d3 and d4 reduce the mean and skew the histogram to the right. It can be concluded that the spatial entropy and the spectral entropy are sensitive to the degree of distortion. Fig.5(c)(d) reveals that the types of distortions are related to the distributions of the spatial and spectral entropies. The mean and the skew vary with the type of distortion. For example, WN increases the mean and skew the spatial entropy histogram to the left, while the other four types of distortions reduce the mean and skew the histogram to the right. The VOLUME 8, 2020 spectral entropy of the original difference image has a bigger mean and skew than that of all the types of distortions. Fig.5(e)(f) indicates that the distributions of the spatial and spectral entropies can reflect the distortion type: symmetric distortion or asymmetric distortion. Overall, spatial entropy and spectral entropy are both sensitive to the type and the degree of the distortion, which can be utilized to extract the features of the difference signal.
Since the mean and the skew of the histogram change with the type and the degree of the distortion, we extract the mean and the skew of the difference signal as its quality features. Specifically, the difference signal is firstly partition into 8 * 8 block as the basic computation unit to compute the entropy. The spatial entropy is given by: where p(x) is the probability density of a block, and x is the pixel values of one block. Then we compute the spectral entropy of each block of the difference signal based on the normalized DCT coefficients matrix c as follows: where 1 ≤ i, j ≤ 8 and i = j. Then we take the mean and skew of the spatial and spectral entropies at two scales as the features of the difference signal f D , and a total of 8 features are extracted. Overall, all quality-aware visual features are extracted, as tabulated in Table 1.

C. QUALITY PREDICTION
After extracting all the above features, the LIBSVM package is utilized to solve the quality prediction problem [70]. In this paper, we adopt SVR with a radial basis function (RBF) kernel to train the prediction function, which is effectively used in other NR image quality assessment models [16], [43], [51]. We firstly train a regression model based on the extracted features and the corresponding subjective scores. Each database is divided randomly into two non-overlapping parts: 80% training sample, and 20% testing sample, as other SIQA works. 1000 times iterations are performed, and the mean value is reported for the parameter selection during the training procedure. At the testing stage, we use the regression model to map the extract features into the final 3D quality objective score.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. DATABASES AND PERFORMANCE CRITETIA
In this section, three publicly available 3D IQA databases are utilized to verify the performance of the proposed model: LIVE 3D Phase I [69], LIVE 3D Phase II [11] and MCL Database [71]. All these databases provide both reference images and distorted images with corresponding subjective scores.
• LIVE 3D Phase I contains 365 distorted images with a co-registered human score in the form of the difference mean opinion score (DMOS). They created from 20 reference images and affected symmetrically by five types of distortions: JP2K (80), JPEG(80), WN (80), FF (80), and Blur (45).
• LIVE 3D Phase II is created from 8 reference images. Every reference stereopair was processed to create 3 symmetric distorted images and 6 asymmetric distorted images. Altogether, there are 360 distorted images with co-registered human scores in the form of DMOS.
• MCL 3D Database consists of 20 reference images and 648 symmetrically distorted stereoscopic images affected by 6 types of distortions (JPEG, JP2K, GB, WN, down-sampling blur (SB), and transmission error (TE)) at four distortion levels. In this paper, only the distorted images are used, and three commonly used performance criteria are utilized to verify the performance of the proposed model: the Pearson linear correlation coefficient (PLCC), Spearman rank-order correlation coefficient (SRCC), and root mean square error (RMSE). The higher the values of PLCC and SROCC, the better the proposed model. While the smaller the value of RMSE, the better the model. PLCC=SROCC=1 and RMSE=0 indicate the perfect performance.
From Table 2, the 2D-extended metrics, especially the ADD-GSIM model, hold the competitive performance on the LIVE 3D Phase I, but they fail to predict the quality of the asymmetric distorted images due to the missing binocular information. The proposed model ranks the top two on all the three databases, which proves its effectiveness on predicting 3D image quality. Specifically, in LIVE 3D Phase I, the proposed model has the best PLCC and RMSE performance, and obtains the competitive SROCC performance compared to Ma2018's work. And in LIVE 3D Phase II, the proposed model has better overall performance than Ma2018's work, which concludes that the proposed model performs better for asymmetrically distorted images than the symmetric distorted images. Besides, the proposed model has better performance than Liu2019's work in LIVE 3D Phase I, but yields worse performance than Liu2019'model in LIVE 3D Phase II. It needs to mention that the performance values in Phase II of the proposed work are very close to that of the top one: Liu2019's work, which concludes that the proposed model shows high correlations with human perception for symmetrically distorted images. Based on the above analysis, the proposed model correlates well with human subjective perception for both symmetrically and asymmetrically distorted images.
For the MCL 3D Database, the proposed model ranks the first among all the models in terms of PLCC and SROCC, and the value of RMSE is very close to the top one. The proposed model presents a robust and accurate performance and further proves its effectiveness on quality assessment of the symmetrically distorted images. Overall, the proposed model correlates well with human subjective perception by  considering monocular and binocular visual properties, and can well evaluate the quality of symmetrically and asymmetrically distorted stereoscopic images, which indicates that the proposed metric is a consistent, stable and accurate model that has better application in reality than other works.

C. PERFORMANCE ON INDIVIDUAL DISTORTION TYPE
To validate the effectiveness of the proposed model on predicting the specific type of distortion, we test the proposed model and other representative SIQA models on different distortion types. Since the MCL 3D Database, like LIVE 3D Phase I, is a symmetric database, and the LIVE 3D Phase I and II contain five commonly different distortion types, we test all the works on LIVE 3D databases to conduct the performance comparison experiments. To save space, we only present the results of PLCC and SROCC. The results are listed in Table 4 and Table 5, in which the top results are highlighted in boldface.
The proposed model yields the best PLCC performance on JP2K and WN in LIVE 3D Phase I, and on Blur and FF in LIVE 3D Phase II. Although it doesn't perform the best than other models on the rest distortion types, our model shows competitive robust and stability of prediction accuracy across all the distortion types. For the SROCC performance, our model yields the best performance on JP2K and WN in  Fig.6 shows the scatter plots of subjective scores versus the objective scores based on our proposed model, and the degree of convergence represents the overall performance: better convergence to the better performance. The vertical axis denotes the provided DMOS values and the horizontal axis represents the predicted objective quality scores. Fig.6(a)(b) is the overall performance and the distortion performance of 33674 VOLUME 8, 2020   LIVE 3D Phase I, and (c)(d) is that of LIVE 3D Phase II. Fig.6 presents that the proposed model has a good convergence on both databases and on five types of distortions, which shows the consistence between the subjective score and the objective score. In summary, the proposed model can effectively and accurately evaluate the stereoscopic image affected by different types of distortions, which proves that the proposed metric is independent of the distortion types.

D. COMPARISON AMONG DIFFERENT FEATURES
In our paper, the monocular and binocular visual features are both extracted to build the NR SIQA model. To understand the role of features in the proposed model, we design three different schemes for performance comparison. For scheme A, only the monocular features are considered. For scheme B, only the binocular features are considered. For scheme C, we delete the monocular luminance NSS features of the left view. The corresponding results are listed in Table 6. It can be concluded that only using the monocular or binocular features cannot achieve the best performance. The monocular and binocular features are complementary and can provide a more reasonable and accurate quality evaluation model. Besides, scheme C yields the lower performance than the proposed model, which indicates that the monocular luminance features of each view play an important role in quality assessment.

E. CROSS-DATABASE PERFORMANCE EVALUATION
Besides, we perform the cross-database evaluation for the proposed model to validate its generality and stability. Since the subjective scores of the LIVE 3D databases are provided in the form of DMOS, and the subjective scores of the MCL 3D database are provided in the form of MOS. So cross-database experiments between LIVE 3D database and MCL 3D database is not appropriate. Therefore, in this paper, we only conducted the cross-test based on LIVE 3D databases. The experiment training samples and testing samples are from different databases, and the experiment procedure is the same as the previous representation in Section III.C. Experimental results are shown in Table 7, and the top two performances are highlighted in boldface. Table 7 reveals that the overall performances of all the models are lower than the former performances that the training and testing samples are from the same database. Although Yang2018's model, trained on LIVE II and tested on LIVE I, denote LIVE II/LIVE I, achieves better perfor-VOLUME 8, 2020 mance than others, it performs not good at the strategy of LIVE I/LIVE II. Although Liu2019's model holds a good performance in both two cross-database tests, the proposed model holds a competitive and stable performance across the two strategies. Besides, the proposed model under the condition LIVE II/LIVE I performs better than it under LIVE I/LIVE II. The probable reason is that LIVE II has symmetrical and asymmetrical distorted images, and the training sample contains two types of distortions but not in the testing sample. Overall, the results indicate that the proposed model is low dependent on image contents, and presents good robustness and generality.

V. CONCLUSION
This paper proposes a novel stereoscopic image quality assessment model. Human monocular and binocular properties are both considered. Especially, color quality-aware features are extracted to complement the monocular quality index, which is ignored in most previous works. Our binocular visual model adopts a more reasonable visual model to obtain the binocular features, in which the binocular multiple gradient features are first extracted for 3D quality assessment. The difference signal and its features provide a simple way to explain the depth perception with an acceptable computational complexity. Experiments conducted on three popular and public used databases, LIVE 3D Phase I, Phase II and MCL 3D Image Quality Databases, demonstrate that the proposed model yields more robust and consistent with human subjective perception than previous state-of-art models, which proves its effectiveness on evaluating the stereoscopic image quality. In the future, we will focus on utilizing deep learning tools to learn more effective visual perception quality-aware features to predict the stereoscopic image quality. His research interests focus on image processing and pattern recognition.
BAOQING HUANG is currently pursuing the M.S. degree with the College of Information, Liaoning University.
His research interests focus on stereo vision research, image processing, and pattern recognition.
HONGWEI YU is currently pursuing the M.S. degree with the College of Information, Liaoning University.
His research interests include image quality evaluation and pattern recognition.