Reduced-Reference Stereoscopic Image Quality Assessment Using Gradient Sparse Representation and Structural Degradation

Reduced-reference stereoscopic image quality assessment (RRSIQA) models evaluate stereoscopic image quality degradation with partial information about the “ideal-quality” reference stereopair. On one hand, sparse representation in recent theoretical studies of visual cognition has been proved to resemble the strategy used to represent natural images in the primary visual cortex. On the other hand, the joint statistics of gradient magnitude (GM) and Laplacian of Gaussian (LOG) features are popularly utilized to form image semantic structures. Motivated by these findings, we present a new RRSIQA metric using gradient sparse representation and structural degradation in this paper. Concretely, the proposed metric is based on two main tasks: the first task extracts the distribution statistics of visual primitives by gradient sparse representation, while the second task measures structural degradation of stereoscopic image due to the presence of distortion by extracting the joint statistics of GM and LOG features. The former, so-called the binocular perceptual visual information (PVI), aims to effectively integrates the gradient maps that are sparser than the image itself. Especially, the process of binocular fusion is simulated by using the mutual information of the gradient-based visual primitives between left and right view’s images as binocular cue. Furthermore, the perceptual loss vectors are taken as the differences of binocular perceptual visual information and the structural degradation between reference and distorted stereopairs. Finally, the perceptual loss vector is utilized to calculate the quality score by a prediction function which is trained using kernel ridge regressing (KRR). The experiments are performed on the popular LIVE 3D IQA databases and Waterloo IVC 3D databases, and experimental results show highly competitive performance with the state-of-the-art algorithms. Moreover, in some challenging cases with particular asymmetric distortion types, the proposed metric can achieves the best quality prediction accuracy in LIVE 3D phase II and Waterloo IVC 3D Phase II.


I. INTRODUCTION
D URING the past two decades, various threedimensional (3D) technologies (such as 3D image coding, reconstruction, enhancement, and monitoring, etc.) have advanced rapidly and drastically changed the way people viewed their world. However, since the current 3D technologies are still immature, various levels and types of distortion will inevitably be introduced into 3D content Fig. 1. The general framework for the SIQA system. which may give rise to a degradation of 3D visual quality. For this reason, it is an urgent demand to establish an effective 3D content quality evaluation method. In general, the most direct and reliable method to estimate image quality is by subjective assessment. However, the subjective metrics are regarded as inconvenient, expensive, and time consuming [1]. These drawbacks provide the motivation for developing efficient and fast objective stereoscopic image quality assessment (SIQA) metrics. Objective SIQA is a key link in 3D image processing systems. However, how to establish an effective SIQA metric has always been a difficult challenge. Particularly, when the two separate monocular images of a stereopair have different levels and types of distortion, it is called asymmetric distortion [2]. To address this issue, some existing researches [3][4][5] assume that the human visual system (HVS) may employ both the two monocular images' quality and the depth/disparity map quality to evaluate the quality of stereo image. However, the ground truth depth/disparity maps are not always available, and meanwhile the depth/disparity information may not be directly related to the quality of stereoscopic image. In this case, there are still enormous spaces of research objective SIQA. In general, based on the reference information provided to the calculation model, SIQA methods can be divided into three categories: full reference (FR), reduced reference (RR) and no/blind reference (NR/blind) SIQA metrics. The FR metric operates on a distorted stereopair with a reference stereopair available for comparison, while the NR SIQA methods do not use reference information at all. As a compromise measure, the reduced reference (RR) metric uses only partial information or a handful of features extracted from the reference stereopair [3]. The general framework for the RRSIQA system is demonstrated in Fig. 1, which includes two parts, namely, sender side and receiver side, respectively. At the sender side, a feature extraction process is performed for reference stereopairs. Likewise, at the receiver side, the same procedure for distorted stereopairs at the sender side. In this study, we will focus on the RRSIQA method, which is widely used to guide the optimization of 3D content production.
Since the ultimate receiver of the image is the visual cortex of brain, the key point to objective image quality assessment (IQA) is to match the characteristics of HVS. In [6], Field and Olshausen showed that natural images can be sparsely unfolded by an overcomplete set of simple atoms. Furthermore, as a supplement to the distribution-based statistical description of natural images, the basic structure of natural images can be reflected in the field of retinal and cortical neurons [6]. In [7], Z. Wang et al. validated that the human eyes are very sensitive to the change of structural information of input scenarios. Besides, the work of [8] declared that the efficiency of neural coding depended both on the transformations that map the input to the neural response and on the statistics of the input. Thusly, the evaluation of image quality by human eyes depends not only on the statistical characteristics of image but also on the visual characteristics of HVS. However, in previous studies, some SIQA metrics are mostly derived from the visual characteristics of HVS or the statistical characteristics of image, which do not assess the stereoscopic image quality accurately. Thusly, in this paper, we try to combine the visual characteristics of HVS with the statistical characteristics of images to overcome the shortcomings of a single strategy.
From the origin of sparse representation, it is directly related to compressed sensing (CS) [9]. The sparse representation theory proves that sparse or compressible signals can be accurately reconstructed from a small number of basic atoms onto a certain subspace [10]. With advancements in mathematics, sparse representation methods span a wide variety of applications, especially in the field of image processing, such as image segmentation [11], image denoising [12], visual tracking [13], and image super-resolution [14], etc. Meanwhile, sparse representation also shows great potential in dealing with the IQA issues [15][16][17]. Almost all existing sparse representation-based IQA methods follow a three-stage framework: dictionary learning (DL), qualityaware feature extraction, and regression model learning from subjective opinions. For DL, the K-SVD algorithm [18] is proven to be an effective method. In the stage of qualityaware feature extraction, the concept of entropy of primitive (EoP) has been proposed to measure the image visual information [19][20]. Then, some typical SIQA metrics have been done based on the concept of EoP. For instance, Qi et al. [21] presented an RRSIQA metric by using binocular perceptual information. Wan et al. [22] proposed an RRSIQA method using sparse representation and natural scene statistics (NSS). Furthermore, in the regression analysis phase, the most common utilized regression model is support vector regression (SVR) with a radial basis function (RBF) kernel. Although these RRSIQA metrics achieve relatively well evaluation results, their performance is still limited. Therefore, an interesting question to consider is whether it is more effective to train the dictionary in a transform domain. A related work on this question is recently proposed by Liu et al., where the dictionary is learned from the patches extracted in gradient-domain [23]. It suggests that the sparser the training samples/patches, the more powerful the learned dictionary. However, the works in [21][22] are restricted to original pixel domain, and extending to more sparser gradient-domain is more appealing.
Apart from the sparse representation theory used for IQA issues, there have been a number of NSS-oriented IQA measurement theories in the last couple of years. For instance, the most extensively accepted method is to use the generalized Gaussian density (GDD) to model the marginal distributions of luminance wavelet coefficients [24]. However, due to the discrete wavelet transform (DWT) or discrete cosine transform (DCT) based manner applied in GDD, most existing NSS-based SIQA methods suffer from two drawbacks, namely, limited representation in image semantic structure and the use of computationally expensive image transformations, while these two issues are of great concern in IQA.
Compared to the existing NSS-based IQA models, the loworder Gaussian derivative operators, such as GM and LOG, are very suitable for the design high performance NR IQA metrics [25]. The reasons why the GM and LOG features are so effective could be divided into two parts: a direct cause and an essential cause. Specifically, the direct cause is that the GM feature measures the strength of local luminance change, while the LOG operator responds to intensity contrast in a small spatial neighborhood. Furthermore, the essential cause is that the LOG operator is a good model of the receptive field of retinal ganglion cells [26][27]. And in fact, such low-order Gaussian derivative operators have been widely applied in the applications of computer vision [28][29][30]. The effectiveness of the joint statistics of GM and LOG features in the work of [25] motivated us to introduce them into the task of RRSIQA metric.
Inspired by the above analysis, we propose a new RRSIQA metric using gradient sparse representation and structural degradation. Firstly, the binocular perceptual visual information (PVI) extracted by using gradient-based sparse representation. To be specific, the entropy of gradient primitives (EGP) of each view image is used as monocular cue, while the mutual information of gradient primitives (MIGP) between the two separate monocular images is regarded as binocular cue. Then, since HVS is very sensitive to the structural degradation of natural images, this paper considers the joint statistics of GM and LOG features to measure the structural degradation of each view image of the distorted stereopair, which is a supplement to the monocular cue EGP. Besides, compared with the SVR model, the kernel ridge regressing (KRR) fitting can be done in closed form and is usually faster for medium size datasets. Therefore, we use the KRR to establish the nonlinear relationship between qualityaware features and stereoscopic image quality index. A preliminary version of this study is published in [71], which does not consider the structural degradation of stereopair. Therefore, it is a sub-optimal model for all the available stereoscopic image information unemployed. In this paper, we have added some new insights and innovations to the initial version in the following ways so that the proposed model has higher accuracy and versatility.
The novelties of this study are generalized as follows.
(1) We propose a new RRSIQA model based on two complementary components: the sparsity properties of HVS and the joint statistics of image semantic structural degradation. These two complementary components are used to quantify the perceived quality degradation on each view image of stereoscopic image. With respect to the previous works, we demonstrate that the use of gradient sparse representation and joint statistics of GM and LOG features results in a higher consistent with subjective opinions.
(2) We introduce the concept of EGP and MIGP to achieve perceptual visual signal representation of the distorted stereoscopic image. More importantly, this study opens a new avenue to study how the sparse representation model in gradient domain can be used to RRSIQA framework design.
(3) Through a comprehensive verification, we find that the proposed model achieves a highly consistent with subjective scorings on both symmetric distortion and asymmetrical distortion. Simultaneously, because the proposed model does not use the depth/disparity information of stereopairs, the computational complexity of this model is low enough to meet the requirements of the real-time application.
The rest of this study is organized as follows. Section 2 provides an overview of the related work. Section 3 introduces the proposed RRSIQA model. Section 4 demonstrates and analyzes the experimental results. Section 5 summarizes this article.

II. RELATED WORK
With ready access to the booming markets of stereoscopic image based on a variety of 3D applications, the research of efficient and effective SIQA techniques are extremely essential. Currently, existing SIQA methods generally fall into four categories, named SIQA model extended from the typical 2D IQA model, SIQA method developed by simulating the characteristics of HVS, SIQA model designed by extracting the regularities of NSS, and SIQA model proposed based on deep learning.

A. SIQA MODEL EXTENDED FROM THE TYPICAL 2D IQA MODEL
As in the earlier studies, because a stereoscopic image consists of the left image and right image, some researchers attempt to extend the typical 2D IQA model to SIQA model. For simplicity, this kind of method usually processes each view image of stereopair independently, and combines the quality scores of the two views' image to yield the quality index of distorted stereoscopic image. For instance, You et al. [4], Benoit et al. [5], Campisi et al. [31], and Gorley et al. [32] extended the existing 2D IQA models to their SIQA models in a simple and direct manner. Furthermore, in [33], the relevance between subjective scores and three 2D IQA quality indexes, such as PSNR, video quality model (VQM) [34] and SSIM [7] for stereoscopic video were investigated. But obviously, since these models individually evaluate the quality of each view image without considering the binocular perceptual characteristics of HVS, the resulting performance most likely will not be satisfactory.

B. SIQA METHOD DEVELOPED BY SIMULATING THE CHARACTERISTICS OF HVS
As the study on SIQA moves forward, particularly with increasing cognition of the binocular perceptual mechanism, such as binocular fusion, binocular rivalry and depth perception behaviors of HVS, many scholars try to put these specific characteristics into SIQA design. In [35], a FR SIQA model based on the binocular fusion process was proposed. In [36], a FR SIQA model based on cyclopean image was presented. In [37], a FR SIQA metric by using binocular combination and binocular frequency integration was developed. The work of [38] simulated simple and complex cells VOLUME 4, 2016  Appina et al. [55] proposed a blind video quality index by measuring the statistical dependencies between motion and disparity subband coefficients of stereoscopic video. Nevertheless, most of these SIQA models need to use the disparity/depth information, while the disparity estimation is still a fundamental, unsolved mystery in the field of stereo-related research. On the other hand, traditional NSS-based models usually require computationally expensive image transformations. Thusly, these shortcomings are the bottlenecks that restrict their widespread applications.

III. THE PROPOSED METHOD
As discussed previously, we explore the fusion features in which incorporated gradient-based sparse representation features and joint statistics of GM and LOG features to build an efficient RRSIQA model. The proposed framework for the RRSIQA metric is shown in Fig. 2. Firstly, taking the general framework for the RRSIQA system, at the sender side, the adaptive dictionary learning is trained offline in gradient domain to build sparse representation of stereoscopic images. Note that, the process of gradient-based dictionary learning is independent of testing stereopairs. Afterwards, the GM maps and the LOG responses of monocular image of reference and distorted stereopairs are calculated by using the low-order Gaussian derivative operators. For each reference and distorted stereopairs, binocular PVI and joint statistics of structural degradation are applied to quality prediction. The binocular PVI extracted by using gradient-based sparse representation. More specifically, each view's EGPs are calculated as monocular cue, and the left and right view's MIGP is derived as binocular cue. Moreover, the structural degradation is represented by the joint statistics of GM and LOG features. A perceptual loss vector is obtained by calculating binocular PVI and structural degradation differences between reference and distorted stereopairs. Finally, the perceptual loss vector is inputted into the KRR to build the maps between quality-aware features and objective quality scores of the test stereopairs. We describe our approach below.

A. GRADIENT SPARSE REPRESENTATION
The most popular interpretation of the sparse representation model is to assume that a natural signal represented by the vector x ∈ R n , can be synthesized in term of a linear combination of only a few primitives or atoms, from a matrix D ∈ R n×k , termed a dictionary. Formally, sparse approximation can be represented by the formula: ∃a ∈ R k such that x ≈ Da and a 0 n, note that, the is 0 norm, where the vector a ∈ R k is sparse: only a few of its entries are non-zeros. We typically assume k > n, implying that the D is redundant to x.
Since the gradients are sparser than the image itself, the learned dictionary in the gradient domain may have sparser representation than the pixel domain image [23]. This finding motivates us to learn the dictionary in the gradient domain. The process of gradient-based dictionary learning is shown in Fig. 3. Specifically, we choose 55 reference images from the LIVE 2D IQA dataset [7] and the IEEE Stereo IQA dataset [62] as image samples. For a given training image I, its gradient map I GM can be defined by: where ⊗ refers to the linear convolution operator and h d , d ∈ {x, y}, denotes the Gaussian partial derivative filter applied along the horizontal (x) or vertical (y) direction: is the isotropic Gaussian function with scale parameter δ.
Given the gradient map I GM , a set of k random, possibly overlapping patches with each of dimension √ m × √ m are extracted from I GM . Then, every patch is verted to a vector of length m, and the patches are concatenated to form a matrix B x ∈ R m×k . Furthermore, we learn an overcomplete dictionary D x ∈ R m×n that has n atoms (m < n) using the local patches in B x as input. Our goal is to learn D x such that each patch (column) b xj ∈ B x can be closely approximated as a linear superposition of a small number of atoms in D x . This is achieved by solving the following sparse optimization problem: where the vector a x j ∈ R n is the sparse representation of the patch b x j ∈ R n . The value of p is typically 0 or 1, and ε refers to the reconstruction error controlled by the user. It is worth noting that the patch size of I GM is set as 8*8, and the number of primitives in the trained gradient dictionary is set as 256. Besides, the classic K-SVD model [18] is utilized for computing the gradient dictionary D x . To better understand the benefit of sparse representation in the gradient domain, one demonstration of visual inspection between traditional pixel-based dictionaries and gradient-based dictionaries is shown in Fig. 4. The learned dictionaries in the pixel domain  For each patch (column) b xj ∈ B x , the process of calculating its sparse representation vector a xj with respect to the dictionary D x is called sparse coding, which can be formulated as follow: where L is the number of primitives used to represent the sparse level of each patch. Although the problem (4) is usually NP-Hard, it can be approximated by various techniques. In this study, because of the simplicity and effectiveness of the orthogonal matching pursuit (OMP) [63] algorithm, we use it to solve problem (4). According to [20], the EoP can be used to measure the amount of visual information in an image. In this section, in order to better show that the proposed EGP can represent the image visual information more sparsely, the EoP/EGP, PSNR and SSIM comparison curves for test images Airplane and Lena with regarding to image primitives and gradient primitives are shown in Fig. 5. As can be clearly seen from Fig. 5 (a) and (d), the two values of EoP and EGP converge almost simultaneously when the number of primitives L is equal to 60. We also observe that the value of EGP is less than the value of EoP. These findings are strong evidence that the extracted sparse representation vector with respect to the gradient dictionary is more sparsely, and it is of great benefit to measure image visual information. That means, as L = 60, the reconstructed image with respect to gradient-based dictionary is closer to the original image in visual perception.

B. PERCEPTUAL VISUAL INFORMATION EXPRESSION
During natural vision, the classical and nonclassical receptive fields function together to form a sparse representation of the visual world [64]. In [6], Field et al. declared that the basis or primitive represented sparsely have characteristics of spatially localized, bandpass, and oriented, etc., which are closely related to the characteristics of the receptive fields of simple cell. Additionally, in [23], sparse representation in gradient domain provides a good solution for image recovery. Therefore, in this study, we can hypothesize that the visual gradient primitives is a good representation of the basic units of visual perception, which is also analogous the receptive fields of simple cells in the visual cortex.
Typically, in previous studies, the work of [19] first proposed the concept of EoP to measure the amount of image visual information. With this model, the coefficients of primitives are considered, known as l 0 norm based EoP. In [65], Shi et al. showed that the l 1 norm based EoP is superior to the l 0 norm based one in measuring image visual information. Moreover, Wan et al. [22] proposed the concept of the entropy of classified primitives (ECP) to measure the monocular visual information. In this paper, we further explore the concept of EoP and propose a new concept based on image gradient primitives, namely, entropy of the gradient primitives (EGP), which is used for measuring image visual information.
Generally, a stereoscopic image consists of left view image I L and right view image I R . Given a gradient dictionary D x , we can compute the sparse representation matrix A X L and A X R for I L and I R , respectively. Assume d k is the k-th visual primitive of D x . Afterwards, the probability density of visual primitive d k for I L is calculated by And then, based on the Shannon theory, the EGP of the left view image I L of stereoscopic image can be calculated by Similarly, the probability density of visual gradient primitive for the right view image of stereoscopic image can be calculated in the same process.
Since HVS relies on both monocular and binocular cues to obtain effective stereoscopic perception, both cues should be considered simultaneously in SIQA design. To meet this goal, we utilize the MIGP as the binocular cues. Then, the sum of coefficients that d k is used to reconstruct both the i th path in left view and the j th path in the right view can be defined by Thus, the joint probability density of visual primitive d k for the left image I L and the right image I R is defined by With the probability density distribution p L , p R and p, the MIGP can be defined by Note that, where Ω= k |p k L × p k R = 0 . Finally, the PVI of a test stereoscopic image can be described by (11) where i refers to the i-th pair of distorted stereopair.

C. STRUCTURAL DEGRADATION DESCRIPTION
In addition to the sparsity properties of HVS, it is believed that the HVS learns through evolution and experience over the lifespan to exploit the statistical structure of natural images when performing visual tasks [66]. Considering the local spatial contrast features of images convey important structural information, and are closely related to the perceived quality of images. In this paper, the local contrast features, namely GM maps and LOG response, are used to measure the structural degradations due to the presence of distortions of stereopairs. The reasons for this operation are twofold. On one hand, in designing RRSIQA model it is critical to choose the quality-aware features in a way that have low computational requirements. To meet that need, we take into account quality-aware features from the spatial domain in order to reduce costly computation introduced by image transform, i. e. the spatial domain image is transformed to frequency domain or wavelet domain to obtain the features. On the other hand, bandpass image responses, in especial Gaussian derivative responses, can be employed for charaterizing all kinds of image semantic structures, including lines, blobs, corners, and edges, etc. These semantic structures are closely related to human perception of image quality. Therefore, according to [25], the joint statistics of GM and LOG features are extracted from each view image of stereopair in order to measure changes in image semantic structures.
To be specific, each view image of stereopair is decomposed into just two channels, the GM map channel and the LOG response channel. Then, the GM map is computed based on the formula (1) and (2). Meanwhile, the LOG of each view image of stereopair I is defined by where h LOG (x, y| σ) = ∂ 2 ∂x 2 g (x, y| σ) + ∂ 2 ∂y 2 g (x, y| σ) Then, the coefficients of calculated GM and LOG are normalized to obtain stable statistical image representations: Note that, the locally adaptive normalization factor N I in the formula (12) and (13) is given at each location (i, j) as follow: where Ω i,j is a local window centered at (i, j), ω (l, k) are positive weights with satisfying l,k ω(l, k) = 1, and T I (i, j) = G 2 I (i, j)+L 2 I (i, j). The marginal probability functions of G I and L I , denoted by P G and P L , respectively, which are defined by where K m,n = P G I = g m , L I = l n , m = 1, ..., M ; n = 1, ..., N is the joint empirical probability function of G I and L I , while m and n refer to the quantization levels of G I and L I . Considering the fact that there are dependencies between the GM and LOG, the following two quality-aware features to measure the dependency between GM and LOG can be defined by As a result, the feature vectors SD can be obtained by concatenating all the above-mentioned four types features to measure the structural degradations of a test stereoscopic image: SD= P G I ,P L I ,Q G I ,Q L I To better illustrate how the distortions of stereoscopic images affect the distribution of the feature vectors SD, the joint normalized histograms of SD at different DMOS levels with five distortion types are shown in Fig. 6. Intuitively, we can clearly see that the shapes of the joint normalized histograms resemble each other in appearance across the same type of distortion. This means that the joint normalized histogram behaves in a content independent manner, and the feature vector SD is a stability and dependable statistical feature for RRSIQA task. Furthermore, as can be seen from Fig. 6 also demonstrates that the hisograms are changed with different levels of distortion. Obviously, the more serious the distortion, the greater the change of histogram shape. This reveals that the histogram shape is closely related to the distortion level. Consequently, we can summarize that the feature vectors SD serve as good discriminatory features for measuring the structural degradations of distorted sterepairs.

D. QUALITY PREDICTION
In the quality prediction stage, we believe that the perceptual visual information loss and the joint statistics of structural degradation can objectively reflect the quality difference between reference and distorted stereopairs. To be specific, the EGP and MIGP are represented as the binocular perceptual visual information, which are monocular cue and binocular cue respectively. The joint statistics of GM and LOG features SD are represented as structural degradation, which are complementary to the monocular cue EGP. The differences of PVI and SD between the reference stereopairs and their distorted versions are computed as loss vector F, which can be defined by: where O and D denote original and distorted stereoscopic images, respectively. Note that, V ∈ {L, R}, L and R refer to left image and right image of a stereoscopic image, respectively.
To obtain the quality index of stereoscopic image, the KRR framework is used to build a map from the loss vector F to the perceived image quality. More specifically, with regarding to a training set {(x 1 ,y 1 ) , (x 2 ,y 2 ) ... (x n ,y n )} ∈ R m × R 1 , The classic approach is to minimize the quadratic cost: However, in the eigenspace, when we substitute x i → Φ (x i ), it may be run into the risk of over-fitting. For avoiding this case, it is necessary to regularize it and set reasonable standards for selecting a mapping C : R m → R to minimize the cost function as follow: where λ w 2 is a regularization term used to stabilize the inverse numerically [65]. In this paper, x i denotes the loss vector, and y i is the subjective score of the i-th stereoscopic image. In accordance to [67], the solution of Eq. (25) can be formulated as follow: Substitute Eq. (26) into Eq. (25), the problem is converted to the optimal solution of the coefficient α, we can obtain Eq. (27) as follow: Then, the inner product of the feature space can be expressed as the kernel functions , which is substituted into Eq. (27), we can obtain Eq. (28) as follow: For simplicity, we arrange Eq. (28) in matrix form: where K is named as reproducing kernel, and Y = [y 1 ,y 2 ,...,y n ] T . It should be noted that the Gaussian kernel is adopted in this paper, which can be defined by: Besides, when the KRR algorithm is used to build the map between the loss vector and subjective scores, we need to set the parameters of the regularization term λ and the Gaussian kernel σ. To serve this purpose, we use a 2D grid search technique with 200 times cross-validation to find out the optimal parameter values of (λ, σ). As illustrated in Fig.  7, the optimal values of (λ, σ) are set to be (1.0e-04, 0.002), (5.0e-05, 0.015), (3.5e-0.5, 0.01), (1.0e-0.5, 0.002) on the LIVE 3D IQA database phase I [68], LIVE 3D IQA database phase II [36], Waterloo IVC 3D database phase I [69] and Waterloo IVC 3D database phase II [70], respectively. We use them in the following experiments.

IV. EXPERIMENTAL RESULTS
To verify the performance of the proposed RRSIQA model, we analyze its ability to evaluate symmetric and asymmetric of distorted stereoscopic images from the aspects of prediction accuracy, monotonicity, and consistency on four popular SIQA databases. A more detailed description is given in the following section.
The Waterloo IVC 3D database also contains two phases, in which three distortion types and four distortion levels are provided. Specifically, the Waterloo IVC 3D Phase I [69] contains 6 reference stereoscopic images and 324 distorted versions (72 symmetrically and 252 asymmetrically distorted stereopairs). The Waterloo IVC 3D Phase II [70] contains 10 reference stereopairs and 450 distorted versions (120 symmetrically and 330 asymmetrically distorted stereopairs). Compared with the LIVE 3D IQA database, the Waterloo IVC 3D database contains mixed distortion types and distortion levels in asymmetrically distorted stereoscopic images.
Three popular objective quality metrics, namely, PLCC (Pearson linear Correlation Coefficient), SRCC (Spearman Rank-order Correlation Coefficient), and RMSE (Root Mean Square Error), are utilized to evaluate quality prediction performance. The more PLCC and SRCC values tend to 1, the more RMSE values tend to 0, representing better performance. Before calculating the criteria, the nonlinearity regression of quality scores with subjective opinions is required by using a five parameters logistic function [71] as follows: where t and f (t) refer to the predicted objective quality score and the nonlinear fitting score, and a i , i = 1, 2, ..., 5, are the regression parameters to be fitted. In order to assess the correlation performance of the proposed model, each database is divided into two nonoverlapped sets: the training set and the test set. To be specific, we firstly randomly select 80% of the reference stereoscopic images as the training set and the rest 20% as the test set. By this means, we ensure that there is no content overlap between the training set and the test set. Next, the proposed metric trained from the training set is examined on the test set, which is repeated 1000 times. Finally, the median PLCC, SRCC, and RMSE results from 1000 train-test interactions represent the final performance prediction.

B. OVERALL PERFORMANCE COMPARISON
To investigate the performance of the proposed RRSIQA metric for all distortion types, twelve representative advanced SIQA metrics are selected for comparison. They are divided into two categories: one consists of four FR SIQA models, including Liu's metric [46], Liu's metric [47], Jiang's metric [59], Ma's metric [45]; the other one consists of five RR SIQA models, including Wang's metric [51], Ma's metric [52], Qi's metric [21], Wan's metric [22], Ma's metric [72] and Ma's metric [73]. Among all the FR and RR SIQA models, Liu's metric [46] is by simulating binocular behaviors of HVS; Liu's metric [47] is by considering the depth and integral color information of stereoscopic image; Jiang's metric [59] and Sun's metric [61] are based on deep learning; Ma's metric [54] is based on NSS models; Wan's metric [22] and Ma's metric [72] are the fusion approaches by jointly considering NSS and HVS models. In [73], a preliminary version of this study is proposed based on entropy of gradient primitives. It is well known that the FR SIQA metric should has better performance than RR/NR SIQA methods due to the whole reference information of stereoscopic image used. However, we still choose some FR SIQA methods for comparison to prove the superior performance of the proposed RRSIQA method. Note that, because Shao's metric [43] focused on the asymmetrically distorted stereopairs based on sparse representation, and Sun's metric [61] used CNN to learn deeper local quality-aware structures for stereoscopic images, we also use them as comparisons.
The overall performance of the proposed RRSIQA metric on the LIVE 3D IQA Phase I and LIVE 3D IQA Phase II databases are tabulated in Table 1, where the best performing metrics for each database are highlighted in boldface. As can be seen from Table 1, the proposed metric achieves highly consistent with human evaluation, especially for asymmetrically distorted stereopairs. To be specific, we can see that most of the SIQA models achieve relatively well performance for the symmetric distortion but fall in the asymmetric distortion. One likely reason for that these models do not fully consider the statistical characteristics of natural scenes and the perceptual properties of HVS. For example, Liu's metric [46] only considers binocular behaviors of HVS; Ma's metric [54] extracts the NSS-based quality-aware features without taking into account the perceptual properties of HVS. Interestingly, the statistical characteristics of natural image and the perceptual properties are considered simultaneously in Wan's metric [22] and Ma's metric [72], resulting in improved performance. However, the general drawbacks of these models are that the sparse representation is restricted to original pixel domain, and some computationally expensive image transformations are adopted. Therefore, all of which achieve very limited performance improvement.
To overcome the shortcomings of the above-mentioned models, this study more comprehensively considers the sparse properties of HVS and joint statistics of structural degradation of stereoscopic image. Experimental results confirm our hypothesis, and show the prediction performance is more consistent with human opinions. Moreover, the scatter plots of objective scores predicted against subjective mean opinion scores (MOS) on the two LIVE 3D IQA databases are showed in Fig. 8. As can be seen from Fig. 8 (a) and (b), the proposed RRSIQA model achieves high consistency with subjective scores. Therefore, the sparse representation in gradient domain and the joint statistics of image semantic structural degradation are two complementary components for measuring the degradation of stereo image quality. Based on the above analysis, we can conclude that the proposed RRSIQA model can be utilized to quantify and assess the symmetric and asymmetric distortions of stereoscopic images.

C. PERFORMANCE COMPARISON ON INDIVIDUAL DISTORTION TYPES
Different types of image distortion result in different viewing experiences, it is necessary to show the universality of the proposed model for individual distortion types. Therefore, the eleven typical schemes are selected and compared with the proposed method on each type of individual distortion. To save space, experimental results in terms of PLCC is tabulated in Table 2, and the best metrics have been highlighted in boldface. As can be seen from Table 2, the proposed metric ranks among the top 4 times in terms of PLCC on some specific types of distortion, followed by Wan's metric [22] 3 times, Ma's metric [45] 2 times and Sun's metric 1 times, especial, the proposed model achieves an impressive performance for JPEG distortion on the symmetric and asymmetric distortions. The principal reason is that the JPEG distortion mainly come from image blurring caused by the high frequency attenuation, which in turn lead to the degradation of image structure. Moreover, we have also found that the performance of the proposed model is close to the best for all individual distortion types. Therefore, we can conclude that the proposed RRSIQA model is comparable to the most efficient model for individual types across both symmetric VOLUME 4, 2016     and asymmetric distortions.

D. PERFORMANCE COMPARISON ON SYMMETRIC AND ASYMMETRIC DISTORTION TYPES
In order to further verify the effectiveness of the proposed method for asymmetric distorted stereoscopic images, we also conduct experiments on the databases of Waterloo IVC 3D I and Waterloo IVC 3D II. For PLCC, SRCC, and RMSE, the results for symmetric and asymmetric distortions are illustrated in Table 3 and Table 4, respectively, where the best performance metrics for each database are highlighted in bold. From Table 3 and Table 4, it can be observed that the performance of the proposed scheme is better than other schemes in both symmetric and asymmetric distorted stereoscopic images. For instance, the PLCC and SRCC values of the proposed metric are about 0.15 and 0.21 higher than the work of [21] on the Waterloo IVC 3D Phase I database. Also, the PLCC and SRCC values of the proposed scheme are respectively 0.26 and 0.32 higher than the work of [21] on the Waterloo IVC 3D Phase II database. The reason is that sparse representation in original pixel domain and a single strategy do not provide the best performance in all situations. Based on the above observations, we can draw conclusion from the proposed method is in significant agreement with subjective judgments on symmetric and asymmetric distorted stereoscopic images.

E. IMPACT OF EACH COMPONENT IN THE PROPOSED SCHEME
Since the perception of image quality by human eyes is sparse and sensitive to the image structure degradation, we should consider these two visual characteristics simultaneously when designing SIQA model. In order to further understand how to combine sparse representation and structural degradation to improve the prediction performance of the proposed measurement method, some feature analyses and ablation experiments are given. Three sets of quality-aware features are used in the proposed scheme, including one binocular cue MIGP, and two monocular cues EGP and SD. On the LIVE 3D IQA phase I database and Waterloo IVC 3D phase II database, PLCC and SRCC feature groups and their combined performance comparisons are provided respectively, and the results are shown in Fig 9. The MIGP represents the binocular visual information, and the EGP represents the monocular visual information, which are extracted from sparse representation in the gradient domain. Since the sparse representation considers the sparse characteristics of HVS, their respective performance looks good. The SD is the joint statistics of LOG and GM features, which can be used to measure the structural degradation of natural image. The combinations of each group's feature are EGP+MIGP, EGP+SD and EGP+MIGP+SD. As can be seen from Fig.  9, the prediction performance of the proposed scheme can be further improved by properly chaining each set of features together. Interestingly, the feature EGP achieves better performance than feature MIGP. The most likely reason is that monocular cues mainly emphasize the characteristics of visual stimuli, while binocular cues emphasize the role of feedback information generated by the coordinated activities of the two eyes. Furthermore, we can also observe that the features SD achieve better performance than the features EGP, MIGP and EGP+MIGP. There are two main reasons for this. One is that the human eye evaluates the image quality, the statistical properties of HVS may be take precedence over the perceptual properties of HVS. The other is that the HVS is very sensitive to structural degradation in an image. In addition, from Fig. 9.(a), it can be seen that the performance of the features EGP+SD achieve better performance than the features EGP+MIGP+SD. A logical explanation is that the symmetric distorted stereoscopic image does not trigger binocular rivalry. In general, it can be seen from Fig. 9 that the combined effect of the three feature groups is better than that of each feature group, which proves the complementarity and effectiveness of each feature group.

F. IMPACT OF PROPORTION OF TRAINING SET
In order to show that the proposed scheme is not highly dependent on the size of the training set, 1000 cross-validation experiments are conducted to test the performance of the proposed scheme under different proportion of a training set and a testing set. Note that, the same database uses the same KRR parameters. The results are demonstrated in Fig. 10. From Fig. 10, we can observe that a stable SIQA model can be derived from a small amount set of stereoscopic images. Intuitively, the PLCC and SRCC slightly decrease with a reduction of the proportion of training data, but it is significant above 30% of the LIVE 3D IQA phase I, 40% of the LIVE 3D IQA phase II, and 50% of the Waterloo IVC 3D phase I and phase II databases, respectively. Therefore, the proposed scheme is essentially independent of the size of the training set.

G. CROSS-DATABASE PERFORMANCE PREDICTION
The prediction strategy utilized in Section IV.A-F is inadequate to evaluate the generalization ability and robustness of the proposed metric, because the training subset and testing subset have the same distortions selected from a same database. Therefore, we conduct cross-database validation experiments on the LIVE 3D IQA Phase I [68] and LIVE 3D IQA Phase II [36]. Note that, the KRR parameters are set to be (1.0e-04, 0.002) and (5.0e-05, 0.015), respectively. Specifically, the proposed model is trained on one dataset, then testing it on the other dataset. The results are shown in Fig.  11. From Fig. 11, we can find that, when the proposed model learned on the LIVE 3D Phase II dataset, the cross-database performance prediction has better than the proposed model is trained on the LIVE 3D Phase I dataset. The most possible reason is that the LIVE 3D Phase II dataset includes not only symmetrically distorted stereopairs, but also asymmetrically distorted stereopairs, which raises the generalization ability and robustness to the proposed model learned. Moreover, the scatter plots of objective cross-database prediction scores against MOS on the two LIVE 3D IQA databases are showed in Fig. 12. As can be seen from Fig. 12(a) and (b), in this cross-database validation experiments, the proposed model achieves high consistency with subjective ratings. Therefore, we believe that the proposed model can deliver high generalization ability and robustness.

H. STATISTICAL EVALUATION
To verify whether a metric is statistically superior to another one, we conduct the one-sided t-test between the correlation scores generated by the algorithms across the 1000 traintest trials. Note that, our analysis here is based on the mean SRCC values across all distortions over 1000 test sets. Fig  13. shows the t-test results conducted between any two SIQA methods on the two LIVE 3D IQA databases. A value of "1" indicates that the algorithm (row) is statistically superior to the algorithm (column). A value of "0" indicates statistical equivalence between the row and column, while a value of "-1" indicates that the algorithm (row) is statistically inferior to the algorithm (column). It is clearly shown that the proposed metric is statistically better than almost all existing SIQA schemes, especially for asymmetric distortion of stereoscopic image.

I. COMPUTATIONAL COMPLEXITY ANALYSIS
Computational complexity is another key factor to assessment the feasibility of the proposed metric. Therefore, we compare the computational complexity (the average running time in testing a pair of stereoscopic image with the resolution of 1920*2780 from the Waterloo IVC 3D Phase I) in the testing stage of all competing methods. All experiments are implemented by MATLAB R2020a and the server of Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz, 32GB RAM, NVIDIA GeForce GTX 1650. The comparison results are shown in Table 5. We can see that the proposed metric proved superior to Chen's metric [36] and Ma's metric [72]. The reason is that for the proposed metric, once the PVI and SD are calculated, the testing time complexity is very low. Anyway, the proposed metric achieves a low complexity solution to high performance RRSIQA. In this study, we propose a RRSIQA method via combining the two features to perform gradient sparse representation and image semantic structure extraction at the initial stage of stereoscopic vision. The gradient sparse representation is used to extract binocular visual information, and the gradient primitive entropy (EGP) of each viewpoint image is used as the monocular cue, and the gradient primitive mutual information (MIGP) between the left and right view's images is used as the binocular cue. The joint statistics of GM and LOG features are taken into account to measure the structural degradation of each view image of distorted stereopair, which is complementary to the monocular cue EGP. The novelty of this study is that we jointly consider the sparsity properties of HVS in gradient domain and the statistical characteristics of image semantic structure. Different from most of the existing SIQA metrics, the proposed metric has two advantages in practical 3D multimedia applications. The one is that the proposed model performs well without the disparity/depth information and traditional NSS-based image transformation, which greatly reduces the complexity of SIQA algorithm. The other is that the proposed model not only requires very few quality-aware features, but also significantly improves consistency with subjective ratings. Thusly, we look forward to further extending this concept to RR 3D video quality measurement in the future work.