Stereoscopic Video Quality Assessment Using Oriented Local Gravitational Force Statistics

We develop a new no-reference (NR) stereoscopic video quality assessment model that adopts oriented local gravitational force (OLGF) statistics in the space-time domain. The OLGF is a novel extension of an existing local gravitational force descriptor and includes two new components: relative local gravitational force magnitude and relative local gravitational force orientation. The resulting algorithm, called Stereoscopic Video Integrity Predictor using OLGF Statistics (SVIPOS), first uses our previous work to generate a cyclopean image and a product image in spatial domain from a stereoscopic video sequence pair, and calculates a frame difference image in temporal domain from a left video sequence. Then the OLGF model is deployed to compute various local gravitational force responses on the cyclopean image, the product image and the frame difference image. Finally, a support vector regression is used to map statistical features extracted from the computed gravitational force responses to stereoscopic video quality predications. SVIPOS is thoroughly tested on three publicly stereoscopic video quality databases. The experimental results show that SVIPOS outperforms state-of-the-art stereoscopic video quality methods.


I. INTRODUCTION
In recent years, stereoscopic videos have been widely used in movies, games, education and other industries due to their good immersion experience and broad application prospects [1], [2]. However, inevitable distortions are generated during stereoscopic video processing [3] and distortions adversely affect human judgements of video content [4]. Therefore, it is still important to design effective algorithms for measuring stereoscopic video quality [5].
Here we focus on addressing the problem of no-reference (NR) stereoscopic video quality assessment (SVQA). Recently, some researchers have proposed NR SVQA algorithms that deliver good prediction performance. Yang et al. [6] used sparse representation and spatiotemporal saliency model to train a dictionary, and predicted a quality score of stereoscopic videos by using stacked autoencoder and support vector machine. Chen et al. [7] proposed The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang. an NR SVQA method that combines natural scene statistics with disparity entropy measurement based on autoregressive prediction. Qi et al. [8] considered just-noticeable difference model to evaluate the quality of stereoscopic videos. Jiang et al. [9] considered motion information extracted in tensor decomposition domain to evaluate the stereoscopic video quality. Fang et al. [10] designed a 2D-TO-3D video quality assessment model, which uses edge and frame difference information in stereoscopic video to weight 2D image quality assessment (IQA)/video quality assessment (VQA) scores. Yang et al. [11] proposed an NR SVQA algorithm by using curvelet transform and an optical flow method to extract spatial and temporal features. However, just using left and right view summation or difference map to simulate the properties of binocular vision is not sufficient for representing stereoscopic video content. A way to fully consider the fused perception of a stereoscopic scene contributes to video quality predictions. Galkandage et al. [12] considered binocular suppression and used an IQA method to evaluate the quality of each frame, and proposed an optimized pooling method to evaluate stereoscopic video quality. Appina et al. [13] proposed an NR SVQA algorithm using binocular disparity and motion component joint statistical modeling. Wang et al. [14] considered the binocular spatio-temporal internal mechanism and the free-energy theory to compute binocular difference maps, and use the multi-channel natural statistics to evalute 3D video quality. Although these algorithms have achieved some success, due to the complication of stereoscopic video content, binocular perception models and effective spatial and temporal feature representation are required to comprehensively integrate various aspects of stereoscopic videos [15].
The image gradient carries rich information of image structures and edges, and has been widely used as a promising image feature in the field of IQA/ VQA [16], [17]. Xue et al. [18] calculated gradient similarity to evaluate image quality. Lu et al. [19] extended the gradient descriptors to the spatio-temporal domain and synthesized local spatial and temporal information to obtain the quality of a stereoscopic video. Li et al. [20] generated new local descriptors for evaluating image quality by using local binary pattern to capture image feature contained in gradient maps. Yang et al. [21] trained gradient and phase dictionaries to learn the quality-influencing features in stereo images and used a support vector regression to obtain the final quality score. Ji et al. [22] used the gradient map, the intensity map, and orientation selectivity based visual patterns to jointly predict the image quality. However, these gradient-based methods only used the difference of two adjacent pixels in horizontal and vertical directions to measure image gradient magnitude and orientation, while ignoring the pixel information in other orientations. Very recently, Bhattacharjee et al. [23] proposed a local gravitational force (LGF) descriptor that considers relationships between all of the surrounding image pixels. Thus this descriptor can better capture local information than conventional gradient methods. There are two components in the descriptor: local gravitational force magnitude, which is to measure image interest areas (edges and textures), and local gravitational force orientation that conveys local geometry structure. LGF has been successfully used for image recognition, this makes us believe that the descriptor has good potential and can be further applied to other fields, especially in the field of visual quality assessment. Inspired by this, we study the principle of this descriptor and further explore the discriminative power of the local gravitational force magnitude and orientation feature components. Here we conducted a novel gravitational force based descriptor by considering the masking effect and intensity changes on local structures. Binocular perception model, as an important form of stereo vision, can effectively measure the perceived quality of stereo views. Shao et al. [24] constructed a binocular perception model based on binocular and monocular energy responses to evaluate the quality of stereo images. Since a cyclopean image can reflect the comprehensive stimulus and binocular perception of left and right views, it is often used to calculate the perceived quality of three-dimensional scene [15]. Shen et al. [25] proposed an NR 3D IQA method, which uses a new disparity search algorithm to compute a cyclopean image. Liu et al. [26] computed a cyclopean image by utilizing a 3D saliency map to weight the left and right views. Yue et al. [27] use the 2D log-Gabor filters to extract energy and homogeneity features from the cyclopean image. In our previous work [28], we proposed a binocular spatial activity model that thoroughly considers the impact of binocular rivalry and suppression. Here, we use this model to synthesize a cyclopean image due to its good performance. A product image [28] is a simple way of measuring empirical correlation between binocular pixels and also suitable for representing spatial information in stereoscopic videos. Therefore, the cyclopean image and product image are used to characterize spatial stereoscopic video content.
Video contains rich temporal information, which has a significant impact on visual perception [5]. Several researchers have made good use of temporal information in video quality predication. Chen et al. [29] proposed an novel Recurrent-In-Recurrent Network, which can express the spatio-temporal distortion in the video more effective by encoding the motion information.Wu et al. [30] used the optical flow to calculate the motion trajectory as the temporal information of the video, and then computed the similarity of motion trajectories to evaluate the video quality. Huang et al. [31] represented the temporal domain information by comprehensively analyzing the ratio of freezing duration, the entropy value of an optical flow map, and the change of light intensity. Zhang et al. [32] utilized 3D-DCT coefficients to extract the spatio-temporal features of distorted videos, and then used the convolutional neural network to obtain a quality score. Although these methods well represent the temporal information of the videos, due to their own model complexity, they generally cost a lot of time in a process of characterizing the temporal information. Alternatively, the temporal frame difference, as a simple and efficient feature representation, has been widely used in the VQA field. Saad et al. [33] extracted DCT coefficients of frame difference and the motion vector between frames, and then used a regression model to obtain a video quality score. Soundararajan et al. [34] studied the statistical model of wavelet coefficients between video frames and designed entropic differencing indices to evaluate video quality. Liu et al. [35] proposed a video quality assessment model using space-time slice (STS) mappings, which uses frame difference to characterize the STS temporal representation. These studies have proved that the video frame difference has a certain distribution law, which can represent the moving edge in the video. Here, we extract motion information by computing a frame difference image. Although the above-mentioned cyclopean image, product image and frame difference image have been used in the IQA/VQA field, we collectively compute them to more effectively characterize spatial and temporal stereoscopic content due to their good performance in binocular perception. Further, VOLUME 8, 2020 the proposed gravitational force based descriptor is used to extract quality-predictive information embedded in these feature images.
Here we develop a no-reference stereoscopic video quality assessment model, called SVIPOS. The resulting model first computes a cyclopean image, a product image and a frame difference image from the left and right videos, then uses a novel local gravitational force descriptor to generate four gravitational force maps on the computed images. Lastly, statistical features extracted from these gravitational force responses are used to predict quality scores. The contributions of this work are as follows. (1) We propose a novel Oriented Local Gravitational Force descriptor that considers the contributions of relative information of the local gravitational force magnitude and orientation feature components. This gravitational force based descriptor is deployed, for the first time, in the field of visual quality assessment. (2) We collectively use the cyclopean image, the product image and the frame difference image to represent spatio-temporal information in stereoscopic video. (3) The OLGF statistics are deployed to efficiently represent the computed spatio-temporal information, and integrate them into the proposed algorithm. (4) SVIPOS delivers highly competitive performance relative to state-of-the-art stereoscopic video quality methods.
The rest of this work is organized as follows. Section II describes the proposed OLGF descriptor. In Section III, the proposed SVIPOS model is described in detail. We evaluate the performance of our proposed model in Section IV, and conclude our paper in Section V.

II. ORIENTED LOCAL GRAVITATIONAL FORCE DESCRIPTOR
The local gravitational force descriptor is motived by law of universal gravitation, and can better capture local information [36]. Following this descriptor, we propose a novel oriented local gravitational force descriptor that incorporates the advantages of the LGF descriptor and additionally consider the contributions of relative information of its feature components.

A. LOCAL GRAVITATIONAL FORCE DESCRIPTOR
Law of universal gravity points out that any two objects in nature attract each other with a force and this force is inversely proportional to the square of their distance and proportional to the product of their masses [36]. For two objects having masses m 1 m 1 m 1 m 1 and m 2 m 2 , the gravitational force is where r is the distance of the two objects, and C 1 is a gravitational constant. An image pixel is quite analogous to an object. Then express the horizontal direction component F x and vertical direction component F y of the force as: where θ is an angle between two pixels with respect to horizontal direction. Fig. 1 shows the gravitational force exerted by eight neighboring pixels on the central pixel in a 3 x 3 window. The F x and F y of the resultant force acting on the central pixel are computed as: where I a and I i are the intensity of the central pixel and ith adjacent pixel respectively, θ ai is the angle between these two pixels with respect to horizontal direction, p is the number of adjacent pixels and r i is the Euclidean distance between these two pixels. The force has gravitational force magnitude and orientation components. More specifically, the local gravitational force descriptor is composed of local gravitational force magnitude (LGFM), and local gravitational force orientation (LGFO). The gravitational force magnitude exerted on the central pixel is and the gravitational force orientation is The LGFM feature is to measure image interest areas (edges and textures), and the LGFO represents local geometry structure. In the following subsection, we will expand on the LGF descriptor and conduct a novel gravitational force based descriptor by considering relative information of its feature components.

B. ORIENTED LOCAL GRAVITATIONAL FORCE DESCRIPTOR
Since visual cortical neurons are highly sensitive to local information [37], distortion may change the anisotropy of local areas and affect people's perception quality. Accounting for local gravitation force magnitude and orientation responses and their relevant potential may be a useful way to enhance the performance of the LGF descriptor. The relative gradient orientation and magnitude have been widely used in the field of IQA due to conveying information concerning changes in local structure and area [16]. They have good ability to capture textural changes and structure shifts arising from distortion. Inspired by this and the advantages of the LGF descriptor [23], we propose a novel gravitational force descriptor, called the Oriented Local Gravitational Force (OLGF), to explore the relative contributions of local gravitational force magnitude and orientation. Specifically, OLGF is composed of LGFM, LGFO, and two novel components: relative local gravitational force magnitude (RLGFM) and relative local gravitational force orientation (RLGFO). The RLGFO feature captures the deviation caused by local structure degradations, and the RLGFM feature measures the intensity changes in local structure. The relative magnitude and orientation in the local neighborhood capture the changes of the local video content. Thus, the relative gravitational force magnitude and orientation responses complement the LGF descriptor in a relative manner. Here, we calculate the relative gravitational force orientation F RO by subtracting the average gravitational force orientation exerted on an image pixel.
where FÔ represents the local average orientation and is defined to be FÔ = arctan Fˆy Fx (9) using the average gravitational force magnitudes of F x and F y and where n is the number of pixels in the neighborhood.
To reduce descriptor complexity, the neighborhood size is set at 3 x 3. Likewise, the relative gravitational force magnitude F RM is defined as Fig. 2 shows various gravitational force representations of a video frame from the NAMA3DS1-COSPAD1 database [38]. It can be seen that there is apparent difference in local structure between the gravitational force orientation and the VOLUME 8, 2020 relative gravitational force orientation maps, and that the gravitational force magnitude and relative gravitational force magnitude are slightly different in edge and texture (see Figs. 2b and d). Although the differences between these four maps are not always apparent, each of them still represents obvious characters to a certain extent. Therefore, the combination of these feature components may be more suitable for different applications (such as texture recognition). Images and videos contain rich local structures and textures. We believe that our OLGF descriptor which represents the changes of the local structures and degradations may perform well in VQA. In the following we use the proposed OLGF descriptor to effectively represent spatial and temporal stereoscopic video content.

III. STEREOSCOPIC VIDEO INTEGRITY PREDICTOR
Next we describe the proposed SVIPOS method in detail. Fig. 3 shows the framework of the method. Along a top path that uses the left and right videos as inputs, a cyclopean image is computed using an existing binocular spatial activity model, and a product image is generated by measuring empirical correlation between binocular pixels. Both computed images are used to characterize spatial stereoscopic video content. All four components of the proposed OLGF descriptor are used to extract quality-predictive information embedded in the computed images, yielding eight gravitational force maps. Along the bottom path, we extract temporal information from the left video by computing a frame difference image and then only use the LGFM of the OLGF descriptor to capture the motion information on the computed frame difference image, yielding one gravitational force map. Since all the generated gravitational force responses show obvious statistical characteristics, we extract statistical features from all nine generated responses. Finally, a support vector regression is used to map generated statistical features to stereoscopic video quality predictions.

A. BINOCULAR PERCEPTION MODEL AND SPATIAL INFORMATION
Stereoscopic video is usually composed of left and right channels of video, so it is very important to consider the interaction between two monocular videos in the reconstruction of stereoscopic vision [39]. Generally, synthesizing cyclopean image using binocular perception model is an effective way to predict the perceived quality of stereo video content. Since the amount of spatial activity contained in a stereo pair is closely related to its perceived quality, we here use a binocular perception model [28] based on a measure of spatial activity to simulate the stereoscopic binocular perception response.
The cyclopean image (CI) is computed as (13), as shown at the bottom of the page, where S L(k) is a moving 17 x 17 window centered at coordinate k, ε S L(k) and ε S R(k+d) represent the spatial activity within neighborhood S L(k) of the left view I L(k) and that of the disparity-corrected right view I R(k+d) respectively, d represents the disparity at k, and C 2 is a constant to ensure stability. The disparity map is generated using an existing stereo matching algorithm [40], in which the SSIM metric is used to choose the best matching. The spatial activity ε(·) is given by: where σ 2 L (k) is the variance of S L(k) . The measure of spatial activity is able to determine whether there exists binocular rivalry or suppression between the left view and right view, and to further gauge the relative degree of influence of both views while existing binocular rivalry. The product image is an estimate of disparity compensated correlation. Using it may achieve good performance and still has low time complexity [28]. We also consider this simple product estimate to express empirical correlation between binocular pixels. The product image is given by: The computed cyclopean and product images represent considerable amount of data. Using spatial activity model or bivariate product to generate a feature map has lower time complexity, which is useful for stereoscopic video applications with limited computing resources. Fig. 4 shows the left view, the right view, the cyclopean image and the product image.

B. SPATIO-TEMPORAL GRAVITATIONAL FORCE RESPONSES
As mentioned above, the synthesized cyclopean and product images can effectively simulate the human stereoscopic visual perception system. Therefore, we deploy the binocular perception models on each frame of the stereoscopic videos and use these synthesized images as the spatial representation of the stereoscopic videos.
The temporal information in stereoscopic video also has a great impact on human perception. Considering temporal information contributes to the quality prediction of stereoscopic videos. Frame difference, as a widely used temporal representation, can be used to express motion information in videos. The neural responses of the visual cortex are almost separated in time domain [41]. Thus we also extract the frame difference image in time domain to represent motion information in stereoscopic videos. The frame difference is given by: where I t represents the tth frame, and T is the number of frames in a video. The frame difference and the above-mentioned cyclopean and product images collectively represent spatial and temporal stereoscopic video content. The proposed OLGF descriptor is then used to extract quality-predictive information embed in these three feature images, yielding gravitational force based feature responses. Specifically, all four components of the OLGF are deployed to capture discriminative features from the cyclopean and product images, while we try to use the frame difference as the complementary quality information and only use the local gravitational force magnitude to capture features from it in order to reduce computational complexity. All the four generated gravitational force responses to the cyclopean and product images and the generated local gravitational force magnitude response to the frame difference are then used for feature extraction.

C. STATISTICAL FEATURES
The natural scene statistics (NSS) features achieved good quality predictive power. Zhou et al. [42] used NSS features VOLUME 8, 2020 with a multivariate Gaussian model to construct a blind stereoscopic image quality metric. Gu et al. [43] proposed an NR IQA method by multiscale NSS analysis. Here we also extract NSS features from the computed gravitational force based maps. A normalization process [44] is first applied to reduce spatial redundancy on a gravitation force based map F j as follows: where σ j and µ j are the standard deviation and local mean value of F j respectively, and C 3 is a constant to ensure stability. The histograms of the four normalized gravitational force based maps of the cyclopean image computed on an original stereoscopic video and different distorted versions of it from the NAMA3DS1-COSPAD1 database [38] are shown in Fig. 5. We may see that the histograms of the normalized coefficients of various gravitational force based maps are similar to a slightly asymmetrical Gaussian appearance, while the distribution curves of various distortions have obvious differences. We also show the histograms of a normalized local gravitational force orientation map of the cyclopean image computed on the H.264 and JPEG2000 distorted versions of an original stereoscopic video at different distortion degrees in Fig. 6. The results again show that our gravitational force feature components are highly sensitive to the distortions. We use an asymmetric generalized Gaussian distribution (AGGD) [44] to quantify the empirical distribution of normalized coefficients of a gravitation force based mapF j . The AGGD model with zero mean is 212448 VOLUME 8, 2020 where and γ is the shape parameter, σ l and σ r are the scale parameters of left and right sides respectively. We use the best fit parameters (η, γ , σ l , σ r ) as statistical features, where We also consider the multi-channel characteristics of the human vision [45], and extract these statistical features from all gravitational force based maps at two scales, namely original and coarse scales.

D. QUALITY EVALUATION
In order to predict stereoscopic video quality, it is necessary to select a regression model to generate a quality score.
Here we choose a support vector regression (SVR) with the radial basis function kernel to train regression models, since it achieved good performance on dealing with high dimensional regression problems [46]. The LIBSVM package [47] is utilized to build a mapping from extracted statistical features to the quality score. First, we formed a regression training set by selecting a proportion of all the distorted stereoscopic videos. The regression training set consists of extracted features and their quality scores. Then, we utilized the SVR on it to train a regression model, yielding a finite-dimensional quality evaluation vector h and a bias b. The final quality score Q is given by: where the V f is the stereoscopic video feature vector obtained by average pooling. We summarize the commonly used symbols in Table 1.

IV. EXPERIMENTAL RESULTS
Our proposed model was validated on three publicly available stereoscopic video quality databases: the NAMA3DS1-COSPAD1 database [38], the SVQA database [8] and the Waterloo IVC Phase I database [48]. The NAMA3DS1-COSPAD1 database consists of 10 reference stereoscopic video sequences and 100 symmetrically distorted videos sequences of frame rates 25 frames per second (fps), and VOLUME 8, 2020 Each database was randomly divided into 80% training and 20% testing subsets in our experiments. We repeated the training-testing procedure over 1000 times and utilized four standard criteria to evaluate model performance: the Pearson's Correlation Coefficient (PLCC), Spearman rankorder correlation coefficient (SRCC), Kendall Rank Order Correlation Coefficient (KRCC) and the Root Mean Squared Error (RMSE). Higher SRCC, KRCC, PLCC and lower RMSE values indicate better consistency between human judgements and prediction scores. Several widely used 2D and 3D IQA/VQA algorithms were selected in comparison experiments, including the 2D IQA algorithms PSNR and SSIM [49], the 3D IQA algorithm SINQ [28], the 2D VQA algorithms ViS3 [51] and SpEED [50], and the 3D VQA algorithms SJND [8], MNSVQM [9], Modi3d [52], Wang's model [14] and BSVQE [7]. The IQA methods were applied on each video frames and used mean pooling to generate a final score, while the VQA methods were run using the settings reported in the original paper. The software release of each of the selected algorithms except SJND [8], MNSVQM [9], Modi3d [52] and Wang's model [14] is publicly available.

A. PERFORMANCE ON THREE DATABASES
We tested the performance of SVIPOS against several widely used IQA/VQA algorithms except SJND [8], MNSVQM [9], Modi3d [51] and Wang's model [14] on the NAMA3DS1-COSPAD1, SVQA, and Waterloo IVC Phase I databases, for the codes for these six methods are not publicly available. Accordingly we listed their results reported in the original papers in the tables below. The median PLCC, SRCC, KRCC and RMSE values across 1000 random runs are tabulated in Table 2. The top performing algorithm is highlighted in bold type. It can be seen that our proposed SVIPOS performed better than all the other compared method on the NAMA3DS1 and SVQA databases, and that it delivered better performance than all the other compared algorithm except BSVQE model on the Waterloo IVC Phase I database. Compared with BSVQE algorithm, SVIPOS still achieved competitive performance on the Waterloo IVC Phase I database. This may be because the cyclopean and product images computed on the left and right videos we use are more effective for binocular perception than simple summation or difference of left and right videos, and our method used the AGGD model when computing NSS features, which may be partially suitable for representing information in synthetic stereoscopic videos of unnatural scenes [52] from the Waterloo IVC Phase I database. These results show that SVIPOS performed best in 10 of 12 performance comparison metrics. Also, we show the scatter plots between the MOS (Mean Opinion Score) values and predicted scores of SVIPOS on three databases in Fig. 7. Clearly, the prediction scores of our proposed method match well with human subjective judgements. The results shown in Table 2 and Fig. 7 may indicate that SVIPOS is able to more effectively measure the perceptual quality of stereoscopic video sequences.

B. PERFORMANCE ON EACH DISTORTION TYPE
Since the NAMA3DS1 and SVQA databases contain different distortion types, we only conducted an experiment on each distortion type from these two databases. The SRCC values are listed in Table 3. The two top-performing metrics are highlighted in bold type. The ''Downsampling and Sharpening'' is a combination of three distortions: reduction of resolution, image sharpening, and downsampling with sharpening. This is because each distortion contains only 10  samples, and such a small amount of data will lead to unstable results.
From Table 3, we can see that SVIPOS delivered competitive performance on each distortion type from the two databases in most cases, and that other 3D VQA algorithms also achieved good results on each distortion. In particular, the 2D VQA methods ViS3 and SpEED also achieved good performance on several distortion types from the NAMA3DS1 database, while they performed significantly worse on each distortion type from the SVQA database. Since the NAMA3DS1 database only contains symmetric distortions, and the SVQA database contains symmetric and asymmetric distortions, these results may indicate that our proposed is also able to effectively measure the perceptual severity of asymmetric and symmetric distortions in stereoscopic video sequences.

C. STATISTICAL SIGNIFICANCE
To verify the statistical significance of the performance of all the compared models except SJND [8], MNSVQM [9], Modi3d [52] and Wang's model [14], we implemented the statistical tests [53] on the NAMA3DS1-COSPAD1, SVQA and Waterloo IVC Phase I databases. Fig. 8 shows the box plots of SRCC values of models across 1000 runs on three databases. It can be seen that the accuracy and stability of SVIPOS boosts best among that of all methods. We also utilized the T-test [54] to measure significance on the SRCC of all methods. The null hypothesis we adopted is that the mean correlation value in the row is equal to that of the column at 95% confidence level. Weshow the results of the t-test on three databases in Fig. 9, where the indicator ''1'' or ''−1'' means that the method in a row is statistically better, or statistically worse than the method in a column, while the indicator ''0'' denotes there are no statistical differences between them. The results show that SVIPOS is stable, and statistically superior to all other methods.

D. CONTRIBUTIONS OF DIFFERENT FEATURE IMAGES
To verify the contributions of each feature images, we also tested the performance of different feature images on the NAMA3DS1-COSPAD1 database. The median SRCC, PLCC, KRCC and RMSE values are listed in Table 4. We found that although features extracted from the frame difference performed slightly worse than those from the cyclopean image or the product image, it still achieved a modest improvement in performance. These results indicate that the cyclopean image and product image may be complementary to each other, and both of them are able to more effectively represent spatial information in stereoscopic video content. The better performance was achieved using features from all three parts. Clearly, all three sources are necessary and contribute to the performance of our method.

E. VARIATION WITH WINDOW SIZE
In order to study the impact of the window size of OLGF descriptor on the algorithm performance, we tested the performance of SVIPOS with different sizes on the NAMA3DS1 database. The results are listed in Table 5.
The results show that the performance of SVIPOS has a decreasing trend when the window size is increasing. This may because more pixels are farther away from the center pixel and occupy a smaller weight, the calculations of larger windows tend to be non-local.

F. ADVANTAGES OF ORIENTED LOCAL GRAVITATIONAL FORCE DESCRIPTOR
In order to further analyze the advantages of the proposed OLGF descriptor, we tested the performance of VOLUME 8, 2020    different gravitational force feature components on the NAMA3DS1 database. For simplicity we divided the feature components into three groups and only tested the performance of each group. The results are tabulated in Table 6. We can see that the relative local gravitational force magnitude and orientation responses achieved better performance than the local gravitational force magnitude and orientation (i.e., the LGF descriptor). The better performance was delivered by combining all four components. These results confirm the efficacy of our proposed OLGF descriptor.

G. DATABASE INDEPENDENCE
We tested the database independence of SVIPOS on the NAMA3DS1-COSPAD1, SVQA, and Waterloo IVC Phase I databases.Two available models SINQ and BSVQE were also tested. Table 7 tabulates the median SRCC values of three algorithms. The results of SVIPOS were highly competitive on three databases.

H. COMPUTATIONAL COMPLEXITY
We also compared the computational complexity of several algorithms on a stereoscopic video of resolution 1920 x 1080 pixels, frame rates 25 fps, and sixteen second durations (i.e., the input stereoscopic video is V = 1920 × 1080 × 400 pixels) from the NAMA3DS1 database. Let g denote the window size in the process of feature computation. For example, g in ViS3 is the size of the convolution matrix, while in BSVQE, g is the spans of the Gaussian operator, and g in SVIPOS is the neighborhood size. The results are shown in Table 8. We can see that 2D methods PSNR, SSIM, SpEED, and ViS3 are faster than 3D methods SINQ, BSVQE, and SVIPOS, because these 2D algorithms only consider spatial factors and do not take into account binocular perception mechanisms such as binocular disparity consuming a lot of time in 3D IQA/VQA methods. SINQ is a 3D IQA method which does not process the temporal information so it is   faster than SVQA methods SVIPOS and BSVQE. Although our proposed SVIPOS method is slower than compared 2D methods and 3D IQA method SINQ, it is faster than the SVQA method BSVQE by a large margin.

V. CONCLUSION
We have developed a novel NR SVQA model using a novel oriented local gravitational force descriptor. The proposed OLGF is utilized to effectively represent discriminative information in stereoscopic video content, and accounts for local gravitation force magnitude and orientation and their relevant potential. Deploying its two novel feature components may further boost performance. Compared with other methods that only use the summation or difference of left and right views, we use the cyclopean and product images to measure correlations between the left and right videos more effectively. The experimental results tested on three stereoscopic video databases show that SVIPOS has outstanding performance. We believe that contributions of relative local gravitational force magnitude and orientation features and a more effective motion model may merit further research. In the future we will incorporate local entropy to build a general motion model for performance improvement, and apply our proposed OLGF to other image and video processing applications, especially in blind image quality predictions.
YONGMEI ZHANG received the B.S. degree from Peking University in 1990 and the M.Sc. and Ph.D. degrees from the North University of China. She was a Postdoctoral Researcher at Beihang University in 2008. In 2012, she was a Visiting Scholar at Peking University. She is currently a Professor with the North China University of Technology. Her research interests include artificial intelligence, image processing, and so on. She is a Senior Member of CCF.
QINGBING SANG received the B.S. degree in computer science from the China University of Geosciences, Wuhan, China, in 1996, and the M.S. and Ph.D. degrees in pattern recognition from the Jiangnan University, Wuxi, China, in 2005 and 2013, respectively. He was a Visiting Scholar with the LIVE Laboratory, The University of Texas at Austin, from August 2010 to August 2011. He is currently an Associate Professor with the School of Artificial Intelligence and Computer, Jiangnan University. His research interests include image processing, computer vision, and machine learning. He has published more than 30 technical articles in these areas. VOLUME 8, 2020