No-Reference Light Field Image Quality Assessment Exploiting Saliency

In the near future, the broadcasting scenario will be characterized by immersive content. One of the systems for capturing the 3D content of a scene is the Light Field imaging. The huge amount of data and the specific transmission scenario impose strong constraints on services and applications. Among others, the evaluation of the quality of the received media cannot rely on the original signal but should be based only on the received data. In this direction, we propose a no-reference quality metric for light field images which is based on spatial and angular characteristics. In more details, the estimated saliency and cyclopean maps of light field images are exploited to extract the spatial features. The angular consistency features are, instead, measured with the use of the Global Luminance Distribution knowledge and the Weighted Local Binary Patterns operator on Epipolar Plane Images. The effectiveness of the proposed metric is assessed by comparing its performance with state-of-the-art quality metrics using 4 datasets: SMART, Win5-LID, VALID 10-bit, and VALID 8-bit. Furthermore, the performance is analyzed in cross-datasets, with different distortions, and for different saliency maps. The achieved results show that the performance of the proposed model outperforms state-of-the-art approaches and perform well for different distortion types and with various saliency models.

Virtual Reality (VR), Augmented reality (AR), and eXtended Reality (XR), a new transmission architecture must be devised, according to CISCO white papers [2]. For this reason, the research community has started investigating on the Future Generation Wireless Network (FGWN), i.e., 6G, to satisfy new consumers demands. Its launch is scheduled in 2030 and it will allow the broadcasting of new immersive multimedia such as Light Field Images (LFIs).
Light field imaging has been proposed as a system for capturing three-dimensional (3D) scenes for next-generation broadcast multimedia services [1]. As highlighted in [3], this technology is still in its early stages, and there are several challenges to cope with. One of the main problems is defining a strategy for efficiently storing and sharing this type of content. In fact, a single LFI can be several gigabytes in size, making it difficult to handle. Therefore, new models are needed for data preprocessing, compression, content editing, rendering, and display. These operation can be perceptually relevant thus impacting the subjective quality score [4]. In particular, the transmission of LFI through noisy telecommunication channels can degrade their quality. Hence, Light Field Quality Metrics (LFQMs) are required for evaluating the impact of each transmission step.
In general, to assess the perceptual LFI quality, subjective experiments and objective quality assessment metrics are adopted. The subjective assessment is a reliable quality assessment method but it is a costly and time-consuming process. On the other hand, objective metrics are complex to design and, often, only partially matching the subjective judgment. In full-reference [5], [6], [7] or reduced-reference [8], [9] metrics the full or part of the spatial and angular features of the original LFI are generally used. However the availability of this information may be difficult, especially in a broadcasting scenario, thus motivating the development of no-reference metrics. In the literature, few efforts [10], [11], [12], [13], [14], [15], [16] have been devoted to devise no-reference metrics.
In this work, a novel no-reference metric exploiting spatial and angular characteristics of the light field is presented. Differently from state-of-the-art approaches, the proposed method combines the information from saliency and the cyclopean maps without the use of learning-based convolutional filters thus providing an explainable and robust feature extractor for light field quality estimation. The joint use of the information extracted from these maps can improve the performance of a quality model inspired by the human vision system. In [17], the authors successfully exploited these models to evaluate the quality of stereo images. This framework has been adopted as light field images can be regarded as a set of stereo pairs.
The major contributions of the article are: • Analysis of the impact of distortions on the estimated saliency map: distortions have been studied to analyze the impacts on the saliency map. The obtained results provide the rationale for using the saliency map as a spatial feature [18]; • Design of a new angular feature: angular irregularities of light field impair the judgement of the human eyes on the overall quality perception [19]. For this reason, the feature set Global Luminance Distribution Pattern (GLDP) has been proposed; • Definition of a new no-reference light field quality metric: a new metric which exploits both spatial and angular features of light field images has been designed. The performance of the proposed model has been analyzed on different datasets, cross-datasets, distortion types, and saliency maps. The rest of the paper is organized as follows: the related works are presented in Section II. The learning-based quality assessment model is presented in Section III while, in Section IV, the objective experiment process is reported. Results analysis and discussions are carried out in Section V. Finally, in Section VI, the conclusions are drawn.

II. RELATED WORK
Based on the availability of original or reference information, an objective quality metric for LFIs can be classified as full-reference, reduced-reference, and no-reference. Full-reference metrics exploit complete original data and measure the similarity between reference and distorted image. The reduced-reference model utilizes only partial information about the reference image, while the no-reference evaluates image quality without any information about the original data. Several efforts have been made in recent years to develop such LFIs quality metrics.

A. Full and Reduced-Reference Quality Metrics
As a first attempt, 2D image quality metrics had been used for LFI quality assessment [20], [21], [22], [23], [24]. These methods usually average the scores obtained by applying the 2D full-reference metrics to single views or sub-aperture images of the LFI. Recently, specifically designed LFI fullreference quality metrics have been proposed [5], [6], [7]. A Log-Gabor-based model is proposed in [5] where the saliency features are extracted from the reference and distorted LFIs, employing the multi-scale and single-scale Gabor wavelet. A full-reference approach is proposed in [7] based on images obtained from the focus stacking of LFIs. Regarding reducedreference approaches, in [8] a metric based on depth map estimation of LFIs is proposed by exploiting a multi-resolution approach. Then, the Structural Similarity Index Measurement (SSIM) metric is used to analyze the distortion level of depth maps. Similarly, in [25], depth information considering multiple views is used to predict the subjective quality.
However, these techniques have limited use in a broadcasting scenario because of the required reference data.

B. No-Reference Quality Metrics
A no-reference model is presented in [12] by combining the 2D and 3D characteristics of the LFI with a Support Vector Regressor (SVR). The hue, saturation, and value components of each Sub Aperture Image (SAI) are considered as 2D features. The 3D features are obtained using sparse depth maps of horizontal and vertical Epipolar Plane Images (EPIs). Finally, the 2D and 3D features are concatenated and given as input to the SVR. However, limited-size datasets are used for performance analysis, thus prone to overfitting.
Shi et al. [10] proposed a metric that evaluates spatial and angular degradations in a LFI. The spatial degradation is measured by capturing the naturalness distribution of the cyclopean map. The angular consistency, instead, is estimated by applying the Weighted Local Binary Pattern (WLBP) operator on EPIs.
A tensor-based quality metric is proposed in [11] where the light field is regarded as a low-rank 4D tensor. The principal components of four oriented sub-aperture view stacks are obtained via Tucker decomposition. Then, the spatial quality of the LFI is measured by considering the global naturalness and local frequency properties. In the final step, the tensor angular variation index is proposed to measure the angular consistency quality by analyzing the structural similarity distribution between the first principal component and each view in the stack.
The Visualization-based Blind Light Field Image (VBLFI) model is proposed by Xiang et al. [16] which exploits LFI visualization features. In more details, the approach employs the Mean Difference Images, which are obtained by applying partial derivative to LFI, highlighting the depth and structural information. In [13], an extended version of [16] is proposed where SAIs of distorted LFI are the input data. However, only two datasets are used to evaluate the model performance, without any distortions.
In [15], a Convolutional Neural Network (CNN)-based metric is proposed. Its two novelties are the use of discriminative EPI patches for the training of a CNN and a multi-task learning. Qu et al. proposed ALAS-DADS [14], a quality metric that exploits a CNN to extract both spatial and angular features. The neural network is composed of 3 modules. The first branch extracts the spatial features from a light field by exploiting separable convolutions. Similarly, the second module captures angular consistencies. Finally, the third module fuses both spatial and angular features to predict the quality score. However, the experimental results are obtained through only two state-of-the-art datasets.
To the best of our knowledge, previous works did not take into account compression-based distortions for evaluating the performance of their model. Moreover, from the analysis of related works, we highlight the lack of a no-reference model that is validated on multiple LFI datasets. Therefore, a no-reference LFI quality metric has been designed and validated on multiple datasets. The proposed model is fully explainable and of lower computational complexity with respect to learning-based methods. Moreover, the adoption of the estimated saliency map improves the effectiveness of prediction of the subjective quality score. In addition, the proposed approach exploits the Global Luminance Distribution knowledge in the design of the feature set GLDP. To the best of our knowledge this is the first attempt in the literature. Finally, this work demonstrates that the combination of saliency-and cyclopean-based features can lead to state-of-the-art performance without the use of learned convolutional filters.

III. PROPOSED MODEL
This section describes the proposed light field quality metric, depicted in Figure 1. It is composed of two parallel units used for extracting spatial and angular LFI characteristics. These proprieties are relevant for the subjective quality assessment of light field images [26]. Finally, the obtained features are concatenated and fed to a SVR to predict the overall quality score. We note as (u, v, s, t) the coordinate vector of a LFI. Moreover, let (U, V, S, T) be the maximum value of u, v, s, and t respectively. Here, u and v are the angular coordinates along horizontal and vertical directions of a LFI, while t and s are the spatial coordinates along vertical and horizontal directions of the SAI. The horizontal EPI are obtained by setting the vertical coordinates v and t. Similarly, vertical EPIs are extracted by fixing horizontal coordinates u and s.

A. Spatial Features
In broadcasting applications, each processing step, such as lossy compression, transmission, and rendering [10], can affect the spatial quality of the LFI. Following [26], in the proposed metric spatial LFI characteristics are considered by analyzing both saliency and cyclopean maps.
1) Saliency Map: The saliency map conveys information on perceptually important areas of the input image [27]. In the proposed metric, to identify saliency regions, we exploit  [28] applied on the Extended Depth Of Field (EDOF) image, that is a light field representation in which all objects are in focus. In more detail, an EDOF image is obtained by performing a multi-focus fusion and refocusing the LFI at multiple depth planes, followed by a wavelet decomposition-based stacking [29].
Let S a ∈ R T×S be the saliency map of the EDOF where T and S are the height and width values in pixels, respectively. From the distribution of S a , we compute 8 statistical parameters [30], [31], [32], [33]: , and entropy (w). The significance of each statistical parameter is detailed in Table I.
Then, the feature vector of the saliency map f S a ∈ R 8 is defined as: where ⊕ is the concatenation operator.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. 2) Cyclopean Map: The cyclopean map estimates the disparity from each SAI pair, evaluating angular inconsistencies [10], [34], [35]. In more details, the cyclopean map allows to measure the naturalness, i.e., the impact of spatial alterations, by comparing adjacent sub-apertures. This phenomenon is analyzed by computing the Mean Subtracted Contrast Normalized (MSCN) coefficients [36]. In fact, when no distortion is applied, the distribution of the pixels belonging to the MSCN coefficients map shows a Gaussian trend. An example of this behavior is shown in Figure 2 where the pixel distribution of the cyclopean map and of the corresponding MSCN map are depicted. We can see that the MSCN distribution changes when a distortion is applied to a LFI [36]. To compute the MSCN-coefficients, the binocular fusion and rivalry features between adjacent SAIs in both directions are mimicked to obtain the sub-cyclopean map [10], [37], [38]. In the practical realization, stereo disparity estimation [39] and spatial activity map [10] are applied. Then, all the SAI subcyclopean maps are superimposed, i.e., the average of them, to extract the MSCN-coefficients.
To analyze the Gaussian trend of MSCN coefficients, the zero mean Asymmetric Generalized Gaussian Distribution (AGGD) procedure is utilized [40], [41]. The algorithm focuses on spatially asymmetric distributions of pixels in the MSCN-map [42], [43], providing four statistical metrics. More precisely, 2 shape parameters (α, η) and 2 scale coefficients along left and right directions (σ le , σ ri ) are estimated. By doing so, a parametrised approximation of MSCN coefficients distribution represents the angular characteristics of the LFI.
Finally, the cyclopean-based features f C y ∈ R 12 are defined as: where C y ∈ R T×S is the MSCN-coefficients map of the original cyclopean map and C y 2 is its version downsampled by a factor of 2. This design choice considers that distortions and artifacts affect the image across different scales [44]. Finally, the spatial features f S pa ∈ R 20 are obtained by combining both spatial and angular characteristics as follows: (3)

B. Angular Consistency
Artifacts in light fields can be caused by angular inconsistencies, due to transmission, reconstruction, or rendering [45]. Using angular consistency features can help improving the performance of quality estimation. For this reason, EPIs are analyzed since they are the main sources of LFI angular consistency [10], [16]. In fact, the luminance of each pixel in a EPI is determined by the angular resolution of the light field camera [46], [47]. From this evidence, a new feature set, called Global Luminance Distribution Pattern (GLDP), is proposed to account for the angular consistency by exploiting the EPI luminance distribution. In addition, another source of angular consistency of LFI is related to the difference between SAIs where each SAI belongs to different angular coordinates. This feature can help analyze the relative relationship among pixels for measuring the change in angular consistency. We explore this feature through the Weighted Local Binary Pattern operator.
1) Global Luminance Distribution Pattern: As demonstrated in [48] and [49], different luminance distributions have an impact on the human perception thus affecting the subjective quality evaluation. Inspired by these results, we apply the same concept to LFI. More precisely, we analyze the luminance distribution of each EPI. In Figure 3 some examples are shown. It can be noticed that all the Probability Density Functions (PDFs) follow an asymmetric Gaussian distribution. We consider this occurrence with the proposed feature set GLDP, extracting statistical parameters from luminance PDFs.
First, a Gaussian low-pass filter is applied to each EPI to remove high-frequency components. Then, four statistical features are extracted from each PDF: μ, w,μ 3 , andμ 4 [50]. However, there are many horizontal and vertical EPIs in a light field. Hence, concatenating all the features leads to unsatisfactory space and time complexity. To cope with this issue, we evaluate the mean and the variance (μ and σ ) of the features for the horizontal and vertical EPIs sets, separately.
In more details, let H = {Hr i ∈ R U×S , i = 1, . . . , V · T} be the set of all horizontal EPIs in grayscale with height U and width S. The GLDP features of horizontal EPIs f H G l ∈ R 8 are obtained as: Similarly, the GLDP features of vertical EPIs f V G l ∈ R 8 are computed where V = {Vr i ∈ R V×T , i = 1, . . . , U · S} is the set of all vertical EPIs in grayscale with height V and width T.
The final feature vector of GLDP for both horizontal and vertical EPIs, f G l ∈ R 16 , is then arranged by: GLDP coefficients can be inspected in Figure 4. The proposed feature set is able to highlight and quantify the impact of a distortion to a LFI.
2) Weighted Local Binary Patterns: In [51], [52] the Local Binary Pattern (LBP) operator is introduced. This operator has been successfully applied to the quality assessment of 2D images in [53] since it is able to detect the statistics of local structure primitives at the early stage of vision. More recently, it has been applied also to light field in [10]. The LBP operator uses circular neighborhoods (with radius r and number of pixels p) of different sizes. It is possible to apply a Weighted LBP operator (WLBP) to reduce the number of LBP features by adopting a weighted rotation invariant operator, which has shown effective performance with many 2D quality metrics [54], [55]. Moreover, in our implementation, we apply three times the WLBP computation with the following values [r, p] = { [1,8], [2,16], [3,24]}, following the same rationale of [44]. By doing so, the process extracts 54 WLBP coefficients.
The general expression for extracting the WLBP features of horizontal EPIs, f r,p H ∈ R 54 is defined as: where L r,p v,t represents the rotation invariant uniform LBP operator and w r,p v,t is the entropy of the corresponding EPI. To estimate the impact of distortions in the image structure across scales, we compute the WLBP coefficients at two resolutions: the original and the reduced one by a factor of 2.
Horizontal and vertical WLBP coefficients for both the resolutions are concatenated to obtain the final feature vector f W l ∈ R 216 as represented below: where f r,p H * ∈ R 54 , f r,p V * ∈ R 54 are the WLBP features of Hr * ∈ R U 2 × S 2 and Vr * ∈ R V 2 × T 2 , respectively. The features obtained from GLDP and WLBP are concatenated in order to compose the angular consistency features f Ang ∈ R 232 as below: Finally, f S pa and f A ng are concatenated to obtain the final feature vector, f S pa ⊕ f A ng , which is applied as input to the SVR [56] that outputs the estimated LFI quality.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

IV. EXPERIMENT DESIGN
In this section we provide details on datasets, experiment setup, and metrics used for evaluating the performances of the proposed approach. Table II illustrates the available datasets, at the time of writing, and their characteristics.

A. Dataset
In this study, 4 publicly available datasets have been used. These datasets have been selected based on the range of image contents and distortion types. The first dataset is SMART [9] which contains 16 original and 256 distorted LFIs. The distorted images are obtained by introducing 4 types and 4 levels of distortions (HEVC-intra, JPEG, JPEG2K, and Sparse Set and Disparity Coding (SSDC)). The pair-wise comparison method is adopted to collect the subjective Bradley-Terrey ratings for each LFI. A higher Bradley-Terrey score is related to a higher preference rate. Two VALID datasets are considered in this work: VALID 10-bits and VALID 8-bits [34]. Both VALID datasets rely on the same 5 original LFIs. In the VALID 10-bits dataset, 100 distorted LFIs are available, applying HEVC [62], P3 [63], P4 [64], P5 [65], and VP9 [66] compression algorithms with 4 levels of severities. In the VALID 8-bits dataset, instead, 40 distorted LFIs are available with HEVC [62] and VP9 [66] codecs. Both VALID datasets provide subjective scores in terms of Mean Opinion Score (MOS). In the Win5-LID dataset, 10 images (6 real and 4 synthetic) are available. The total number of distorted LFIs is 220 using HEVC and JPEG2K codecs. In addition, reconstruction distortions are present in the dataset such as Linear Interpolation (LI), Nearest Neighborhood Interpolation (NNI), CNN1 [67], and CNN2 [68]. CNN1 and CNN2 are learningbased reconstruction distortions which extract features from EPI by a CNN. The perceptual qualities of LFIs are measured in terms of MOS scores.

B. Implementation Details
EDOF images for each dataset are obtained by using the approach proposed in [29]. To train and test the SVR model, each dataset has been randomly split into train and test sets with a 80%:20% ratio. VALID 10-bits, VALID 8-bits, and Win5-LID performances are measured with linear kernel function whereas SMART dataset is evaluated with a polynomial one. The choice of each kernel function is based on random hyperparameter optimization [69].

C. Performance Evaluation
To evaluate the performance of the proposed model, Spearman's Rank-order Correlation Coefficient (SRCC), Pearson's Linear Correlation Coefficient (PLCC), Kendall's Rank-order Correlation Coefficient (KRCC), and Root Mean Square Error (RMSE) have been employed. SRCC, PLCC, and KRCC measure monotonic, linear, and ordinal relationships between predicted and ground truth quality scores, respectively. Similarly, the RMSE is used to measure the distance between the predicted and the ground truth quality scores. Following [10], [11], a nonlinear activation function g : R → R has been applied to the model output: where x is the model output and β = [β 1 , . . . , β 5 ] are the coefficients optimized in the data nonlinear fitting step.

V. RESULTS ANALYSIS AND DISCUSSIONS
First, to demonstrate the effectiveness of using the saliency map as a spatial feature, an analysis of the impact between distortion and saliency map is reported. Then, the proposed metric is tested on LFI datasets and the results compared with state-of-the-art models. In addition, model's performance with cross-datasets are reported. Finally, an ablation study related to the features adopted in our model is provided.

A. Analysis of the Relationship Among Distortion, Saliency, and Perceived Quality
As highlighted in Section IV-A, LFIQA datasets are composed by different number of reference images (n), distortion types (z), and level of distortions (m). To study whether a distortion of the LFI causes a degradation on its saliency map, the pipeline in Figure 5 has been designed. The SMART dataset has been considered for this analysis, thus z, m, and n are 4, 4, and 16, respectively.
In more details, let I ed ∈ R 3×T×S and I er ∈ R 3×T×S be the distorted and reference EDOF images, respectively. Then, I de  and I er undergo a saliency extraction step with the function f (·) where the normalized image or one of the five saliency map models (Itti [28], GBVS [70], Geometry [71], BMS [72], and EBMS [73]) are computed. The image pairs are obtained as follows: To measure the similarity between the I f ed and I f er , we compute the correlation cc(·) : R 2×T×S → R, described in [74]. This process is iterated for n image pairs and for each distortion level m to obtain a m×n correlation matrix. A comparison between the obtained matrix and the corresponding subjective quality scores is computed by means of SRCC and PLCC metrics.
The results of the analysis are reported in Table III. The strong correlation between the distortion and the saliency map demonstrates the effectiveness of this representation for quality assessment purposes. Hence, statistics of the saliency map are employed in the proposed metric as spatial features.

B. Selection of the Saliency Model
To select the saliency model to be used in the proposed LFI quality metric, samples extracted from VALID 10-bits are randomly selected. For this purpose, saliency models  Table IV, highlight the superiority of the Itti approach [28], motivating its use in the proposed metric.

D. Robustness Evaluation of Proposed Model With Cross-Datasets
The cross-datasets test is used to validate the model between different datasets. Win5-LID, VALID 10-bits, and VALID 8bits datasets have been selected for the study. The achieved results are presented in Table VII, where a high correlation between VALID 10-bits and VALID 8-bits is shown. This result is obtained since VALID 10-bits and VALID 8-bits images share similar proprieties, i.e., spatial and angular consistency features. However, the proposed model's performance are low when the Win5-LID dataset is used as a training set and VALID as test. This result could be due to the angular resolution of Win5-LID which is [9 × 9] whereas VALID is [13 × 13]. In addition, the image content and the spatial and angular consistency features are different, thus changing the pixel luminance distribution.

E. Ablation Study
We study the performance of our model with 3 feature sets: f S pa , f Ang , and f S pa ⊕ f Ang . This study evaluates the importance of spatial and angular consistency features in the proposed LFQM. Similarly, it shows the improvement in model performance when both spatial and angular consistency features are involved in the prediction process. The results are reported in Table VIII. It is notable the superiority of f Ang compared to the f S pa for all datasets. The main reason is that f Ang considers the distortion effect on higher-dimensional space of LFI at once, i.e., EPI are employed with full length of angular resolution along horizontal and vertical directions. Although f S pa has a lower number of dimensions than f Ang , it provides additional information, increasing the correlation between subjective and predicted score.

VI. CONCLUSION
In this work the use of saliency information for light field quality assessment has been exploited. As a first step, we have verified the impact of a distortion in LFI on the estimated saliency map. The achieved results on multiple LFI datasets show high correlation scores between the measure of distortion in saliency map and the subjective quality score  VIII  SIMILARITY MEASUREMENT OF THE PREDICTED QUALITY  SCORES AND GROUND TRUTH QUALITY SCORES FOR  7 TYPES OF INPUT FEATURES TO THE SVR of the corresponding LFI. Then, we have exploited the spatial and angular consistency features of LFIs based on a machine learning approach. The model performance has been evaluated on LFI quality datasets: SMART, WIN5LID, VALID 10-bits, and VALID 8-bits. The experimental results show that the proposed model outperforms state-of-the-art no-reference quality metrics. Moreover, the results show that our model guarantees good results in cross-dataset evaluation with different distortions types and saliency map models, yielding good generalization capabilities. Finally, it is possible that learning-based approaches, with saliency and cyclopean maps as input, can reach better performances that hand-crafted features. Clearly, an ad-hoc modeling study has to be carried out to account for computational complexity and performance. Hence, the use of learning-based features extracted from saliency and cyclopean maps is set as future work.