An Effective Foveated 360° Image Assessment Based on Graph Convolution Network

Virtual reality (VR) has been adopted in various fields such as entertainment, education, healthcare, and the military, due to its ability to provide an immersive experience to users. However, 360° images, one of the main components in VR systems, have bulky sizes and thus require effective transmitting and rendering solutions. One of the potential solutions is to use foveated technologies, that take advantage of the foveation feature of the human eyes. Foveated technologies can significantly reduce the data required for transmission and computation complexity in rendering. However, understanding the impact of foveated 360° images on human quality perception is still limited. This paper addresses the above problems by proposing an accurate machine-learning-based quality assessment model for foveated 360° images. The proposed model is proven to outperform the three cutting-edge machine-learning-based models, which apply deep learning techniques and 25 traditional-metric-based models (or analytical-function-based-models), which utilize analytical functions. It is also expected that our model helps to evaluate and improve 360° content streaming and rendering solutions to further reduce data sizes while ensuring user experience. Also, this model could be used as a building block to construct quality assessment methods for 360° videos, that are reserved for our future work. The source code is available at https://github.com/telagment/FoVGCN.

technologies [1]. In contrast to traditional images (i.e., 2D 23 images), VR images are typically recorded with a 360 • cam-24 era, that captures the 360 • space of a scene [2]. The problem 25 The associate editor coordinating the review of this manuscript and approving it for publication was Tai-Hoon Kim . is that omnidirectional contents of VR applications have huge 26 data sizes, and thus require effective transmitting and render-27 ing solutions [3]. 28 To cope with this problem, one of the most potential solu-29 tions is to use foveated technologies, that are based on the 30 foveation feature of the human eyes. This feature refers to 31 spatially foveated visual acuity due to the heterogeneous dis-32 tribution of photoreceptors in the retina. In foveated technolo- 33 gies, image areas gazed by the retina region of higher photo 34 receptor density have higher quality levels than the outside 35 areas. This allows significantly reducing not only the data 36 discussed in Section V. 93 Quality of Experience (QoE) has long been investigated for 94 different content types [26].  [27], 99 [28], [29], [30]. Among them, there are, however, only some 100 on 360 • contents [3], [4], [7], [28], [29]..

101
In [28], the authors proposed a framework to compare the 102 performance of four subjective quality assessment methods: 103 Double Stimulus Quality Comparison (DSQC), Single Stim-104 ulus Absolute Category Rating (ACR), Ascending Method 105 (AM), and Descending Method (DM). By the analysis of 106 Quality of Experience (QoE) scores of foveated 360 • images, 107 it was found that the DSQC method obtains the highest con-108 sistency, but requires more judgments and time to converge 109 to the consensus. Meanwhile, ACR was found to be the most 110 efficient method. In [29], the authors focused on the subjec-111 tive comparison between 2D and 3D foveated 360 • videos in 112 terms of users' perceptual quality. The results showed that 113 the perceptual quality of 2D videos was more affected by 114 the quality of the image area corresponding to the periph-115 eral region. Meanwhile, for 3D videos, the perceptual quality 116 was largely impacted by the image area quality associated 117 with the fovea region. Also, based on the results, a perfor-118 mance evaluation of 12 objective quality metrics was con-119 ducted. Foveated Wavelet Quality Index (FWQI) was found 120 to be the most effective model for both 2D and 3D foveated 121 360 • videos.

122
In [4], the key question the authors focused on was how to 123 spatially reduce data size without noticeable perceptual qual-124 ity degradation by taking advantage of the foveation feature. 125 In particular, a subjective quality assessment for foveated 126 360 • images was conducted taking into account three regions 127 of the human retina, i.e., the central vision area with one-side 128 eccentricity θ ∈ [0 • , 9 • ], the near peripheral area with θ ∈ 129 (9 • , 30 • ], and the far peripheral area. In this experiment, the 130 image quality corresponding to each region was reduced step 131 by step until the participants notice a perceptual difference. 132 By utilizing encoding parameters (i.e., quantization param-133 eters and resolutions), that had been recorded, the authors 134 proposed a rendering solution, that is indicated to be able 135 to significantly improve rendering throughput by about 10× 136 without perceptual loss, in comparison to the traditional solu-137 tion of uniform quality. Reference [3] is the first study, that 138 could quantify the impacts of different retina regions on user 139 quality perception. In particular, the authors performed a sub-140 jective quality assessment of foveated 360 • images.  In most cases, machine learning-based methods have been 211 found to perform better than traditional-metric-based meth-212 ods. There have been a small number of studies, that applied 213 machine learning in the domain of full-reference uniform 214 image quality assessment, for instance, [8], [33], and [34]. 215 In [8], the authors proposed a new framework, that applies 216 a deep neural network to study the human visual sensitivity 217 (HSV), based on distorted images, a subjective score, and an 218 objective error map (DeepQA model) or without an objective 219 error map (DeepQA-s model) in a uniform image quality 220 dataset. In [33], the author considered the important role of 221 multiple viewports related to the image inside the field of 222 view (FoV). Those viewports are extracted by viewport sam-223 pling with inputs being reference images (i.e., original images 224 on the server-side) and distorted images (i.e., received images 225 on the client-side). Their proposed stereoscopic omnidirec-226 tional image quality assessment (SOIQA) model then learned 227 those viewport features using a deep neural network and sup-228 port vector regression (SVR). Machine learning techniques 229 have been applied efficiently in [8] and [33] to learn the char-230 acteristics of uniform immersive image quality in terms of the 231 full-reference quality approach. However, the characteristics 232 of uniform 360 • image quality are vastly different from those 233 of foveated 360 • image quality. Therefore, the research direc-234 tion of assessment methods, that work effectively for foveated 235 360 • images is still an open issue.

236
In the direction of NR-IQA methods, machine-learning-237 based approaches have been utilized quite commonly. 238 Zhang et al. [35] proposed a deep bi-linear model for non-239 reference image quality assessment (BIQA) to deal with 240 synthetic distortions and authentic distortions in images. 241 Afterward, Xu et al. [6] developed a novel Viewport oriented 242 Graph Convolution Network (VGCN), that concatenates a 243 global branch based on Zhang's work [35], which predicts 244 the global quality score by handling the synthetic and authen-245 tic distortions, and a local branch, that learns the interac-246 tions among different viewports by using graph convolution 247 network to get the overall image quality. Kim et al. [36], 248 first, extracted features of distorted images to predict quality 249 scores, then proposed a user perception guidance by using 250 adversarial learning to enhance the prediction performance. 251 Sun et al.
[5] introduced a multi-channel convolution neu-252 ral network (i.e. MC360IQA), in which the overall quality 253 is predicted using six simultaneous ResNet-34s, that extract 254 features from six created viewports. In [37], the authors 255 introduced meta-learning based image quality assessment 256 FIGURE 1. FoVGCN Model Operation based on the full-reference image quality assessment (FR-IQA) approach which leverages the information of both reference and distorted images. First, the graph structure data is constructed based on Error map and attention weight matrix. Then, the convolution graph neural network interprets the graph data to predict the final quality assessment score of viewports.  [6], and [36]. Although, FoVGCN is designed to 281 work efficiently for foveated image quality, its performance 282 is also cross validated with uniform datasets to evaluate its 283 application generality over heterogeneous cases in reality. plays an important role in creating an error map and an atten-296 tion weight matrix which represent the spatial quality chang-297 ing in different zones of an image and the priority of human 298 attention, respectively. After being pre-processed, both the 299 error map and the attention weight matrix construct a graph 300 structure, that is the input of the graph convolution layer. 301 The graph structure is then fed into the convolution graph 302 neural network block to predict the overall quality score of 303 the image. In the next subsections, the two main blocks of the 304 FoVGCN model and its parameter settings will be described 305 in detail.

307
In the preprocessor block, reference and distorted images are 308 fed into the error map creator block to create an error map. 309 Error map is implied as a graph in which each vertex (node) 310 represents a pixel and is connected via an edge to a foveation 311 node (or foveation pixel), as Figure 2 describes. A foveation 312 node is defined as the center point in the virtual viewport [3]. 313 Therefore, each non-foveation pixel has only one neighbor 314 (foveation node), and the information stored at each node is 315 calculated based on an error map E. 316 The attention weight matrix block takes the shape of a 317 viewport as the input to create a distance-based distribution 318 matrix which has the same size as the viewport. The atten-319 tion weight a (i,j),(n,n) measures the connective strength based 320 on distance relation between arbitrary E i,j and foveation 321 node E n,n , i, j ∈ (1, 2, . . . , 2n). Consequently, error map 322 and attention weight matrix construct the graph structure, 323 that is the input of the graph convolution layer as shown in 324 Figure 2. In this section, we describe details of how to create an error 327 map as the first step of the Preprocessor to generate the 328 desired input for the GCN model in the latter phase. Intu-329 itively, an Error Map represents spatial quality changing in a 330 distorted image when comparing the image with a reference 331 one. Besides, it also serves as the graph matrix input of the 332 graph convolution network.

348
The detail of the Error map Creator is shown in Figure 3.  Figure 4, the distorted viewport 1 has a quite good quality and 354 distorted viewport 2 has a lower quality, while the reference 355 viewport has the highest quality.

356
In the Errormap Transform block, the error map is first 357 divided into four quadrants for adapting with Attention 358 weight matrix transformation, then each of those is rotated 359 and sorted as four consecutive quadrants following the rule 360 in Figure 5. The division of the error map is to help reduce 361 the computational complexity of the proposed model towards 362 real-time quality assessment. As a result of this process, the 363 attention weight matrix size is decreased by a factor of 4, 364 thereby significantly reducing the running time of the whole 365 model. 366 VOLUME 10, 2022 following Equ.2, and its distribution is visualized in Figure 7. between pixel E i,j and center pixel E n,n ;

404
• dis max is the maximal distance between two nodes E i,j 405 and E n,n );

406
• Threshold δ is applied to avoid the zero value, and is set 407 equal to 0.0001.

408
Gaussian degradation distribution As mentioned in [4], 409 the density function of cones presented in [38] can be approx-410 imated by a Gaussian distribution. Inspired by this observa-411 tion, the Gaussian distribution is leveraged to represent the 412 perception process of human eyes. This idea follows the con-413 cept of the human retina [3] which is user perception tends 414 to be affected by the quality of fovea and parafovea zones 415 from the center to the outside of an image. The reason is 416 that human eyes concentrate significantly on the fovea and 417 parafovea zones of the human retina -a small region in the 418 viewport [3]. Therefore, we need a distribution, that describes 419 a better human perspective. Therefore, we apply formula (3), 420 following the Gaussian distribution to construct the attention 421 weight matrix: bigger σ values correspond to larger focal regions. Therefore, 430 in our experiment presented later in this paper, the perfor-431 mance of the FoVGCN model will be shown to study the 432 impact of the sigma value σ corresponding to the human 433 attention.

435
As the last phase of the whole FoVGCN assessment pro-436 cess, the Graph Convolution Network is applied since it can 437 efficiently learn the graph-structured data and successfully 438 study the characteristic of foveated 360 • images. In GCN, 439 we construct the graph structure in which a node represents a 440 pixel and an edge represents the distance-based correlation 441 between an arbitrary node to a foveation node. The graph 442 structure is built on the error map and attention weight matrix. 443 Significantly, the GCN is designed by using the attention 444 weight matrix instead of an adjacency matrix for a purpose 445 of describing the humane eyes attention in practice. The 446 attention weight matrix satisfies real, symmetric, positive 447 semidefinite properties, that are adapted to the mathematical 448 requirement in Eq. 4 and 5.

449
In our work, the Graph Convolution Network is applied 450 to extract features and study the spatial quality changes rele-451 vant to the attention weight matrix in the foveated immersive 452 image dataset used in [3].

453
The error map, which is designed as a graph with a 454 size of (720, 720), is used to represent the quality changes 455 between different pixels. This size is heuristically chosen 456 based on the size of the attention weight matrix. Using a large 457 weight matrix leads to a computational overload in GCN. 458 In our design and experiment, an attention weight matrix of 459     First, the attention weight matrix is symmetrically nor-478 malized byD − 1 2ÃD − 1 2 , before being multiplied with the 479 learnable model parameters W (l) and the output of the prior 480 layer H (l) . Then, the output layer is obtained after going 481 through the activation function.

482
In addition, following Eq.(4), a degree matrixD is inversely 483 calculated. Since this process takes huge computation and 484 resources, the inputs such as attention weight matrix and 485 error map need to be pre-processed before being fed into 486 the FoVGCN model in order to reduce the computational 487 complexity, thereby reducing running time. Instead of feeding 488 the entire error map and attention weight matrix A, we process 489 them as described in Section III-A1 and Section III-A2. To create a down-sampled image with a size of (720, 720), 492 we use the zero-padding method to form a square matrix of 493 (1440, 1440) in order to avoid distortion when the image is 494 down-scaled. This square matrix is then down-sampled to the 495 size of (720, 720) to reduce the computational cost. Next, 496 an error map is created from those viewports and the atten-497 tion weight matrix is fed through the six blocks of the graph 498 VOLUME 10, 2022   The details of all parameters used in the FoVGCN model 508 are presented in Table 1.

510
To evaluate the performance of FoVGCN, we use three open 511 datasets, one of which is a foveated image dataset, and the 512 other two are uniform image datasets. In the following sec-513 tions, we will firstly describe these datasets. Then, the experi- employed with a fixed filter size of 50 and four different 529 standard deviations. Specifically, the distortion of images was 530 conducted based on five regions of the human retina and two 531 basic scenarios of spatial quality changes: the quality gradu-532 ally decreases or increases from the center to the periphery. 533 For each scenario, four different quality levels were generated 534 corresponding to four different standard deviations σ . Due 535 to the fact that blurring in the center zones is easier to be 536 perceived than in the peripheral zones, the values of σ are 2, 537 4, 8, and 12 for the first scenario and 1, 2, 4, and 6 for the 538 second scenario. To prevent boundaries between the low and 539 high-quality zones from irritating viewers, a linear function 540 was used to smooth transition belts between two adjacent 541 zones. Please refer to [3] for more details about the process 542 of creating the distorted images.

543
However, this foveated dataset has a limited number 544 of samples to achieve a good training performance. So, 545 to enhance the performance and accuracy of our proposed 546 method, we apply a data augmentation technique by flip-547 ping the viewports twice from the left to the right and from 548 the bottom to the top, without destroying the characteristics 549 of the foveated dataset. As the result, this technique triples 550 the amount of data, thus helping to achieve a better training 551 performance.

553
In our work, FoVGCN is specifically designed to cope with 554 foveated-quality images. That inclusively means that it may 555 not work well for uniform-quality images in comparison 556 with the other existing solutions, which are designed for this 557 uniform type. However, to investigate the effectiveness of 558 FoVGCN for uniform images, we also evaluate the perfor-559 mance of FoVGCN on two other uniform image datasets -560 CVIQ [5] and OIQA [40] [42], [43]. Specifically, they are uti-579 lized to measure the difference, the linear and non-linear 580 correlations between subjective quality values and objective 581

598
Note that in our experiments, the learning model is found 599 out to work efficiently with the learning rate set at 10 −4 as 600 the model could converge after 200 epochs. The training and 601 testing phases are executed in Google colab pro (Intel(R) 602 Xeon(R) CPU @ 2.30GHz, Tesla P100-PCIE-16GB GPU). 603 In the evaluation process, firstly, we analyze the perfor-604 mance of our FoVGCN model on the foveated image dataset 605 in two cases of the attention weight matrix, namely (1) the 606 Linear degradation distribution and (2) the Gaussian degrada-607 tion distribution. Secondly, we change the sigma coefficient 608 in the Gaussian attention weight matrix in order to study 609 the impact of sigma on the performance of the FoVGCN 610 method. Thirdly, FoVGCN is compared with 25 traditional-611 metric-based methods and three machine-learning-based 612 image assessment approaches. Finally, we conduct some 613 cross-validation experiments to investigate how FoVGCN 614 and other fovea-quality-metrics solutions would work with 615 the two uniform datasets.  In conclusion, the FoVGCN model with the Gaussian 643 weight matrix outperforms the linear distribution weight 644 matrix. Moreover, it is proven that our method achieves a 645 stable and significantly good performance in all experiments. 646

647
As mentioned in Section III-A2, we know that any change 648 in the sigma coefficient sigma of the Gaussian distribution, 649 which is used to construct the weight matrix, will result in 650 VOLUME 10, 2022  performance comparison between FoVGCN versus other 692 existing solutions over the foveated dataset is also summa-693 rized in Table 4.

694
In addition, the scatter diagrams of the ground truth and 695 the predicted MOSs of all metrics and models are shown 696 in Figure 14. In this figure, the horizontal axis presents the 697 MOS score, and the vertical axis shows the predicted MOS 698 score, which is the quality image score predicted by each 699 different model/approach. The trend of those diagrams is 700 expected to be the shape of Identity Function Graph indicat-701 ing the relationship between the predicted MOS score and 702 real MOS score. MOS stands for Mean Opinion Score, which 703 is a numerical measure of the human-judged overall quality 704 of experience (QoE), normally for voice and video sessions, 705 ranked on a scale from 1 (bad) to 5 (excellent). The definition 706 of QoE and MOS can be found in [26]. 707 We can see that, among the analytical metrics, only 708 the FMSE, FPSNR, and WVPSNR have reasonable rela-709 tionships between the actual MOSs and predicted MOSs. 710 This is because these metrics are specifically designed with 711 foveation feature.    other uniform image datasets of CVIQ [5] and OIQA [40]. 725 In the cross-validation experiment, we use the foveated image 726 dataset for training, while the CVIQ dataset and OIQA dataset 727 are employed for testing. In this part, other foveal metrics are 728 used for comparison. 729 VOLUME 10, 2022 The results are illustrated in Table 2 and Table 3. We can see that, with the CVIQ dataset, FoVGCN achieves better per-FoVGCN achieves comparable accuracy with other metrics; 734 however, its RMSE (0.285) is much smaller than others. That 735 means FoVGCN is more stable than other foveal metrics. VGCN, have lower (or very low) performances (see Fig. 12).

759
Note that, though these deep-learning-based models are 760 already retrained using the same foveated image dataset as the 761 proposed FoVGCN model, their low performances imply that 762 the deep-learning architectures of these models still cannot 763 capture the characteristics of foveated images.

764
Currently, the study in this paper still has some limitations. 765 • First, the proposed model is just focused on image con-766 tents. It was not evaluated with video contents due to the 767 lack of foveated video datasets.

768
• Second, the resolution of foveated images in this study is 769 fixed. This is also because of the available dataset does 770 not provide images of different resolutions.

771
In the future, we will carry out subjective tests to obtain 772 more foveated content datasets, which cover different cases 773 of resolutions, headsets, and content types (i.e. images and 774 videos). The FoVGCN model will be extended and evaluated 775 using these future datasets. Field studies using foveated qual-776 ity models in the context of VR video streaming will be also 777 implemented.

779
In this paper, we have proposed FoVGCN as an efficient 780 assessment model for foveated 360 • images. The model uses 781 Graph Convolutional Network to represent the complex rela-782 tionships among different locations of an immersive image. 783 It is expected that the proposed FoVGCN model will be an 784 effective and reliable method for researchers to evaluate cod-785 ing and rendering solutions of foveated image/video field. 786 In the future work, we will employ this model to improve 787 VR video streaming adaptation techniques to ensure good 788 perceived quality for viewers. 789 VOLUME 10, 2022