Object-Based Geometric Distortion Metric for Viewport Rendering of 360° Images

To visualize omnidirectional (or 360°) visual content, a sphere to plane projection is employed, that maps pixels from the observed sphere region to a 2D image, called as viewport. However, this projection introduces geometrical distortions on the rendered image, such as object shape stretching, or shearing, and bending of straight lines, which may affect the user’s quality of experience (QoE). This paper proposes an object-based quality metric to assess the subjective impact of the objects shape deformation. The metric uses semantic segmentation to identify the relevant objects in the viewport, where the stretching distortion has a higher perceptual impact, and computes the stretching distortion for each object. Two distinct approaches were exploited and evaluated: the first one, directly computes and compares object shape measures on the sphere and on the viewport; the second one is based on Tissot indicatrices, which are computed for individual objects in the viewport. The experimental results show that while the Tissot based method performs slightly better than the direct shape measurement, both approaches outperform benchmark solutions; furthermore, they are able to classify the viewport quality, with respect to quality scores obtained in a subjective crowdsourcing study, with a correct decision percentage close to 90%. Additionally, the Tissot based approach was used in a global quality metric that finds out the Pannini projection parameters that result in the least perceivable geometric distortion. It is shown that the automatically tuned Pannini projection results in viewports with a more pleasant visual quality than the considered benchmark projections.


I. INTRODUCTION
In recent years, the popularity of omnidirectional visual content and applications is increasing rapidly, notably in virtual reality (VR) and augmented reality (AR) fields. Omnidirectional visual content can already be found in a large set of applications that users can enjoy, including immersive gaming, remote education, virtual shopping, virtual sports, virtual tours, and even broadcasting of live content. However, to have a successful application, the user's quality of experience (QoE) should be high, and thus techniques to assess and improve the quality of experience are an important research topic [1]- [4].
Omnidirectional visual content contains the information of the scene around the camera, covering the whole 360 • The associate editor coordinating the review of this manuscript and approving it for publication was Chua Chin Heng Matthew .
(horizontal) × 180 • (vertical) viewing range, referred to as viewing sphere. To store or transmit the omnidirectional visual content, the viewing sphere is mapped on a 2D image using, typically, the equirectangular or cubic projections. When this type of content is played, the user can observe any part of the visual scene by changing the viewing direction (''look around'') which creates, to him, the feeling of being physically present; this allows a visual experience more immersive than what is offered by traditional 2D visual content.
There are several ways to display omnidirectional visual content, including HMDs (e.g., HTC Vive and Oculus Rift), smartphones (or tablets), or standard computer monitors. Typically, the HMD provides a better immersive experience, although it is somewhat uncomfortable, expensive and not accessible to all users. Therefore, watching omnidirectional visual content on smartphones or computer monitors VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Viewport rendering: a) Coordinate system; b) Sphere to plane projection.
is rather common. Recently, the large technological industry, such as Google and Facebook, provided several applications and services for smartphones and personal computers (thus, using 2D screens), including Google street view, Facebook 360 photos, and YouTube VR, among others; this is the target application scenario of this paper. Regardless of the used display, the users only see a fraction (called as viewport) of the entire sphere, at a time. The viewport content is defined by the viewing direction (VD, in Fig. 1-a)) and by the horizontal and vertical fields of view (HFoV and VFoV, in Fig. 1-b)); a large field of view (FoV) includes more information in the viewport, making the visual experience more pleasant [5]. To render the viewport in a 2D image, a sphere to plane projection is required; the rectilinear, stereographic, and Pannini projections are often considered for this procedure [6]- [8]. However, any sphere to plane projection introduces geometric distortions, such as stretching of objects and bending of straight lines.
The projection type, the considered FoV, and the image content characteristics are the three main factors that influence the amount of perceived distortion [9]. Concerning the latter, although the geometric distortion at infinitesimal scale (i.e., locally) is independent on the image content, it has a minimum value at the viewport center and increases towards the viewport borders. Also, for nonconformal projections this distortion is not isotropic. At a larger scale, the geometric distortion suffered by an object is thus dependent on its position, shape, and area. The rectilinear projection keeps the straightness of the lines, but stretches the objects, mainly those close to the viewport borders; this effect is clearly visible in the People viewport, depicted on the top of Fig. 2. The stereographic projection preserves the shapes locally, but it severely bends the straight lines (fisheye effect), as can be seen on the Buildings viewport, depicted on the bottom of Fig. 2. In the Pannini projection, the projection center, d, is an adjustable parameter, allowing the projection to vary its main characteristics, from rectilinear to quasi-stereographic. In this projection, the vertical and radial lines are projected without bending, but the horizontal lines are bent, as shown on the Buildings viewport. To reduce the bending of horizontal lines, vertical compression can be applied, at the expense of increasing the objects stretching. A good balance between stretching and bending may be achieved by adjusting the two parameters, d and vc (vertical compression factor). Thus, the Pannini projection has the advantage of preserving object shapes and the straightness of vertical lines better than the typical rectilinear and stereographic projections, being more suitable for viewport rendering with large FoV; this justifies why it was the selected projection for the study described in this paper.
Since the geometric distortion, and its subjective impact, is dependent on the image content, selecting a proper viewport projection and its parameters may have an important role on the user's QoE. This requires the availability of a content-aware objective quality metric to automatically measure the perceived geometric distortion, after viewport rendering. In Earth cartography, the Tissot's indicatrices [10] have been used to assess the geometric distortions introduced by map projection. However, this metric is content independent, resulting in the same distortion measure, for the same sphere to plane projection, regardless of the image content.
As an example, the same distortion measure will be obtained for the People and Buildings rectilinear viewports in Fig. 2, although the stretching distortion in People is much more annoying than in Buildings. Moreover, the line bending, visible in Fig. 2 for Buildings with stereographic and Pannini projections, cannot be estimated with Tissot's indicatrices. In the authors previous work [11], several line bending and stretching measures were proposed; the former directly measures the resulting line curvature, and the latter is based on Tissot's indicatrices, improved with the integration of saliency weights that give more importance to the parts of the viewport image that attract the user attention. Both bending and stretching measures were evaluated by correlating their values with perceptual scores, obtained from subjective tests. However, while the bending metric showed to be well correlated with perceptual scores, the stretching measures achieved a poor performance. In fact, the Tissot's indicatrices are local measures and, even if weighted by saliency scores, they are not able to accurately capture the global object distortion.
In this paper, object-based stretching distortion metrics are proposed to automatically assess the subjective impact of the viewport stretching distortion, in 360 • image rendering. The new metrics overcome the shortcoming of the stretching distortion measures proposed in [11] by measuring the distortion of individual objects, having semantic meaning (such as people, cars, or furniture), using semantic segmentation. A procedure, that integrates one of the new metrics with a line bending measure, is also proposed to automatically tune the Pannini projection parameters, d and vc, according to the viewport content. In this context, the main contributions of this paper can be summarized as follows: • A web-based crowdsourcing subjective evaluation of viewport images, rendered with the Pannini projection, was conducted, to assess the perceptual impact of the stretching distortion. The resulting viewport images and perceptual scores were made available in [12].
• Several object-based stretching distortion metrics are proposed, which were evaluated using the subjective test results. These metrics measure the stretching distortion of viewport objects with semantic meaning, where the stretching distortion has a higher perceptual impact.
• A procedure to automatically tune the Pannini projection parameters, d and vc, according to the image content, is proposed. It integrates one of the new object-based stretching distortion metrics, showing that it can be exploited to minimize, in a perceptual way, the geometric distortions resulting from the rendering process.
The rest of this paper is organized as follows: Section II reviews related works; Section III presents the results of the subjective test campaign to assess the stretching distortion perceptual impact, and the corresponding analysis; Section IV details the proposed object-based stretching distortion metrics. Section V presents the metrics evaluation results. Section VI describes and assesses a potential application for the proposed metric: the automatic tuning of the Pannini projection parameters for a given viewport. Finally, Section VII concludes the paper.

II. RELATED WORK
In cartography, several map projections have been used to flatten the earth surface into a plane. Any sphere to plane projection distorts the spherical surface in different ways, changing its geometric properties such as distance, direction, shape, or area; no projection can preserve, simultaneously, all these properties. The Tissot's indicatrix (or ellipse of distortion) [10], is the geometric distortion metric most used in cartography. This ellipse is obtained after projecting, in the map, an infinitely small circle defined on the sphere; the scale, area, and angular deformations are defined using the relationship between the semi major and semi minor ellipse axis. However, this metric has some drawbacks, notably: i) it is content independent; ii) it is a local metric, thus global distortion, e.g., bending of straight lines, cannot be directly measured with it.
More recently, a few geometric distortion measures were proposed in the context of content-aware projections for wide-angle images, seeking to minimize the resulting geometric distortions. In [13], the projection was locally, and iteratively, adapted to the image content based on local conformality and line straightness measures. Conformality measure was computed based on Cauchy-Riemann equations [14], and line straightness measure was computed based on the geometry of straight lines. However, these measures were not validated with respect to perceived geometric distortion. Moreover, the straight lines present in the scene need to be identified manually by the users. In [8], the same line straightness and conformality measures proposed in [13] were used to optimize the Pannini projection parameters. The optimized Pannini and a set of benchmark projections were subjectively evaluated in a crowdsourcing subjective test. However, the test results showed that, in average, the Pannini projection with fixed parameters achieved higher quality scores than the optimized Pannini. Also, as in [13], there was no procedure defined to validate the distortion measures with respect to perceived distortion. In [15], a content-aware projection was proposed to minimize the stretching distortion of human faces in wide-angle photos taken from a mobile device, and with FoV up to 120 • . To preserve the human faces, a face stretching measure was proposed. Moreover, a line constraint measure was defined over the regions between the faces and the background to preserve the straightness of the lines. The proposed projection was locally adapted to the stereographic projection on facial regions and evolved to the rectilinear projection over the background. Similar to [8], [13], the distortion measures were not validated with respect to perceived distortion. Also, the stretching measure was designed specifically for faces, and not for general objects. However, correcting the human faces without the rest of the body may create additional artifacts. Furthermore, the distortions of other objects in the scene were not considered.
In the authors previous work [11], several stretching and line bending measures were proposed to measure the geometric distortion in viewport rendering of 360 • images. The proposed stretching distortion measures were based on the Tissot's indicatrices but, to make it content dependent, saliency weights were integrated to give more importance to the parts of the viewport image that attract more the user attention. Both bending and stretching measures were validated with respect to perceived geometric distortion, by correlating their values with the perceptual scores obtained from a subjective test campaign; while the bending metrics showed to be well correlated with those scores, the stretching measures achieved a low performance.
Since the human visual system is highly sensitive to the stretching distortion of the objects presented on the image, in this paper several object-based stretching distortion metrics are proposed, seeking to overcome the shortcomings of previously proposed metrics. This allows a better optimization (in a perceptual sense) of the sphere to plan projection, seeking to improve the perceived quality of the rendered viewport.

III. SUBJECTIVE ASSESSMENT OF THE STRETCHING DISTORTION EFFECT
This section describes a crowdsourcing-based subjective evaluation of viewport images, aiming to assess the perceptual impact of the stretching distortion; the viewport images were rendered using the Pannini projection which, for a particular choice of its projection parameters, results in a pure rectilinear projection, for which the stretching effect (also known as perspective distortion) is most evident. For the sake of completeness, the viewport rendering process using the Pannini projection is firstly described; the used 360 • images dataset, the subjective evaluation methodology, and the final subjective test results and analysis are then successively presented.

A. VIEWPORT RENDERING WITH PANNINI PROJECTION
The Pannini projection (PP) [16] uses a cylindrical surface as an intermediary surface to project a point from the viewing sphere to a plane, tangent to the sphere, as depicted in Fig. 3. This procedure consists on two steps: i) rectilinear projection from the spherical surface to the cylindrical surface (red lines in Fig. 3); ii) perspective projection from the cylindrical surface to the plane (blue lines in Fig. 3), with the projection center at a distance d from the cylinder center. For d = 0, the PP becomes the rectilinear projection; increasing d, gradually expands the middle of the resulting image -making it closer to the viewer and reducing the perspective distortion -but the horizontal lines are progressively bent. To reduce this effect, vertical compression can be applied, at the expense of bending the radial lines and/or increasing the stretching of the objects. The vertical compression strength can be adjusted manually, seeking the best compromise between the bending of horizontal and radial lines, and the objects stretching. The forward projection equations for the Pannini projection are given by [16]: where (φ, θ) denote, respectively, the longitude and latitude coordinates of a point on the viewing sphere, vc is the vertical compression factor, and x p , y p are the Cartesian coordinates of the projected point; these coordinates have their origin at the center of the plane, as depicted in Fig. 3. Inversely, for any point on the plane with Cartesian coordinates x p , y p , the backward projection equations are given by [16]: where For the front viewport, corresponding to a viewing direction (φ, θ) = (0 o , 0 o ), a pixel with coordinates (m, n) (and with this coordinate system centered on the top-left corner of the viewport plane) is rendered by applying (8) and (9): followed by (4) to (7); in (8) and (9), W vp and H vp are, respectively, the viewport width and height, in pixels; V hs , V vs , are, respectively, the horizontal and vertical viewport sizes, in length units: where F h and F v are, respectively, the horizontal and vertical FoV. For a given F h and viewport aspect ratio, AR = V hs /V vs , F v can be obtained by: The rendering of a viewport corresponding to a generic viewing direction, (φ VD , θ VD ), requires an additional step, relating the corresponding sphere coordinates, X ,Ý ,Ź , with the ones that are obtained for the front viewport, (X , Y , Z ): where R is the viewport rotation matrix, defined according to (φ VD , θ VD ).

B. DATASET
Ten omnidirectional images in equirectangular format (ERI), extracted from the datasets available in [8] and [17], were used in the subjective assessment. The images, and their spatial resolutions, are depicted in Fig. 4. The selected images contain indoor and outdoor scenes, and include objects near and far away from the camera. For each image, three viewports were rendered, corresponding to three different viewing directions, with 70% overlapping between successive directions. This allows to compare different levels of stretching distortion of the same objects, when these objects appear in different positions on the viewports. The viewports were rendered for the region of the omnidirectional image where the users attention is often attracted for, using true saliency maps available for the images taken from [17], and the attention-related model proposed in [18] for the images taken from [8].
For each viewing direction, two viewports were rendered with the Pannini projection (PP), using different parameter values: PP 1 (d = 0, vc = 0); PP 2 (d = 0.5, vc = 0). Thus, for each image, six viewports were produced, denoted as VP i , i = 1, 2, . . . , 6, where VP 1 , VP 2 , VP 3 correspond to PP 1 , and VP 4 , VP 5 , VP 6 are correspond to PP 2 . PP 1 , which is a rectilinear projection, was selected since it is often used for viewport rendering of omnidirectional images, and results on strong objects stretching. PP 2 was included to get, for the same viewing directions, different levels of stretching distortion of the objects. The viewports were rendered with a F h of either 110 • or 115 • (presented for each ERI on the bottom of Table 1), and with a spatial resolution of 856 × 856 pixels (AR = 1); besides being recommended in [19] for subjective tests, this resolution allows the simultaneously display of two viewports, side-by-side, in typical monitors. The F h of 110 • was selected based on the study in [5]; moreover, it is a moderate size often used in HMD displays. As shown in Table 1, for some of the images a F h of 115 • was used to guarantee that the main objects were not cut by the image borders.

C. SUBJECTIVE EVALUATION METHOD
In this work, the pairwise comparison (PC) was chosen as the subjective evaluation method. Nowadays, PC became very popular for image and video quality assessment [20]- [22],  particularly for the evaluation of rendering methods; in this method, the viewer is asked to observe a pair of rendered viewports, shown side by side, and to select the one that has higher quality in his opinion.
As depicted in Fig. 5, six comparisons were made per 360 • image: a complete set of comparisons between the viewports rendered with PP 1 , and three additional comparisons between the viewports rendered with PP 1 and PP 2 , and having the same viewing direction. The comparisons between viewports rendered with PP 2 were not considered to limit the test duration; furthermore, these viewports have similar level of stretching distortion. Also, since for some of the 360 • images and viewing directions the bending distortion was visible for viewports rendered with PP 2 , these were excluded from the subjective test. Table 1 presents the selected pairs and horizontal field of view, F h , for each 360 • images.
In total, 45 viewport pairs were considered. A web-based crowdsourcing interface was designed to perform the visualization of the stimuli and to collect the PC scores; it presents two viewport images, 'A' and 'B', side by side as shown in Fig. 6, and requires a monitor with a minimum resolution of 1920×1080 pixels, with a minimum diagonal size of 13-inch. To participate in the test, an invitation email with detailed instructions was sent to several observers; the observers were asked to not perform the test if they do not have a monitor with the aforementioned characteristics. Before starting the subjective test, the observers were asked to: i) open the subjective test interface in their web browser and put it on fullscreen mode; ii) type their name and age in the top of the page; iii) select their monitor size in inches. The instructions about the subjective test procedure were shown on the same page. Subsequently, to familiarize the observer with the stretching distortion's characteristics and the evaluation interface, a short training video was shown.
The viewports used in the training video were not used for the actual test. During the test, the viewport pairs were shown in random order and position, and the observers were asked to judge which viewport image ('A' or 'B') had the best quality. To avoid random preference selections, the option 'A = B' was also included. A total of 32 subjects, aged between 21 and 58 years, from Instituto Superior Técnico (IST), performed the online subjective evaluation. The omnidirectional images, the rendered viewports, and the resulting PC subjective scores were made publicly available in [12].

D. SUBJECTIVE TESTS RESULTS AND ANALYSIS
Outliers were first detected by computing the transitivity satisfaction rate, , [23]; this rule is violated when a circular triad is formed among the three stimuli, VP 1 , VP 2 , and VP 3 (e.g., The score reliability, o , of observer o, was computed as: where d o is number of detected circular triads and h o is the total number of possible circular triads for an observer.
If o < 0.8, the observer o is considered to be an outlier. Four outliers were detected, and their subjective scores were not further considered. Next, for each compared viewports pair, VP i , VP j , the winning frequency, w ij , which represents the number of times VP i was preferred over VP j , was computed. To solve the tie cases, a score of 0.5 was given to each viewport whenever the observer had chosen the option where O is the number of observers, and w ii = 0. The probability of selecting VP i against VP j , is given by: To determine whether the difference on the number of times VP i was preferred over VP j (and vice-versa) is statistically significant, a statistical hypothesis test was performed according to the procedure suggested in [20]. After solving the tie cases, the PC scores roughly follow a Bernoulli process B (O, p), where O is the number of subjects and p is the probability of success in a Bernoulli trial. Fig. 7 depicts the cumulative distribution function (CDF) for a Binomial distribution with O = 28 (the final number of observers, after outliers removal) and p = 0.5, as suggested in [20], meaning that when comparing VP i and VP j both have the same chance of being selected. The CDF of the Bernoulli distribution can be expressed as [24]: where k is the number of times that VP i was selected over VP j , p is the probability of selecting VP i over the VP j in the Bernoulli trial, and · is the floor operator. The resulting CDF value is the probability that the observers select, k times, VP i over VP j . The critical region for the statistical test is obtained from the CDF. To find out whether the number of times VP i was preferred over VP j is statistically significant, thus allowing to conclude that '' VP i is better than VP j '', a one-tailed binomial test was performed with a significant level of 0.05, with the following hypothesizes: H0 (VP i is equal or worse than VP j ); H1 (VP i is better than VP j ). In Fig. 7, the CDF has values above the probability of 0.95 for O ≥ 18(F (18, 28, 0.5) = 0.9564). Therefore, if k ≥ 18, the null hypothesis (H0) can be rejected. A similar statistical test was applied to find out if the number of times that VP j was preferred over VP i is statistically significant, thus allowing to conclude that ''VP i is worse than VP j '', with the following hypothesizes: H0 (VP i is equal or better than VP j ); H1 (VP i is worse than VP j ). In Fig. 7, the CDF has values below the probability of 0.05 for O ≤ 9(F (9, 28, 0.5) = 0.0436). Therefore, if k ≤ 9, the null hypothesis (H0) can be rejected. Note that the Bernoulli process is defined only for integer values, and non-integer values need to be rounded; the floor function is used for this purpose. Fig. 8 presents the probability of selecting VP i over VP j , computed by (14), for each compared pair, and for all considered 360 • images. The horizontal blue dashed line corresponds to the case where the vote count for VP i is equal to, or greater than, 18, i.e., P VP i > VP j = 18/28 = 0.643. The horizontal red dashed line corresponds to the case where the vote count for VP i is equal or less than 9, i.e. P VP i > VP j = 9/28 = 0.321. Values on or above the horizontal blue dashed line, and on or below the horizontal red dashed line, correspond to the cases where the difference in the votes between VP i and VP j is statistically significant; in the first case, VP i has a higher perceived quality than VP j ; on the second case, VP i has a lower perceived quality than VP j . The values between the two horizontal dashed lines correspond to the cases where the difference in the votes between VP i and VP j is not statistically significant.
From the experimental results, the following conclusions can be obtained: • As can be observed in Fig. 8, the viewports rendered with PP 2 (d = 0.5, vc = 0) were selected over those rendered with PP 1 (d = 0, vc = 0), for most of the considered images (values inside the shaded area in Fig. 8). This was expected since the rectilinear projection has a strong stretching effect, and this stretching decreases as the value of d increases.
• For the viewports rendered with PP 1 , the preferred ones (by the subjects) are strongly dependent on the position of the main objects, because the stretching distortion has a higher perceptual impact when the objects are close to the image borders, and/or close to the camera. As an example, Fig. 9 shows three pairs of viewports rendered with PP 1 , together with the subject selections; due to the difference on viewing direction, the same objects are rendered in different positions of a viewport pair, having different perspective distortion.
• The human perception is very sensitive to the stretching distortion of the human body, which may justify why, as shown in Fig. 9, most of the observers have selected VP 2 over VP 3 for Outd2, and VP 2 over VP 1 for Ind2,  while for Outd1 there is no clear choice between VP 2 and VP 3 .
• The stretching distortion of the background, e.g., sky, floor, building walls, has a lower perceptual impact than the stretching of foreground objects.
Finally, and since the subjects were used different monitor sizes, the impact of it on the subjective results was assessed. For that, the observes were divided into two groups: O g1 , containing the observers that have a monitor size in the range [13,16] inch, and O g2 , contains the observers that have a monitor size in the range [22,27] inch. Then, a paired sample T-test with a significant level of 0.05 was applied, to compare the preference probability, P(VP i > VP j ), between groups. This procedure is illustrated in Fig. 10. The T-test results indicate that the null-hypothesis, i.e., that the two groups have similar means, cannot be rejected since the resulting p-value, 0.57, is much higher than the significant level of 0.05; this confirms that the difference on the subjective results from the two groups is not statistically significant.

IV. OBJECT-BASED STRETCHING DISTORTION MEASUREMENT
This section describes two new approaches for measuring the object shape distortion in viewport rendering of 360 • images.  The first one directly computes and compares object shape measures on the sphere and on the viewport, thus before and after rendering, while the second is based on the Tissot indicatrices [10], which are computed for individual objects in the rendered viewport. As depicted in Fig. 11, the process starts with the semantic segmentation of the 360 • image, in equirectangular format (ERI), producing a segmentation map denoted as ERI seg . Afterwards, for a required viewport horizontal field-of-view, F h , spatial resolution (W vp , H vp ), and viewing direction (φ VD , θ VD ), the viewport rendering process is applied to ERI seg , resulting in the viewport segmentation map. For the rendering, the Pannini projection, with parameters (d, vc), is used. The objects distortion is then computed, using the two approaches aforementioned. The main steps are described in the following sections.

A. SEMANTIC SEGMENTATION
Semantic segmentation is a process of assigning a label (e.g., person, car, bicycle, and so on) to objects in the image; in this process, multiple objects of the same class have the same label; it has been used in many computer vision tasks, using 2D images. Although some semantic segmentation models have been developed for 360 • images (e.g., [25]- [28]), they were designed for the purpose of autonomous driving, with outdoor images. In this paper, to obtain the semantic segmentation of both indoor and outdoor 360 • images, the input equirectangular image (ERI) is transformed to cubic format, which results in six 2D, rectilinear projected images (the cube faces), with horizontal and vertical FoVs of 90 • . The Auto-DeepLab semantic segmentation model, proposed in [29], is then applied to each cube face. This model is a deep learning-based approach designed for semantic segmentation of 2D images, that was trained, validated, and tested on several datasets that include indoor and outdoor scenes, and has high accuracy. In this work, it was used the Auto-DeepLab with multi-scale inference and the network backbone Xception-65, pre-trained on ImageNet [30], and MS-COCO [31] datasets. The training was performed on the PASCAL VOC 2012 dataset [32], which contains 20 foreground object classes and one background class. As described in [29], the training was performed with a polynomial learning rate with an initial value of 0.05, and a crop size of 513 × 513 pixels. Batch normalization parameters were fine-tuned during training. After obtaining the semantic segmentation for all six cube face images, it is transformed back to equirectangular format. As an example, Fig. 12-a) depicts the semantic segmentation of Ind2, using Auto-DeepLab. As already mentioned, multiple objects of the same class have the same label. To obtain different labels for disconnected objects, the connected component analysis (CCA) [33], with 4-connectivity, is applied to the segmented ERI. Fig. 12-b) depicts the resulting ERI segmentation map after CCA, where each object is represented with a different color.

B. OBJECT SHAPE MEASUREMENT
The object shape distortion measure can be obtained by relating the object shape on the viewing sphere and on the viewport. Several object shape measures have been proposed on the literature [34], [35]. Since the sphere to plan projections typically alter the area of the objects, or objects are stretched in the horizontal and/or vertical directions towards the viewport borders (cf. Figs. 2 and 6), three shape measures were considered: area, average width and average height. In cartography, these measures showed good performance when used to characterize the distortion of continents and countries for different map projections [36]; this justifies why they were chosen for the study described in this paper.
After semantic segmentation of the ERI image, it is possible to obtain the semantic segmentation map for any viewport by projecting ERI seg ; this allows to obtain the objects in the viewport, Obj vp , linked to the same objects on the viewing sphere, Obj s . Fig. 13-a) depicts an example of a viewport from Ind2 and its segmentation map, with three objects (Fig. 13-b)), obtained by projecting ERI seg ; the same objects are also identified in ERI seg (Fig. 13-c)). The following object shape measures were considered: • Object area: On the sphere, the object area can be computed by summing up the area covered by parallel lines (defined as a sequence of pixels) within the object. At latitude θ, the parallel line area, PLA s (θ), contained in an object is given by: where PA s (θ) is the area covered by a pixel at latitude θ, and N PL ERI (θ) is the total number of pixels within the object at latitude θ. PA s (θ) can be approximated by: where W ERI and H ERI are, respectively, the width and height of the ERI image, in pixels, φ θ is the area covered by a pixel in the ERI image, and cos (θ) reflects the decrease in the area (on the sphere) comprised by φ, θ, as θ varies from 0 to ± 90 degrees. The object area, on the sphere, is computed by: where K PL ERI is the total number of parallel lines covered by the object, k = 1 . . . K PL ERI is the index of those lines, and θ k is the latitude of the k-th parallel line.
The area covered by a pixel on the viewport, PA vp , is given by: where V hs and V vs are, respectively, the viewport width and height, in length unit, given by (10). The object area in the viewport is computed by: where N obj vp is the total number of pixels within the object.
• Object average width: On the sphere, and at latitude θ, the width of the object, OW s (θ), is the length of the parallel line at θ, covered by the object. Since, in discrete domain, each parallel corresponds to a line on the ERI image, OW s (θ) can be computed as: where N PL ERI (θ) is the total number of pixels within the object at latitude θ, and 2π/W ERI is the width covered by a pixel in the ERI image. The object average width, on the sphere, is computed by: On the viewport, the width of the object at line i can be computed as: where N l vp (i) is the total number of pixels covered by the object at line i, and V hs /W vp is the width covered by a pixel in the viewport image. The object average width, on the viewport, is given by: with the summation applied to the viewport lines covered by the object, and K l vp being the total number of those lines.
• Object average height: On the sphere, at longitude φ, the object height is the length of the meridian line (ML) at φ -which corresponds to a column of the ERI imagecovered by the object: where N ML ERI (φ) is the total number of pixels within the object at longitude φ, and π/H ERI is the height covered by a pixel in the ERI image. The object average height, on the sphere, is given by: where K ML ERI is the total number of meridian lines covered by the object, k = 1 . . . K ML ERI is the index of those lines, and φ k is the longitude of the k-th meridian line.
On the viewport, the height of the object at viewport column j, can be computed as: where N c vp (j) is the total number of pixels covered by the object at column j, and V vs /H vp is the height covered by VOLUME 10, 2022  a pixel in the viewport image. The object average height, on the viewport, is given by: with the summation applied to the viewport columns covered by the object, and K c vp being the total number of those lines. It is important to note that all the shape measurements are in length units, and are obtained only for the objects (or parts of the objects) that are rendered on the viewport. As an example, only the parts of objects 1 and 3 that can be seen in Fig. 13-b), were used for the shape measures. Table 2 presents the resulting OA, OW , and OH values, on the sphere and on the viewport, for the three objects of Fig. 13-b). All the measures increase after projection, especially for the objects closer to the viewport borders.

C. SHAPE DISTORTION COMPUTATION
Based on the object shape measures previously presented, the following object shape distortion metrics are defined: • Area distortion: For each object in the viewport, the area distortion is expressed by: where the OA s and OA vp are computed by (18) and (20), respectively.
• Width distortion: The object width distortion is given by: where OW s and OW vp are given by (22) and (24), respectively. This measure characterizes the horizontal stretching of the object.
• High distortion: The object high distortion is computed by: where OH s and OH vp are computed by (26) and (28), respectively. This measure characterizes the vertical stretching of the object.
• Total length distortion: The total length distortion of an object is defined as:  where OWD and OHD are computed by (30) and (31), respectively.
It is important to mention that, besides the absolute difference expressed by (29) to (31), the relative difference was also considered, but did not improve the performance of the metric. To obtain a global viewport stretching distortion measure, several pooling functions were considered to aggregate the shape distortion measure of all detected objects in the viewport. The considered pooling functions are listed in Table 3, where, D is a vector containing one of the distortion measures for all objects in the viewport, and D p is a vector containing the p% highest elements of D; OA vp is a vector containing the object area on the viewport, and denotes element-wise product. Poolings PF 1 and PF 2 assume that the subjective impact of the distortion increases with the number of objects, while pooling PF 3 and PF 4 consider that the impact varies with the average objects distortion; pooling PF 5 presume that the perceptual impact is mainly influenced by the most distorted object, while pooling PF 6 considers the object area in the viewport, giving more emphasis to the distortion of large objects. The reason for the percentile (p%) is to exclude the objects with low distortion values (e.g., the distortion for objects at the viewport center is low and may not be visible); as p% approaches 100%, PF 2 and PF 4 will be closer to PF 5 ; if p% approaches 0%, PF 2 will be closer to PF 1 and PF 4 will be closer to PF 3 . In summary, considering four shape distortion measures with six pooling functions, results in 24 potential shape-based stretching measures.

D. TISSOT-BASED OBJECT DISTORTION COMPUTATION
The Tissot indicatrix [10] has been used for years by cartographers to evaluate and compare distortion on different Earth map projections. This indicatrix is obtained after projection, in the map, an infinitely small circle defined on the sphere; the relationship between the major and minor axis of the resulting ellipse, after projection, enables to compute the local scale, area, and angle distortions, at the FIGURE 15. Histogram plots of s, t , h, k for two identified viewport objects, Object 2 and Object 3 of Fig. 13-b): a) Local area distortion, s; b) Local shape distortion, t ; c) Scale factor, h; d) Scale factors, k.
projected point. Fig. 14 depicts an infinitesimal unit circle defined on the sphere, and its corresponding Tissot's indicatrix after projection on the plane;â andb are, respectively, the Tissot's indicatrix semi major and minor axis; h and k correspond, respectively, to the scale factor along the projected meridian and parallel. In the authors'previous work [11], three viewport Tissot distortion measuresnamely, area distortion (D area ) scale distortion (D scale ), and angle distortion (D angle ) -were defined to measure the stretching distortion in the viewport rendering of 360 • images, under the general perspective projection; to make these measures content dependent, saliency weights were used. In this paper, object based Tissot distortion measures are proposed, and obtained according to the two following steps: 1) Compute local Tissot distortion metrics: For a given horizontal and vertical field-of-view, F h and F v , the viewing area on the sphere is defined by φ ∈ [−F h /2, F h /2] and θ ∈ [−F v /2, F v /2]; this region is then uniformly sampled with a fixed interval φ, θ (set to 0.05 degrees in this paper). For each sampled point, indexed by i, with spherical coordinates (φ i , θ i ), the corresponding Tissot scale factors h i and k i , semimajor,â i , and semi-minor,b i , axis of the Tissot ellipse are obtained. The details about the computation of these parameters are presented in the appendix. Afterwards, the local area distortion, s i , and local shape distortion, t i , are computed as proposed in [11]: Although the local angle distortion was also initially considered, it did not improve the results, and was not retained for further assessment. Fig. 15 presents histogram plots of s, t, h, k for two objects of Fig. 13-b), namely Object 2 (close to the viewport center and with low distortion), and Object 3 (close to the border and Stretching distortion values for the proposed object-based measures, and for the stretching measures proposed in [11], computed for a pair of viewports, Ind 2 − VP 1 and Ind 2 − VP 2 , presented respectively in Fig. 9-e) and Fig. 9-f).
with high distortion). As can be seen, s, t, h, k have a wider range of values (and with higher variance) for Object 3, than for Object 2. 2) Compute Tissot-based object distortion metrics: For each object in the viewport, the following object-based Tissot distortion metrics are obtained: where OAD TO , OSHD TO , OSD TO , are, respectively, the object-based Tissot area, shape, and scale distortion metrics. The superscript TO denotes object-based Tissot measure; s and t are vectors containing, respectively, the local area and shape distortions for all points within an object; h and k are vectors containing the scale factors for all points within an object. To obtain a single measure per object, the Variance and Average functions were considered in (41) to (43); however, the Variance function was selected as it showed the best performance. To obtain a global viewport stretching distortion measure, the pooling functions presented in Table 3 of Section IV-C were used. In this case, using three Tissot based object distortion measures, with six pooling functions, results in 18 potential Tissot based stretching distortion metrics. Fig. 16 presents the resulting stretching distortion values for a sub-set of the proposed object based stretching measures with pooling function PF 6 , and the three stretching measures -D area , D scale , and D angle -that were previously proposed in [11], computed for a pair of viewports, Ind2-VP 1 and Ind2-VP 2 (depicted in Fig. 9 -e) and -f)); the blue and orange bars correspond, respectively, to Ind2-VP 1 and to Ind2-VP 2 . As can be figured out, the proposed object based stretching measures allow a higher discrimination between the quality of the two viewport images, than the metrics proposed in [11], since for the former the difference between blue and orange bars are much more evident. VOLUME 10, 2022

V. METRICS EVALUATION
In this section, the proposed object-based stretching distortion metrics are evaluated and compared to benchmark solutions. The usual measure to evaluate an objective quality metric is the correlation (Pearson and/or Spearman) between objective scores and ground truth opinion scores (typically, MOS or DMOS). However, since in this work pairwise comparison (PC) was used on the subjective tests, it is not possible to obtain MOS or DMOS values for each individual stimulus. Accordingly, the proposed metrics are assessed versus PC scores using the classification errors approach, as suggested in Rec. ITU-T J.149 [37], and applied in related literature [20], [21].

A. CLASSIFICATION ERRORS
According to Rec. ITU-T J.149 [37], a classification error (CE) occurs when the objective and subjective scores lead to different conclusions about the relative quality of a pair of stimuli, VP i and VP j . Three types of errors may happen: • False Tie (FT): when the subjective score indicates that VP i and VP j are different, but the objective score indicates that they are similar.
• False Differentiation (FD): when the subjective score indicates that VP i and VP j are similar, but the objective score indicates that they are different.
• False Ranking (FR): when the subjective score indicates that VP i (VP j ) is better than VP j (VP i ), but the objective score indicates the opposite.
Let OM represent the minimum difference, between the objective quality scores of two stimuli, that defines when the two stimuli become perceptually distinguishable. As OM increases, more stimuli pairs are considered similar, increasing the occurrence of FT, but the occurrences of FD and FR will decrease. On the contrary, as OM decreases, the occurrence of FT also decreases, but the occurrence of FD and FR will increase. Following ITU-T J.149, the percentage of each error type and of correct decisions are obtained from the considered stimuli pairs as a function of OM, for individual metrics; this allows to compare the metrics and determine the best one for the application under analysis. The best OM value is the one that maximizes the correct decision percentage [21], [37].

B. EXPERIMENTAL RESULTS AND ANALYSIS
To evaluate the proposed distortion metrics, the viewport dataset described in Section III-B, and the processed PC scores described in Section III.D, were used. Moreover, the performance of the metrics were compared to the following benchmark solutions: area distortion (D area ), scale distortion (D scale ), and angle distortion D angle , proposed in [11], the conformality measure (CM ) proposed in [13], and the content-dependent conformality (CM sal ) proposed in [8]; the latter is a modified CM , by integrating the viewport saliency on it. For poolings PF 2 , PF 4 , several values of p% were considered, and the resulting classification errors and correct decision were obtained. The best performance was obtained for p = 50%. Table 4 reports the classification errors and correct decision values for each proposed distortion measure, using the pooling functions described in section IV-C, and for the benchmark solutions; the OM value that maximized the correct decision percentage was used. As can be figured out, there is a significant performance improvement for the objectbased metrics, when compared with the benchmark solutions. Among the proposed metrics, the object-based Tissot metrics achieved the highest Correct Decision and the lowest False Tie percentages. Among the benchmark metrics, the CM has the worst performance. This metric is content independent, and the same metric value is obtained for any viewport image. When the conformality integrates the saliency, as in CM sal , the performance increases, confirming that it brings some additional value to the metric. To find out the best solution among the proposed ones, the true positive rate (TPR), defined by (44), was computed:   TPR value of 0.89 obtained for OAD TO with PF 6 , being also lower than the TPR values obtained for the other proposed metrics. Fig. 17 depicts the plots of classification errors and correct decision for the selected metric, OAD TO with PF 6 , where the dashed line on the right side plot indicates the OM value that maximizes the correct decision. When OM = 0 (all stimuli pairs are considered as perceptually different by the objective metric), the correct decision percentage is 82%, which agrees with the results of the subjective test, where the difference was statistically significant in 82% of the pairs (cf. Fig. 8).
To evaluate the impact, on the metric performance, of considering or not the background, the background distortion was computed using the selected metric (OAD TO with PF 6 ), and included in the metric as an additional measure; after, the classification errors and correct decision values with/without considering the background distortion were compared. Table 6 presents the resulting classification errors and correct decision values, showing that the metric performance decreases when the background distortion is included. This is consistent with fact that the stretching in the background is not as visible as the stretching of foreground objects, and shows the advantage of having an object-based stretching metric. Taking into account the evaluation results of the different metrics, the object-based Tissot area distortion (OAD TO ), with pooling function PF 6 , is the proposed one to assess the subjective impact of the viewport stretching distortion, in 360 • image rendering.

VI. CONTENT-AWARE PANNINI PROJECTION
This section describes a useful application of the proposed stretching distortion metric; it consists in obtaining the optimal -in a perceived quality sense -projection parameters, (d, vc), for the viewport rendering of a 360 • image using the general Pannini projection, resulting in a content aware Pannini projection (CAP). As mentioned in Section I, stretching of objects and bending of straight lines are the two main artifacts that condition the perceived geometric distortion of the rendered viewports; furthermore, they have an opposite evolution with the variation of the projection parameters, i.e., stretching decreases and bending increases when d varies from 0 to 1, and/or vc varies from 1 to 0. Thus, the procedure to find the optimal parameters, (d opt , vc opt ), seeks the best compromise between these two types of artifacts.

A. METHODOLOGY
For a given input ERI image, viewing direction (φ VD , θ VD ), and FoV, the resulting viewport stretching and bending metrics are iteratively computed for different combinations of d and vc values, varying d between d min and d max , with a step-size d, and vc between vc min and vc max , with a stepsize vc. For the results presented in this paper, d min = 0.1, d max = 1, vc min = 0 and vc max = 1; d and vc were both set to 0.1, resulting in N = 110 possible (d, vc) pairs. The optimal parameters, (d opt , vc opt ), to be used on the viewport rendering, are obtained by minimizing, over the considered (d, vc) pairs, a cost function, described by (45): where SM (d, vc) and BM (d, vc) are, respectively, the viewport stretching and bending measures for projection parameters (d, vc); α is the stretching to bending ratio; SM min , SM max , BM min , and BM max are normalizing constants guaranteeing that the metric values are on the interval [0,1]; they correspond to minimum and maximum SM and BM values that were found for a set of 2200 viewports, rendered from 20 omnidirectional images taken from [8], [17], and using (d, vc) values on the intervals previously specified. For SM (d, vc), the object-based approach showing the best results -OAD TO with PF 6 -was used, while for BM (d, vc) the Line Measure Combination, LMC, proposed in [11], was selected due to its good performance. In (45), parameter α seeks the best balance between stretching and bending subjective impact and it was learned in a perceptual way using a small data set of Pannini viewports, not contained in the final evaluation dataset. The details about the procedures to obtain the normalization constants and α are provided in the supplementary material.

B. RESULTS
The proposed content aware Pannini (CAP) projection was compared with several benchmark projections that include rectilinear, stereographic, two Pannini with fixed parameters (d = 0.5, vc = 0), and (d = 1, vc = 0), and the optimized Pannini projection (OP) proposed in [8]. For comparison purposes, several viewports were rendered from a set of omnidirectional images available in the datasets of [8], [17]. The viewports were rendered with a horizontal FoV, F h , of 150 • and a spatial resolution of 960×540 pixels (16:9 aspect ratio). Fig. 18 depicts some viewports examples obtained with the proposed CAP and benchmark projections. As can be figured out, the CAP viewports are generally more pleasant than those resulting from the benchmark projections, providing a good compromise between bending and stretching distortions. In particular, the following qualitative comparisons can be made (additional comparison results are provided the supplementary material): • CAP vs rectilinear and stereographic: The viewports resulting from CAP are clearly more pleasant than those resulting from rectilinear and stereographic projections. While the lines are straight in the rectilinear viewports, the perspective effect is very strong and annoying, and the object shapes are too much stretched, notably for the Office1, Office2, and Buildings viewports. Although the objects shape is preserved in the stereographic viewports, the lines are severely bent (fisheye effect).
• CAP vs Pannini with fixed parameters: The proposed CAP generates viewports with a good balance between the stretching of objects and bending of lines. This cannot be achieved for Pannini with fixed parameters, as for vc = 0 the horizontal lines are rather bent, particularly for d = 1.
• CAP vs OP: The viewports obtained for CAP have less geometric distortion than the viewports resulting from OP. In particular, for the Bedroom viewport the horizontal lines on the celling and on the floor are straighter for CAP. In the Office1 and Office2 viewports, CAP kept the horizontal lines as straight as OP, but the objects shape (e.g., the chair and monitor on the left side in Office1 and the chair on the left side in Office2) is more conformal for CAP. The proposed CAP projection was also compared with the content-aware generalized perspective projection (CA-GPP), proposed in [11]. In the CA-GPP, the projection parameter (d) is optimized based on Support Vector Regression (SVR). The SVR model was trained and tested with a viewport dataset rendered with a square FoV of 90 • and 110 • . For fair comparisons, several viewports were rendered using CAP with a square FoV of 110 • , and a spatial resolution of 856 × 856 pixels, as in [11]. Fig. 19 depicts some viewport examples obtained with the proposed CAP and with the CA-GPP. As can be observed, while for Buildings the results are similar, for Friends and Dinner the CAP provides a better tradeoff between stretching and bending; with CA-GPP, for the Friends viewport the vertical lines are too much bent and the same happens for the horizontal lines of Dinner. Note that while the CA-GPP was optimized for the FoV considered in these results (110 • ), there was not such optimization for CAP.

VII. CONCLUSION
This paper proposes an object-based quality metric to assess the perceived geometric distortion in viewport images, rendered from 360 • images, in planar displays. The metric measures the distortion of individual objects, having semantic meaning (such as people, cars, furniture). The experimental results show that the proposed metric outperforms the considered benchmark metrics, and is able to assess the viewport quality, with respect to perceptual scores, with a correct decision percentage close to 90%. Also, as a useful application, the proposed metric was integrated in a procedure that optimizes the Pannini projection parameters, according to the viewport content, achieving a good compromise between stretching of objects and bending of straight lines.
As final remarks, it is important to note that for same omnidirectional image, and depending on the viewing direction, the viewport content may vary a lot, and the same will happen to the best projection parameters; however, how to change these parameters during video navigation, requires further investigation, with specific subjective tests where user interaction is allowed. Furthermore, since the proposed content aware Pannini projection is globally adapted to the viewport content (i.e., d and vc have the same values for the whole viewport), stretching and/or bending may be still visible for some image regions and structures; if those parameters are allowed to vary locally, the visibility of geometric distortions could be further reduced. These challenges will be subject of study in future work.

Tissot Indicatrix for the Pannini projection
The Tissot indicatrices, of a given sphere to plan projection, are characterized by the scaling parameters h, k and by the semi-major,â, and semi-minor,b, ellipses axis. To compute these parameters for the Pannini projection (PP), and following the procedure described in the appendix of [11], the partial derivatives of (x p , y p ) with respect to (φ, θ) need to be VOLUME 10, 2022 computed. From (1) and (2), it follows: . (49) The local area distortion, s, and the local shape distortion, t, can be computed by (39) and (40), respectively. A conformal projection has t = 1, and an equal-area projection has s = 0. As an example, Fig. 20 depicts the local shape distortion, t, along the equatorial line θ(= 0) and φ ∈ [−55 • , 55 • ], for PP projection with varying parameters d and vc, one at a time. As can be seen in Fig. 20-a), local shape distortion is maximum for the rectilinear projection (d = 0). On the other hand, the stereographic PP (d = 1, vc = 0), is locally conformal (t = 1), although horizontal lines are bent, as can be seen in Fig. 18. In the PP, the bending of horizontal lines can be corrected by applying vc, however shape distortion is introduced, as can be concluded from Fig. 20-b).