Visual Quality of Compressed Mesh and Point Cloud Sequences

With the development of immersive video, the delivery and storage of 3D content have become important research areas. While compression methods for meshes and point clouds, the two main representations for 3D content, are actively studied, there are few studies of their perceptual compression quality and none that consider observation distance. In this paper, we study the perceptual quality of compressed 3D sequences, for both point cloud compression and mesh-based compression. We explore the impact of bit rate and observation distance on perceptual quality. Evaluation of perceptual quality is carried out both by collecting viewer opinion scores of the compressed sequences separately, and with a side-by-side comparison. A functional model for mesh and point cloud compression quality is estimated to predict Mean Opinion Score (MOS) which yields high Pearson correlation and rank correlation scores with measured MOS.


I. INTRODUCTION
Immersive video, as one rapidly growing multimedia form in daily life, is an important future direction of interaction with content and the real world, due to the richer experience it provides compared to traditional 2D media content, including innovative navigation and interactive functionalities [1], [2]. In immersive video, viewers can see details without constraint on the viewpoint, since immersion is guaranteed through 6 degrees of freedom. The advantages of immersive video lead to multimedia applications in many areas, such as teleconferencing [3], [4], sports [5] and education [6]. In this paper, we refer to this kind of volumetric video as 3D video. The development of depth sensors, such as Kinect from Microsoft and RealSense from Intel, led to an increase in 3D data processing applications. Also, research into autonomous vehicles has boosted the requirement to process 3D information to understand the surrounding world. However, along with the advantages over traditional 2D video, huge amounts of 3D data lead to storage problems and the need for compression.
The associate editor coordinating the review of this manuscript and approving it for publication was Shiqi Wang. There are two main representations for 3D data, mesh and point cloud (PC). A mesh represents 3D content with faces (represented by edges and vertices) that define the shape, and texture information that defines the color across the surface of each face. Point clouds represent 3D content with a collection of points in 3D space; each point is associated with attributes such as color information. Visual quality of mesh and PC compression, including their comparative performance, is an important and relatively new area of study. Difficulties arise from the lack of promising objective metrics. There are few objective metrics that work on both mesh and PC representations, and most current objective metrics are poorly correlated with human perception [7]- [12]. Most current work on quality evaluation focuses on geometry distortion and ignores texture distortion, however, the overall visual quality is affected by both. Subjective experiments under varying conditions, as well as objective metrics correlated with perception, are needed. In this context, the contributions of this paper are: • We compare PC sequence compression and mesh sequence compression which allows us to determine which representation is preferred depending on factors of sequence content, bit rate and observation distance.
We use the evaluation method in [13] from MPEG, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in which a virtual viewing trajectory is chosen and the 3D sequence is rendered into 2D for subjective viewing.
• A model of subjective rating estimation is proposed which consists of two parts, bit rate quality factor (BQF) assessing the quality of compression based on bit rate, and observation distance correction factor (ODCF) making a correction over BQF with observation distance.
Our model fits well with subjective ratings.
This paper is structured as follows. Sec. II introduces prior work on compression and quality evaluation of meshes and point clouds. Sec. III describes the subjective experiment. Experimental results and the estimation model are in Sec. IV. Sec. V provides compression suggestions based on the subjective test results, and conclusions are in Sec. VI.

II. RELATED WORK A. 3D CONTENT COMPRESSION
For mesh compression, triangle fan-based compression (TFAN) [14] enumerates triangle connection cases to encode the connection information of a mesh so as to improve compression efficiency. Google's open-source compression tool, Draco, is based on the corner-table method [15] and achieves real-time compression and decompression. Similar connectivity pattern information is used to improve encoding efficiency for mesh compression in [16]. The whole mesh is divided into a few sub-meshes and compressed with a graph Fourier transform in [17] which approximates the connectivity of a sub-mesh with a sparse matrix.
For point clouds, since the points are not connected as they are in a mesh, the spatial organization of points should be built so as to encode the points efficiently. Research on PC compression has included many diverse approaches. PC reconstruction based on rank minimization theory is proposed in [18] to complete holes in dynamic PC sequences. Octrees are applied in [19], [20] to progressively compress point clouds. In [21], octrees and graph-based transforms are combined to compress geometry information in static PCs. Hierarchical clustering of points can generate Level of Detail (LoD) in [22], which describes different levels of complexity for a PC, and LoD is progressively compressed. A hierarchical sub-band transform that resembles an adaptive variation of a Haar wavelet is applied in [23] for color attribute compression for PCs. A motion-compensated approach to encoding dynamic PC sequences is proposed in [24]. An encoding scheme for building a 3D model is based on a set of low-frequency spherical harmonic basis functions in [25].
MPEG hosted a call for proposals [13] and picked three methods [26] as winners for three different categories: static models, dynamic sequences and dynamic acquisition. Test Model Category 2 (TMC2) from Apple Inc. [27] achieves the best subjective and objective quality under given target bit rates for the dynamic sequence category. Its core idea is to project points, both their geometry coordinates and attributes, to 2D and convert the 3D sequence to a 2D sequence. Then any existing video codec, such as FFmpeg or HM (Test Model for HEVC) could be used to compress the 2D sequence. In their framework, after the resolution of the 2D sequence is fixed as an initial parameter, any scale-up or scale-down of the 2D sequence will create outlier points when projected back to 3D space. To add spatial scalability to TMC2, [28] proposed to add a patch-aware averaging filter to remove outliers.

B. 3D CONTENT QUALITY ASSESSMENT
Subjective quality evaluation and computable quality evaluation of 3D representations are both active research topics. Subjective quality for 3D content is much less well studied than for 2D video. Prior work on subjective evaluation and objective metrics for PCs are summarized in [29]. Most recent work focused on evaluating subjective quality or Quality of Experience (QoE) for static 3D objects, including comparison with computable metrics. In [13], MPEG adopted a subjective experiment approach to evaluate the quality of compressed PCs, in addition to the conventional point-topoint and point-to-plane metrics. The impact of different noise levels on QoE of PCs was studied in [30]. Augmented reality head-mounted displays were used in subjective evaluation of PCs in [31]. In [32], different subjective methodologies were studied, such as Absolute Category Rating (ACR) and Double Stimulus Impairment Scale (DSIS). The authors concluded that they performed similarly for PC subjective experiments involving the quality of compression-like distortions. In [9], 20 subjects subjectively scored PC compressed videos from 1 (Bad) to 5 (Excellent), however, this research only considered different encoding configurations without comparing compression bit rates. Also, subjects were allowed to observe the sequence with free viewpoint which might lead to a variety of results since people may focus on different local details. The impact of different reduction methods for PCs on the QoE of rendered images was investigated in [33].
There is less research on subjective quality of 3D sequences. A subjective experiment on TMC2 compression for two PC sequences was carried out in [34]; the authors found that perceptual quality was more affected by texture distortion than geometry distortion. The impact of different rendering configurations on QoE in VR-based training was studied in [35]. Better immersion and faster interaction with 3D content was found to affect subjective quality in [36]. QoE of adaptive PC streaming was investigated in [37] with different network configurations.
One prior work compared the visual quality of mesh and PC representations. In [38], a comparison between colored mesh and PC concluded that mesh compression generates better visual quality at high bit rates while PC compression is better at low bit rates. Our work differs from [38] in that we consider multiple observation distances, and our compression pipelines allow PC scaling and mesh simplification, which can change the visual tradeoffs. In addition, we include a model for estimating quality ratings based on bit rate and observation distance.
Computable quality metrics for 2D and 3D sequences can be categorized as Full-Reference (FR) metrics, which have access to the original sequence as well as the distorted one, Reduced-Reference (RR) metrics, which have access to the distorted sequence and to some key parameters extracted from the original sequence, and No-Reference (NR) metrics [39], which do not have access to the original reference sequence. NR metrics have been further subdivided into pixel-based (NR-P) metrics which use the distorted sequence to evaluate quality, and bit-stream-based (NR-B) quality predictors which do not make use of the distorted sequence directly, but rather use basic bit stream parameters such as the bit rate and packet loss rate to predict quality. Because of different properties of mesh and point cloud representations, different metrics are applied. A survey of perceptually-based computable metrics for visual impairment of 3D objects appears in [40].
Among FR metrics of mesh quality, root mean square error (RMS) and Hausdorff distance (HD) were adopted as straightforward metrics, but they were found to be poorly correlated with human perception [7], [8]. Curvature was computed on different scales of a distorted mesh and its reference in [7], based on which a correspondence was found to generate a mapping between meshes. Then, inspired by the work of [41], a structural similarity index on 3D, named Mesh Structural Distortion Measure (MSDM) was implemented on all scales. A final score was computed as a weighted combination considering all scales. Similarly, based on curvature, [42] took visual masking and saturation effects into consideration to correct scores directly from curvature. Quality for watermarked meshes was explored in [43] by measuring the difference of surface roughness between watermarked and original meshes. Measuring distance between curvature tensors of two triangle meshes under comparison was proposed in [44]. A RR method was developed in [45] that considered distributions of extracted parameters from dihedral angles of distorted and reference meshes. The Kullback-Leibler divergence between the distributions was calculated as a perceptual distance. In [46], a local roughness measure was derived from Gaussian curvature. A NR method proposed in [47] fed distributions of dihedral angles into a trained support vector regression to predict quality scores. With the development of neural networks, [48] took mean curvature as input to a regression neural network to predict a quality score without a reference mesh. All of this prior work focused on meshes without color.
For computable metrics of compressed PCs, similar to [7], [8], simple point-to-point and point-to-plane FR objective metrics, such as RMS and HD, showed no correlation with perceptual quality [9]- [12]. The conclusion that pointto-point and point-to-plane metrics are limited in predicting subjective quality ratings especially for TMC2 was also verified in [49], [50]. In [12], point-to-point and point-to-plane metrics were studied for a PC de-noising algorithm; they concluded that the point-to-plane metric is more correlated with perceptual quality than is a point-to-point metric for PC with no noise. In [51], the normal vectors used for the point-to-plane metric were averaged within a small local region to avoid the error of estimated normal from geometric distortions. In [52], a plane-to-plane FR metric measured angular similarity through the intersection angle of normal vectors between two corresponding points. While the aforementioned work also focused only on geometric distortion, two studies considered color attributes as well [53], [54]. In [53], the authors rendered a 3D point cloud onto a 2D plane, then applied traditional image FR quality metrics to measure the 2D image quality as input to their prediction model for perceptual quality of the 3D PC. Different objective metrics were proposed in [54] based on geometry, normal vectors, curvature and color separately, and the color-based metric was found to best match perceptual quality.
The current work differs from this prior work in several ways. We provide a comparison of PC and mesh compression across many different bit rates and across three different observation distances, in order to explore the effect that both rate and observation distance have on perceived quality. Unlike most prior work, we consider 3D content with color. We include the possibility of reducing the number of triangles or number of points. Lastly, we develop a simple NR quality predictor which can predict subjective quality scores for both mesh and PC compression as functions of bit rate and observation distance. A list of terms and symbols used in this paper is provided in Table 1.

III. SUBJECTIVE EXPERIMENT
Here we design a subjective test to compare the quality under compression of PC and mesh representations of 3D content.

A. TEST SEQUENCES
We use four 3D dynamic sequences: Basketball, Dancer, Model and Exercise. Each sequence has 300 frames. In Fig. 1, we show example frames to illustrate sequence content. Exercise contains a man wearing solid-color clothing; he does exercises slowly, with a large movement range. In Dancer, a man in solid-color clothing does an urban dance with fast movement and large movement range. The Model sequence shows a woman wearing a patterned dress with a swirly skirt; this sequence contains more complex texture than the others. Lastly, Basketball involves two objects, a basketball and a man; he plays with the basketball with a moderate motion speed. Both the ball and the man's shirt have some color variation. Fig. 2(a) shows the camera setup for data acquisition, consisting of 75 stationary cameras, of which 25 are color and 50 infrared. Fig. 2(b) shows the data acquisition pipeline. The whole pipeline is similar to [55]. The 50 infrared cameras lead to 25 depth images which are aligned with the 25 color images for each frame in the 'Generating Correspondence' module. Those 25 image pairs for each frame constitute the raw data of each sequence.
For the mesh version of each sequence, starting from the raw data, all image pairs are fed into a foreground segmentation algorithm. The foreground segmentation algorithm is adopted from [55]; it considers a confidence map for an RGB image, IR image and Shape from Silhouette jointly in order to generate a good segmentation. After foreground segmentation, a triangle mesh is reconstructed with a given rough number of triangles. Because the captured content is dynamic, a mesh tracking method, non-rigid Iterative Closest Point [56], is applied to ensure topology consistency in each group of frames. The reconstructed mesh has vertices with coordinates ranging from 0 to 2048. At the last step, a texture atlas image, containing the color information of size 2048 × 2048, is generated.
For the PC version of each sequence, we first generate a mesh with 100k triangles per frame as the source mesh. The number of triangles is high enough to ensure good quality. Then, we sample across the mesh surface with an interval of 0.5 which leads to a source PC with approximately 2.5 million points per frame.

B. SEQUENCE PREPARATION
Each test sequence is encoded at five target bit rates: 3Mbps, 6Mbps, 9Mbps, 15Mbps, and 25Mbps. Those target bit rates are chosen to cover a large range which will satisfy many potential applications. We render the 3D content as depicted in Fig. 3. Using OpenGL in a Linux OS, we obtain rendered 2D images of the 3D sequences out of which we generate video sequences that are shown to subjects. The observation distance, which is the distance between the sequence centroid and the virtual rendering camera, takes the values 1.5m (close), 3.0m (middle) and 5.0m (far) where m represents meters. The observation distances chosen represent a significant range of possible observation distances. Observing at 1.5m or closer generally does not allow the whole body to be seen, while observing at 5m or farther means the body appears rather small. By projecting each point or face to the image plane, OpenGL simulates what the 3D content will look like from a camera positioned at the specified observation distance from the center of the object. The virtual viewing trajectory makes a full 360 degree turn around the main figure, while making a small variation in height (somewhat higher and lower than the midpoint). The average distance between the virtual rendering camera and the figure is roughly the observation distance (actual distance could be a little closer or further, less than 5% difference). These virtual viewing trajectories are chosen to present a large set of angles and details to be examined by the subjects. During the rendering process, the size of rendered videos is fixed to 2048 × 2048.
The observation distance is defined as the distance between the sequence centroid and the virtual camera. In a real application, observation distance would typically be an input from the user, who chooses the distance from which to view the content. Once an observation distance (and angle) are chosen by the user, that determines the rendering parameters which are needed to achieve that view, but it does not determine the  encoding parameters. Different observation distances could correspond to different applications. For example, teleconferencing might involve a close distance, whereas a user watching a performance might choose a far distance.
Then, for each target bitrate and observation distance, we aim to pick the compression parameters with the best visual quality for evaluation. Compression parameter sets are different for mesh and point cloud representations, and they are chosen as follows:

1) FOR MESH REPRESENTATION
This consists of two parts, the atlas image and vertex information. Atlas images are formed into a sequence and compressed with any video codec. We used FFmpeg (version 3.4.1) to compress it, involving two compression factors: atlas image size L and compression quantization parameter (QP) q. The vertex information contains vertex coordinates, vertex connection information, and vertex-atlas mapping information. Vertex coordinates are quantized with step size β. Vertex connection information is compressed losslessly with TFAN [14], chosen because MPEG adopted TFAN in their mesh coder. The vertex-atlas mapping determines the correspondence between a vertex coordinate and a pixel position in the atlas image which is a two dimensional vector for each vertex. Vertex-atlas mapping information is quantized with step size γ . All the vertex information after quantization and compression is further compressed as a subtitle bitstream in the FFmpeg framework. The number of triangles per frame N t can also be controlled as a pre-processing step. The overall parameter set for mesh compression is formulated as the vector (N t , L, q, β, γ ).
Starting with the raw data for each sequence, 6 versions of the mesh sequence are generated with different numbers of triangles per frame: 3k, 5k, 8k, 10k, 12k and 15k. For each version and each target bit rate, the compression parameter set that generates the best visual quality is manually chosen. That leads to 6 different compression parameter sets in total for each target bit rate. Then, 2D video is rendered according to a certain observation distance. We manually pick one parameter set from those 6 parameter sets as the one with the best 2D visual quality under this bit rate and this distance. For each observation distance and target bit rate, we repeat this procedure. In the end, given 5 bit rates and 3 distances for each sequence, we have 15 different compression parameter sets for each of the four test sequences.

2) FOR PC REPRESENTATION
We use TMC2, proposed as an MPEG standard, for PC compression. An important TMC2 parameter is the projected image size, l x and l y ; the core idea of TMC2 is to project a PC onto a 2D image for both geometry and texture. Then, a conventional video codec is applied with QPs for geometry and texture (QP geometry and QP texture ). We used FFmpeg (version 3.4.1) as the video codec. Note that the size parameters l x and l y for PC compression, and L for mesh compression, are used only in the compression step, where they may differ for different bit rates or observation distances. However, following decompression, for purposes of rendering and display, all cases use the same interface and display size of 2048 × 2048.
Another important parameter, similar to the number of triangles per frame in mesh compression, is the down-scaling factor s which controls the number of cloud points. Down-scaling is performed over the source PC. Given a point (x, y, z) in the original PC, the scaled point is (x/s, y/s, z/s). VOLUME 8, 2020 Since the PC compression algorithm only takes integer input, the coordinates (x/s, y/s, z/s) are rounded to integer. After shrinking all point coordinates and rounding to integers, some points that were distinct in the original PC will map to the same position in the down-scaled PC. These duplicate points are removed. In this way, a downscaling of coordinates directly leads to a down-sampling of points. So the down-scaling reduces the number of points. Scaling-up is implemented when we render the scaled PC. We have 6 scale factors, 1.0, 1.25, 1.5, 1.75, 2.0 and 2.5. The compression parameter set for PC compression is (s, l x , l y , QP geometry , QP texture ). For each target bit rate and each observation distance, we manually pick the parameter set that provides the best visual quality. In the end we have 15 parameter sets for 5 bit rates and 3 distances for each sequence.
In Table 2, we show compression parameter sets that are picked for our experiment for Dancer. Those parameter sets achieve the best visual quality for the given target bit rate and observation distance. To generate the candidates, we traversed all possible combinations of parameters that fit the bit rate constraint. After rendering all candidates to 2D videos, the investigators manually picked the parameter set with the best visual quality among the candidates for each observation distance. It is not always the case that the lowest QP was selected as the best candidate, because, for example, to meet the bit rate constraint, the lowest QP might occur with heavy down-scaling of the point cloud, a combination that might lead to poor quality. When we manually pick the parameters, the parameters are hidden from us to avoid bias, so we choose the parameters based on visual quality of the rendered videos. Table 2 shows a clear trend in different parameter sets for different bit rates, for example, for lower bit rates, a smaller number of triangles is a good choice for meshes, and heavier down-scaling is a good choice for point clouds. But Table  2 does not show a clear trend for observation distances, suggesting that some different combinations of parameters which achieve the same bit rate look visually similar.
There are 120 rendered videos in total, corresponding to 2 representations (mesh and point cloud), 4 sequences, and 15 sets of compression parameters (3 distances × 5 bit rates). Each video lasts 10 seconds. Fig. 4 and Fig. 5 present rendered images of the compressed PC and mesh from the 250th frame of Model. From the example frames, we notice that observation distance plays an important role in visual quality. At low bitrate, PC compression causes some cracks and outliers. While mesh compression appears to better preserve the geometry compared to PC compression, the texture information appears to suffer more distortion.

C. EXPERIMENT DESIGN
Thirty subjects (22 males, 8 females, age range 19-37) participated in our two-part experiment. For display, we used an Acer 24-Inch LED backlight monitor. In the first part, the subject scores each video from 0 (worst quality) to 100 (best quality). Before starting, subjects are shown a few example videos to calibrate their scoring standard. The 120 videos are shown in a random order. The user interface for the first part of the experiment is shown in Fig. 6. The size of the user interface (and of its displayed video) remains the same for all cases. Subjects input their score in the box according to the scale. After clicking 'Submit Score', the score is recorded and the next random video appears. Including training, this part takes approximately 30 minutes and includes 120 videos.
The second part aims to compare two videos (A and B) side by side, asking the subject which one is better based on visual quality. The videos are for the same sequence, bit rate and distance; one is from mesh representation and the other is from PC representation. The user interface for this part is shown in Fig. 7. The order of videos is random as is their placement as A or B. There are five choices: A is far better, A is a little better, they are similar, B is a little better and B  is far better. The subject makes a choice and clicks 'Submit' to proceed to the next pair. This part of the experiment takes about 15 minutes and includes 60 video pairs.

A. RATINGS OF MESH AND POINT CLOUD SEQUENCES
For the first part of the experiment, viewers rated sequences separately, and for this data we need to normalize scores  and handle outliers. Given the rating range from 0 to 100, scores from different viewers tend to fall in quite different subranges, so we first normalize raw scores. We find the minimum and maximum scores given by each viewer for a specific sequence. For each sequence, the median of the minimum scores across viewers is denoted S min (likewise S max ), and all viewers' scores for the sequence are normalized to the range from S min to S max .
We adopt a screening method from [57] to eliminate scores that are outliers or inconsistent. This method, also used in [58], makes use of the fact that our test contains videos at different bit rates.
1) First, we aim to remove outlier scores. For each video ζ , we determine the mean, standard deviation and kurtosis, denotedū ζ , σ ζ and β 2ζ . When 2 < β 2ζ < 4, the distribution of scores for that video is close to normal, and a score outside the range [ū ζ −2σ ζ ,ū ζ +2σ ζ ] could be regarded as an outlier. VOLUME 8, 2020 If β 2ζ is not between 2 and 4, the range for inlier is enlarged For each viewer, we will reject the 30 scores of this viewer for a given sequence if there are two or more scores above the upper end of the inlier range, or if there are two or more scores below the lower end of the inlier range.
2) If scores from a viewer are consistent, the rating for a lower bit rate should not be larger than that for a higher bit rate. All 30 scores of a viewer for a certain sequence are rejected if there are more than two times that the user gives a score at any lower bit rate more than K times larger than the score given by the same user at any higher bit rate for the same sequence and observation distance. Here K = 1.3 is chosen empirically; the consistency is good enough to show the scoring trend without rejecting too many scores from viewers.
After screening, there are on average 22 user ratings for each sequence. We average viewer scores for the same video to determine its mean opinion score (MOS). We divide all MOS values with the highest score for each sequence and each representation to normalize the highest score to 1. We plot curves of normalized MOS vs. bit rate in Fig. 8. The curves show that increasing bit rate produces better visual quality, and for a given bit rate, closer observation distances generally receive lower quality scores than farther observation distances. In Fig. 9, we plot curves of MOS vs. distance. Given a fixed bit rate, the relationship appears close to linear, and the slope and intercept for each line depend on the bit rate. In Section IV-C we will use the data in Figs. 8 and 9 to create a model that predicts the MOS as a function of bit rate and observation distance.

B. PREFERENCE FOR MESH AND POINT CLOUD COMPRESSION
In the second part of the experiment, viewers compare sequences side-by-side, and here we use all data points and for this preference there is no normalization done. Bar charts of the data are shown in Fig. 10. The sequence Dancer contains fast motion and the sequence Model contains detailed texture. Sequence-specific conclusions that we draw from these plots are: 1) For Basketball, Exercise and Model, people prefer PC over mesh compression at low bit rates (e.g. 3Mbps), especially for Model, whose texture is richest. Mesh texture appears more blurry than PC texture at low bit rates. 2) For Dancer at relatively low bit rates, mesh compression is better than PC. A PC is composed of discrete points which might cause bad artifacts (such as holes) when the motion is fast, as in Dancer.
Combining all sequences, the left bar chart in Fig. 11 shows the total count of people who preferred mesh compression in the comparison. In the middle is the count of those choosing "similar", and PC preference is at the right. The differences between Fig. 11 and Fig. 10 are that Fig. 10 shows preferences for each sequence while Fig. 11 combines all sequences, and   Conclusions that we draw from these plots are: 1) There is a general trend that PC compression is preferred at low rates, and the two different representation types become more similar as the bit rate increases. 2) When we observe the sequence from afar, the two representations are similar except at low bit rates. 3) With decreasing observation distance, the preference for mesh compression increases.

C. OPINION SCORE MODEL
This section proposes an opinion score model that takes bit rate and observation distance as inputs. The model predicts the MOS score using: Here r is bit rate and d is observation distance. Because the video version with the highest bit rate and farthest distance always gets the highest score, we have MOS(25Mbps, 5) = 1. BQF(r) is Bitrate Quality Factor with the formulation adopted from [58]: where r max is the maximum bit rate which we take as 25Mbps and w is a parameter for BQF(r). ODCF(r, d) is Observation Distance Correction Factor which we define as the ratio between the score of distance d and the score of the far distance (5) for a certain bit rate r, so ODCF(r, 5) = 1. That leads to MOS(r, 5) = BQF(r) and we could use the MOS for far distance to estimate the parameter for BQF(r). Then, we would like to determine the function ODCF(r, d). We compute the ODCF(r, d) value using Eq. 1. From the  roughly linear curves of MOS vs. distance shown in Fig. 9, and taking the slope and intercept for each line to be dependent on the bit rate, we obtain: where slope(r) is the slope and is a function of bit rate, and b(r) is also a function of r. While the linear fit involves only three points for ODCF(r, d), the three distances represent a wide range of likely observation distances, and the linear fit is simple and performs well in quality estimation, as will be shown in the following evaluation. Since ODCF(r, 5) = 1, Eq. 3 yields b(r) = 1 − slope(r) * 5. We want to determine a relation between slope(r) and r. First, for all 5 bit rates, we fit a linear model to ODCF(r, d) and generate 5 different slope(r). Curves of slope(r) are shown in Fig. 12. The relation for mesh compression looks like an inverse proportional function slope(r) = m/r and that for PC compression is a linear function slope(r) = p * r + t. Here m, p, t are all function parameters.
Then the final MOS could be computed as the multiplication of BQF(r) and ODCF(r, d): The fitting is performed using Matlab's built-in functions fittype() and fit(). Algorithm 1 shows the fitting process as pseudo-code. When using all four sequences for parameter estimation, the fitting result is shown in Fig. 13. Pearson Correlation (ρ), Mean Squared Error (MSE), Spearman's Rank Correlation Coefficient (SRCC) [59], Kendall Rank Correlation Coefficient (KRCC) [59] and perceptually weighted rank correlation (PWRC) [60] are used for evaluation. The evaluation results are in Table 3.
For comparison, we use the MOS prediction model from [53] which takes MOS predicted = Ax 3 + Bx 2 + Cx + D where A, B, C and D are parameters to be estimated and x is the score from the FR 3D objective quality metric VIFp (Visual Information Fidelity, pixel domain version) which was shown to achieve the best correlation with human perceptual quality in MOS prediction in [53]. VIFp is carried out on the projected 2D images of 3D content, and it considers color information and geometry information jointly, giving an overall score for input 3D content. VIFp can handle both mesh and PC representations.   As in [53], we render each frame of each sequence into 2D along six axis directions, then compute the averaged VIFp of all six directions as the final VIFp value of this frame with the provided tool, Video Quality Measurement Tool (VQMT) [61]. VIFp of the whole sequence is computed as the averaged VIFp across frames. Then we apply the VIFp as input to estimate the parameters of their model based on our measured MOS data. The comparison results of predicted MOS are in Table 3. Compared to the model from [53] which is also fit to our collected MOS scores, our model better predicts MOS.
Because the MOS scores for all sequences are used to fit the models, Table 3 is overly optimistic, for both our approach and that of [53], about the correlation between actual and predicted MOS scores. Therefore, cross validation of the fitting is carried out by removing the scores from one sequence and estimating the model parameters with the remaining three sequences. Then we compute the correlation VOLUME 8, 2020  Table 4 for all four sequences. From Table 4, we notice that the correlation coefficients are still promising even though the parameters of the model are estimated from other sequences, and the approach generally outperforms the VIFp model [53] evaluated using the same cross validation strategy.

V. IMPLICATIONS FOR COMPRESSION
These subjective test results have implications for the choice of compression method and parameters.

A. CHOICE BETWEEN POINT CLOUD AND MESH
The second part of the subjective test indicates which representation is better for certain bit rates and observation distances. Fig. 10 and Fig. 11 provide some guidance on choosing the compression method. If the required bit rate is low, such as 3Mbps, we should choose PC regardless of observation distance. Secondly, if the application requires a close observation distance and the required bit rate is not low, mesh should be chosen. For other cases, there is no preference between the representations.
We could also choose the better representation based on the opinion score model. The bit rate and distance are inputs to our proposed models. In the left plot of Fig. 14

B. CHOICE OF NUMBER OF TRIANGLES AND SCALE FACTOR
There are several compression parameters for PC and mesh compression, including the number of points per frame for a point cloud which could be controlled by the scale factor, and the number of triangles for a mesh. From the manually chosen compression parameter sets described in Sec. III B, we plot the number of triangles N t and scale factor s vs. bitrate in Fig. 15. In this plot, N t and s are averaged across sequences. From the plot, we notice that the trends are similar across different observation distances, especially for PC compression. The curve for mesh compression at close distance is slightly different because at close distance, people notice more texture details so texture should account for more of the fixed bit rate. That leads to the lower number of triangles for best visual quality.
With increasing bit rate, the number of triangles tends to increase and the scale factor tends to decrease. We do linear and power law fitting as follows: s = a * r c , a * r c > 1 1, otherwise where k = 0.6496, a = 3.118 and c = −0.3667 are model parameters. The curve fitting result is shown in Fig. 15 in black. These models can give a useful rule-of-thumb for estimating the number of triangles or scale factor given the bit rate.

VI. CONCLUSION
This study provides several new results regarding subjective quality of compressed 3D content as it relates to choice of representation, observation distance, bit rate, and scaling. The main conclusions for this paper are as follows: • We designed a subjective test to compare the compression quality for point cloud and mesh representations. Point cloud compression is better for low bit rates, whereas mesh compression is preferred when the observation distance is close and the bit rate is not low. When the bit rate is high, there is little difference between the two representations.
• For the two representations, we propose two models that estimate people's opinion scores and fit the experimental data well under cross-validation. The model can be used to choose a representation based on observation distance and target bit rate.
In addition, when we generated parameter sets for our experiment, we found the general trend that reducing the number of mesh triangles or reducing the number of cloud points improved visual quality at low bit rates. Suggestions for reducing the number of mesh triangles and choosing the point cloud scale factor are provided. Such reductions are not routinely considered part of the compression pipeline for 3D content, but our finding fits the well-known result for 2D content that spatial down-sampling is useful at low bit rates (see, e.g., [62], [63]). Such reductions can play an important role in preserving quality at low bit rates for 3D content as well, and would be worthy of a further subjective study.
There remains considerable room for further study of subjective quality of PC and mesh compression. We chose the bit rate and the observation distance for a compressed point cloud because they are two very important factors which affect quality, and which also do not require any computation on the actual distorted sequence at the decoder. With increased complexity, there are many other factors which affect quality at a given bit rate and given observation distance, such as the color variation in the texture, the inherent geometric complexity of the sequence, or the rapidity of the motion. Prediction of head and eye movement based on past movement or based on saliency has been carried out for 360 degree video [64], [65], and such work could inform the compression approaches as well as the viewing trajectories for future subjective experiments. For future work, more factors will also be considered, such as the sequence's complexity in geometry or texture, or the rapidity of motion, to make the prediction of quality more accurate. One limitation of this study is the limited number of sequences used as the test data set. A larger number of test sequences that can cover diverse visual content should also be involved in our future work.