• Abstract

SECTION I

## INTRODUCTION

We are currently witnessing something that has become a once-in-a-decade event in the world of video compression: the emergence of a major new family of video compression standards. The mid-1990s saw the introduction of the Moving Picture Experts Group (MPEG)-2 video coding standard (ITU-T Rec. H.262 and ISO/IEC 13818-2 [1]), the first compression standard to be widely adopted in broadcasting and entertainment applications. Advanced Video Coding (AVC) (ITU-T Rec. H.264 and ISO/IEC 14496-10 [2]) appeared in the mid-2000s, offering the same subjective quality at approximately half the bit rate. Now, a new standard, High Efficiency Video Coding (HEVC) (ITU-T Rec. H.265 and ISO/IEC 23008-2), has been developed that promises a further factor of two improvement in compression efficiency for the mid-2010s [3].

The HEVC standard has been jointly developed by the same two standardization organizations whose previous collaboration resulted in both MPEG-2 and AVC: 1) the ISO/IEC MPEG and 2) the ITU-T Video Coding Experts Group (VCEG), through the Joint Collaborative Team on Video Coding (JCT-VC) [4]. HEVC version 1 was ratified in 2013 as H.265 by the ITU-T and as MPEG-H Part 2 by ISO/IEC [5]. This first version supports applications that use conventional (single-layer) encoding of 4:2:0-sampled video with 8- or 10-bit precision. A second edition was completed in July 2014, and an additional extension was completed in February 2015 [6]. These extend the standard to support contribution applications with tools that enable 4:2:2- and 4:4:4-sampled video formats as well as 12- and 16-bit precision [7], and multilayer coding enhancements for efficient scalability [8] and stereo/multiview and depth-enhanced 3D compression [9]. A further amendment is currently being developed to enable more efficient coding of screen-captured graphics and text content and mixed-source content [10].

A few evaluations have previously been reported comparing the compression performance of HEVC with AVC and also demonstrating the suitability of HEVC for various applications, particularly including some evaluations for high-resolution video content [11]. The subjective test results for a number of test sequences and some analysis of the bit rate savings at different quality levels were presented in [12]. In [13], a study on the suitability of HEVC for beyond-HDTV broadcast services was presented, but with no comparison with previous standards. A comparison of HEVC and AVC performance for frame rates up to 30 Hz on a small number of ultra high definition (UHD) video sequences was presented in [14] and [15], and an informal study focused on low-delay (LD) applications and real-time encoding with HD resolutions was presented in [16]. All of these, including some that were carried out at the early stages of HEVC development, have provided very consistent evidence of the substantial coding efficiency improvements enabled by HEVC.

For its formal evaluation of HEVC performance, the JCT-VC performed a subjective evaluation on a wider range of content of resolutions varying from 480p ($832\times 480$) up to UHD with frame rates of up to 60 Hz. The test sequences used for the verification testing were deliberately chosen to be different from those that had been used during the development of the standard, to avoid any possible bias that the standard could have toward those sequences. A test report was produced for the JCT-VC itself [17], and some additional analysis of HEVC performance for UHD content has been presented in [18] (using cropped $2560\times 1600$ regions) and [19] (a recent brief conference publication about the testing).

As video coding standards generally specify only the format of the data and the associated decoding process without specifying how encoding is to be performed, it is not possible in general to test the compression performance of a standard. Some particular encoding methods must be used as a proxy to represent the capability of the standard instead. The outcome of such a comparison is generally more reliable when similar encoding techniques with similar configurations are applied in the compared encoders rather than simply comparing unknown technologies as black boxes. The verification tests were therefore performed using reference software encoders that had been developed in the standardization work and used very similar encoding algorithms and configurations that were selected to represent important applications. These publicly available reference software codebases are known as the HEVC model (HM) for HEVC [20] and the joint model (JM) for AVC [21].

While this paper reports an extended set of results of HEVC verification tests, it also summarizes the details of tools that can be used in the analysis of the results, pointing to a number of factors an evaluation should consider. Compared with the initial results [17], those presented in this paper are based on the use of more viewers for subjective tests, the objective results are presented and compared with the subjective results, and additional analysis of the coding gains versus the bit rate is provided. Ultimately, the verification test showed that the key objective of HEVC had been achieved—i.e., providing a substantial improvement in compression efficiency relative to its predecessor AVC.

This paper is organized as follows. An overview of video quality evaluation and statistical analysis methodology is presented in Section II, and the test settings used in the subjective evaluations are detailed in Section III. Section IV presents the test results and detailed analysis. Finally, the conclusion is given in Section V.

SECTION II

## VIDEO QUALITY EVALUATION

This section provides an overview of the objective and subjective video quality metrics and the related analysis used in this study.

For the convenience of video coding performance assessment, the most commonly used objective metric is peak signal-to-noise ratio (PSNR). However, it is commonly acknowledged that PSNR has the disadvantages of disregarding the viewing conditions and the characteristics of human visual system perception. In addition, the PSNR for a given video sequence can be computed in different ways, depending on how the picture components (e.g., luma and chroma) or individual picture PSNR values are combined. Nevertheless, for a particular content item and small variations of coding conditions, the changes in PSNR values for an overall video sequence can typically be reliably interpreted.

Other objective video quality metrics, such as the structural similarity index (SSIM) and video quality metric (VQM), have been proposed, but are not used nearly as frequently as PSNR [22]. VQM is not often used—primarily due to its computational complexity—and for both metrics, the interpretation of the values they provide has not yet become a common practice in video coding community. Therefore, in the context of HEVC and AVC compression, this paper provides comparisons and analysis using the PSNR objective measure and subjective quality evaluation results.

Subjective quality evaluation is the process of employing human viewers for grading video quality based on individual perception. Formal methods and guidelines for subjective quality assessments are specified in various ITU recommendations. Among the many of these, the most relevant to this context are ITU-T Rec. P.910 [23], which defines subjective video quality assessment methods for multimedia applications, and ITU-R Rec. BT.500 [24], which defines a methodology for the subjective assessment of the quality of television pictures. These specifications describe a number of test methods with distinct presentation and scoring schemes, along with the recommended viewing conditions.

Explanations of the quality metrics and data analysis methods are provided in the following sections.

### A. Objective Quality Evaluation Using PSNR

PSNR is defined as the ratio between the maximum possible power of the signal (the original image) and the power of noise, which in the considered scenario is introduced by lossy compression. For a decoded image component $I_{d}$, the mean square error (MSE) with reference to an original image component $I$ is computed as TeX Source$$\textrm {MSE}=\frac {\sum \nolimits _{i=0}^{M-1}\sum \nolimits _{j=0}^{N-1}(I(i,j)-I_{d}(i,j))^{2}}{M\cdot N}$$ where $M$ and $N$ are the width and height of the image component, and the image component is, for example, an array of luma samples or $C_{B}$ or $C_{R}$ chroma samples. The PSNR value is then computed as TeX Source$$\textrm {PSNR}=10\cdot \log \frac {(2^{B}-1)^{2}}{\mathrm{ MSE}}$$ where $B$ is the bit depth of image samples. This is typically calculated for each frame separately, and then averaged for the frames of a video sequence. Due to the logarithmic transformation, this corresponds to using the geometric average of frame MSEs, and the impact of this should be critically considered when a high fluctuation over frames is present.

For video sequences, which ordinarily consist of three color components, either the luma PSNR value (${\rm PSNR}_{Y}$), calculated using only luma component values, may be reported or a weighted PSNR value (${\rm PSNR}_{W}$) using all three components can be computed using some weighting criteria. An example of a popular weighting for content with 4:2:0 sampling is TeX Source$${\mathrm{ PSNR}}_{W}=\frac {6 \cdot {\mathrm{ PSNR}}_{Y}+{\mathrm{ PSNR}}_{C_{B}}+{\mathrm{ PSNR}}_{C_{R}}}{8}.$$

The most accurate interpretation of the objective results is obtained by looking at the frame-by-frame results for each component. However, this may not be practical for the final presentation of the results for a large data set and a large number of test points.

### B. Subjective Quality Evaluation

For the HEVC verification test that includes a wide range of visual quality points, a degradation category rating (DCR) [23] test method was selected. For this purpose, it was used to evaluate the quality (and not the impairment) with a quality rating scale made of 11 levels [23], ranging from 0 (lowest quality) to 10 (highest quality), which may be interpreted as in Table I. The numerical scale helps avoid misinterpretations associated with the use of category adjectives (e.g., excellent or good), especially in cases where the tests are performed across different countries and including nonnative English speakers.

Table I Subjective Quality Scale With 11 Points Used in an HEVC Verification Test

The basic results of the subjective test are evaluated in terms of the average rating, which is called the mean opinion score (MOS), and the associated confidence interval values that are computed for each coding point, after having verified the reliability of each viewer. For the DCR method, it is recommended to hire more than 15 naïve viewers that have been properly screened for visual acuity and color blindness, to allow for an accurate statistical analysis of the subjective scores [24].

From the raw data, i.e., the individual subjective scores, the reliability of each viewer is calculated. The individual reliability is evaluated using the correlation coefficient $r$ computed between each score $x_{i}$ provided by a viewer and the overall MOS value $y_{i}$ assigned for that test point $i$ as TeX Source$$r=\frac {\sum \nolimits _{i=1}^{T} (x_{i}-\bar {x} )( y_{i}-\bar {y} )}{\sqrt { \sum \nolimits _{i=1}^{T} ( x_{i}-\bar {x})^{2}\cdot \sum \nolimits _{i=1}^{T} (y_{i}-\bar {y} )^{2}}}$$ where $T$ is the total number of test points for a viewing session, $y_{i}$ is the average of all scores for the test point $i$, and $\bar {x}$ and $\bar {y}$ are the average values of $x_{i}$ and $y_{i}$ for all test points, respectively. In this HEVC verification test, a correlation index greater than or equal to 0.75 is considered as valid for the acceptance of the viewer’s scorings; otherwise, the viewer is considered as an outlier. Once the results for outliers are discarded, the MOS for each test point is computed using the arithmetic average of scores of the remaining viewers. In addition, the confidence interval is computed for each test point to estimate a range of values covered by a certain probability.

Assuming a Gaussian (normal) distribution for the population of subjective scores with sample size $n$, mean (MOS) $\mu$, and sample-based standard deviation measurement $s$, the confidence interval is defined as ($\mu -c, \mu +c$), where $c$ is computed as TeX Source$$c=z\cdot \frac {s}{\sqrt {n}}.$$

In the analysis of the subjective test results, the 95% confidence interval, as shown in Fig. 1, is calculated for each test point. For a 95% confidence interval with a Gaussian distribution, the value of $z$ in (5) is 1.96. For the results presented in Section IV, the confidence interval is plotted alongside the MOS, as shown in Fig. 2, with an interpolated curve from MOS values.

Fig. 1. Confidence interval for normal distribution and 95% probability.
Fig. 2. Example of a confidence interval related to a test point and an associated MOS versus bit rate curve.

### C. Interpretation of Bit Rate Savings From Subjective Quality Comparison

The objective of the verification test is to gauge the bit rate savings of HEVC over AVC when the AVC and HEVC test points have the same subjective quality.

Fig. 3 shows an example of a plot comparing the AVC and HEVC MOS versus bit rate curves. There is no overlap in the MOS confidence intervals of the HEVC test point C and AVC test point B, and hence, there is sufficient statistical significance to conclude that the HEVC test point C has a better quality than the AVC test point B. There is, however, an overlap in the MOS confidence intervals of the HEVC test point A and AVC test point B. This means that it is highly likely that the HEVC test point A and AVC test point B have subjective quality that cannot be distinguished. However, there is still a chance that the subjective qualities of HEVC test point A and AVC test point B are not the same.

Fig. 3. Overlapping confidence interval of test points.

A more rigorous analysis is to perform a two-sample unequal variance (heteroscedastic) student’s $t$-test test using the two-tailed distribution to determine if indeed the subjective qualities given by the sample mean values of the pair of test points are not the same. The null hypothesis, $H_{0}$, in this case would be that the HEVC test points have the same quality as the AVC test point, and the alternate hypothesis, $H_{a}$, is that the HEVC test points do not have the same quality as the AVC test point.

To compare the means of two populations, the $t$-statistic can be used, which is expressed as TeX Source$$t=( \bar {X}_{1}-\bar {X}_{2})/ \sqrt {s_{1}^{2} /n_{1}+s_{2}^{2} / n_{2}}$$ where $\bar {X_{i}}, s_{i}^{2}$, and $n_{i}$ denote the sample mean, the sample variance, and the size of the $i$th sample, $i\in \{1, 2\}$. By computing the $t$-statistic in this way and approximating it with a student’s $t$-distribution whose degree of freedom (DF) is specified as TeX Source$$\textrm {DF}=\frac {\big ( s_{1}^{2}/n_{1}+s_{2}^{2}/n_{2} \big )^{2}}{\big ( s_{1}^{2} / n_{1} \big )^{2}/( n_{1}{-1} ){ +}\big ( {s}_{2}^{2} / n_{2} \big )^{2}/( n_{2}{-1} )}$$ a probability value $p$ can be computed from the $t$-statistic that indicates the extent to which the means of the two populations are considered to be different. The smaller the $p$-value is, the more significant the difference between the distributions of the two populations is.

A $p$-value less than 0.05 indicates a very low probability of committing a type-I error (i.e., rejecting the null hypothesis when it is true). In such a case, the null hypothesis can thus be safely rejected, and it can be concluded that there is statistical significance that the HEVC test point does not have the same quality as the AVC test point. A $p$-value greater than or equal to 0.05 means that the null hypothesis cannot be confidently rejected. For the purpose of this paper, the HEVC test point is considered to have the same quality as the corresponding AVC test point in such a case. However, there is still a possibility of committing a type-II error (i.e., failure to reject the null hypothesis when in fact the alternate hypothesis is true). The power or sensitivity of a statistical test is the probability of correctly rejecting the null hypothesis ($H_{0}$) when it is false—i.e., the probability of correctly accepting the alternative hypothesis ($H_{a}$) when it is true [25]. A statistical power test of the data has shown that if in fact the true population mean for the difference in the HEVC MOS and AVC MOS is greater than or equal to 0.8, then the mean probability of committing a type-II error, $\beta$, is less than or equal to 0.14, and hence the mean power of the test (defined as $1-\beta$) is 0.86. By convention, a test with a power greater than 0.8 (or $\beta \le 0.2$) is considered statistically powerful [26].

In the design of the verification test, four bit rates per codec, $R_{\mathrm {HEVC}}$ and $R_{\mathrm {AVC}}$, were used. The bit rates were carefully selected for each of 20 test sequences so that each $R_{\mathrm {HEVC}}$ is approximately half of the corresponding $R_{\mathrm {AVC}}$. These gave 80 pairs of test points on which the $t$-test described above was applied. The results of the test determine, for each pair of test points, whether the HEVC test point has a quality better than, the same as, or less than the AVC test point, and give a rough estimate of the bit rate savings of HEVC compared with AVC. The following are the possible outcomes for each pair of test points.

The first case is when the null hypothesis is rejected and there is statistical significance that the HEVC MOS at $R_{\mathrm {HEVC}}$ is greater than the AVC MOS at $R_{\mathrm {AVC}}$. This means that one can reasonably conclude that the HEVC test point is achieving a better quality than the AVC test point at half the bit rate of AVC. Note that by the design of the test, the bit rate saving when $R_{\mathrm {HEVC}}$ is half $R_{\mathrm {AVC}}$ is 50%. Since the bit rate for an HEVC test point could be further reduced to achieve the same quality, the bit rate saving of HEVC compared with AVC is therefore greater than 50% for this case.

The second case is when the null hypothesis was failed to be rejected. This means that the HEVC test point has about the same quality as the AVC test point at half the AVC bit rate, since $R_{\mathrm {HEVC}}$ is approximately half of $R_{\mathrm {AVC}}$. Therefore, the bit rate saving of HEVC compared with AVC is approximately 50% for this case.

The third case is when the null hypothesis is rejected and there is statistical significance to conclude that the HEVC MOS at $R_{\mathrm {HEVC}}$ is less than the AVC MOS at $R_{\mathrm {AVC}}$. This means that the HEVC test point is not achieving equal or better quality than the AVC test point at half the bit rate of AVC. More bits would need to be allocated to the HEVC test point before the same quality would be achieved. Therefore, the bit rate saving of HEVC compared with AVC is less than 50% for this case.

### D. Bjøntegaard Model

The Bjøntegaard model [27], [28] has become a popular tool for evaluating the coding efficiency of a given video codec in comparison with a reference codec over a range of quality points or bit rates. Bjøntegaard delta (BD) metrics are typically computed as a difference in bit rate or a difference in quality based on interpolating curves from the tested data points. In this paper, the focus is on the difference in bit rate, expressed as a percentage of a reference bit rate, as this is easily interpreted as the bit rate saving benefit for equal measured quality.

The BD-rate represents the average bit rate savings for the same video quality (e.g., PSNR or MOS) and is calculated between two rate-distortion curves, such as AVC and HEVC MOS curves in Fig. 3. The bit rate saving difference between the two rate-distortion curves at a given level of quality is TeX Source$$\mathrm {\Delta }R(D)=\frac {R_{B}(D)-R_{A}(D)}{R_{A}(D)}$$ where $R_{A}(D)$ and $R_{B}(D)$ are the bit rate of the interpolated reference and tested bit rate curves, respectively, at the given level of quality/distortion $D$. $\Delta R(D)$ is typically represented as a percentage of the reference bit rate $R_{A}(D)$ so that a negative value represents compression gain, while a positive value represents compression loss.

The Bjøntegaard model uses a logarithmic scale for the domain of the bit rate interpolation, so by defining $r=\log R$, the bit rate savings can be expressed as TeX Source$${\Delta }R(D)={10}^{r_{B}(D)-r_{A}(D)}{-1}.$$

Taking into account the actual measured rate-distortion points ($R(i)$, $D(i)$), the fitted rate-distortion curves $\hat {r}(D)$ are used in BD-rate computation. Over a range of quality levels, the BD-rate is approximated as TeX Source$${\Delta R}_{\mathrm {Overall}} \approx {10}^{\frac {1}{D_{H}-D_{L}}\int ^{D_{H}}_{D{_{L}}} {[ \hat {r}_{B}( D )-\hat {r}_{A}( D ) ]dD}}-1$$ where the lower $D_{L}$ and higher $D_{H}$ integration bounds are computed from the range of the interpolated distortion values $D_{A}$ and $D_{B}$ for the reference and tested data sets, respectively, as TeX Source\begin{align} D_{L}=&\max \Biggl\{\min {(D_{A}(0),\ldots ,D_{A}(N_{A}-1) ),}\notag \\&\qquad~~ \min {( D_{B}(0),\ldots ,D_{B}(N_{B}-1) )} \Biggr\}\notag \\ D_{H}=&\min \Biggl\{\max {(D_{A}(0),\ldots ,D_{A}(N_{A}-1) ),}\notag \\&\qquad~~ \max {( D_{B}(0),\ldots ,D_{B}(N_{B}-1 )}\Biggr\} \end{align} where $D$ (0) is the lowest and $D(N-1)$ is the highest measured quality point, for either the tested or reference sets, as shown in Fig. 4.

Fig. 4. Illustration of BD-rate computation.

In the HEVC verification test, the number of test points for both the reference and evaluated sets is four (i.e., $N_{A}=N_{B}=4$) and the curve fitting uses cubic spline interpolation.

As can be observed from Fig. 4, in some cases, the overall BD-rate measure may be computed over a relatively small interval of overlapping distortion regions. In such a case, the BD-rate metric does not necessarily represent average coding efficiency for all test points involved in the actual test. Therefore, it is important to design the test in a way that the distortion overlap between the two tested codecs covers a range of qualities of interest for specific application.

As the metrics derived from the Bjøntegaard model can be applied to different evaluation criteria, it is important to understand the range on which they are computed. For example, BD-rate can be computed for MOS and PSNR, for the same test material. However, as demonstrated in Fig. 5, the actual bit rates on which the two are computed may not be the same. Therefore, in addition to providing BD-rates for both PSNR and MOS in our evaluation reported in Section IV, we also compute BD-rate on the bit rate interval common for both criteria (MOS and PSNR).

Fig. 5. BD-rate for PSNR and MOS ratings. Actual bit rate ranges and test points taken into account do not necessarily overlap.

Newer studies [29] have shown how the Bjøntegaard model can further be extended to compute BD-rate intervals considering the confidence intervals of the MOS ratings for each test point, as shown in Fig. 6. The dotted curves show the boundaries of confidence intervals for each curve, and two new BD-rate values are computed comparing $D_{B,\mathrm {min}}$ with $D_{A,\mathrm {max}}$ (labeled BD-ratemin) and $D_{B,\mathrm {max}}$ with $D_{A,\mathrm {min}}$ (labeled BD-ratemax), where [$D_{\mathrm{ min}}$, $D_{\mathrm{ max}}$] represents 95% confidence intervals of MOS. The new BD-rates thus provide lower and upper limits for the BD-rate. However, it is noted that these three values of BD-rate are based on different reference (AVC) bit rate ranges as shown in Fig. 6. Although in the results reported in Section IV, these intervals are reported, they have to be carefully interpreted, as the limits of the intervals are defined for significantly different bit rate ranges. However, for relatively small differences between rate-distortion curves, it can be useful to evaluate BD-rate confidence intervals.

Fig. 6. BD-rate with MOS confidence intervals.
Fig. 7. BTCs for subjective evaluation.
SECTION III

## TEST SETTINGS

This section provides information regarding the test material used, test settings, and logistics.

### A. Selection of Test Material and Test Points

The HEVC verification tests were carried out for four categories of spatial resolutions: UHD (3840 $\times$ 2160, except for the Traffic sequence, which is 4096 $\times$ 2048), 1080p (1920 $\times$ 1080), 720p (1280 $\times$ 720), and 480p (832 $\times$ 480). The details of the test sequences are provided in Tables II and III. Screenshots are given in Figs. 811. The sequences are selected from different sources and have different spatiotemporal characteristics, which leads to different behavior of compression algorithms. This is the first formal test of video compression standards where a wide range of resolutions including content with UHD resolution with high frame rate has been evaluated. The format, as specified in ITU-R Rec. BT.2020 [30], has a number of extended features compared with ordinary HD video. In addition to containing more pixels per frame, it specifies support for higher frame rates, wider color gamut, and higher bit depths [31]. However, for compatibility with available playout and display systems, all tested video sequences have 8 bits per component per sample and are in the Y’$\text{C}_{B}\text{C}_{R}$ color space defined by ITU-R Rec. BT.709 [32].

Fig. 8. UHD test sequences.
Fig. 9. 1080p test sequences.
Fig. 10. 720p test sequences.
Fig. 11. 480p test sequences.
Table II Test Sequences
Table III Parameters of Used Test Sequences

The test sequences were compressed using HEVC (HM-12.1, Main profile [20]) and AVC (JM-18.5, High profile [21]) encoding. Either a random access (RA) or low delay (LD) configuration (Cfg) was used (similarly configured for both HEVC and AVC, with a refresh period of approximately 1 s and hierarchical referencing for RA and with no periodic refresh and no reordered referencing for LD). For each test sequence, four test points using different fixed quantization parameter (QP) settings were selected so that the tested HEVC bit rates are approximately half of the AVC bit rates. Also, the ranges of the QP values were selected so that the subjective quality of the encoded sequences spans a large range of MOS values. This bit rate ratio was chosen because it is already well established that the quality of HEVC is much better than the quality of AVC at the same bit rate. These test conditions can identify whether a bit rate saving of 50% or more is achieved for the majority of the tested video sequences. The full details of the QP values and bit rates can be found in [17].

### B. Subjective Test Structure

Subjective tests for different categories of video sequences were conducted in separate sessions. Each subjective test session of the DCR method consisted of a series of basic test cells (BTCs). Each BTC was made of two consecutive presentations of the video clip under test, as shown in Fig. 7. First, the original version of the video sequence was displayed, followed by the coded version of the video sequence, with a gap of 1 s. Then, a message asking the viewers to vote was displayed for 5 s.

Each test session was designed with 45 BTCs in total: the first three BTCs represented the stabilization phase and were selected to show to the viewers the whole range of quality they would see during the test. Two BTCs showing original versus original were also used as a sanity check for the range of ratings made by the viewers. The scores coming from the original BTCs and from the stabilization phase were excluded from further analysis. All the BTCs were randomly ordered to avoid the same content being seen repeatedly and to spread the quality as much as possible in a uniform way across the whole test.

### C. Subjective Test Logistics

The subjective tests were performed at two sites, under a controlled laboratory environment, adhering to the recommendations in ITU-R Rec. BT.500 and ITU-T Rec. P.910. The tests for UHD and 720p resolutions were done at the BBC R&D labs in London and for the other resolutions at the University of the West of Scotland (UWS). The equipment and session details for each test are given in Tables IV and V, respectively. Additional analysis of the influence of viewing distances (front and back rows in the seating arrangement for viewers) on the subjective rating has been provided in [19]. 1

Table IV Test Logistics (BBC)
Table V Test Logistics (UWS)
SECTION IV

## RESULTS

This section summarizes the subjective test results. It also provides an analysis with a focus on a comparison with the objective test results, which are easier to obtain in practice.

### A. Objective and Subjective Test Results

The subjective evaluation results for each category of test sequences are shown in Figs. 1215 in the form of MOS versus bit rate plots. The objective quality metric (PSNR) values for the same test points are also plotted on the same graph using a second vertical axis. Note that the scales for the two vertical axes in these plots are independently selected, and thus no direct connection between the subjective (i.e., MOS) and objective (i.e., PSNR) plots is demonstrated. The legend in all plots shows that circle and triangle markers represent the results for actual test points, while the curves between them were calculated using cubic spline interpolation with the bit rate in a logarithmic scale, as in typical BD-rate computation. Only the parts of the curves related to BD-rate calculation, either for MOS or PSNR BD-rate computation, are displayed. The solid and dotted lines represent MOS and PSNR curves, respectively. Confidence intervals are displayed for each MOS test point. The PSNR results presented are for the luma color component only. Because of space limitation, the chroma results are not presented. However, the authors note that the weighted PSNR results as in (3) are highly correlated to luma-only PSNR results. Typically, the weighted PSNR has a somewhat higher value than luma PSNR, but the values of BD-rate for weighted PSNR are typically close to the values of BD-rate for luma PSNR.

Fig. 12. Subjective and objective evaluation results for UHD content; subjective results, with associated 95% confidence intervals.
Fig. 13. Subjective and objective evaluation results for 1080p content; subjective results, with associated 95% confidence intervals.
Fig. 14. Subjective and objective evaluation results for 720p content; subjective results, with associated 95% confidence intervals.
Fig. 15. Subjective and objective evaluation results for 480p content; subjective results, with associated 95% confidence intervals.

By considering the positions of MOS points for AVC and HEVC, it can be observed that HEVC achieves the same subjective quality as AVC while typically requiring substantially lower bit rates. Table VI shows the results of the student’s $t$-test on the 80 pairs of HEVC and AVC test points. These pairs of test points were classified into the three categories of bit rate savings as described in Section II-C. The first four rows show the distribution of bit rate savings achieved for each resolution. The last row summarizes the distribution of bit rate saving statistics for all resolutions. This shows that for 74 out of the 80 pairs of test points (or 92.5%), HEVC has a bit rate saving compared with AVC that is greater than or equal to 50%. The amount of bit rate saving is similar for both the RA and LD test cases. Only six pairs of test points (or 7.5%) show a bit rate saving of less than 50%. Four of the six pairs of test points were contributed by one video sequence (SVT04a), where the HEVC encoder did not perform as well as in the other cases.

Table VI Coding Gain Estimates Using $t$-Test on MOS Score

The data convincingly show that the HEVC is achieving bit rate saving that is at least 50%. However, the granularity of the above bit rate saving estimation was limited by the test design, where in each pair of test points, the HEVC bit rate was selected as approximately half the AVC bit rate. In order to get a more precise quantification of the estimated bit rate savings, a different method is used, where the MOS BD-rate values for the subjective ratings for each test sequence, as shown in Table VII, are computed. The upper and lower limits of BD-rates corresponding to the 95% confidence interval of MOS, as discussed in Section II-D, are also indicated.

Table VII BD-Rates for Subjective and Objective Evaluation Results

In addition to the BD-rates for the available MOS range of each video sequence, an additional measurement was also calculated for the range of MOS scores greater than or equal to seven. This range (MOS $\ge$ 7) is typically expected for a number of services, such as broadcasting, where targeted quality levels are good to excellent. Negative BD-rate values in Table VII indicate the bit rate savings measured for HEVC relative to the bit rate used for AVC.

The averages in Table VII are computed only for the BD-rate values displayed in the table. Note that the results for MOS BD-rates for the HomelessSleeping and Cubicle sequences are excluded. In order for the BD-rate interpolation to work correctly, the MOS values should exhibit a smooth curve and the interval of the averaging (shown as curves in Figs. 1215) should be interpolated from at least three points that are monotonically increasing with bit rate. This condition was not satisfied for the omitted test sequences (although a very substantial gain is evident for HEVC in both omitted cases).

It is noted that for most video sequences, the MOS-based BD-rate benefit is substantially higher than what is measured by the PSNR-based BD-rate. MOS BD-rates indicate that HEVC could provide the same visual quality as AVC for most tested content categories at well below half the bit rate of the latter, surpassing the performance expected at the launch of the HEVC standard development process.

The fact that BD-rates for MOS, confidence intervals, and PSNR in Table VII have not been computed over the same bit rate range for a given test sequence, which has been discussed in Section II, is further addressed in the following section.

### B. Compression Efficiency Results for Specific Bit Rates

Although different BD-rate measures presented in Table VII are useful indicators of the compression performance of HEVC, comparing them with each other raises validity concerns, as they are computed for different reference bit rate ranges, as discussed in Section II. To partly address this problem, additional analysis of the results has been conducted to evaluate compression efficiency for bit rates that are common to both PSNR and MOS BD-rate computation.

The bit rate savings achieved by HEVC with reference to the associated AVC bit rate, computed for the continuous bit rate range using cubic spline interpolation as discussed in Section II-D, are shown in Fig. 16, considering both subjective and objective quality assessments. In the majority of cases, the bit rate savings for equal MOS are higher than the bit rate savings for equal PSNR. The savings vary across different bit rates and different video source content. Considering only the parts of the curves from Fig. 16 that are defined both for MOS and PSNR for a given AVC bit rate, we have measured the average BD-rates and the average difference between the PSNR and MOS curves, which are presented in Table VIII. The results in Table VIII do not take into account the test sequences that were discarded according to MOS BD-rate computation problems (HomelessSleeping and Cubicle) described in Section IV-A.

Fig. 16. Bit rate savings provided by HEVC at various reference bit rates. (a) UHD content. (b) 1080p content. (c) 720p content. (d) 480p content.
Table VIII Average Differences Between MOS BD-Rate and PSNR BD-Rate for the Same Reference Bit rates

For the same bit rate range, the MOS-based BD-rate saving is 59% and the PSNR-based BD-rate saving is 44%, averaged over all test sequences. Depending on the content category, the average differences between the two measurement methods are between 11% and 18%. In other words, the bit rate savings measured for equal PSNR are lower than the bit rate savings measured for equal MOS between AVC and HEVC by roughly 15%. This is consistent with the difference between the average values in Table VII, with the difference being that in this case the bit rate range for PSNR and MOS rate saving measurement is equal.

SECTION V

## CONCLUSION

This paper presents the results of a formal subjective verification test that was carried out by the JCT-VC for the new HEVC video coding standard. In this paper, a more rigorous analysis of the subjective test results using student’s $t$-test shows that the HEVC test points at half or less than half the bit rate of the AVC reference were found to achieve a comparable quality in 92.5% of the test cases. In addition, it provides a summary of evaluation metrics and an analysis of the results that were obtained, with a focus on comparison between the subjective and objective evaluation results and the performance across different bit rate ranges.

A more precise quantitative estimate of the bit rate savings was obtained by applying the MOS-based BD-rate measurement on the results of the subjective test. It was found for the investigated test cases that the HEVC Main profile can achieve the same subjective quality as the AVC High profile while requiring on average approximately 59% fewer bits. The PSNR-based BD-rate average over the same sequences was calculated to be 44%. This confirms that the subjective quality improvements of HEVC are typically greater than the objective quality improvements measured by the method that was primarily used during the standardization process of HEVC.

It can therefore be concluded that the HEVC standard is able to deliver the same subjective quality as AVC, while on average (and in the vast majority of typical sequences) requiring only half or even less than half of the bit rate used by AVC. This means that the initial objective of the HEVC development (substantial improvement in compression compared with the previous state of the art) has been successfully achieved.

### Acknowledgment

The authors would like to thank the contributors of the video test sequence material for permitting its use for the tests and this publication, and their collaborators in the JCT-VC for their assistance in developing the test plan and analyzing its results.

## Footnotes

This paper was recommended by Associate Editor T. Wiegand.

1The test sequences Manege and SedofCropped are “Copyright © 2012-2013, all rights reserved to the 4EVER participants and their licensors. 4EVER consortium: Orange, Technicolor, ATEME, France Télévisions, INSA-IETR, Globecast, TeamCast, Telecom ParisTech, HighlandsTechnologies Solutions, www.4ever-project.com, contact: maryline.clare@orange.com. The 4EVER research Project is coordinated by Orange and has received funding from the French State (FUI/Oseo) and French local Authorities (Région Bretagne) associated with the European funds FEDER.” Copyright holders for other test sequences are single organizations identified in Table II.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available