IEEE Quick Preview
  • Abstract



We are currently witnessing something that has become a once-in-a-decade event in the world of video compression: the emergence of a major new family of video compression standards. The mid-1990s saw the introduction of the Moving Picture Experts Group (MPEG)-2 video coding standard (ITU-T Rec. H.262 and ISO/IEC 13818-2 [1]), the first compression standard to be widely adopted in broadcasting and entertainment applications. Advanced Video Coding (AVC) (ITU-T Rec. H.264 and ISO/IEC 14496-10 [2]) appeared in the mid-2000s, offering the same subjective quality at approximately half the bit rate. Now, a new standard, High Efficiency Video Coding (HEVC) (ITU-T Rec. H.265 and ISO/IEC 23008-2), has been developed that promises a further factor of two improvement in compression efficiency for the mid-2010s [3].

The HEVC standard has been jointly developed by the same two standardization organizations whose previous collaboration resulted in both MPEG-2 and AVC: 1) the ISO/IEC MPEG and 2) the ITU-T Video Coding Experts Group (VCEG), through the Joint Collaborative Team on Video Coding (JCT-VC) [4]. HEVC version 1 was ratified in 2013 as H.265 by the ITU-T and as MPEG-H Part 2 by ISO/IEC [5]. This first version supports applications that use conventional (single-layer) encoding of 4:2:0-sampled video with 8- or 10-bit precision. A second edition was completed in July 2014, and an additional extension was completed in February 2015 [6]. These extend the standard to support contribution applications with tools that enable 4:2:2- and 4:4:4-sampled video formats as well as 12- and 16-bit precision [7], and multilayer coding enhancements for efficient scalability [8] and stereo/multiview and depth-enhanced 3D compression [9]. A further amendment is currently being developed to enable more efficient coding of screen-captured graphics and text content and mixed-source content [10].

A few evaluations have previously been reported comparing the compression performance of HEVC with AVC and also demonstrating the suitability of HEVC for various applications, particularly including some evaluations for high-resolution video content [11]. The subjective test results for a number of test sequences and some analysis of the bit rate savings at different quality levels were presented in [12]. In [13], a study on the suitability of HEVC for beyond-HDTV broadcast services was presented, but with no comparison with previous standards. A comparison of HEVC and AVC performance for frame rates up to 30 Hz on a small number of ultra high definition (UHD) video sequences was presented in [14] and [15], and an informal study focused on low-delay (LD) applications and real-time encoding with HD resolutions was presented in [16]. All of these, including some that were carried out at the early stages of HEVC development, have provided very consistent evidence of the substantial coding efficiency improvements enabled by HEVC.

For its formal evaluation of HEVC performance, the JCT-VC performed a subjective evaluation on a wider range of content of resolutions varying from 480p (Formula$832\times 480$) up to UHD with frame rates of up to 60 Hz. The test sequences used for the verification testing were deliberately chosen to be different from those that had been used during the development of the standard, to avoid any possible bias that the standard could have toward those sequences. A test report was produced for the JCT-VC itself [17], and some additional analysis of HEVC performance for UHD content has been presented in [18] (using cropped Formula$2560\times 1600$ regions) and [19] (a recent brief conference publication about the testing).

As video coding standards generally specify only the format of the data and the associated decoding process without specifying how encoding is to be performed, it is not possible in general to test the compression performance of a standard. Some particular encoding methods must be used as a proxy to represent the capability of the standard instead. The outcome of such a comparison is generally more reliable when similar encoding techniques with similar configurations are applied in the compared encoders rather than simply comparing unknown technologies as black boxes. The verification tests were therefore performed using reference software encoders that had been developed in the standardization work and used very similar encoding algorithms and configurations that were selected to represent important applications. These publicly available reference software codebases are known as the HEVC model (HM) for HEVC [20] and the joint model (JM) for AVC [21].

While this paper reports an extended set of results of HEVC verification tests, it also summarizes the details of tools that can be used in the analysis of the results, pointing to a number of factors an evaluation should consider. Compared with the initial results [17], those presented in this paper are based on the use of more viewers for subjective tests, the objective results are presented and compared with the subjective results, and additional analysis of the coding gains versus the bit rate is provided. Ultimately, the verification test showed that the key objective of HEVC had been achieved—i.e., providing a substantial improvement in compression efficiency relative to its predecessor AVC.

This paper is organized as follows. An overview of video quality evaluation and statistical analysis methodology is presented in Section II, and the test settings used in the subjective evaluations are detailed in Section III. Section IV presents the test results and detailed analysis. Finally, the conclusion is given in Section V.



This section provides an overview of the objective and subjective video quality metrics and the related analysis used in this study.

For the convenience of video coding performance assessment, the most commonly used objective metric is peak signal-to-noise ratio (PSNR). However, it is commonly acknowledged that PSNR has the disadvantages of disregarding the viewing conditions and the characteristics of human visual system perception. In addition, the PSNR for a given video sequence can be computed in different ways, depending on how the picture components (e.g., luma and chroma) or individual picture PSNR values are combined. Nevertheless, for a particular content item and small variations of coding conditions, the changes in PSNR values for an overall video sequence can typically be reliably interpreted.

Other objective video quality metrics, such as the structural similarity index (SSIM) and video quality metric (VQM), have been proposed, but are not used nearly as frequently as PSNR [22]. VQM is not often used—primarily due to its computational complexity—and for both metrics, the interpretation of the values they provide has not yet become a common practice in video coding community. Therefore, in the context of HEVC and AVC compression, this paper provides comparisons and analysis using the PSNR objective measure and subjective quality evaluation results.

Subjective quality evaluation is the process of employing human viewers for grading video quality based on individual perception. Formal methods and guidelines for subjective quality assessments are specified in various ITU recommendations. Among the many of these, the most relevant to this context are ITU-T Rec. P.910 [23], which defines subjective video quality assessment methods for multimedia applications, and ITU-R Rec. BT.500 [24], which defines a methodology for the subjective assessment of the quality of television pictures. These specifications describe a number of test methods with distinct presentation and scoring schemes, along with the recommended viewing conditions.

Explanations of the quality metrics and data analysis methods are provided in the following sections.

A. Objective Quality Evaluation Using PSNR

PSNR is defined as the ratio between the maximum possible power of the signal (the original image) and the power of noise, which in the considered scenario is introduced by lossy compression. For a decoded image component Formula$I_{d}$, the mean square error (MSE) with reference to an original image component Formula$I$ is computed as FormulaTeX Source\begin{equation} \textrm {MSE}=\frac {\sum \nolimits _{i=0}^{M-1}\sum \nolimits _{j=0}^{N-1}(I(i,j)-I_{d}(i,j))^{2}}{M\cdot N} \end{equation} where Formula$M$ and Formula$N$ are the width and height of the image component, and the image component is, for example, an array of luma samples or Formula$C_{B}$ or Formula$C_{R}$ chroma samples. The PSNR value is then computed as FormulaTeX Source\begin{equation} \textrm {PSNR}=10\cdot \log \frac {(2^{B}-1)^{2}}{\mathrm{ MSE}} \end{equation} where Formula$B$ is the bit depth of image samples. This is typically calculated for each frame separately, and then averaged for the frames of a video sequence. Due to the logarithmic transformation, this corresponds to using the geometric average of frame MSEs, and the impact of this should be critically considered when a high fluctuation over frames is present.

For video sequences, which ordinarily consist of three color components, either the luma PSNR value (Formula${\rm PSNR}_{Y}$), calculated using only luma component values, may be reported or a weighted PSNR value (Formula${\rm PSNR}_{W}$) using all three components can be computed using some weighting criteria. An example of a popular weighting for content with 4:2:0 sampling is FormulaTeX Source\begin{equation} {\mathrm{ PSNR}}_{W}=\frac {6 \cdot {\mathrm{ PSNR}}_{Y}+{\mathrm{ PSNR}}_{C_{B}}+{\mathrm{ PSNR}}_{C_{R}}}{8}. \end{equation}

The most accurate interpretation of the objective results is obtained by looking at the frame-by-frame results for each component. However, this may not be practical for the final presentation of the results for a large data set and a large number of test points.

B. Subjective Quality Evaluation

For the HEVC verification test that includes a wide range of visual quality points, a degradation category rating (DCR) [23] test method was selected. For this purpose, it was used to evaluate the quality (and not the impairment) with a quality rating scale made of 11 levels [23], ranging from 0 (lowest quality) to 10 (highest quality), which may be interpreted as in Table I. The numerical scale helps avoid misinterpretations associated with the use of category adjectives (e.g., excellent or good), especially in cases where the tests are performed across different countries and including nonnative English speakers.

Table 1
Table I Subjective Quality Scale With 11 Points Used in an HEVC Verification Test

The basic results of the subjective test are evaluated in terms of the average rating, which is called the mean opinion score (MOS), and the associated confidence interval values that are computed for each coding point, after having verified the reliability of each viewer. For the DCR method, it is recommended to hire more than 15 naïve viewers that have been properly screened for visual acuity and color blindness, to allow for an accurate statistical analysis of the subjective scores [24].

From the raw data, i.e., the individual subjective scores, the reliability of each viewer is calculated. The individual reliability is evaluated using the correlation coefficient Formula$r$ computed between each score Formula$x_{i}$ provided by a viewer and the overall MOS value Formula$y_{i}$ assigned for that test point Formula$i$ as FormulaTeX Source\begin{equation} r=\frac {\sum \nolimits _{i=1}^{T} (x_{i}-\bar {x} )( y_{i}-\bar {y} )}{\sqrt { \sum \nolimits _{i=1}^{T} ( x_{i}-\bar {x})^{2}\cdot \sum \nolimits _{i=1}^{T} (y_{i}-\bar {y} )^{2}}} \end{equation} where Formula$T$ is the total number of test points for a viewing session, Formula$y_{i}$ is the average of all scores for the test point Formula$i$, and Formula$\bar {x}$ and Formula$\bar {y}$ are the average values of Formula$x_{i}$ and Formula$y_{i}$ for all test points, respectively. In this HEVC verification test, a correlation index greater than or equal to 0.75 is considered as valid for the acceptance of the viewer’s scorings; otherwise, the viewer is considered as an outlier. Once the results for outliers are discarded, the MOS for each test point is computed using the arithmetic average of scores of the remaining viewers. In addition, the confidence interval is computed for each test point to estimate a range of values covered by a certain probability.

Assuming a Gaussian (normal) distribution for the population of subjective scores with sample size Formula$n$, mean (MOS) Formula$\mu $, and sample-based standard deviation measurement Formula$s$, the confidence interval is defined as (Formula$\mu -c, \mu +c$), where Formula$c$ is computed as FormulaTeX Source\begin{equation} c=z\cdot \frac {s}{\sqrt {n}}. \end{equation}

In the analysis of the subjective test results, the 95% confidence interval, as shown in Fig. 1, is calculated for each test point. For a 95% confidence interval with a Gaussian distribution, the value of Formula$z$ in (5) is 1.96. For the results presented in Section IV, the confidence interval is plotted alongside the MOS, as shown in Fig. 2, with an interpolated curve from MOS values.

Figure 1
Fig. 1. Confidence interval for normal distribution and 95% probability.
Figure 2
Fig. 2. Example of a confidence interval related to a test point and an associated MOS versus bit rate curve.

C. Interpretation of Bit Rate Savings From Subjective Quality Comparison

The objective of the verification test is to gauge the bit rate savings of HEVC over AVC when the AVC and HEVC test points have the same subjective quality.

Fig. 3 shows an example of a plot comparing the AVC and HEVC MOS versus bit rate curves. There is no overlap in the MOS confidence intervals of the HEVC test point C and AVC test point B, and hence, there is sufficient statistical significance to conclude that the HEVC test point C has a better quality than the AVC test point B. There is, however, an overlap in the MOS confidence intervals of the HEVC test point A and AVC test point B. This means that it is highly likely that the HEVC test point A and AVC test point B have subjective quality that cannot be distinguished. However, there is still a chance that the subjective qualities of HEVC test point A and AVC test point B are not the same.

Figure 3
Fig. 3. Overlapping confidence interval of test points.

A more rigorous analysis is to perform a two-sample unequal variance (heteroscedastic) student’s Formula$t$-test test using the two-tailed distribution to determine if indeed the subjective qualities given by the sample mean values of the pair of test points are not the same. The null hypothesis, Formula$H_{0}$, in this case would be that the HEVC test points have the same quality as the AVC test point, and the alternate hypothesis, Formula$H_{a}$, is that the HEVC test points do not have the same quality as the AVC test point.

To compare the means of two populations, the Formula$t$-statistic can be used, which is expressed as FormulaTeX Source\begin{equation} t=( \bar {X}_{1}-\bar {X}_{2})/ \sqrt {s_{1}^{2} /n_{1}+s_{2}^{2} / n_{2}} \end{equation} where Formula$\bar {X_{i}}, s_{i}^{2}$, and Formula$n_{i}$ denote the sample mean, the sample variance, and the size of the Formula$i$th sample, Formula$i\in \{1, 2\}$. By computing the Formula$t$-statistic in this way and approximating it with a student’s Formula$t$-distribution whose degree of freedom (DF) is specified as FormulaTeX Source\begin{equation} \textrm {DF}=\frac {\big ( s_{1}^{2}/n_{1}+s_{2}^{2}/n_{2} \big )^{2}}{\big ( s_{1}^{2} / n_{1} \big )^{2}/( n_{1}{-1} ){ +}\big ( {s}_{2}^{2} / n_{2} \big )^{2}/( n_{2}{-1} )} \end{equation} a probability value Formula$p$ can be computed from the Formula$t$-statistic that indicates the extent to which the means of the two populations are considered to be different. The smaller the Formula$p$-value is, the more significant the difference between the distributions of the two populations is.

A Formula$p$-value less than 0.05 indicates a very low probability of committing a type-I error (i.e., rejecting the null hypothesis when it is true). In such a case, the null hypothesis can thus be safely rejected, and it can be concluded that there is statistical significance that the HEVC test point does not have the same quality as the AVC test point. A Formula$p$-value greater than or equal to 0.05 means that the null hypothesis cannot be confidently rejected. For the purpose of this paper, the HEVC test point is considered to have the same quality as the corresponding AVC test point in such a case. However, there is still a possibility of committing a type-II error (i.e., failure to reject the null hypothesis when in fact the alternate hypothesis is true). The power or sensitivity of a statistical test is the probability of correctly rejecting the null hypothesis (Formula$H_{0}$) when it is false—i.e., the probability of correctly accepting the alternative hypothesis (Formula$H_{a}$) when it is true [25]. A statistical power test of the data has shown that if in fact the true population mean for the difference in the HEVC MOS and AVC MOS is greater than or equal to 0.8, then the mean probability of committing a type-II error, Formula$\beta $, is less than or equal to 0.14, and hence the mean power of the test (defined as Formula$1-\beta $) is 0.86. By convention, a test with a power greater than 0.8 (or Formula$\beta \le 0.2$) is considered statistically powerful [26].

In the design of the verification test, four bit rates per codec, Formula$R_{\mathrm {HEVC}}$ and Formula$R_{\mathrm {AVC}}$, were used. The bit rates were carefully selected for each of 20 test sequences so that each Formula$R_{\mathrm {HEVC}}$ is approximately half of the corresponding Formula$R_{\mathrm {AVC}}$. These gave 80 pairs of test points on which the Formula$t$-test described above was applied. The results of the test determine, for each pair of test points, whether the HEVC test point has a quality better than, the same as, or less than the AVC test point, and give a rough estimate of the bit rate savings of HEVC compared with AVC. The following are the possible outcomes for each pair of test points.

The first case is when the null hypothesis is rejected and there is statistical significance that the HEVC MOS at Formula$R_{\mathrm {HEVC}}$ is greater than the AVC MOS at Formula$R_{\mathrm {AVC}}$. This means that one can reasonably conclude that the HEVC test point is achieving a better quality than the AVC test point at half the bit rate of AVC. Note that by the design of the test, the bit rate saving when Formula$R_{\mathrm {HEVC}}$ is half Formula$R_{\mathrm {AVC}}$ is 50%. Since the bit rate for an HEVC test point could be further reduced to achieve the same quality, the bit rate saving of HEVC compared with AVC is therefore greater than 50% for this case.

The second case is when the null hypothesis was failed to be rejected. This means that the HEVC test point has about the same quality as the AVC test point at half the AVC bit rate, since Formula$R_{\mathrm {HEVC}}$ is approximately half of Formula$R_{\mathrm {AVC}}$. Therefore, the bit rate saving of HEVC compared with AVC is approximately 50% for this case.

The third case is when the null hypothesis is rejected and there is statistical significance to conclude that the HEVC MOS at Formula$R_{\mathrm {HEVC}}$ is less than the AVC MOS at Formula$R_{\mathrm {AVC}}$. This means that the HEVC test point is not achieving equal or better quality than the AVC test point at half the bit rate of AVC. More bits would need to be allocated to the HEVC test point before the same quality would be achieved. Therefore, the bit rate saving of HEVC compared with AVC is less than 50% for this case.

D. Bjøntegaard Model

The Bjøntegaard model [27], [28] has become a popular tool for evaluating the coding efficiency of a given video codec in comparison with a reference codec over a range of quality points or bit rates. Bjøntegaard delta (BD) metrics are typically computed as a difference in bit rate or a difference in quality based on interpolating curves from the tested data points. In this paper, the focus is on the difference in bit rate, expressed as a percentage of a reference bit rate, as this is easily interpreted as the bit rate saving benefit for equal measured quality.

The BD-rate represents the average bit rate savings for the same video quality (e.g., PSNR or MOS) and is calculated between two rate-distortion curves, such as AVC and HEVC MOS curves in Fig. 3. The bit rate saving difference between the two rate-distortion curves at a given level of quality is FormulaTeX Source\begin{equation} \mathrm {\Delta }R(D)=\frac {R_{B}(D)-R_{A}(D)}{R_{A}(D)} \end{equation} where Formula$R_{A}(D)$ and Formula$R_{B}(D)$ are the bit rate of the interpolated reference and tested bit rate curves, respectively, at the given level of quality/distortion Formula$D$. Formula$\Delta R(D)$ is typically represented as a percentage of the reference bit rate Formula$R_{A}(D)$ so that a negative value represents compression gain, while a positive value represents compression loss.

The Bjøntegaard model uses a logarithmic scale for the domain of the bit rate interpolation, so by defining Formula$r=\log R$, the bit rate savings can be expressed as FormulaTeX Source\begin{equation} {\Delta }R(D)={10}^{r_{B}(D)-r_{A}(D)}{-1}. \end{equation}

Taking into account the actual measured rate-distortion points (Formula$R(i)$, Formula$D(i)$), the fitted rate-distortion curves Formula$\hat {r}(D)$ are used in BD-rate computation. Over a range of quality levels, the BD-rate is approximated as FormulaTeX Source\begin{equation} {\Delta R}_{\mathrm {Overall}} \approx {10}^{\frac {1}{D_{H}-D_{L}}\int ^{D_{H}}_{D{_{L}}} {[ \hat {r}_{B}( D )-\hat {r}_{A}( D ) ]dD}}-1 \end{equation} where the lower Formula$D_{L}$ and higher Formula$D_{H}$ integration bounds are computed from the range of the interpolated distortion values Formula$D_{A}$ and Formula$D_{B}$ for the reference and tested data sets, respectively, as FormulaTeX Source\begin{align} D_{L}=&\max \Biggl\{\min {(D_{A}(0),\ldots ,D_{A}(N_{A}-1) ),}\notag \\&\qquad~~ \min {( D_{B}(0),\ldots ,D_{B}(N_{B}-1) )} \Biggr\}\notag \\ D_{H}=&\min \Biggl\{\max {(D_{A}(0),\ldots ,D_{A}(N_{A}-1) ),}\notag \\&\qquad~~ \max {( D_{B}(0),\ldots ,D_{B}(N_{B}-1 )}\Biggr\} \end{align} where Formula$D$ (0) is the lowest and Formula$D(N-1)$ is the highest measured quality point, for either the tested or reference sets, as shown in Fig. 4.

Figure 4
Fig. 4. Illustration of BD-rate computation.

In the HEVC verification test, the number of test points for both the reference and evaluated sets is four (i.e., Formula$N_{A}=N_{B}=4$) and the curve fitting uses cubic spline interpolation.

As can be observed from Fig. 4, in some cases, the overall BD-rate measure may be computed over a relatively small interval of overlapping distortion regions. In such a case, the BD-rate metric does not necessarily represent average coding efficiency for all test points involved in the actual test. Therefore, it is important to design the test in a way that the distortion overlap between the two tested codecs covers a range of qualities of interest for specific application.

As the metrics derived from the Bjøntegaard model can be applied to different evaluation criteria, it is important to understand the range on which they are computed. For example, BD-rate can be computed for MOS and PSNR, for the same test material. However, as demonstrated in Fig. 5, the actual bit rates on which the two are computed may not be the same. Therefore, in addition to providing BD-rates for both PSNR and MOS in our evaluation reported in Section IV, we also compute BD-rate on the bit rate interval common for both criteria (MOS and PSNR).

Figure 5
Fig. 5. BD-rate for PSNR and MOS ratings. Actual bit rate ranges and test points taken into account do not necessarily overlap.

Newer studies [29] have shown how the Bjøntegaard model can further be extended to compute BD-rate intervals considering the confidence intervals of the MOS ratings for each test point, as shown in Fig. 6. The dotted curves show the boundaries of confidence intervals for each curve, and two new BD-rate values are computed comparing Formula$D_{B,\mathrm {min}}$ with Formula$D_{A,\mathrm {max}}$ (labeled BD-ratemin) and Formula$D_{B,\mathrm {max}}$ with Formula$D_{A,\mathrm {min}}$ (labeled BD-ratemax), where [Formula$D_{\mathrm{ min}}$, Formula$D_{\mathrm{ max}}$] represents 95% confidence intervals of MOS. The new BD-rates thus provide lower and upper limits for the BD-rate. However, it is noted that these three values of BD-rate are based on different reference (AVC) bit rate ranges as shown in Fig. 6. Although in the results reported in Section IV, these intervals are reported, they have to be carefully interpreted, as the limits of the intervals are defined for significantly different bit rate ranges. However, for relatively small differences between rate-distortion curves, it can be useful to evaluate BD-rate confidence intervals.

Figure 6
Fig. 6. BD-rate with MOS confidence intervals.
Figure 7
Fig. 7. BTCs for subjective evaluation.


This section provides information regarding the test material used, test settings, and logistics.

A. Selection of Test Material and Test Points

The HEVC verification tests were carried out for four categories of spatial resolutions: UHD (3840 Formula$\times $ 2160, except for the Traffic sequence, which is 4096 Formula$\times $ 2048), 1080p (1920 Formula$\times $ 1080), 720p (1280 Formula$\times $ 720), and 480p (832 Formula$\times $ 480). The details of the test sequences are provided in Tables II and III. Screenshots are given in Figs. 811. The sequences are selected from different sources and have different spatiotemporal characteristics, which leads to different behavior of compression algorithms. This is the first formal test of video compression standards where a wide range of resolutions including content with UHD resolution with high frame rate has been evaluated. The format, as specified in ITU-R Rec. BT.2020 [30], has a number of extended features compared with ordinary HD video. In addition to containing more pixels per frame, it specifies support for higher frame rates, wider color gamut, and higher bit depths [31]. However, for compatibility with available playout and display systems, all tested video sequences have 8 bits per component per sample and are in the Y’Formula$\text{C}_{B}\text{C}_{R}$ color space defined by ITU-R Rec. BT.709 [32].

Figure 8
Fig. 8. UHD test sequences.
Figure 9
Fig. 9. 1080p test sequences.
Figure 10
Fig. 10. 720p test sequences.
Figure 11
Fig. 11. 480p test sequences.
Table 2
Table II Test Sequences
Table 3
Table III Parameters of Used Test Sequences

The test sequences were compressed using HEVC (HM-12.1, Main profile [20]) and AVC (JM-18.5, High profile [21]) encoding. Either a random access (RA) or low delay (LD) configuration (Cfg) was used (similarly configured for both HEVC and AVC, with a refresh period of approximately 1 s and hierarchical referencing for RA and with no periodic refresh and no reordered referencing for LD). For each test sequence, four test points using different fixed quantization parameter (QP) settings were selected so that the tested HEVC bit rates are approximately half of the AVC bit rates. Also, the ranges of the QP values were selected so that the subjective quality of the encoded sequences spans a large range of MOS values. This bit rate ratio was chosen because it is already well established that the quality of HEVC is much better than the quality of AVC at the same bit rate. These test conditions can identify whether a bit rate saving of 50% or more is achieved for the majority of the tested video sequences. The full details of the QP values and bit rates can be found in [17].

B. Subjective Test Structure

Subjective tests for different categories of video sequences were conducted in separate sessions. Each subjective test session of the DCR method consisted of a series of basic test cells (BTCs). Each BTC was made of two consecutive presentations of the video clip under test, as shown in Fig. 7. First, the original version of the video sequence was displayed, followed by the coded version of the video sequence, with a gap of 1 s. Then, a message asking the viewers to vote was displayed for 5 s.

Each test session was designed with 45 BTCs in total: the first three BTCs represented the stabilization phase and were selected to show to the viewers the whole range of quality they would see during the test. Two BTCs showing original versus original were also used as a sanity check for the range of ratings made by the viewers. The scores coming from the original BTCs and from the stabilization phase were excluded from further analysis. All the BTCs were randomly ordered to avoid the same content being seen repeatedly and to spread the quality as much as possible in a uniform way across the whole test.

C. Subjective Test Logistics

The subjective tests were performed at two sites, under a controlled laboratory environment, adhering to the recommendations in ITU-R Rec. BT.500 and ITU-T Rec. P.910. The tests for UHD and 720p resolutions were done at the BBC R&D labs in London and for the other resolutions at the University of the West of Scotland (UWS). The equipment and session details for each test are given in Tables IV and V, respectively. Additional analysis of the influence of viewing distances (front and back rows in the seating arrangement for viewers) on the subjective rating has been provided in [19]. 1

Table 4
Table IV Test Logistics (BBC)
Table 5
Table V Test Logistics (UWS)


This section summarizes the subjective test results. It also provides an analysis with a focus on a comparison with the objective test results, which are easier to obtain in practice.

A. Objective and Subjective Test Results

The subjective evaluation results for each category of test sequences are shown in Figs. 1215 in the form of MOS versus bit rate plots. The objective quality metric (PSNR) values for the same test points are also plotted on the same graph using a second vertical axis. Note that the scales for the two vertical axes in these plots are independently selected, and thus no direct connection between the subjective (i.e., MOS) and objective (i.e., PSNR) plots is demonstrated. The legend in all plots shows that circle and triangle markers represent the results for actual test points, while the curves between them were calculated using cubic spline interpolation with the bit rate in a logarithmic scale, as in typical BD-rate computation. Only the parts of the curves related to BD-rate calculation, either for MOS or PSNR BD-rate computation, are displayed. The solid and dotted lines represent MOS and PSNR curves, respectively. Confidence intervals are displayed for each MOS test point. The PSNR results presented are for the luma color component only. Because of space limitation, the chroma results are not presented. However, the authors note that the weighted PSNR results as in (3) are highly correlated to luma-only PSNR results. Typically, the weighted PSNR has a somewhat higher value than luma PSNR, but the values of BD-rate for weighted PSNR are typically close to the values of BD-rate for luma PSNR.

Figure 12
Fig. 12. Subjective and objective evaluation results for UHD content; subjective results, with associated 95% confidence intervals.
Figure 13
Fig. 13. Subjective and objective evaluation results for 1080p content; subjective results, with associated 95% confidence intervals.
Figure 14
Fig. 14. Subjective and objective evaluation results for 720p content; subjective results, with associated 95% confidence intervals.
Figure 15
Fig. 15. Subjective and objective evaluation results for 480p content; subjective results, with associated 95% confidence intervals.

By considering the positions of MOS points for AVC and HEVC, it can be observed that HEVC achieves the same subjective quality as AVC while typically requiring substantially lower bit rates. Table VI shows the results of the student’s Formula$t$-test on the 80 pairs of HEVC and AVC test points. These pairs of test points were classified into the three categories of bit rate savings as described in Section II-C. The first four rows show the distribution of bit rate savings achieved for each resolution. The last row summarizes the distribution of bit rate saving statistics for all resolutions. This shows that for 74 out of the 80 pairs of test points (or 92.5%), HEVC has a bit rate saving compared with AVC that is greater than or equal to 50%. The amount of bit rate saving is similar for both the RA and LD test cases. Only six pairs of test points (or 7.5%) show a bit rate saving of less than 50%. Four of the six pairs of test points were contributed by one video sequence (SVT04a), where the HEVC encoder did not perform as well as in the other cases.

Table 6
Table VI Coding Gain Estimates Using Formula$t$-Test on MOS Score

The data convincingly show that the HEVC is achieving bit rate saving that is at least 50%. However, the granularity of the above bit rate saving estimation was limited by the test design, where in each pair of test points, the HEVC bit rate was selected as approximately half the AVC bit rate. In order to get a more precise quantification of the estimated bit rate savings, a different method is used, where the MOS BD-rate values for the subjective ratings for each test sequence, as shown in Table VII, are computed. The upper and lower limits of BD-rates corresponding to the 95% confidence interval of MOS, as discussed in Section II-D, are also indicated.

Table 7
Table VII BD-Rates for Subjective and Objective Evaluation Results

In addition to the BD-rates for the available MOS range of each video sequence, an additional measurement was also calculated for the range of MOS scores greater than or equal to seven. This range (MOS Formula$\ge $ 7) is typically expected for a number of services, such as broadcasting, where targeted quality levels are good to excellent. Negative BD-rate values in Table VII indicate the bit rate savings measured for HEVC relative to the bit rate used for AVC.

The averages in Table VII are computed only for the BD-rate values displayed in the table. Note that the results for MOS BD-rates for the HomelessSleeping and Cubicle sequences are excluded. In order for the BD-rate interpolation to work correctly, the MOS values should exhibit a smooth curve and the interval of the averaging (shown as curves in Figs. 1215) should be interpolated from at least three points that are monotonically increasing with bit rate. This condition was not satisfied for the omitted test sequences (although a very substantial gain is evident for HEVC in both omitted cases).

It is noted that for most video sequences, the MOS-based BD-rate benefit is substantially higher than what is measured by the PSNR-based BD-rate. MOS BD-rates indicate that HEVC could provide the same visual quality as AVC for most tested content categories at well below half the bit rate of the latter, surpassing the performance expected at the launch of the HEVC standard development process.

The fact that BD-rates for MOS, confidence intervals, and PSNR in Table VII have not been computed over the same bit rate range for a given test sequence, which has been discussed in Section II, is further addressed in the following section.

B. Compression Efficiency Results for Specific Bit Rates

Although different BD-rate measures presented in Table VII are useful indicators of the compression performance of HEVC, comparing them with each other raises validity concerns, as they are computed for different reference bit rate ranges, as discussed in Section II. To partly address this problem, additional analysis of the results has been conducted to evaluate compression efficiency for bit rates that are common to both PSNR and MOS BD-rate computation.

The bit rate savings achieved by HEVC with reference to the associated AVC bit rate, computed for the continuous bit rate range using cubic spline interpolation as discussed in Section II-D, are shown in Fig. 16, considering both subjective and objective quality assessments. In the majority of cases, the bit rate savings for equal MOS are higher than the bit rate savings for equal PSNR. The savings vary across different bit rates and different video source content. Considering only the parts of the curves from Fig. 16 that are defined both for MOS and PSNR for a given AVC bit rate, we have measured the average BD-rates and the average difference between the PSNR and MOS curves, which are presented in Table VIII. The results in Table VIII do not take into account the test sequences that were discarded according to MOS BD-rate computation problems (HomelessSleeping and Cubicle) described in Section IV-A.

Figure 16
Fig. 16. Bit rate savings provided by HEVC at various reference bit rates. (a) UHD content. (b) 1080p content. (c) 720p content. (d) 480p content.
Table 8
Table VIII Average Differences Between MOS BD-Rate and PSNR BD-Rate for the Same Reference Bit rates

For the same bit rate range, the MOS-based BD-rate saving is 59% and the PSNR-based BD-rate saving is 44%, averaged over all test sequences. Depending on the content category, the average differences between the two measurement methods are between 11% and 18%. In other words, the bit rate savings measured for equal PSNR are lower than the bit rate savings measured for equal MOS between AVC and HEVC by roughly 15%. This is consistent with the difference between the average values in Table VII, with the difference being that in this case the bit rate range for PSNR and MOS rate saving measurement is equal.



This paper presents the results of a formal subjective verification test that was carried out by the JCT-VC for the new HEVC video coding standard. In this paper, a more rigorous analysis of the subjective test results using student’s Formula$t$-test shows that the HEVC test points at half or less than half the bit rate of the AVC reference were found to achieve a comparable quality in 92.5% of the test cases. In addition, it provides a summary of evaluation metrics and an analysis of the results that were obtained, with a focus on comparison between the subjective and objective evaluation results and the performance across different bit rate ranges.

A more precise quantitative estimate of the bit rate savings was obtained by applying the MOS-based BD-rate measurement on the results of the subjective test. It was found for the investigated test cases that the HEVC Main profile can achieve the same subjective quality as the AVC High profile while requiring on average approximately 59% fewer bits. The PSNR-based BD-rate average over the same sequences was calculated to be 44%. This confirms that the subjective quality improvements of HEVC are typically greater than the objective quality improvements measured by the method that was primarily used during the standardization process of HEVC.

It can therefore be concluded that the HEVC standard is able to deliver the same subjective quality as AVC, while on average (and in the vast majority of typical sequences) requiring only half or even less than half of the bit rate used by AVC. This means that the initial objective of the HEVC development (substantial improvement in compression compared with the previous state of the art) has been successfully achieved.


The authors would like to thank the contributors of the video test sequence material for permitting its use for the tests and this publication, and their collaborators in the JCT-VC for their assistance in developing the test plan and analyzing its results.


This paper was recommended by Associate Editor T. Wiegand.

1The test sequences Manege and SedofCropped are “Copyright © 2012-2013, all rights reserved to the 4EVER participants and their licensors. 4EVER consortium: Orange, Technicolor, ATEME, France Télévisions, INSA-IETR, Globecast, TeamCast, Telecom ParisTech, HighlandsTechnologies Solutions,, contact: The 4EVER research Project is coordinated by Orange and has received funding from the French State (FUI/Oseo) and French local Authorities (Région Bretagne) associated with the European funds FEDER.” Copyright holders for other test sequences are single organizations identified in Table II.


No Data Available


Thiow Keng Tan

Thiow Keng Tan

Thiow Keng Tan (S’89–M’94–SM’03) received the B.Sc. degree, the bachelor’s degree in electrical and electronics engineering, and the Ph.D. degree in electrical engineering from Monash University, Melbourne, VIC, Australia, in 1987, 1989, and 1994, respectively.

He is currently a Consultant with NTT DOCOMO, Inc., Tokyo, Japan. He is an active participant at the video subgroup of the ISO/IEC JTC1/SC29/WG11 Moving Picture Experts Group (MPEG), the ITU-T SG16/Q6 Video Coding Experts Group, and the ITU-T/ISO/IEC Joint Video Team and the ITU-T/ISO/IEC Joint Collaborative Team for Video Coding standardization activities. He holds over 60 granted U.S. patents. His research interests include image and video coding, analysis, and processing.

Dr. Tan received the Douglas Lampard Electrical Engineering Medal for his Ph.D. thesis and the first prize IEEE Region 10 Student Paper Award for his final year undergraduate project. He also received three ISO certificates for outstanding contributions to the development of the MPEG-4 standard. He has served on the Editorial Board of IEEE TRANSACTIONS ON IMAGE PROCESSING.

Rajitha Weerakkody

Rajitha Weerakkody

Rajitha Weerakkody received the B.Sc. degree in electronic and telecommunication engineering from University of Moratuwa, Moratuwa, Sri Lanka, in 2000 and the Ph.D. degree in electronics engineering from University of Surrey, Guildford, U.K., in 2008.

He was a Network Engineer and Manager with the wireless telecommunication industry in Sri Lanka from 2000 to 2005. He joined British Broadcasting Corporation, London, U.K., in 2008, for research on video coding for audio-visual archiving applications, and managed the technical design and build of the U.K. distribution and public viewing network for the ultra-high definition-super hi-vision showcase at the London 2012 Olympic Games. He is currently a Research Technologist with British Broadcasting Corporation, where he is involved in video compression technology. He has authored or co-authored over 50 research publications.

Marta Mrak

Marta Mrak

Marta Mrak (SM’13) received the Dipl.-Ing. and M.Sc. degrees in electronics engineering from University of Zagreb, Zagreb, Croatia, and the Ph.D. degree from Queen Mary University of London, London, U.K.

She was a Post-Doctoral Researcher with University of Surrey, Guildford, U.K., and the Queen Mary University of London. She joined the Research and Development Department, British Broadcasting Corporation, London, in 2010, to work on video compression research and the H.265/High Efficiency Video Coding standardization. She has co-authored over 100 papers, book chapters, and standardization contributions, and also co-edited a book entitled High-Quality Visual Experience (Springer, 2010).

Dr. Mrak is a member of the IEEE Multimedia Signal Processing Technical Committee, an Area Editor of Signal Processing: Image Communication (Elsevier), and a Guest Editor of several special issues in relevant journals. She received the German DAAD Scholarship for video compression research from the Heinrich Hertz Institute, Berlin, Germany, in 2002. She has been involved in several projects funded by European and U.K. research councils in roles ranging from Researcher to Scientific Coordinator.

Naeem Ramzan

Naeem Ramzan

Naeem Ramzan (S’04–M’08–SM’13) received the M.Sc. degree in telecommunication from University of Brest, Brest, France, in 2004 and the Ph.D. degree in electronics engineering from Queen Mary University of London, London, U.K., in 2008.

He was a Senior Fellow Researcher and Lecturer with Queen Mary University of London. He joined University of the West of Scotland, Paisley, U.K., where he is currently a Reader in Visual Communication. He has authored or co-authored over 70 research publications, including journals, book chapters, and standardization contributions. He co-edited a book entitled Social Media Retrieval (Springer, 2013).

Dr. Ramzan is a fellow of the Higher Education Academy and a member of Institution of Engineering and Technology. He served as a Guest Editor for a number of special issues in technical journals. He has organized and co-chaired three ACM Multimedia Workshops, and served as the Session Chair/Co-Chair for a number of conferences. He is the Co-Chair of the Ultra HD Group of the Video Quality Experts Group (VQEG) and the Co-Editor-in-Chief of VQEG E-Letter. He has participated in several projects funded by European and U.K. research councils.

Vittorio Baroncini

Vittorio Baroncini

Vittorio Baroncini received the bachelor’s degree in physics from Rome University “La Sapienza.”

He was a Hardware Designer and Group Leader with ITT R&D Laboratory, beginning in 1976, and he was Counsellor with the Italian Telecommunication Ministry in 1984. In 1986, he joined Fondazione Ugo Bordoni (FUB), Rome, Italy, in the Television Group, as a Hardware Designer and Visual Quality Expert. In 1992, he began his activity in the CCIR (ITU-R) and ISO/IEC MPEG standardization organizations. He designed a metric to measure the Quality of Service of TV services with a reduced reference approach. He is a Co-Founder of the Video Quality Experts Group (1997) that led to the standardization of the first world-wide International Standard for an objective quality metric for digital TV (ITU-R Rec. BT.1683). Among his many activities within MPEG, some of the more notable are the tests for Calls for Proposals for Digital Cinema, High Efficiency Video Coding (HEVC), 3D Video (3DV), and HEVC Screen Content Coding extensions, and the recent the Call for Evidence of new technologies for high-dynamic range and wide color gamut video. He is the Technical Leader responsible for video quality assessment and the HDTV, 3DV, and Digital Cinema projects with FUB. He has authored many papers for conferences and scientific journals, and two books on MPEG technologies.

Jens-Rainer Ohm

Jens-Rainer Ohm

Jens-Rainer Ohm (M’92) received the Dipl.-Ing., Dr.Ing., and Habilitation degrees from Technical University of Berlin, Berlin, Germany, in 1985, 1990, and 1997, respectively.

He has been participating in the Moving Picture Experts Group (MPEG) since 1998. He has been a Full Professor and the Chair of Institute of Communication Engineering with RWTH Aachen University, Aachen, Germany, since 2000. He has authored numerous papers and German and English textbooks in multimedia signal processing, analysis, and coding, and basics of communication engineering and signal transmission. His research interests include motion-compensated, stereoscopic and 3D image processing, multimedia signal coding, transmission and content description, audio signal analysis, and fundamental topics of signal processing and digital communication systems

Dr. Ohm has been the Chair and Co-Chair of various standardization activities in video coding, namely, the MPEG Video Subgroup since 2002, the Joint Video Team of MPEG and ITU-T SG16 Video Coding Experts Group from 2005 to 2009, and currently the Joint Collaborative Team on Video Coding (JCT-VC) and the Joint Collaborative Team on 3D Video Coding Extensions (JCT-3V).

Gary J. Sullivan

Gary J. Sullivan

Gary J. Sullivan (S’83–M’91–SM’01–F’06) received the B.S. and M.Eng. degrees in electrical engineering from University of Louisville, Louisville, KY, USA, in 1982 and 1983, respectively, and the Ph.D. and Engineering degrees in electrical engineering from University of California at Los Angeles, Los Angeles, CA, USA, in 1991.

He has been a longstanding Chairman or Co-Chairman of various video and image coding standardization activities in ITU-T Video Coding Experts Group, ISO/IEC Moving Picture Experts Group, ISO/IEC JPEG, and their joint collaborative teams, since 1996. He was the originator and a Lead Designer of the DirectX Video Acceleration video decoding feature of the Microsoft Windows operating system. He is currently a Video/Image Technology Architect with the Corporate Standardization Group, Microsoft Corporation, Redmond, WA, USA. His research interests include image and video compression, rate-distortion optimization, motion estimation and compensation, scalar and vector quantization, and loss-resilient video coding.

Dr. Sullivan is a fellow of the International Society for Optics and Photonics. He has received the IEEE Masaru Ibuka Consumer Electronics Award, the IEEE Consumer Electronics Engineering Excellence Award, the INCITS Technical Excellence Award, the IMTC Leadership Award, and the University of Louisville J. B. Speed Professional Award in Engineering. The team efforts he has led have been recognized by the ATAS Primetime Emmy Engineering Award and the NATAS Technology and Engineering Emmy Award. He has been a Guest Editor of several previous special issues and special sections of IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY.

Cited By

No Data Available





No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
INSPEC Accession Number:
Digital Object Identifier:
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size