SECTION I

ISO/IEC MPEG and ITU-T VCEG formed the Joint Collaborative Team on Video Coding (JCT-VC) to establish a new standardization activity on video coding, referred to as High Efficiency Video Coding (HEVC). A call for proposals was issued in January 2010 and the responses were reviewed at the first JCT-VC meeting in April 2010. The best performing proposals formed the basis of the initial HEVC test model under consideration [1]. The first HEVC test model (HM1.0) was made available in October 2010. Since then, it has undergone several refinements. This paper describes the transform coefficient coding in the draft international standard (DIS) of the HEVC specification [2].

HEVC is a successor to the H.264/AVC video coding standard [3]. One of its primary objectives is to provide approximately two times the compression efficiency of its predecessor without any detectable loss in visual quality. HEVC adheres to the hybrid video coding structure; it uses spatial and temporal prediction, transform of the prediction residual, and entropy coding of the transform and prediction information.

Fundamental difference between HEVC and previous video coding standards is that HEVC uses a quadtree structure. The quadtree structure is a flexible mechanism for subdividing a picture into different block sizes for prediction and residual coding. In HEVC, a block is defined as an array of samples. A unit encapsulates up to three blocks (e.g., one luma component block and its two corresponding chroma component blocks), and the associated syntactical information required to code these blocks. The basic processing unit is a coding tree unit (CTU) that is a generalization of the H.264/AVC concept of a macroblock. A CTU encapsulates up to three coding tree blocks (CTBs) and the related syntax. Each CTU has an associated quadtree structure that specifies how the CTU is subdivided. This subdivision yields coding units (CUs) that correspond to the leaves of the quadtree structure. A CU uses either intra or inter prediction and is subdivided into prediction units (PUs). For each PU, specific prediction parameters (i.e., intra prediction mode or motion data) are signaled. A nested quadtree, referred to as the residual quadtree (RQT), partitions a CU residual into transform units (TUs). The CUs, PUs, and TUs encapsulate coding blocks (CBs), prediction blocks (PBs), and transform blocks (TBs) respectively, as well as the associated syntax. The quadtree structure gives an encoder greater freedom to select the block sizes and coding parameters in accordance with the statistical properties of the video signal being coded. The reader is referred to [4] for additional details on the use of the quadtree structure in video coding.

After the quadtree structure appropriately determines the TBs, a coded block flag signals whether a TB has any significant (i.e., nonzero) coefficient. If a TB contains significant coefficients, the residual coding process signals the position and value of each nonzero coefficient in the TB. This paper describes the methods used to code this information, with a focus on the transform coefficient coding methods for square TBs. A TB can range in size from 4×4 to 32×32 for luma and from 4×4 to 16×16 for chroma. Non-square TBs are not explicitly discussed because they are coded in the same way as the 16×16 and 32×32 TBs and they are not a part of the HEVC main profile

Transform coefficient coding in HEVC is comprised of five components: scanning (Section III), last significant coefficient coding (Section IV), significance map coding (Section V), coefficient level coding (Section VI), and sign data coding (Section VII). Section II details the design principles that guided the development of the transform coefficient coding algorithms, Section VIII provides experimental results, and Section IX concludes this paper.

SECTION II

HEVC has a single entropy coding mode based on the context adaptive binary arithmetic coding (CABAC) engine that was also used in H.264/AVC. Unlike H.264/AVC, the context adaptive variable length coding (CAVLC) mode is not supported in HEVC. A thorough treatment of the inner workings of the CABAC engine in H.264/AVC can be found in [5].

The CABAC engine has two operating modes, regular mode and bypass mode. Regular mode has a context modeling stage, in which a probability context model is selected. The engine uses the selected context model to code the binary symbol (bin). After each bin is coded, the context model is updated. Bypass mode assumes an equiprobable model. This mode is simpler and allows a coding speedup and easier parallelization because it does not require context derivation and adaptation.

CABAC is highly sequential and has strong data dependences, which makes it difficult to exploit parallelism and pipelining in video codec implementations. The serial nature of CABAC stems from its feedback loops at the arithmetic coding and context modeling stages, particularly on the decoder side. Context selection for regular bins has two dependences; the context selection and the value of the context (CABAC state) may depend on previous bins. Due to the large percentage of bins devoted to residual coding, it is especially important that transform coefficient coding design limits these dependencies to enable high-throughput implementations.

HEVC introduces several new features and tools for the transform coefficient coding to help improve upon H.264/AVC, such as larger TB sizes, mode dependent coefficient scanning, last significant coefficient coding, multilevel significance maps, improved significance flag context modeling, and sign data hiding. HEVC followed a development process in which it was iteratively refined to improve coding efficiency and suitability for hardware and software implementation. Core experiments (CEs) were an important part of HEVC's iterative development process. A CE is a set of focused tests defined to gain a better understanding of the proposed techniques in JCT-VC. In a CE, new techniques had the opportunity to mature and be refined before their adoption. Over the course of the HEVC standardization process, several CEs on transform coefficient coding were conducted (e.g., [6]).

The following practical issues played key roles in the general design considerations.

*Hardware Area*: How many logic gates are required to implement the codec (particularly the decoder) in hardware? Is the size of the hardware reasonable, so as to limit the power consumption, operating temperature, and required footprint in a real device?*SIMD Implementation*: Can single instruction multiple data (SIMD) instruction sets be used to exploit data-level parallelism?*Throughput*: There is a significant complexity difference between the processing of the regular and bypass coded bins. Multiple bypass bins grouped together can be coded in a single cycle. Regular coded bins generally need to be processed sequentially, as the output of one bin may affect the following bin. In contrast, in CAVLC methods, all the data associated with a coefficient can be computed in parallel. CAVLC is traditionally used in applications that require high throughput because it is easier to parallelize its operation, but it leads to a penalty in compression efficiency compared to CABAC. How many regular mode bins are required to code each transform coefficient on average and in the worst-case? How many transform coefficients can be (de)coded per unit time?*Parallelism, Pipelining, and Speculative Computation*: The length of the data dependency path determines the level of parallelism and pipelining. It is possible to use speculative methods to predict the values of multiple bins in parallel. Speculative computation is a look-ahead technique that pre-calculates a tree of several future bins. The proper branch of the tree is selected when the actual bins become available. On average, these methods allow coding of more than one regular bin per cycle. However, in some scenarios, none of the pre-calculated values match the actual bins, leading to problematic corner cases. Hence, speculative computation may not guarantee improvement in throughput in the worst-case. Furthermore, additional hardware resources are required to compute multiple cases in parallel, leading to a larger and more complex design. Can the design facilitate these performance optimizations to improve the overall speed (especially at the decoder)?

Transform coefficient coding in HEVC strives to achieve a balance between coding efficiency and practicality. As such, features and tools addressing the practical issues were adopted only if they improve coding efficiency or at worst, resulted in a slight degradation. With this in mind, the key principles in the design of transform coefficient coding in HEVC can be summarized as follows.

- Improve coding efficiency, as this is one of the primary goals of HEVC.
- Reduce the number of coded bins on average and in the worst-case guarantee a minimum throughput.
- Increase the percentage of bypass bins and group bypass bins together for higher throughput.
- Reduce the dependency in context derivation of the current bin on previously coded bins.
- Avoid interleaving syntax elements; a serial dependency exists if a syntax element depends on the value of the previous syntax element.
- Reduce the number of contexts: in H.264/AVC, the number of contexts used for coefficient coding is a high percentage of the total number of contexts [5].
- Simplify scans for the hardware and SIMD implementation.
- Simplify and modularize the coding of large TBs.

SECTION III

There are two distinct concepts in scanning. A scan pattern converts a 2-D block into a 1-D array and defines a processing order for the samples or coefficients. A scan pass is an iteration over the transform coefficients in a block (as per the selected scan pattern) in order to code a particular syntax element.

In H.264/AVC, a zigzag scan is used. A zigzag scan interacts poorly with the template-based context models [7] in which the context for a coefficient depends on the previous coefficients whenever the scan moves from one diagonal to another. A diagonal scan starts in the top right corner and proceeds to the bottom left corner. The diagonal scan reduces the data dependency, that allows for a higher degree of parallelism in context derivation. In HEVC, although the context model for the significance of a coefficient has been refined to remove the data dependency problem, however, a diagonal scan is still used instead of a zigzag scan.

In HEVC, the scan in a 4×4 TB is diagonal. The scan in a larger TB is divided into 4×4 subblocks and the scan pattern consists of a diagonal scan of the 4×4 subblocks and a diagonal scan within each of the 4×4 subblocks [8]. This is possible because in HEVC, the dimensions of all TBs are a multiple of 4. Fig. 1 shows the diagonal scan pattern in an 8×8 TB, that splits into 4 subblocks. One reason for dividing larger TBs into 4×4 subblocks is to allow for modular processing, that is, for harmonized subblock based processing across all block sizes. Additionally, the implementation complexity of a scan for the entire TB is much higher than that of a scan based on 4×4 subblocks, both in software (SIMD) implementations and hardware (the estimated gate count for the subblock scan is one half [9]).

Horizontal and vertical scans may also be applied in the intra case for 4×4 and 8×8 TBs. The horizontal and vertical scans are defined by row-by-row and column-by-column scans, respectively, within the 4×4 subblocks. The scan over the 4×4 subblocks is same as that used within the subblock.

A coefficient group (CG) is defined as a set of 16 consecutive coefficients in a scan order. Given the scan patterns in HEVC, a CG corresponds to a 4×4 subblock. This is illustrated in Fig. 2, where each color corresponds to a different CG. A 4×4 TB consists of exactly one CG. TBs of size 8×8, 16×16, 32×32 are partitioned into nonoverlapping 4×4 CGs.

Scanning starts at the last significant coefficient in a block and proceeds to the DC coefficient in the reverse scanning order defined in Section III-A. CGs are scanned sequentially. Up to five scan passes are applied to a CG [10] and all the scan passes follow the same scan pattern [11]. Each scan pass codes a syntax element for the coefficients within a CG, as follows.

- $significant{\_}coeff{\_}flag$: significance of a coefficient (zero/nonzero).
- $coeff{\_}abs{\_}level{\_}greater1{\_}flag$: flag indicating whether the absolute value of a coefficient level is greater than 1.
- $coeff{\_}abs{\_}level{\_}greater2{\_}flag$: flag indicating whether the absolute value of a coefficient level is greater than 2.
- $coeff{\_}sign{\_}flag$: sign of a significant coefficient (0: positive, 1: negative).
- $coeff{\_}abs{\_}level{\_}remaining$: remaining value for absolute value of a coefficient level (if value is larger than that coded in previous passes).

In each scan pass, a syntax is coded only when necessary as determined by the previous scan passes. For example, if a coefficient is not significant, the remaining scan passes are not necessary for that coefficient. The bins in the first three scan passes are coded in a regular mode, while the bins in scan passes 4 and 5 are coded in bypass mode, so that all the bypass bins in a CG are grouped together [12]. In certain scenarios, scan passes 2 and 3 may be terminated early. In these cases, the remaining flags are not coded, and any information signaled by these flags is instead signaled by the syntax element $coeff{\_}abs{\_}level{\_}remaining$, thus shifting more bins to bypass mode [13].

Data processing is localized within a CG and once a CG is fully processed, its coefficient levels can be reconstructed before proceeding to the next one. With this syntax-plane coding approach, syntax elements are separated into different scan passes, thus helping speculative coding algorithms [14], since the next syntax element to be processed within a scan pass is known. In contrast, in H.264/AVC, all the syntax elements specifying the level information of a significant coefficient are coded before proceeding to the next coefficient. Therefore, the value of the current syntax element determines the type of the next syntax element to be processed. For instance, if $coeff{\_}abs{\_}level{\_}greater1{\_}flag$ equals 1, then the next syntax element is $coeff{\_}abs{\_}level{\_}greater2{\_}flag$. Otherwise, the next element is $coeff{\_}sign{\_}flag$. This kind of dependency is considerably reduced with the new design.

A new tool in HEVC that improves coding efficiency is mode dependent coefficient scanning (MDCS) [15]. For intra coded blocks, the scanning order of a 4×4 TB and a 8×8 luma TB is determined by the intra prediction mode. Each of the 35 intra modes uses one of the three possible scanning patterns: diagonal, horizontal, or vertical. A look-up table maps the intra prediction mode to one of the scans.

This tool exploits the horizontal or vertical correlation of the residual depending on the intra prediction mode. For example, for a horizontal prediction mode, transform coefficient energy is clustered in the first few columns, so a vertical scan results in fewer bins being entropy coded. Similarly for a vertical prediction, a horizontal scan is beneficial. Experiments showed that including horizontal and vertical scans for large TBs offers little compression efficiency, so the application of these scans is limited to the two smaller TBs. A detailed performance evaluation of MDCS is provided in Section VIII.

SECTION IV

Signaling the last significant coefficient typically reduces the number of coded bins by saving the explicit coding of trailing zeros toward the end of a forward scan. In H.264/AVC, the significance map coding is carried out by interleaving a bin indicating the significance of the coefficient $(significant{\_}coeff{\_}flag)$ and a bin indicating whether the coefficient is the last significant within the block $(last{\_}significant{\_}coeff{\_}flag)$ when the $significant{\_}coeff{\_}flag$ equals 1.

The drawback of the H.264/AVC approach is that the decision regarding coding of the $last{\_}significant{\_}coeff{\_}flag$ for a particular transform coefficient is dependent on the value of the $significant{\_}coeff{\_}flag$. As discussed above, this increases the complexity of speculative coding substantially. This is especially critical for the significance map as it may account for more than 40% of the bins.

In HEVC, the coding of the significance flag is separated from the coding of the last significant coefficient flag. To achieve this, the position of the last significant coefficient in a TB following the forward scan order is coded first, and then, the $significant{\_}coeff{\_}flags$ are coded. The position of the last significant coefficient in a block is coded by explicitly signaling its $(X, Y)$-coordinates [16]. Coordinate $X$ indicates the column number and $Y$ the row number.

The coordinates are binarized in two parts, a prefix and a suffix. The first part represents an index to an interval (syntax elements $last{\_}significant{\_}coeff{\_}x{\_}prefix$ and $last{\_}significant{\_}coeff{\_}y{\_}prefix$). This prefix has a truncated unary representation and the bins are coded in regular mode. The second part ($last{\_}significant{\_}coeff{\_}x{\_}suffix$ and $last{\_}significant{\_}coeff{\_}y{\_}suffix$) has a fixed length representation and is coded in bypass mode [17]. The suffix represents the offset within the interval. For certain values of the prefix, the suffix is not present and is assumed to be zero.

Let $T$ be the transform size. The number of intervals is $N+1$, where $N=2\log_{2}(T)-1$. The truncated unary code for size $N$ is used to code the interval index, i.e., the *prefix*. The interval lengths are shared across all transform sizes and the binarization is also shared (except when the unary code is truncated). The suffix is coded only when the interval length is larger than one, that is, when ${\oldstyle{prefix}}>3$. The *suffix* is represented by a fixed length binary code using $b$ bits specifying the offset within interval, where
TeX Source
$$b=\max (0,\lfloor\,{prefix}/2\rfloor-1)\eqno{\hbox{(1)}}$$ and the suffix range is
TeX Source
$${suffix}=\{0,\ldots, 2^{b}-1\}.\eqno{\hbox{(2)}}$$ The fixed length code is signaled starting with the most significant bit.

The magnitude of the last position, denoted by *last*, can be derived from the *prefix* and *suffix* as
TeX Source
$${last}=\cases{2^{b}(2+{\rm mod}({prefix},2))+{suffix}, & if ${prefix}>3$\cr{prefix}, & otherwise\cr}\eqno{\hbox{(3)}}$$ where $mod(.)$ is the modulus after division operation.

The maximum length of the truncated unary code (which is also the number of regular coded bins) for one coordinate is 3, 5, 7, and 9 for transform sizes of 4, 8, 16, and 32, respectively. The maximum number of bins for coding one coordinate (regular and bypass) is 3, 6, 9 and 12, respectively. As an example, Table I shows the binarization for $T=32$.

In order to group bypass bins, the prefix of the $X$ coordinate is signaled first, followed by the prefix of $Y$. After that, the suffixes for $X$ and $Y$ are signaled. Coordinates $X$ and $Y$ have separate set of contexts. Different bins within the truncated unary part with similar statistics share contexts in order to reduce the total number of contexts. The number of contexts for the prefix of one coordinate is 18 (15 for luma and 3 for chroma), so the total number of contexts for last position coding is 36. Table II shows the context assignment for different bins for a given coordinate across all transform sizes, luma, and chroma components.

This method has no performance penalty [16] with respect to the interleaved signaling of significance map and last flags in H.264/AVC. At the same time it has the following advantages.

- On average, the total number of bins for the last position is reduced.
- In the worst-case, the number of bins are significantly reduced. For an $N\times N$ TB, the maximum number of bins with the H.264/AVC method is $N\times N-1$. For example, in case of a 32×32 TB, the worst-case is 1023 bins in regular mode, whereas in HEVC, the worst-case is reduced to 24 bins: 12 bins per coordinate when each value is equal or larger than 24 (see last row in Table I).
- Some of the bins are coded in bypass mode and grouped.
- Interleaving of significance map and last flags is eliminated.
- In H.264/AVC, the scan pass for the significance map is in the forward direction due to the method of signaling the last coefficient. For the next scan pass the position of the last coefficient is known, and hence, coefficient levels are scanned in reverse order enabling the usage of efficient context models [5]. The HEVC method allows all the scan passes to use the same reverse scan order.

SECTION V

In HEVC, as in H.264/AVC, a coded block flag (CBF) signals the significance of the entire TB, i.e., it signals whether a TB contains nonzero transform coefficients. There are separate CBFs for luma and each of the two chroma components. A CBF is coded using CABAC in regular mode. Context selection for a CBF depends on the depth of the block in the quadtree hierarchy and whether it is comprised of luma or chroma samples. There are two contexts for luma and three for chroma. The contexts are defined in such a way for simplicity and due to the disparity between their statistics. Indeed, luma and chroma blocks have different properties as do blocks at different levels of the RQT. All of the transform coefficients in the TB are zero when a CBF equals zero. On the other hand, when a CBF equals one, the last significant coefficient and the significance map, which identifies the positions of the nonzero coefficients in the TB, are coded.

The significance map is coded in the first scan pass over the residual data. Since the last coefficient is already known to be significant, the significance map scan pass starts at the coefficient before the last coefficient in the scan order, and continues backwards until the top-left coefficient of the CG is reached. The scan pass then proceeds to the next CG in reverse scan order, continuing in this manner until the entire TB has been processed.

Effective intra and inter frame prediction methods in HEVC reduce the energy of the prediction residual. The strong energy compaction property of the discrete cosine and sine transforms concentrates this energy in a small number of coefficients. Then, quantization may adjust certain coefficients to zero. As a result, the significance map is often sparse. In order to exploit this sparsity, a new approach introduced in HEVC is to code the significance map within a TB in two levels [18]. The idea is to group together coefficients and code the significance of coefficient groups before coding the significance of the coefficients contained within them. As such, this design is highly compatible with the subblock and CG based scans described in Section III-A. Benefits come from the fact that if the significance of a CG is known to be zero, then the coefficients in that CG do not need to be coded since they can be inferred to be zero.

Let $p_{0}$ denote the probability that a given CG is comprised entirely of coefficients that are zero, and let $N$ denote the number of coefficients in that CG. Then, if TeX Source $$p_{0}+(N+1)\times (1-p_{0})<N\eqno{\hbox{(4)}}$$ the coding complexity measured in terms of the number of explicitly coded bins is reduced. This in turn increases the average throughput as per the design principles. Experimental results (see Section VIII-B) indeed show that the approach in [18] reduces the average number of coded bins, in addition to improving the coding efficiency.

In HEVC, the significance information is coded at multiple levels. The CBF signals the significance of the entire TB, while within a TB, level $L_{1}$ corresponds to the significance of CGs, and level $L_{0}$ corresponds to the significance of individual coefficients.

At $L_{1}$, the significance of a CG is defined to be 1 if at least one coefficient in that CG is nonzero, and 0 otherwise. The significance of a CG is signaled using the syntax element $coded{\_}sub{\_}block{\_}flag$ (CSBF). The flag of the CG containing the last significant coefficient is not coded, since it is known to be 1. Similarly, the CSBF of all subsequent CGs (in forward scan order) are not coded because they are known to be 0. To improve coding efficiency, the CSBF of the CG containing the DC coefficient is not coded but instead, implicitly set to be 1 at both the encoder and decoder. This is because the probability of this CG being entirely comprised of coefficients that are 0 is itself nearly 0.

The CSBF is coded using CABAC in regular mode. The context model is based on a template of neighboring CGs, the basic premise being that the significance of neighboring CGs can be used to make a good prediction about the significance of the current CG. The context $c_{g}$ for the CSBF of a given CG $g$ can be 0 or 1 and is derived as follows: TeX Source $$c_{g}=\min (1, s_{r}+s_{l})\eqno{\hbox{(5)}}$$ where $s_{r}$ and $s_{l}$ are equal to the CSBF of the neighboring right and lower CGs, respectively. If the neighboring CG falls outside the boundary of the TB, its CSBF is assumed to be 0.

At $L_{0}$, the significance of the individual coefficients are signaled using the $significant{\_}coeff{\_}flag$. $L_{1}$ is leveraged to avoid having to explicitly code this flag in many instances. A $significant{\_}coeff{\_}flag$ in the CG containing the DC coefficient is always coded since the CSBF of that CG is always 1. Otherwise, a $significant{\_}coeff{\_}flag$ is not coded if:

- the corresponding coefficient is in a CG with ${\rm CSBF}=0$, in which case the $significant{\_}coeff{\_}flag$ is inferred to be 0, or
- the corresponding coefficient is the last in reverse scan order in a CG with ${\rm CSBF}=1$, and all other coefficients in that CG are 0, in which case the $significant{\_}coeff{\_}flag$ is inferred to be 1.

A significance flag is coded using a context model for each coefficient between the last one in scanning order (which is excluded) and the DC coefficient. For 4×4 TBs, the context depends on the position of the coefficient within the TB, as in H.264/AVC. Coefficient positions are grouped according to their frequency [19] and the significance flags within a group are coded using the same context. Fig. 3 shows the context modeling for a 4×4 TB. High-frequency coefficients with similar statistical distributions share the same context, while a separate context is assigned to each of the lower frequency coefficients. Luma and chroma components are treated in the same way for simplicity, even though significance information in chroma blocks requires less modeling.

The position-based context modeling approach is simple to implement and allows for a high degree of parallelism in context derivation. On the other hand, a template-based context modeling approach, in which a context is determined using a causal neighborhood [7], provides a higher coding efficiency in large TBs. However, this approach does not allows for a high degree of parallelism because in certain cases, a context is dependent on the significance of coefficients immediately preceding it in the scan order.

In HEVC, context modeling for significance flags in 8×8, 16×16 and 32×32 TBs is both position and template-based. The key is that the template is designed to avoid data dependencies within a CG [20]. As shown in Fig. 4, a context is selected for a significance flag depending on a template of the neighboring right and lower CSBF, $s_{r}$ and $s_{l}$ respectively, and on the position of the coefficient within the current CG. There are 4 patterns, corresponding to the 4 combinations of $s_{r}$ and $s_{l}$, with each pattern assigning different contexts (represented by distinct numbers inFig. 4) to the different positions in the CG [21]. For example, if $s_{r}=0$, $s_{l}=1$ and the coefficient is in the top leftmost position of the CG, the context for its significance flag is “2.” This design sacrifices some of the coding gain of a context model based on a template of neighboring coefficients, but allows the determination of up to 16 contexts in parallel. This tradeoff is an example of how transform coefficient coding in HEVC achieves a balance between coding efficiency and practicality.

TBs are split into two regions: the top leftmost subblock is region 1, and the rest of subblocks make up region 2. Both regions use the context selection method described above, but have different sets of contexts for luma to account for the different statistics of the low and high frequencies. For chroma blocks, contexts for regions 1 and 2 are shared. The DC component has a single dedicated context and it is shared across all TB sizes. TBs of size 16×16 and 32×32 share contexts to limit the total number of contexts.

SECTION VI

H.264/AVC codes the absolute level as a truncated unary code in the regular mode for bins 1 to 14. If the level is larger than 14, then a suffix is appended to the truncated unary code. The suffix is binarized with a 0th-order Exp-Golomb code (EG0) and coded in bypass mode.

Coefficient level coding in HEVC is partly inherited from H.264/AVC. Several modifications are introduced to address large TBs [7] and to enhance throughput by encoding more bins in bypass mode [13], [22]. The absolute level of a significant coefficient is coded in the second, third and fifth scanning passes in a CG. The corresponding syntax elements are $coeff{\_}abs{\_}level{\_}greater1{\_}flag$, $coeff{\_}abs{\_}level{\_}greater2{\_}flag$ and $coeff{\_}abs{\_}level{\_}remaining$.

In order to improve throughput, the second and third passes may not process all the coefficients in a CG [13]. The first eight $coeff{\_}abs{\_}level{\_}greater1{\_}flag{\rm s}$ in a CG are coded in regular mode. After that, the values are left to be coded in bypass mode in the fifth pass by the syntax $coeff{\_}abs{\_}level{\_}remaining$. Similarly, only the $coeff{\_}abs{\_}level{\_}greater2{\_}flag$ for the first coefficient in a CG with magnitude larger than 1 is coded. The rest of coefficients with magnitude larger than 1 of the CG use $coeff{\_}abs{\_}level{\_}remaining$ to code the value. This method limits the number of regular bins for coefficient levels to a maximum of 9 per CG: 8 for the $coeff{\_}abs{\_}level{\_}greater1{\_}flag$ and 1 for $coeff{\_}abs{\_}level{\_}greater2{\_}flag$. There is no performance impact by introducing this method in the HEVC design as demonstrated in [13].

For coefficient level flags, a context set is selected depending on whether there is a $coeff{\_}abs{\_}level{\_}greater1{\_}flag$ equal to 1 in the previous CG [23] and whether the DC coefficient is part of the CG [11], i.e., if the current CG is in region 1 or 2. For chroma, the context set assignment does not depend on the CG location. Therefore, there are 4 different contexts sets for luma and 2 for chroma, as shown in Table III. Each set has 4 context models for $coeff{\_}abs{\_}level{\_}greater1{\_}flag$ and 1 context for $coeff{\_}abs{\_}level{\_}greater2{\_}flag$, so the number of contexts for these syntax elements is 24 and 6, respectively. The specific context within a context set for $coeff{\_}abs{\_}level{\_}greater1{\_}flag$ is selected depending on the number of trailing ones and the number of coefficient levels larger than 1 in the current CG (Table IV). The same logic and contexts are applied to all TB sizes.

After the level flags are coded, the fifth and last scan pass codes the syntax element $coeff{\_}abs{\_}level{\_}remaining$, which specifies the remaining absolute value of the coefficient level. Let the *baseLevel* of a coefficient be defined as
TeX Source
$$\eqalignno{{\oldstyle{baseLevel}}=&\,{\oldstyle{significant{\_}coeff{\_}flag}}\cr &+{\oldstyle{coeff{\_}abs{\_}level{\_}greater1{\_}flag}}&{\hbox{(6)}}\cr& +{\oldstyle{coeff{\_}abs{\_}level{\_}greater2{\_}flag}}}$$ where a flag has a value of 0 or 1 and is inferred to be 0 if not present. Then, the absolute value of the coefficient is simply
TeX Source
$${\oldstyle{absCoeffLevel}}={\oldstyle{baseLevel}}+{\oldstyle{coeff{\_}abs{\_}level{\_}remaining}}.\eqno{\hbox{(7)}}$$ The syntax element $coeff{\_}abs{\_}level{\_}remaining$ is present in the bitstream if a coefficient level is greater than 2 or whenever the maximum number of $coeff{\_}abs{\_}level{\_}greater1{\_}flag$ or $coeff{\_}abs{\_}level{\_}greater2{\_}flag$ per CG is reached. $Coeff{\_}abs{\_}level{\_}remaining$ is binarized using Golomb–Rice codes and Exp-Golomb codes [24].

Golomb–Rice codes are a subset of Golomb codes and represent a value $n>=0$, given a tunable Rice parameter $m$, as a quotient $q$ and a remainder $r$ TeX Source $$\eqalignno{q=&\,\lfloor n/m\rfloor&{\hbox{(8)}}\cr r=&\, n-q\times m&{\hbox{(9)}}}$$ where $m$ is a power of 2. The quotient $q$ is the prefix and has a unary code representation. The remainder $r$ is the suffix and has a fixed length representation. Golomb–Rice codes are attractive here for several reasons. Firstly, they are optimal for geometrically distributed sources such as the residual coefficients. Secondly, since $m$ is a power of 2, division and multiplication can be efficiently implemented using shift operations. Finally, the fixed length part is coded with exactly $log_{2}(m)$ bins, which simplifies reading from the bitstream.

Exp-Golomb codes, like Golomb–Rice codes, have implementation and speed advantages. They are also very efficient for geometric distributions, but more robust to changes in the source distribution. The code structure is similarly formed by a unary prefix followed by a fixed length suffix, but the number of codewords in the suffix part doubles after each bit in the unary code. Therefore, Exp-Golomb codes have a slower growth of the codeword length. By using Exp-Golomb codes, the maximum codeword length for $coeff{\_}abs{\_}level{\_}remaining$ is kept within 32 bits.

The syntax element $coeff{\_}abs{\_}level{\_}remaining$ is coded in bypass mode in order to increase throughput. HEVC employs Golomb–Rice codes for small values and switches to an Exp-Golomb code for larger values. The transition point between the codes is when the unary code length equals 4. Table V shows the binarization for Rice parameter $m=0$ and $m=1$.

The Rice parameter is set to 0 at the beginning of each CG and it is conditionally updated depending on the previous value of the parameter and the current absolute level as follows: TeX Source $${\rm if} {\oldstyle{absCoeffLevel}}>3\times 2^{m},\quad m=\min(4, m+1).\eqno{\hbox{(10)}}$$

The parameter update process allows the binarization to adapt to the coefficient statistics when large values are observed in the distribution. Fig. 5 summarizes the absolute level binarization processes of H.264/AVC and HEVC.

SECTION VII

In HEVC, the sign of each nonzero coefficient is coded in the fourth scan pass in bypass mode, assuming that these symbols are equiprobable and uncorrelated. Sign flags $coeff{\_}sign{\_}flag$ represent a substantial proportion of a compressed bitstream (around 15–20% depending on the configurations). It is difficult to directly compress this information. However, HEVC provides a mechanism to reduce the number of coded signs, called sign data hiding (SDH), introduced in [25] and [26].

For each CG, and depending on a criterion, encoding the sign of the last nonzero coefficient (in reverse scan order) is simply omitted when using SDH. Instead, the sign value is embedded in the parity of the sum of the levels of the CG using a predefined convention: even corresponds to “+” and odd to “−.” The criterion to use SDH is the distance in scan order between the first and the last nonzero coefficients of the CG. If this distance is equal or larger than 4, SDH is used. This value of 4 was chosen because it provides the largest gain on HEVC test sequences. Also, this value is fixed in the standard, because experiments could not establish that additional compression gain can be obtained by allowing this threshold to vary for different sequences or pictures [27]. Having a fixed value simplifies hardware implementation and bitstream conformance testing.

On the encoder side, if the criterion is met and if the parity of the sum of the levels in the CG matches the sign to omit, there is no additional process, and one bit of information is saved by avoiding the signaling of this sign. If the parity does not match the sign to omit, the encoder has to change the value of one of the quantized coefficients in the CG so that the parity matches the sign. This is an encoder choice and it is up to the implementer to decide which coefficient to modify and how. Of course, it is preferable to make the change that least affects the rate-distortion (RD) performance. References [25] and [26] show that such change can be found with a satisfying tradeoff between complexity and compression.

A first approach [25] relies on rate-distortion optimized quantization (RDOQ) being used during encoding. RDOQ [28], [29] is an encoder-only method that adjusts the quantized values of the coefficients to minimize a joint RD cost function. RDOQ tests alternate quantization values of the coefficients and selecting them if they provide a better RD tradeoff compared to the inital ones. When SDH is used, the RD costs computed during the RDOQ phase are also used to identify the parity change that least degrades RD performance. This can be performed without computing new RD costs in addition to the ones already computed by RDOQ and, therefore, very little additional complexity is needed.

A second approach [26] has been proposed when RDOQ is not used, such as in low-complexity encoders. Here, it is desirable to avoid computing RD costs, mostly because of the complexity incurred by simulating the encoding of alternate quantization values by CABAC. Therefore, for each coefficient in a CG, only the difference between the original coefficient and its dequantized value is computed. The coefficient that yields the largest difference magnitude in its CG has its quantized value increased by one (if the difference is positive) or decreased by one (if the difference is negative), thus providing the parity change. Since the coefficient with the largest difference magnitude is also the closest to its alternate quantization value, this process ensures that the impact of the parity change is small. The computation of the difference can be simply derived from the usual quantization formula. Therefore, the impact on the encoder complexity is modest.

On decoder side, if the SDH criterion is met, the sign of the last nonzero coefficient of each CG is not decoded. Instead, it is inferred from the parity of the sum of the levels in the CG. The advantage of hiding the sign of the last coefficient in scan order (instead of the first, for instance) is clear; when the last coefficient is reached, the information needed to process its sign when SDH is used (such as obtaining the parity of the sum of the quantized coefficients in order to infer the sign) is already available.

The rationale behind SDH, as shown on Fig. 6 (left diagram), resides in the fact that, in about 50% of the CGs where it is used, a full bit is saved. In the other 50%, when a change in one of the quantization levels is needed, the RD loss is moderate because there exist quantization solutions of the CG that are close to the optimal solution and have the opposite parity, as shown in Fig. 6 (right diagram). In this case, embedding the sign bit in a CG gives enough of a chance to find a quantized coefficient that causes moderate RD loss when modified. Furthermore, when there are more sign bits to hide (e.g., with small quantization steps), there is more residual data where to embed them.

HEVC includes several tools that are aimed at facilitating lossless encoding at the CU level. Since SDH usually requires some change in the quantization levels, it is inherently a lossy algorithm. Therefore, the standard does not allow the use of SDH in CUs where lossless tools are activated [30]. Also, a specific syntax element is provided to activate SDH at picture level during encoding, so an encoder can choose simply not to use SDH if it does not match its efficiency-complexity target.

SECTION VIII

This section provides performance results for transform coefficient coding in HEVC. Individual results for the new features of MDCS, multilevel significance map, and SDH are first reported. Then, HEVC transform coefficient coding and H.264/AVC transform coefficient coding are compared as a whole with respect to coding efficiency, bin usage, and throughput. This is done by comparing a realization of HEVC, HM8.0, and a model of H.264/AVC. In this model, H.264/AVC transform coefficient coding has been extended in a straightforward manner, as in [31], to deal with transforms larger than 8×8 and integrated into HM8.0. The main profile does not include non-square transforms, so no further extension to cover them is necessary. However, RDOQ has been disabled in order to measure the performance of the coefficient coding methods directly without the influence of this encoder-only technique.

The conducted experiments follow the JCT-VC common test conditions as described in [32]using HEVC main profile in HM8.0. The test set has twenty sequences split in five classes depending on the resolution: class A (4K), class B (1080p), class C (WVGA), class D (WQVGA), and class E (720p). Additionally, there is a class F (not included in the performance averages) composed of screen content sequences. Ten seconds of each sequence are encoded. The three coding configurations in common test conditions are all-intra (AI), low delay (LD), and random access (RA) with a hierarchical GOP structure of size 8 and refresh points every second. Common test conditions do not include results for the highest resolution sequences (class A) in the LD configuration and for the video-conference content (class E) in the RA configuration. The quantization parameter is set to $QP=\{22, 27, 32, 37\}$. Coding efficiency results are presented as the percentage of bit-rate savings (BD-rate [33]) with respect to the main profile anchor. BD-rate computes an average of the bit-rate savings over the four $QP$ points. Positive numbers indicate BD-rate losses. Results include the encoding and decoding times of the methods as a percentage of the tested anchor.

The performance of MDCS is tested by disabling the method. Table VI compares MDCS on and off cases. Positive numbers show the loss incurred by disabling MDCS. For the AI configuration, MDCS shows an average BD-rate gain of 0.8%. Since MDCS is applied only to intra predicted blocks, the gains are lower for the configurations that make use of inter prediction. For RA, the gain is 0.3%, and for LD, which uses less intra prediction than RA, the gain is 0.1%. Encoding and decoding times are essentially unchanged.

The performance of the multilevel significance map is measured by disabling level $L_{1}$ significance. Tables VII and VIII report the BD-rate and number of bins of having the multilevel significance map enabled versus disabled. In Table VIII, the columns labeled T, R, and S correspond to the total bins per pixel, regular coded bins per pixel, and significance map bins per pixel, respectively. In both tables, positive numbers show the loss incurred by disabling the $L_{1}$ significance.

Table VIIIshows average savings of significance bins per pixel of 5.5% in AI, 8.2% in RA and 16.2% in LD. The main reason for differences in the three configurations is that inter predicted blocks have sparser residue than intra predicted blocks, since inter prediction is typically more accurate. As explained in Section V-A, the multilevel significance map is designed to take advantage of sparse residual. Table VII shows that the average BD-rate gain provided by the method is between 1.0% and 1.2%, without a noticeable effect on encoding and decoding times. Disabling $L_{1}$ increases the number of bins and changes the significance map statistics, which negatively affects other significance map methods tuned to work along with $L_{1}$. The total bin savings does not directly translate into corresponding BD-rate gains because the significance flag bins eliminated by a zero CSBF have relatively low entropy

An encoder can choose whether to activate or deactivate the SDH tool. Table IX shows the loss incurred over the different test configurations when SDH is disabled. In these experiments, SDH is applied with the encoder approaches described in Section VII. The first approach is used together with RDOQ (which is a part of the main profile common test conditions), and the second approach is used for the case when RDOQ is disabled. The average gain provided by SDH is 0.8% when used on top of RDOQ. When RDOQ is disabled, transform coefficients tend to have larger magnitude, and therefore SDH is activated more often, and an average gain of 1.4% is observed. Since SDH requires additional computation on the encoder side to determine the right place to hide the sign bit, the encoding time is decreased by an average of 2% when SDH is deactivated. The decoding time is essentially unchanged because there is very little additional decoding complexity due to SDH and the number of bins is reduced slightly. The encoder strategies may be modified to further improve the gains achieved by SDH. For instance, since coefficients are heavily interdependent for entropy coding, an encoder may chose to test several combined coefficient changes in order to obtain the desired parity, instead of changing just one. This type of approach could potentially provide compression gains at the expense of additional encoder complexity.

Transform coefficient coding in HEVC and the H.264/AVC model are compared in terms of bit-rate savings. Table X shows that on average, the HEVC method reduces the BD-rate by approximately 4.5% for AI and 3.5% for RA and LD when compared to the H.264/AVC model. Coding efficiency improvement is greater when more residual data is present i.e., at high bit-rates and for intra coding. Most of the gain comes from MDCS, multilevel significance map and SDH. Some gain comes from the careful selection of a reduced and meaningful set of context models for the syntax elements. Table XI provides a summary of the number of contexts used for the syntax elements related to coefficient coding in HEVC and H.264/AVC.

In H.264/AVC, applications requiring higher throughput than those achieved by using CABAC could rely on CAVLC entropy coding. However, HEVC supports only one entropy coder based on CABAC. Hence, it is crucial that residual coding can achieve much higher throughput when compared to the H.264/AVC CABAC method [34].

HEVC substantially reduces the average number of coded bins of the H.264/AVC model. Fig. 7 shows the average ratio of HEVC to H.264/AVC model number of bins for the three coding configurations and a wide range of $QP$ values. HEVC codes fewer bins and a higher percentage of them are coded in bypass mode. The average number of bins per pixel for HEVC are shown in Table XII. The highest bin-rate occurs for all-intra at low $QP$. For $QP=0$(left column), 6.9 bins per pixel are coded. In that scenario, HEVC reduces the average number of regular mode bins up to 2.5 times. The percentage of regular and bypass coded bins in HEVC are shown in Table XIII. The data is split for coefficient-related syntax elements. For AI and $QP=0$, the average percentage of bypass bins is above 60%: most of them are grouped together in the sign and remaining level scan passes.

However, it is fundamental to focus on the worst-case when comparing the implementation complexity of different entropy coding methods as the worst-case defines the hardware requirements. In case of HEVC, the worst-case is greatly improved compared to H.264/AVC [35]. In H.264/AVC, more than 15 bins per coefficient may be coded with adaptive context modeling, while remaining bins are coded in bypass mode. Using this approach, all the residual data could use regular bins in the worst-case even with typical operational data rates. HEVC minimizes the maximum number of regular coded bins per coefficient and achieves a throughput closer to what would traditionally have been possible using CAVLC schemes. For a 4×4 TB, HEVC requires at most 30 regular coded bins: 6 for last coding, 15 for significance, 8 for the larger than one flag, and 1 for the larger than two flag. Thus, the worst-case is 1.875 regular bins/coefficient. The ratio is lower for larger TBs, due to the last significant coefficient coding method. In comparison, H.264/AVC requires 15.875 regular bins/coefficient for a 4×4 TB and 15.969 regular bins/coefficient for a 8×8 TB in the worst-case.

The studies in [36], [37], [38], [39] indicate that in practice, the parsing and decoding of the transform coefficients in HEVC, which account for over 50% of the compressed bitstream, takes at most the same amount of time as the motion compensation or deblocking filter stages.

SECTION IX

This paper described in detail the transform coefficient coding in the HEVC DIS specification. Transform coefficient coding in HEVC strives to strike a balance between high coding efficiency and practicality of implementation. It is comprised of five components: scanning, last significant position coding, significance map coding, coefficient level coding, and sign data coding. Since HEVC and H.264/AVC share the basic arithmetic coding engine, the HEVC transform coefficient coding design has sought to overcome the shortcomings in the H.264/AVC design with respect to its throughput capacity, which became apparent during the implementation phase of that standard. The new design is capable of high compression efficiency and delivering high throughput at the same time, leading to a single entropy coder that can address a wider range of applications.

From the point of view of coding efficiency, existing methods in significance map and level coding were improved while new schemes such as MDCS, multilevel significance map and SDH were introduced, leading to an overall average gain of 3.5% over the H.264/AVC-like transform coefficient coding. To summarize, HEVC transform coefficient coding improves coding efficiency while reducing the average and worst-case complexity.

This paper was recommended by Associate Editor A. Kaup.

J. Sole, R. Joshi, and M. Karczewicz are withQualcomm, San Diego, CA 92121 USA (e-mail: joels@qti.qualcomm.com; rajanj@qti.qualcomm.com; martak@qti.qualcomm.com).

N. Nguyen and T. Ji are with Research In Motion Ltd., Waterloo, ON N2L 5Z5, Canada (e-mail: nnguyen@rim.com; tiji@rim.com).

G. Clare and F. Henry are with Orange Labs, Issy-les-Moulineaux 92794, France (e-mail: gordon.clare@orange.com; felix.henry@orange.com).

A. Dueñas was with Cavium, San Jose, CA 95131 USA. He is now with NGcodec, San Jose, CA 95126 USA.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions