Overview of the Low Complexity Enhancement Video Coding (LCEVC) Standard

The Low Complexity Enhancement Video Coding (LCEVC) specification is a recent standard approved by the ISO/IEC JTC 1/SC 29/WG04 (MPEG) Video Coding. The main goal of LCEVC is to provide a standalone toolset for the enhancement of any other existing codec. It works on top of other coding schemes, resulting in a multi-layer video coding technology, but unlike existing scalable video codecs, adds enhancement layers completely independent from the base video. The LCEVC technology takes as input the decoded video at lower resolution and adds up to two enhancement sub-layers of residuals encoded with specialized low-complexity coding tools, such as simple temporal prediction, frequency transform, quantization, and entropy encoding. This paper provides an overview of the main features of the LCEVC standard: high compression efficiency, low complexity, minimized requirements of memory and processing power.


I. INTRODUCTION
T HE Low Complexity Enhancement Video Coding (LCEVC) is a Video Coding standard finalized in November 2021 by the ISO/IEC JTC 1 working group formerly known as SC29/WG11 (MPEG) and currently as SC29/WG04 (MPEG Video Coding). The specification is officially named ISO/IEC IS 23094-2, also identified as MPEG-5 Part 2 [1]- [3].
All these specifications represent three decades of evolution of video coding algorithms with the explicit goal of achieving ever increasing compression while maintaining the same subjective quality for the user. They share a common structure, since all of them use blocks of samples, either of fixed size or adaptive size depending on the picture content, and all of them exploit the spatial and temporal redundancy (intra and inter predictive coding), the prediction residual redundancy (transform coding), and finally the statistical redundancy (entropy coding).
In all the video coding standards listed above, from H.261 in 1988 to VVC in 2021, the bitstream is composed of a singlelayer, since all pictures and all blocks forming a picture are processed and encoded in a single bitstream. VVC specifies from its first edition the possibility to use the multi-layer approach, but can still be used as a single-layer video codec, as specified in the Main 10 and Main 10 4:4:4 profiles, [19] Annex A.
Starting with MPEG-2/H.262 (1996) and H.263 (1996), multi-layer extensions, also denoted as scalable extensions, to the single-layer video coding algorithms have been developed. However, the most relevant specifications for scalable video coding were developed with AVC/H.264, named Scalable Video Coding (SVC) [21], defined in AVC Annex G, and finalized in 2008, and with HEVC/H.265, named Scalable High Efficiency Video Coding (SHVC) [22], defined in HEVC Annex H, and finalized in 2015. The main feature introduced with scalable video coding, as in SVC and SHVC, is the possibility to partition the video bitstream into several subsets, separating two or more layers in the temporal dimension (with a first layer at lower frame rate, e.g. 30 fps, and a second layer at higher frame rate, e.g. 60 fps), in the spatial dimension (e.g. with a first layer at lower resolution, e.g. 1920 × 1080, and a second layer at higher resolution, e.g. 3840 × 2160), or in the quality dimension (with a first layer at lower quality, and a second layer at higher quality). In all cases, the lower layer is a complete bitstream, sufficient to decode the video sequence at a lower quality, whatever the type of scalability: temporal, spatial, or quality scalability.
LCEVC is not designed for being an alternative to other existing and emerging video coding standards, like AVC, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ HEVC, EVC, or VVC, but rather for being a standalone toolset for enhancement of any other existing video codec. This is achieved working on top of other coding schemes, encoding the residual differences between a lower quality encoding and the original video. The LCEVC technology typically takes as input the decoded video at lower resolution and adds up to two enhancement sub-layers of residuals encoded with specialized low-complexity coding tools. Thus, LCEVC can be classified as a multi-layer video coding technology, but the main difference with existing scalable ones, like SVC and SHVC, lies in the fact that the added enhancement layers are completely independent from the base video. The most similar approach to LCEVC is represented by the use of an "external base layer" as defined in SHVC, [17] Annex H, where the enhancement layer specified by SHVC can be applied to a base layer specified by AVC, [15]. LCEVC generalizes this approach, allowing any base layer to be enhanced using the same LCEVC technology, completely agnostic of the codec used for the base video bitstream.
By design, LCEVC is based on a set of tools with low complexity, and it is intended to be efficiently and effectively implemented in software via existing hardware blocks in existing devices, such as Single Instruction, Multiple Data (SIMD) processors and Graphics Processing Units (GPU), for scaling and/or shading. With these design choices, LCEVC achieves a trade off in terms of optimal Rate Distortion (RD) performance, energy saving, and ease of implementation. The numerical results of such trade off, in terms of RD performance, are described in Section IX.
LCEVC has recently gained attention in the industrial and scientific community concerning its characteristic of improvement of the current codecs [23]- [25].
In [24] the authors report a comparison of LCEVC to AVC (in its implementation x264) and HEVC (in its implementation x265), when applied to High Dynamic Range (HDR) video sequences. The paper includes results for bitrate savings for LCEVC compared to the base codec, using the metrics PSNR, MS-SIM, and VMAF versus Bitrate.
In [25] the authors report a comparison of LCEVC to AVC (in its implementation x264) and HEVC (in its implementation x265), in the context of Live Gaming Video Streaming applications. The paper provides results for bitrate savings for LCEVC with respect to the base codec, in terms of PSNR, VMAF, and MOS versus Bitrate.
In [23] the authors provide an overview of the LCEVC specification and a comparison with AVC, HEVC, and VVC, in terms of PSNR, VMAF, and MOS versus Bitrate. The paper provides also an analysis of the complexity, reporting the encoding times of LCEVC with the base codec at quarter resolution versus the encoding times of the base codec at full resolution. Finally, the paper provides an analysis of the (low) correlation of bitrate savings with temporal complexity and the (high) correlation of bitrate savings with spatial complexity of the video sequences.
This paper is organized as follows. Section II provides a high level description of the LCEVC encoder, decoder, and bitstream structure. Sections III to VII analyze in detail the single processing blocks, namely: Upscaler, Predicted Residual, Temporal Prediction, Transformation, Quantization, and Entropy Coding. Section VIII provides a comparison of the complexity, in terms of processing time, of LCEVC with the base codec at quarter resolution versus the base codec at full resolution. Section IX reports results of LCEVC in terms of Rate Distortion performance. Finally, Section X summarizes the conclusions of the paper.
The following notation is used throughout the paper: bold letters are used for vectors and matrices, | | indicates the magnitude operator, the floor operator, sgn( ) the sign function, * the convolution, · the matrix product, and max(, ) returns the maximum between the two arguments.

II. OVERVIEW
This section gives a high level description of the building blocks of the LCEVC video coding technology, for the encoder, the decoder, and the bitstream format.
The design of LCEVC foresees up to two sub-layers of enhancement to a base layer compressed video representation. The first layer (sub-layer 1) is optional and can be disabled by proper signaling in the LCEVC bitstream, while the second layer (sub-layer 2) is mandatory. Although the number of layers could be greater than two, the choice of using only one or two enhancement layers is based on empirical studies that show that adding further layers does not improve the overall performance of the multi-layer scheme, comparing the additional complexity introduced to the additional compression achieved.

A. Encoder
The general structure of an LCEVC encoder is depicted in Fig. 1. The encoding process can be divided into three main steps.
Firstly, the input sequence is downscaled using a nonnormative downscaler. Depending on the chosen configuration, the downscaling can be applied up to two consecutive times. The video, now at a lower resolution than the input sequence, is fed into the base encoder (e.g., AVC, HEVC, EVC, VVC). This process is not further specified in LCEVC: any encoder that produces a decodable bitstream can be used. As explained in Section II-C, the base bitstream is included in the LCEVC bitstream. Using a normative upscaler, which allows the use of different and content-adaptive kernels as well as the optional encoder-signaled activation of a non-linear corrector called Predicted Residual, the upscaled base reconstruction is used as the input for the second step of the LCEVC encoding process. The enhancement sub-layer 1 (L-1) residuals are created by subtracting the downscaled input sequence and the base reconstruction. These residuals, which are typically sparse (e.g., sharp edges or fine details), are transformed, quantized and entropy encoded resulting in coefficient groups as discussed in Section II-C.
Some processing blocks specified in LCEVC take as input a small square block of samples, with size of either 2 × 2 or 4 × 4 samples: Temporal Prediction, Transform. The choice of the 2 × 2 or 4 × 4 block size can be done at the Encoder side, and signaled in the bitstream. Other processing blocks are applied directly to the whole matrix of samples or coefficients: Downscaler and Upscaler, Quantization, Entropy Coding.
The transform used in LCEVC has a simple structure and uses a small kernel, of size either 2 × 2 or 4 × 4. This allows to both efficiently code sparse information and parallelize the transforms, since individual blocks are not dependent on other blocks within a picture.
A linear quantizer, which may include an adaptive deadzone, is used to further process the transform coefficients.
The entropy encoder, which consists of a run-length encoder (RLE) and an optional prefix encoder (Huffman encoder), processes the quantized transform coefficients, and creates the coefficient groups for sub-layer 1.
The inverse processes of the quantization and transform are applied, with adaptive and optionally asymmetric dequantization, creating the sub-layer 1 reconstruction. Additionally, an L-1 filter can be added, which operates as a simple deblocking filter.
Finally, the sub-layer 1 reconstruction is upscaled to full resolution and subtracted from the original input sequence.
The such created enhancement sub-layer 2 (L-2) residuals are fed into the temporal prediction algorithm. LCEVC uses a zero-motion vector temporal scheme which operates on a block-by-block basis. The residuals from the previous picture are stored in a temporal buffer and are added to the L-2 residuals in case the temporal prediction module is activated. To reduce the signaling overhead in e.g. a fast-moving sequence, where this zero-motion vector scheme would likely not be beneficial, the temporal prediction can be disabled for a group of samples of size 32 × 32, as well as for an entire picture with a single bit. The temporal signaling, containing the information whether the temporal prediction is active for a specific transform block, is entropy encoded and included as a temporal layer in the LCEVC bitstream.
The sub-layer 2 residuals (after the temporal prediction is applied, when appropriate) are transformed, quantized, and encoded using the same tools as explained for enhancement sub-layer 1. The quantizer can use different quantization parameters for the two enhancement sub-layers L-1 and L-2, allowing to balance the impact of the two layers and to decide where to add more details.
The configuration of the Encoder and Decoder requires a set of parameters. Such parameters are set at the Encoder side, included in binary format in the Bitstream, and extracted by the Decoder. These parameters are handled by the Encoder Configuration and Decoder Configuration blocks, depicted graphically in Fig. 1 and Fig. 2. Besides the static information, like picture size, picture rate, bits per sample, the configuration includes parameters regarding the decision whether to use both sub-layer 1 and 2 or to use exclusively sub-layer 2, the block size for Temporal Prediction and Transform, the decision whether to use or not the L-1 filter, plus other configurable parameters that assume a default value if not explicitly specified in the Bitstream.

B. Decoder
The functionality of a normative LCEVC decoder is standardized in [1], and its structure is visualized in Fig. 2. As in the case of the encoder, described in the previous Section, three main steps are visible: the base decoding and the corrections in enhancement sub-layers 1 and 2.
The decoding process is highly parallelizable, and amenable to both SIMD-and GPU-accelerated processing. The base video decoder is independent from the LCEVC enhancement part. Additionally, the two enhancement sub-layers can be reconstructed in parallel as well. No inter-block dependencies are present within a picture, making the inverse transform blocks independent from other blocks. If the temporal prediction is active, the enhancement sub-layer 2 reconstruction is dependent on the temporal buffer which stores residuals from the previous picture.
Using the same (or inverse) tools that have been described in Section II-A, the reconstructions from the three main decoding steps are achieved. Those are combined using two normative upscaling filters and additions resulting in the decoded output sequence.

C. Bitstream Structure
An LCEVC encoded bitstream is formed of two separate bitstreams, namely a base bitstream produced by a base encoder conformant to its associated specification (e.g. AVC, HEVC, EVC, VVC) and an enhancement bitstream (the LCEVC bitstream) produced by the enhancement encoder and conformant to the specification ISO/IEC 23094-2.
The LCEVC bitstream is structured in a specific order, depending on the chosen configuration parameters. A simplified structure is visualized in Fig. 3. The LCEVC encoder can be set to enhance all the three available planes (luma and chroma) or the luma plane, only. Within each encoded plane, up to two layers can be present. The first layer (enhancement sub-layer 1, L-1) is used to encode transformed and quantized residuals before applying the final upscaling. The second layer (enhancement sub-layer 2, L-2) is then added after the final upscaling, meaning at the same resolution as the overall output sequence. Each of these two sub-layers is split into coefficient groups. Depending on the chosen transform type, with a kernel size of 2×2 or 4×4 samples, either 4 or 16 coefficients groups are present within a sub-layer. Each coefficient group contains the corresponding transform coefficients.
Furthermore, temporal signaling is added as an additional coefficient group to enhancement sub-layer 2 if the temporal prediction is active. This group contains information whether residuals from the previous picture, stored in a temporal buffer, are used for prediction on a block-by-block basis.

III. UPSCALER
The first processing step on the output pictures of the base decoder is the upscaling from the lower resolution at which the base decoder operates: for instance the base decoder may operate at 1920 × 1080 resolution while the LCEVC encoder operates at 3840 × 2160 resolution.
The upscaling process described in this section is based on two design choices, adopted to keep the implementation at low complexity, even in software, while at the same time providing a good quality of the upscaled picture.
The first design choice is limiting to 2 the upscaling factor in each dimension of the picture, to reduce the complexity of up-sampling filters.
The second design choice is limiting to 4 the number of taps in the poly-phase implementation of up-sampling, which was found to be a good compromise between computational complexity and picture quality.
Given the impulsive response of a one-dimensional up- the vector representation of the up-sampling operation can be written as where the (↑ 2) operator is a dyadic up-sampling operator, and H = h · h T is the two-dimensional kernel. To avoid , the poly-phase implementation can be realized with the cascade of mono-dimensional filtering as follows. The first pass is the horizontally up-sampling of the picture by a factor two where (↑ 2) c e,o is a dyadic upsampling operator that inserts zeros at the even or odd positions in columns.
The second pass consists of vertically up-sampling the picture by a factor two, where (↑ 2) r e,o is a dyadic upsampling operator that inserts zeros at the even or odd positions in rows. To apply the convolution operation at the picture boundaries, the picture matrix is extended outside the boundaries by replicating the last available sample. Fig. 4 shows the values of one of the four poly-phase bidimensional kernels, resulting from the successive application of (2) and (3), H ee = h e · h T e , for each of the four normative filters. The others kernels have the same values, rotated by multiples of π/2. Remember that in the final picture each kernel affects only one of the four up-scaled samples. These interpolation filters should limit the effect of the spectral input replicas, entering in the frequency band of the signal thanks to the up-sampling, so they are also known as anti-imaging filters.
For the upscaling operation, LCEVC defines a set of four normative filters, with a fixed scaling factor of 2 in the horizontal and vertical directions, named: "Nearest Neighbor",  "Linear", "Cubic", "Modified Cubic". Their respective onedimensional taps, on a scale of 16384, are shown in Table I. Fig. 5 shows the magnitude of the H filter, as defined in (1), resulting from the poly-phase implementation of 4 × 4 filter showed in Fig. 4. All have a low-pass behaviour, as required to an interpolation filter, with greater or lesser effectiveness in filtering the spectrum images.
Besides the four predefined filters specified by the standard and described above, LCEVC provides a mechanism for the user to define a custom filter, with the only limitations that it shall be represented by four taps and implemented as a separable filter, as well as the predefined filters. When the custom filter is selected, the values of the four taps, are encoded in the bitstream using the same scale of 16384.

A. Predicted Residual
The processing block defined Predicted Residual consists of an adjustment of the sample values after up-scaling from the lower resolution to the higher resolution, typically by a factor 2 in the horizontal and vertical direction.
The adjustment is performed comparing the values of the base resolution picture samples B with the corresponding full resolution picture samples F. The process consists in computing the difference between the base resolution samples and the average of the four full resolution samples, and use such modifier adding it to the four full resolution samples.
where J 4 is a 4 × 4 matrix of ones and H NN is Nearest Neighbor 8 × 8 filter.

IV. TEMPORAL PREDICTION
The Temporal Prediction algorithm is extremely simple. The reconstructed residuals from a picture are stored in a temporal buffer. Based on a cost criterion implemented by the encoder, for each block of size 2×2 or 4×4, the block can be predicted from the temporal buffer or encoded without prediction.
The prediction is done for each block from its corresponding spatial position in the temporal buffer, so no Motion Estimation is performed and no Motion Vectors are encoded in the bitstream. Only one reconstructed picture is stored in the  temporal buffer, so the prediction is always performed as a forward prediction from the previous picture in display order. Anyway, since the base reconstructed picture is needed to decode the corresponding LCEVC enhancement, the bitstream order of the LCEVC pictures is the same as the bitstream order of the associated Base pictures.
Figs. 6 -7 show the display order and the decoding order for the Base bitstream (e.g. AVC) and the LCEVC Enhancement bitstream, respectively. The upper part shows the display order with arrows representing the temporal prediction references, while the lower part shows the decoding order.
The Temporal Prediction information is then encoded to and decoded from the bitstream in a similar way to the transform coefficients, as described in Section VII, with the  only exception that the information for each block is binary, to indicate whether Temporal Prediction is applied or not for that block.
The choice to use a very simple algorithm for Temporal Prediction, without any Motion Estimation and Motion Vectors, is based on the consideration that the input to the LCEVC encoding process consists of residuals, i.e. the coding errors between the upscaled base encoding and the full resolution original. For this reason, the input residuals to LCEVC benefit implicitly from the Motion Estimation performed by the Base codec, that removes the temporal correlation between the input pictures. Thus, the Temporal Prediction in LCEVC is only used when the residuals show a strong similarity to the co-located residuals of the preceding picture in display order. This is the case, for example, of a static part of the picture with high frequency details that benefits from the enhancement performed by LCEVC, like graphical overlay on natural video.

A. Direct Transformation
The Direct Transformation, computes a 2×2 or 4×4 matrix of frequency coefficients from a 2×2 or 4×4 matrix of original residuals, i.e. samples in the picture domain. To define it as a matrix product, we need to transform the residual matrix R in a column vector and the coefficient matrix C in the same way: where N is the dimension of the squared matrix of residuals. Now we can write where DT is a 2 N × 2 N matrix of ±1 and it is represented in Fig. 8.

B. Inverse Transformation
In the same way, the Inverse Transformation, computing a 2 ×2 or 4×4 matrix of reconstructed residuals from a 2 ×2 or 4 × 4 matrix of coefficients, is defined as a matrix product: The relationship between the Direct Transformation matrix ( DT) and the Inverse Transformation matrix (I T) is that the latter is the inverse matrix of the former, except for a normalization factor One notable property of the Direct Transform and Inverse Transform matrices is that, apart from the normalization factor, they are orthogonal, so The I T matrix is represented in Fig. 9. The physical interpretation of the Inverse Transformation matrix is that each vertical vector represents a 4×4 component of the residuals triggered by a single c q (k) coefficient.

VI. QUANTIZATION
In the same way as other block based video coding standards, the main processing block of LCEVC that allows a precise control of the bitrate produced by the encoder is the Quantization block, with the specification of the quantizer stepwidth (SW), in a similar way to the "quantization parameter" (QP) defined in other MPEG standards. The modulation of the quantizer stepwidth allows a coarser quantization with higher values of SW resulting in a lower bitrate production, or a finer quantization with lower values of SW resulting in a higher bitrate. The following subsections describe in detail the algorithms for Direct Quantization and Inverse Quantization, and the formulae for deriving the different parameters used in the algorithms, with derivation from a single value of SW signaled in the bitstream, which is denoted as Original Stepwidth.
Concerning the number of bits used for the representation of samples, residuals, and transform coefficients, LCEVC operates as follows. It should be noted that, starting from the upscaled base picture, and adding Temporal Prediction and the reconstructed residuals after Inverse Transformation and Inverse Quantization, it is possible that the final result exceeds the 16 bits range [−32768, 32767]. This is prevented by clipping the results of the sum of the three components in the 16 bits range.

A. Direct Quantization
The algorithm for Direct Quantization can be described by the following formulae. The following symbols are used for the variables of the Direct Quantization equations: has already a dead zone around zero, since the quantization is obtained by rounding toward zero.
To increase the size of the dead zone around zero, an additional parameter DZ is subtracted from the absolute value of the input, checking that the resulting value is non negative: c q (k) = sgn(c(k)) max(|c(k)| + DZ, 0) DSW (13) with output values clipped in the range [−8192, 8191], that is in 14 bits. The DSW value is computed as Thus, the actual denominator in the direct quantization formula has a quadratic relationship with the OSW value signaled in the bitstream. The DZ value is computed as (15) with A = 39, B = 126484. Thus, also the DZ used in the direct quantization formula to reduce the number of small transform coefficients to be encoded in the bitstream, has a quadratic relationship with the OSW value signaled in the bitstream, with negative values.
As an example, with the stepwidth signaled in the bitstream, i.e. the Original Stepwidth, assuming values of 1024, 2048, 3072, 4096, using a fixed value for the quantization matrix coefficient (QMC = 32), the direct quantization formula parameters are reported in Table II. The inverse quantization formula is symmetric to the direct quantization one, taking into account that a Dead Zone value is subtracted from the original coefficient, and an Inverse Quantization Offset value is added to the dequantized coefficient. depending on the value of the DequantOffsetModeFlag (DOMF) signaled in the bitstream. The actual value of the offset used in the inverse quantization formula depends on the offset value signaled in the bitstream, on DSW and on the logarithm of OSW and DSW.
As an example, with signaled stepwidth values of 1024, 2048, 3072, 4096, using a fixed value for the signaled offset (OO = 16), the inverse quantization formula coefficients are reported in Table III.

VII. ENTROPY CODING
Prefix Coding is used to associate a variable length codeword (VLC) to each symbol of the transformed and quantized coefficients. The length of the codewords associated to all symbols in the codebook is assigned using a Huffman tree, built with the canonical Huffman algorithm. Then the binary codewords are computed with a deterministic algorithm, replicable at the encoder and decoder.
The following steps are iterated, until a single parent (root) node for the tree is created: 1) Order the nodes in a list, in order of increasing frequency. 2) Take the two symbols with lowest frequency and create a parent node that has the two nodes as children. 3) Assign the sum of frequencies of the children to the parent, add the parent to the nodes list and remove the children from the list. Once the Huffman tree is completed, the number of branches needed to reach each leaf node, with each leaf corresponding to a symbol, determines the number of bits for the respective code word, that is the code length for that symbol.
Once the code lengths are assigned, the symbols are reordered in descending code length order and, when the length is the same, in ascending lexicographic order. Finally, binary codewords are assigned to the symbols, starting from the highest code length and from the lowest symbol, and starting from the all-zero codeword, in ascending order. In the specific case with the six symbols depicted in Fig. 11, the code lengths and values assigned to each symbol are reported in Table IV.
The main innovative aspect of the entropy coding algorithm specified in LCEVC is the computation of an adaptive and optimized codebook for each set of symbols to be encoded. Both the encoder and the decoder implement the same deterministic algorithm to generate the codewords associated to the symbols from the frequency of the same symbols. Thus it is sufficient to transmit the symbol value and the corresponding codeword length for each set of symbols to be encoded and decoded.
Comparing LCEVC to the recent MPEG standards, LCEVC uses an adaptive VLC solution for entropy coding, as opposed to the Binary Arithmetic Coding (BAC) of the other specifications. This design choice is motivated by the lower computational complexity of VLC with respect to BAC, even though the compression performance is sub-optimal.

VIII. PROCESSING PERFORMANCE
The performance of the software implementation of LCEVC was compared to the performance of four different video codecs, namely HEVC, EVC, and VVC (specified by MPEG/VCEG), and AV1 (specified by AOM) [26]. For each video codec, the relevant Reference Software has been used to perform the encoding sessions, selecting the following software versions: • HEVC: HM version 16.08 [27] • EVC: TM version 01.00 • VVC: TM version 04.02 [28] • AV1: release 01.00 [29] The set of test sequences used to run the encoding sessions was limited to four video sequences from the MPEG test set. The encoding sessions have all been run with a fixed QP throughout the complete sequence of either 500 frames for the 50 fps sequences or 600 frames for the 60 fps sequences. To get comparable results, for each sequence the eight values of QP giving the best approximation of four target bitrates have been retained. The selected target bitrates are: sequences Collecting the test data from the encoding sessions on the four test sequences with the four base codecs, the average execution time for each set of sequences was estimated, at the same resolution and the same frame rate. All the encoding sessions were performed on the same machine, a Windows Server 2012 with the following characteristics: • CPU: intel Xeon E5-2690 at 2.90 GHz • RAM: 32.0 GB • OS: Windows Server 2012 R2 64 bit From the average ratios of processing times for the three most recent codecs (EVC, VVC, AV1) with respect to HEVC, it is possible to estimate the complexity of the software implementation of the codecs: • EVC has a complexity about 5 times higher than HEVC; • VVC has a complexity about 10 times higher than HEVC; • AV1 has a complexity about 17 times higher than HEVC (in the single pass configuration). Concerning the processing time of LCEVC, the average times for the sequences of 500 frames and 600 frames are: • for a full resolution of 1920 × 1080 around 70 s • for a full resolution of 3840 × 2160 around 140 s In all the above cases, including the Test Model for HEVC, the weight of the LCEVC processing is below 1% of the processing time of the base codec, so essentially negligible when computing the overall complexity. In summary, the saving in processing time with the configurations used for Base resolution plus LCEVC encoding compared to Full resolution encoding are: • for HEVC Base with LCEVC compared to HEVC Full resolution, a factor 3.6; • for EVC Base with LCEVC compared to EVC Full resolution, a factor 2.6; • for VVC Base with LCEVC compared to VVC Full resolution, a factor 3.6; • for AV1 Base with LCEVC compared to VVC Full resolution, a factor 3.1.

IX. RATE DISTORTION PERFORMANCE
The goal of a video compression algorithm is the minimization of the bitrate required to achieve a given video quality, or, stated in a symmetrical way, the maximization of the video quality achieved for a given bitrate. To verify quantitatively the requirements on bitrate minimization or quality maximization, the approach adopted in the standardization process of ISO/MPEG and ITU-T/VCEG is the execution of a test campaign, finalized at determining the objective and subjective video quality, using well defined metrics.
For objective quality, the metrics adopted in the LCEVC verification tests are PSNR (Peak Signal to Noise Ratio) [30] and VMAF (Video Multi-method Assessment Fusion) [31]. For subjective quality, the metric adopted is MOS (Mean Opinion Score) [32].
The formal objective and subjective assessment of the LCEVC standard has been completed and approved in April 2021 at the 134th MPEG meeting. The official report on the LCEVC verification tests is published in the MPEG website as WG04 document N0076 [33].
LCEVC has been compared to four single-layer video coding technologies developed by MPEG and VCEG, specifically AVC/H.264, HEVC/H.265, EVC and VVC/H.266, that While the in depth analysis of the objective and subjective MPEG Verification Tests results is beyond the scope of the paper, a summary of the operational points used for the test and the conclusions are reported here.
For each of the four video codecs, 6 sequences were encoded at full resolution with the single layer codec and at quarter resolution with the base codec adding the LCEVC enhancement: 4 sequences at 3840 × 2160 (UHD) resolution and 2 sequences at 1920 × 1080 (HD) resolution.
Four each video sequence, the test points were selected to span, as far as possible, the same range of objective quality, with particular care for the range of VMAF values, that give a better estimate of the range of subjective MOS quality.
Tables V and VI report, for the test with AVC and VVC respectively, the minimum and maximum values for the essential parameters: bit rate, PSNR, VMAF, MOS. Additionally, they report the minimum and maximum percentage of bit rate allocated to the LCEVC Enhancement bitstream with respect to the Base bitstream.
Comparing the full-resolution LCEVC-enhanced encoded sequences with the full-resolution single-layer encoded sequences, in the same range of subjective quality, the reported average bit rate savings are: • 46% for UHD and 28% for HD for LCEVC enhancing AVC; • 31% for UHD and 24% for HD for LCEVC enhancing HEVC; • an overall benefit for LCEVC enhancing EVC and VVC.

X. CONCLUSION
The LCEVC standard was developed in the context of ISO/IEC MPEG Video Coding, with the explicit goal to provide a scheme for enhancement to any existing, or even future, single-layer video coding algorithm. It is a multilayer scheme, but differs substantially from other scalable schemes defined by MPEG, such as SVC and SHVC, since it is designed to work in the samples and residuals domain, without any dependency from the base codec algorithm.
The two main goals of the LCEVC development were the high efficiency in terms of Rate Distortion performance, and the low complexity in terms of processing power and memory requirements. These goals have been verified through the test campaign performed following the established MPEG/VCEG approach.
The verification tests performed at the end of the standardization process confirm that bitrate savings can be achieved when using a single-layer base codec at quarter resolution in conjunction with LCEVC at full resolution, when comparing to the same base coding technology at full resolution. The subjective test results reported in [33] indicate bitrate savings of approximately 40% over AVC, 30% over HEVC, 15% over EVC and VVC, at the operational points used for these tests. Stefano Battista received the Laurea degree in electronic engineering from the Università Politecnica delle Marche, Italy, in 1990. From 1991 to 1997, he was a Researcher with the Telecom Italia Laboratory, Video and Multimedia Group, Turin, Italy. His main activities have been on video coding, video analysis, 3D modeling, and multimedia systems, with a focus on standardization for multimedia communications. From 1997 to 1999, he was a System Engineer with the Advanced Systems Technology Group, STMicroelectronics, Agrate, Milan, Italy. His main activities were in the development of new architectures for multimedia consumer products, specifically on DVB/DVD platforms and their evolution toward interactive TV. From 1999 to 2018, he was a Researcher and a System Engineer with bSoft, Macerata, Italy, where he has been active in research and development. He has published several research papers on video coding and multimedia systems and contributed to books related to the fields of communications and multimedia. He actively participated in the standardization effort of MPEG, in particular to MPEG-4 reference software development, co-authoring part five of the standard. For this contribution to the standardization activity, he received an ISO Certificate of Appreciation. degree. He is currently a Co-Founder and the CEO of V-Nova, leading company in data compression and artificial intelligence. He is also a Keen Innovator, an Entrepreneur, and an Investor, with relevant business building experience and half a dozen exits. Having retained in-depth scientific and engineering expertise, he contributed to the foundational development work for the V-Nova Intellectual Property portfolio and is an Inventor or a Joint Inventor of several essential aspects of the coding standards MPEG-5 Part two Low Complexity Enhancement Video Coding (LCEVC) and SMPTE VC-6 ST-2117, with over 200 patents filed. Former Senior Partner at McKinsey, where he was the Head of the Organization and Operations Practices of the Mediterranean Complex, he also has a breadth of business experience and access to senior executives in a variety of industries and geographies, with well-established experience in telecoms, technology, healthcare, insurance, banking, automotive, aerospace, and defence. He led transformational projects in all continents, and he was instrumental in setting up some of McKinsey's own innovation-related business building activities.
Simone Ferrara received the Graduate degree in telecommunications engineering from the Politecnico di Milano and the M.Sc. degree in electrical engineering from Washington University in St. Louis, USA. He is currently a Senior Vice President at V-Nova, where he is responsible for the technology, IP and standardization strategy, including driving its execution across the company. He has been leading the development of MPEG-5 Part 2 Low Complexity Enhancement Video Coding and has contributed to many of the coding tools included in the standard. Prior to V-Nova, he worked in the telecommunications industry first as a Researcher and then as an IP Expert.