A Frame-Level Constant Bit-Rate Control Using Recursive Bayesian Estimation for Versatile Video Coding

In this paper, we present a frame-level constant bit-rate (CBR) control method using recursive Bayesian estimation (RBE) for Versatile Video Coding (VVC). An <italic>R</italic>-<inline-formula> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> model for rate control (RC) has handled the total texture and non-texture bits at a time and has worked reasonably well in High Efficiency Video Coding (HEVC). Nevertheless, if the rate estimation is inaccurately performed, that is, the <inline-formula> <tex-math notation="LaTeX">$R$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> values for a current frame cannot be linearly modeled with their respective values in the previous frames, the resulting RC performance is degraded. In our work, we adopt the RBE which alternates <italic>prediction</italic> and <italic>update</italic> steps not only to precisely estimate the rates, but also to allocate target bits based on the changes in the distortions of the previously coded frames, thus considering the rates and distortions simultaneously. Therefore, an elaborate RC can be performed especially at fluctuating frame complexities. Experimental results show that our RC method outperforms the RC of VVC Test Model (VTM-5.0) in terms of normalized root mean square error (NRMSE) with maximum (average) 34.95% (12.35%) improvement, and maintains higher visual quality consistency in terms of standard deviation of PSNR by 33.07% (22.34%) improvement for All Intra (AI), maximum (average) 44.82% (27.29%) and 22.54% (9.50%) for Low Delay (LD), and maximum (average) 47.35% (39.94%) and 30.35% (18.54%) for Random Access (RA), respectively, compared to the default RC method of the original VTM-5.0.


I. INTRODUCTION
Recently, Joint Video Exploration Team (JVET) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Expert Group (MPEG) have been developing the Versatile Video Coding (VVC) standard [5] beyond its predecessor, the High Efficiency Video Coding (HEVC) standard [6]. Various and novel video coding technologies such as Coding Tree Unit (CTU) structure, intra/inter prediction, transforms, in-loop filtering, entropy coding, etc. [7] are devised and tested in a VVC Test Model (VTM) platform [1].
An R-Q (rate-quantization) model such as the Laplacian mixture model (LMM) [8]- [11] has increased the rate The associate editor coordinating the review of this manuscript and approving it for publication was Nilanjan Dey. estimation performance in HEVC [6]. However, our previous work [12] demonstrated that the LMM is less effective for R-D estimation when applied for VVC Test Model (VTM-5.0) [1] compared to the HM [13]. This is because the residues are obtained in various-sized CUs of deeper depths (maximum 9-level depth) in VVC become more complicated problems than those (maximum 4-level depth) in HEVC. Although an R-λ model [2]- [4] adopted in HM worked reasonably well and less complex than the LMM, the R-D estimation performance would be degraded unless the R and λ values for a current frame cannot be linearly modeled with their respective values in the previous frames. Also, since VVC has more flexible coding structures, some predefined model parameters of the R-λ model used in HEVC might be no longer effective for VVC [14]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In our previous work [12], we showed that our recursive nonlinear estimation on the probability density function (pdf) of particles (rates) via a Bayesian theorem and a sequential importance resampling (SIR) algorithm was effective in enhancing the R-D estimation performance. In this work, we present an R-λ model-based RC that relies on our previous stochastic framework for rate estimation. As a result, more precise RC is obtained which yields robust rate estimation and less-fluctuating visual quality over frames. The contribution of our work is summarized as follows: (i) We utilize a reliable and robust rate estimation method based on a recursive Bayesian estimation (RBE) scheme [12] to stochastically estimate the rates for the next frames to be encoded. The RBE-based rate estimation not only utilizes the real rates of previously encoded frames but also considers their distortions so that more elaborate RC can be performed especially with fluctuating frame complexities. (ii) Our RC method is comprehensively applied for All Intra (AI), Low Delay (LD), and Random Access (RA) configurations, and shows the effectiveness of rate control by reducing the PSNR fluctuations and by utilizing the estimated rates by the RBE for bit allocation which can effectively replace the default rate estimation of the R-λ model in the RC of the original VTM-5.0 [1]. The remainder of the paper is organized as follows: In Section II, we address the related works on R-D models for RC and bit allocation (BA) used for previous video coding standards; In Section III, our proposed method is described in details; In Section IV, experimental results are presented, and Section V concludes our work.
For H.264/MPEG-4 AVC [17], Jing et al. considered an average gradient per pixel of the frame for enhancing the prediction accuracy of a quantization parameter (QP) to be applied for encoding [18]. Yan et al. utilized distortions by taking an image complexity for better intra-frame rate estimation [19]. Chang et al. proposed joint RC for a hybrid coder using gradient-based R-Q and D-Q models [20]. For HEVC [6], Karczewicz et al. have taken into account the sum of absolute transformed differences (SATD) as a complexity measure for the R-λ model [21]. Wang et al. applied gradient terms for the scene complexity to determine a new R-λ model to increase the rate estimation performance for RC [22]. Gao et al. improved the R-D performance through the optimized CTU-level BA using a structural similarity (SSIM)-based game theory approach [23]. Although it seems fancy and reasonable, it entails high computational complexity. In order to reduce both the bit-rate and visual quality fluctuations, many methods have been studied that aimed at maintaining the visual quality consistency over frame evolutions [10], [24]- [26]. Recently, for VVC [5], a new quality dependency factor is derived in accordance with temporal layer for rate control [53]- [55]. In addition, a quadratic R/D model is proposed especially for intra frame rate control [56].

B. BIT ALLOCATION (BA) OPTIMIZATION
In order to improve the R-D performance for RC, various bit allocation optimization schemes have been studied [4], [23], [27]- [30]. In particular, Li et al. formulated an optimization problem to minimize the average distortion MINAVE [4]. By solving the optimization problem with a quality dependency constraint, they theoretically explained different λ values of the R-D cost function should be considered for every frame. Also, with temporal levels of LD and RA configurations, the BA process was conducted by weighted λ values to reduce the computational complexity [31]. The aforementioned BA process was implemented into the HM [13].
Chen et al. utilized a bi-section algorithm to explore an optimal λ value for a CTU-level RC and BA [28]. Li et al. proposed an algorithm to get a λ value for the CTU-level RC and BA with a closed-form equation via Taylor expansion [27]. Also, Guo et al. extended the algorithm into a frame-level RC and BA [29]. Fiengo et al. utilized a forward-backward primal-dual (FBPD) algorithm to solve the optimization problem for its recursive R-D model [32]. In addition, to boost up the efficiency of the CTU-level bit allocation, a game theory approach is taken [23]. Also, a machine learning-based technique is considered for improving the prediction accuracy of the R-D model [30].

1) R-Q MODELS
From the information theory, a closed-form solution for rate and distortion function can be derived [40]. In VM 8 of MPEG-4 [16], the rate and distortion function for residues of Laplacian pdf is expanded by the Taylor series such that a quadratic rate model is formulated as [33]: where a and b refer to model parameters according to video content characteristics, Q indicates a QP, and R is a target bit amount for a certain coding level. Previously, in order to increase the rate estimation accuracy of the former video coding standards such as H.264/MPEG-4 AVC [17] and HEVC [6], several studies exploited the gradients of pixel intensities [18]- [22]. Also, various pdf models for transform coefficient (TC) values such as Laplacian, Cauchy, and Gaussian were investigated to increase the R-Q models' accuracy [37], [41]. As a special case of the R-Q models, ρ, a percentage for quantized transform coefficients (QTC) of zeros from the pdf model of TC values is utilized as a linear function. It is called an R-ρ model as [36]: where 1-ρ indicates a percentage of non-zero QTC in a frame, θ is a model parameter, and N indicates a total number of pixels in the frame.

2) R-Q MIXTURE MODELS
Several schemes for an R-Q mixture model were developed by exploiting the different characteristics of pdfs in various CU depth levels [8]- [11]. The R-Q mixture model is expressed as a multiple mixture model of Laplacian function: where x represents TC values, p i is the portion of pixels in the i-th CU depth per frame, N CU is the total number of CU depth levels (= 4 for HEVC), φ i is the Laplacian model parameter as where σ i is the standard deviation of TC values in the i-th CU depth per frame, and is the set of real numbers. It is demonstrated that multiple R-Q models can be better fitted to the pdfs of actual TC values [8]- [11]. Gao et al. proposed a synthesized pdf model by minimizing the Kullback-Leibler divergence, and then the synthesized pdf model is collaborated with the R-ρ model to increase the R-D estimation accuracy [35].
Since the conventional R-Q models rarely deal with the non-texture bits, the rate estimation performance is decreased. In order to overcome these problems, texture and non-texture bits are separately estimated by their own models [9], [11]. It worked reasonably in HEVC, however, it is computationally burdensome since the coding depth levels are highly increased in VVC. Moreover, the computational complexity is significantly increased when the LMM with the radial basis function (RBF) network is used for R-D performance improvement [11].

3) R-MODELS
The conventional R-Q models developed for MPEG-2 [15], MPEG-4 [16], and H.264/MPEG-4 AVC [17] suffer from imprecise rate estimation performance as the video coding technologies get advanced. In HEVC [6], since encoded bits are influenced by the various coding parameters of intra and inter modes, a QP is not the only critical factor determining the amount of resulting bits compared to the previous video coding such as H.264/MPEG-4 AVC [17]. Instead of the R-Q models, several ideas regarding the relation between the QP (or rate) and a Lagrangian multiplier λ that represents the slope of the R-D curve were proposed [2], [3], [42].
From the viewpoint of rate-distortion optimization (RDO) [43], the distortion (D) should be minimized such that the rate (R) is less than a given bit budget (R b ) as: Via the Lagrange multiplier method [44], (4) can be expressed by an unconstrained problem as: where J is an R-D cost function and λ is the Lagrangian multiplier. Moreover, Mallet et al. verified that the R-D curve can be expressed as a rectangular hyperbolic function as [45]: where ϕ and γ are model parameters, and γ remains of the order of 1. It is also demonstrated that the rectangular hyperbolic function is more suitable than the exponential function [3]. Since the R-D curve is convex, it is differentiable. Thus, (6) can be rewritten as: From (7), we have an R-λ model as: where α and β indicate parameters of the R-λ model. It should be noted that R in (8) contains bits for both texture and nontexture, while R in (1) only contains the texture bits. Owing to the precise R-D modeling performance, the R-λ model-based RC algorithm has been adopted into the HM [2], [3], [6], [13]. However, the model parameters in (8) are estimated by a previously coded data such that the R-D modeling performance is likely to be degraded when the characteristics of previously coded data are nonlinear.

III. PROPOSED FRAME-LEVEL CONSTANT BIT-RATE CONTROL USING RECURSIVE BAYESIAN ESTIMATION
In order to make an elaborate rate control for VVC by overcoming the various shortcomings of previous R-D models, we propose a frame-level constant bit-rate control using RBE.
To be self-contained, we briefly review the basic concept of the RBE used for our frame-level constant bit-rate control in the following.

A. RECURSIVE BAYESIAN ESTIMATION (RBE)
A recursive Bayesian estimation (RBE) can be used for various applications of signal processing, control and dynamical systems, computer vision, and robotics to estimate a system information such as states, model parameters, and so on [46]. The Bayesian theorem [47] is utilized for a Bayesian estimation to construct a posterior probability density of the state from all the measurements given an initial prior probability density. In the RBE, two steps (prediction and update steps) are needed basically to perform the estimation. In the prediction step, a state evolution probability is used to predict a prior probability density while, in update step, both the prior probability density and a measurement data are used to obtain the posterior probability density. Through these two alternate steps, an optimal estimate can theoretically be found in accordance with several criterions such as means, modes, medians, and so on [46]. In addition, the estimation accuracy can be measured in terms of covariance.

1) BAYESIAN ESTIMATION
Estimation procedures collect the information of parameters for a random vector x, defined as a state, from a random vector y which is often obtained from an imprecise (or noisy) measurement equipment or random modeling. Usually, x is assumed having a known prior probability density p(x).
According to the Bayesian rule, as y is measured, the knowledge of parameters for x is changed as [47]: where the posterior probability density p(x|y) after receiving y represents everything about the parameters of x, and the denominator p(y)is a scalar positive constant that can be found by marginalization as [47]: Thus, we only consider the numeratorp(y|x)p(x) in (9) to solve p(x|y). Several estimates for the Bayesian estimation are found via a conditional mean estimate (ME) and maximum a posteriori (MAP) as [46]: x ME = xp(x|y)dx (11) x MAP = arg max x p(x|y) (12) wherex ME andx MAP are the scalar estimates of ME and MAP.

2) RECURSIVE ESTIMATION
State evolutions occur for each time sequence via the Markov process with an initial state x 0 ∼ p(x 0 ) in a recursive estimation process. A state transition (or prior) probability density can be expressed as [46]: where x k is the state vector at time instant k. Since it is often assumed that the measurement vector y k is conditionally independent of the previous measurement vectors (y 1 , y 2 , · · · , y k−1 ) given the current state x k , a likelihood probability density is described as [46]: Both of the transition and likelihood probability density models in (13) and (14) rely on time instant k. In addition, the relationship between (13) and (14) can be described as a hidden Markov model (HMM) [48]. In HMM, the states are hidden (to be estimated), but the measurements dependent on the states are visible. Fig. 1 shows a flowchart of a HMM. As shown in Fig. 1, the state transition and likelihood probability density models are described via the first-order x k and y k refer to a state random vector in (13) and a measurement random vector in (14), respectively. p(x k |x k−1 ) and p(y k |x k ) indicate a state transition probability density in (13) and a likelihood probability density in (14), respectively.
Markov process and measurements dependent on the states, respectively. By the HMM and Bayesian theorem in (9), the posterior probability density can be inferred. More specifically, by adopting the Bayesian and recursive estimations alternately, the conceptual solution for RBE can be obtained. Based on the assumption that the state evolution x k is the Markov process and x k+1 is independent of y k when x k is given. Thus, we have By integrating both sides of (15) with respect to x k , we have the following Chapman-Kolmogorov identity [49]: Eq. (16) is the prediction step where the prior probability density is estimated in the Bayesian recursion. In order to find the posterior probability density p(x k |y k ), we apply the Bayesian theorem in (9) to the measurement vector y k based on a conditional independence assumption on y k in (14), which results in: where p(y k |x k ) and p(x k |y k−1 ) indicate the likelihood and prior probability densities in (14) and (16) at time k, respectively, and p(y k |y k−1 ) = p(y k |x k )p(x k |y k−1 )dx k is a normalizing constant value [50]. Eq. (17) is referred to as the update step in the Bayesian recursion. According to (16) and (17), the prior and posterior probability densities can be alternately updated to enhance the prediction accuracy. Moreover, scalar point estimates, such as ME and MAP, and estimation error covariance C based on p(x k |y k ) are expressed as [46]: x k,MAP = arg max In spite of the theoretical optimal solution of the RBE to compute p(x k |y k ), (17) is not a practical solution due to the intractable integrals with the infinite representations of pdfs for prior and posterior.

3) PARTICLE FILTERING
Particle filtering (PF) obtains an estimate value (e.g., an updated mean, model parameters) based on point (particle) mass representations of probability densities by applying the Bayesian theorem [46], [50], [51]. It is very advantageous for the particle filtering that any distribution of randomly sampled particles can be applied to its robust SIR algorithm to have the estimate value whereas other conventional methods require predefined distribution functions [46]. Thus, the particle filtering can be widely used for various applications such as terrain-aided navigation, economic forecasting, statistical signal processing [46], [50], [51]. Our previous work treated the rates and distortions as random variables whose pdf forms are unknown, and applied the PF for the R-D estimations in VVC for the first time [12]. Fig. 2 illustrates a particle filtering concept. As shown in Fig. 2, the RBE is performed via the SIR algorithm to have the posterior probability density p(x k |y k ) in (17). The prior probability density p(x k |y k−1 ) at time k in (16) and the likelihood probability density p(y k |x k ) in (14) are plugged into (17), which is called the update step to have p(x k |y k ). Then, p(x k |y k ) is plugged back into (16), which is called the prediction step to have the prior probability density p(x k+1 |y k ) at time k + 1. Therefore, the alternate operations between the update step and the prediction step increase the prediction accuracy of particle filtering. Detailed mathematical definitions and descriptions of the SIR algorithm for particle filtering can be found in our previous work [12].  (14), respectively. p(x k |y k−1 ) and p(x k+1 |y k ) refer to the prior pdf in (16) at time k and time k + 1, respectively. p(y k |x k ) and p(x k |y k ) represent the likelihood pdf in (14) and the posterior pdf in (17), respectively.

B. PROBLEM FORMULATION
As mentioned in Section I, rate control algorithms using the LMM-based R-Q model is problematic when R-D estimation is not precisely performed in VVC [5] that has a deeper coding structure than HEVC [6]. Moreover, rate estimation (RE) performance of the R-λ model is degraded if the respective linearities among λ and bpp values are not maintained. In order to cope with these problems, we formulate the RE problem as: where R T indicates a proposed target bit amount per-frame, and R(λ) and D(R(λ)) indicate a rate and a distortion of the R-λ model, respectively. It is noted that D(·) indicates the distortion function in (6). R min and R max are the minimum and maximum allowances of rates, respectively, to prevent a buffer from overflowing and underflowing. λ opt represents an optimal value to be found in the R-λ model, which can be solved by certain optimization techniques. Our previous work [12] showed that the rate and distortion values are obtained quite in random due to various spatial and temporal complexities of the input video sequences. Thus, the rate and distortion values can be regarded as random variables, so being modeled by a certain pdf. Therefore, we propose an RBE-based stochastic framework in Section-III-A that simply predicts R T to solve the formulated problem in (21) Fig. 3 describes an overview of our proposed frame-level RE, BA, and RC using RBE. Initially, a target bit-rate for RC is set as input to our algorithm. The proposed frame-level RE using RBE estimates an intermediate rate,T k+1 R which can not only improve the rate estimation accuracy but also be effectively used for our BA process. Then, a proposed target bit amount per-frame R T is calculated by our RBE for BA process so that R T can be applied for the R-λ model [2]- [4] to determine λ opt . Then, λ opt is used for selecting an appropriate QP. Finally, the selected QP is utilized for our video encoding process. The details of our proposed method are described in the following sub-sections.

1) PROPOSED FRAME-LEVEL RATE ESTIMATION USING RBE
Our RE method utilizes the RBE by considering a distortion variation. The distortion variation between two encoded frames is defined as: where MSE k and MSE k−n are the mean square error in frame k and k-n, respectively, and γ is a control parameter for the rate adjustment, which is empirically set to 0.3 for our experiments. Also, n is empirically set to 2 which yields an appropriate variation for distortion to properly respond to the dynamics of distortions during the RE. Also, we used empirically found values, γ = 0.3 and n = 2 for all test sequences and QPs (= 22, 27, 32, and 37), which reasonably work well. R is an intermediate rate by the proposed RE, R T is the target bit amount per-frame by the proposed BA, T Bits is the target bit amount per-frame by the frame-level BA in [1], λ opt is an optimal value of the R-λ model [2]- [4], and QP is a quantization parameter.
For the practical implementation of RBE, by regarding the rates as random variables of unknown pdfs, the SIR algorithm in our previous work [12] is exploited. Initially, we randomly generate N (= 150) rate particles whose vector form is r k at frame k from a normal distribution with mean (= 0) and standard deviation (= 0.1). In addition, initial particle weights for r k are set to 1/N . Then, the prediction and update steps in (16) and (17) are performed alternately. Furthermore, the weights adjusting and normalizing processes for particles are performed by the update steps using measurement data (= actual encoded bits or distortions) [12]. For this, the RBE is able to stably estimate the rates even though it relies on a stochastic framework. So, the rate particles are propagated to the next frame k + 1 as: where R k is the actual scalar rate of frame k after video coding and MSE k is the variation of distortion in (22). After propagating the rate particles r k to frame k + 1, the rate particles r k+1 are resampled for N times according to r k+1 = g i k+1 (r k+1 ) where g i k+1 (·) is the sampling function that randomly samples the i-th particle r i,k+1 with replacement at frame k + 1. Then, the weights (or probabilities) of r k+1 are normalized as: where p(r k+1 ) is the pdf of r k+1 . By inner product with r k+1 and s k+1 , the intermediate rate estimate for frame k + 1 via the RBE considering MSE k can be obtained as: where r k+1 and s k+1 are the rate particles in (23) and their weights in (24) at frame k + 1, respectively.

2) PROPOSED FRAME-LEVEL BIT ALLOCATION AND RATE CONTROL
In the R-λ model-based RC, the target bit amount per-frame for the frame-level BA is defined as [1]: where R TBL is a total bit budget, and R TBF = T BR FR indicates an average target bit amount per-frame. T BR and FR indicate a target bit-rate (bits/sec) and a frame-rate (frames/sec). N Left is a number of frames left, and SW is the size of a sliding window for bit-rate fluctuation smoothing which is set to 40 [1], [3]. Usually, a bit allocation per-frame is deeply related to the performance of RC since the RC calculates a QP based on the allocated target bits. However, in the R-λ model [2]- [4], T Bits is allocated simply in a mechanical manner in accordance with a certain frame-complexity measure. Thus, it not only degrades an R-D performance but also fluctuates the visual quality over frames. It is noted that an elaborate optimization technique for BA process is not considered in this work since we rather focus on simplified RC and BA processes. However, if an RBE-based BA is exclusively applied, it causes a buffer underflow and results in the lack of bit resources toward the end of a video sequence. In order to prevent this problem, we restrict the proposed target bit amount perframe R T by averaging the target bit amount per-frame T Bits in (26) and the intermediate rateT k+1 R for frame k + 1 by our frame-level RE using RBE in (25). Thus, R T is defined as: As described in (21), λ opt can be solved by certain optimization techniques such as gradient descent method, bisection method, and so on [28,44] to calculate QP, which may cause high computational complexity. In order to relieve this, R T in (27) is assumed as a true rate value thanks to the high RE accuracy in our RBE-based stochastic framework. Thus, (21) is rewritten as: where R(λ opt )is an estimated bit amount by (8). Thus, λ opt is obtained by solving (28). In addition, via the relation between R-λ and QP [2,3,42], QP is determined as: where c (= 4.20005) and d (= 13.71220) are empirical constant values [2], [3], [42] and round() indicates the function to round a value to its nearest integer. Then, by substituting λ opt into (29), QP can be calculated. In order to prevent abrupt changes for both λ opt and QP, the allowable ranges are constrained with [λ avg · 2 −2/3 , λ avg · 2 2/3 ] and [QP avg -2, QP avg + 2], respectively [2,3]. Moreover, the R-λ model parameters in (8) are updated by the linear update model [3]. It is noted that the same initial model parameters of the R-λ model in HEVC are applied for our experiment. Fig. 4 summarizes a flowchart of frame-level RE, RC and BA schemes using RBE. As shown in Fig. 4, in order to reflect the R-D characteristics into RC, the distortion variations of previously encoded frames in (22) is considered. Then, the proposed RE through the SIR algorithm [12] is applied to have the intermediate rate estimateT k+1 R for frame k + 1. The proposed BA process is performed by achieving the proposed target bit amount per-frame (R T ) as an average ofT k+1 R and T Bits according to (27). By using R T , λ opt is calculated by (28). Finally, QP is determined by (29), then the QP is used for our rate distortion optimized video coding.

A. EXPERIMENTAL SETTINGS
To prove the fidelity of our proposed method for the framelevel RE, BA, and RC using RBE, the proposed RC method is implemented into VVC Test Model reference software (VTM-5.0) [1]. All the experiments are performed under All Intra (AI), Low Delay (LD), and Random Access (RA) configurations using GOP (= 1), GOP (= 4), and GOP (= 8), respectively, with four QP values (22, 27, 32, and 37) in the JVET common test conditions [52]. For an intra frame period, only first frame is an intra picture for AI and LD configurations. The intra frame period is 8 for RA configuration. Rate-distortion optimized quantization (RDOQ) and rate-distortion optimized quantization for transform skip (RDOQTS), context-adaptive binary arithmetic coding (CABAC), and sample adaptive offset (SAO) are activated for use during encoding. The intra coding tools of VTM-5.0 such as multiple transform selection (MTS), low frequency non-separable secondary transform (LFNST), intra sub-partitions (ISP), and matrix-weighted intra prediction (MIP) are activated. Also, fast implementation tools such as FastLFNST, FastMIP, and ISPFast are activated. The maximum width, height, and partition depth of CU are 64, 64, and 4, respectively. The CTU size is 128.
For the experiments, we use nineteen test sequences of seven classes which have different texture characteristics and resolutions (Class A1 (3840×2160), Class A2 (3840×2160), Class A (2560 × 1600), Class B (1920 × 1080), Class C (832×480), Class D (416×240), and Class E (1280 × 720)), which have been used as the test sequences in VVC development. Note that Class A, Class B, Class C, Class D, and Class E are 8-bit depth sequences, and Class A1 and Class A2 are 10-bit depth sequences. More information on the test sequences is listed in Table I. To evaluate the proposed method for frame-level RE, BA, and RC using RBE, its RE and target bit-rate allocation performances are compared to those of VTM-5.0 with the R-λ model [2]- [4]. The RE and target bit-rate allocation accuracies are measured by a normalized root mean square error (NRMSE) and bit-rate accuracy (BRA) measure, respectively, which are defined in Section-IV-B. The target bit-rates for each test sequence are determined as the actual bit-rates obtained at four QP values (22, 27, 32, and 37) without the RC activation using VTM-5.0 [1], and are then compared with the encoded bit amounts obtained by the proposed RC method and the VTM-5.0's RC method [1]. For the evaluation of visual quality consistency over frames, a standard deviation for the PSNR values of all encoded frames σ PSNR is used for each test sequence.

B. EVALUATION ON RATE CONTROL AND BIT ALLOCATION
The accuracy for RE is measured in terms of NRMSE. The NRMSE metric [12] is defined as:  where N (= 100) is the number of coded frames, Est(k) and Act(k) refer to the estimated and actual (true) encoded bits in frame k, respectively, and avg(Act) is the average of actual coded bits over all the frames. Lower NRMSE values indicate higher accuracy in RE. For the accuracy measure of RC, the BRA (%) is used as: where T_BR GT indicates the target bit-rate obtained by the VTM-5.0 without RC, and T_BR act is the actual encoded bit-rate by the VTM-5.0 with our proposed RC method (R-λ model with our RBE for BA) and the one with the default RC method (conventional R-λ model [2]- [4]). Greater BRA values indicate higher accuracy in RC. Table II, Table III, and Table IV show the average BRA (%), NRMSE, and σ PSNR performances for the proposed RC method implemented in VTM-5.0 and the default RC method of the original VTM-5.0 for AI, LD, and RA configurations, respectively. It is noticed that lower NRMSE and σ PSNR values indicate more precise estimates for the actual (true) rates and more consistent visual quality.    Table II, Table II, and Table IV, our proposed RC method outperforms the default RC method of the original VTM-5.0 for AI, LD, and RA by 34.95% (12.35%), 44.82% (27.29%), and 47.35% (39.94%) improvements in terms of maximum (average) NRMSE, respectively, and shows better visual quality consistency for AI, LD, and RA by 33.07% (22.34%), 22.54% (9.50%), and 30.35% (18.54%) improvements, respectively, in terms of maximum (average) σ PSNR , compared to the default RC method. The average BRAs of the proposed and default RC methods are 99.89% and 99.91% for AI, 97.50% and 98.03% for LD, and 85.52% and 83.58% for RA, respectively. Finally, it is worthy to note that BRA is measured after encoding the total frames for each sequence and its values may not reflect the rate estimation accuracy frame-by-frame. Therefore, the BRA is more worthwhile to be analyzed in conjunction with σ PSNR .

As shown in
These noticeable improvements on rate estimation accuracy and visual quality consistency stem mainly from the capability that the proposed RC method can precisely predict the intermediate rates by exploiting our stochastic RBE framework. In addition, our RC method is capable of properly allocating the per-frame target bit amount R T by considering the R-D characteristics in collaboration with the R-λ model [1]. Fig. 5, Fig. 6, and Fig. 7 show the plots of NRMSE for estimated rates by the proposed and default RC methods VOLUME 8, 2020   for five test sequences with AI, LD, and RA configurations. As shown in Fig. 5, Fig. 6, and Fig. 7, the trends of NRMSE curves are almost identical for the both RC methods, but the proposed RC method shows smaller NRMSE values for almost all QP value ranges in our experiment. Fig. 8 shows the R-D curves obtained by the VTM-5.0 [1] without RC activation, and with the proposed and default RC methods turned on. As shown in from Fig. 8-(a) to Fig. 8-(c), some R-D gains by our RC method are found in particular ranges from QP = 22 to QP = 37, compared to the default RC method. Similar R-D curves are also achieved for other test sequences. Fig. 9, Fig. 10, and Fig. 11 show the measured per-frame PSNR performances between the two RC methods using AI, LD, and RA configurations respectively for QP = 37. As shown in Fig. 9-(a) and Fig. 9-(c), the proposed RC method tends to yield smoother PSNR curves with smaller peak-to-valley variations, compared to the default RC method. Especially, the proposed RC method exhibits much less visual quality fluctuation between the 60-th and 100-th frames in Fig. 9-(b). As shown from Fig. 10-(a) to Fig. 10-(c), it is also noticed that the proposed RC method yields smoother change of PSNR curves. In addition, the proposed RC method tends to yield smoother PSNR curves with smaller peak-to-valley variations between the 70-th and 100th frames in Fig. 11-(a), compared to the default RC method. Especially, the proposed RC method maintains the visual quality consistency up to the end of the sequence whereas the   default RC method fails to do so especially from the 80-th to the 100-th frames both in Fig. 11-(b) and Fig. 11-(c). Fig. 12, Fig. 13, and Fig. 14 show some decoded frames for subjective visual quality comparisons using AI, LD, and RA configurations, respectively. Fig. 12 shows the 72-th reconstructed frame (cropped) of Tango2 (3840 × 2160) with QP = 37. As shown in Fig. 12-(b) to Fig. 12-(d), the vivid blemish on the wrist in the original frame shown in Fig. 12-(b) appears blurred in Fig. 12-(d) by the default RC method. Fig. 13 shows the 49-th reconstructed frame (cropped) of BlowingBubbles (416 × 240) with QP = 32. As shown in Fig. 13-(b) to Fig. 13-(d), the braided hair and face on the left girl in the original frame shown in Fig. 13-(b) appears blurred in Fig. 13-(d) by the default RC method. Fig. 14 shows the 82-th reconstructed frame (cropped) of ParkScene (1920 × 1080) with QP = 37. As shown in Fig. 14-(b) to Fig. 14-(d), a pattern carved on a pillar in the original frame shown in Fig. 14-(b) appears blurred in Fig. 14-(d) by the default RC method. However, our RC method presents better visual quality as shown in Fig. 12-(c), Fig. 13-(c), and Fig. 14-(c).
From the observations of the extensive experimental results throughout Table II, Table III, Table IV, Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig. 10, Fig. 11, Fig.12, Fig. 13, and Fig. 14, the proposed RC method shows the superiority of RC performance in terms of NRMSE, σ PSNR , and PSNR compared   to the default RC method, for almost all the test sequences by enhancing both the rate estimation accuracy and visual quality consistency.
C. COMPLEXITY Table V shows the complexity of the proposed RC method in terms of run times, which was performed on a PC platform with Intel Core TM i-7-8700K CPU@3.70 GHz, a 32.0 GB RAM and a 64-bit Windows TM 10 operating system. The average run times were measured with three runs of encoding 100 frames using QP = 22 for each test sequence by using the VTM-5.0 reference SW encoder [1]. As shown in Table V, the time increments required for the proposed RC method are ranged between -11.03% and 5.29% for AI configuration, −17.48% and 18.62% for LD configuration, and −23.16% and 12.93% for RA configuration compared to the original VTM-5.0 reference SW encoder with the default RC method. Our RC method, which is implemented into the VTM-5.0 by replacing the default RC method, has reduced the encoding time about 0.22% for AI configuration, 0.68% for LD configuration, and 0.93% for RA configuration in average, thus not increasing the overall complexity of the VTM-5.0 reference SW encoder. It is also worthwhile to mention that the encoding time is affected by the selected QP values for rate control.

D. BUFFER FULLNESS
In order to seamlessly stream the encoded bit-sequences within a certain bandwidth under a CBR constraint, a proper buffer size needs to be defined such that an RC algorithm controls the bit generation to prevent buffer overflow and underflow. The buffer is named as a coded picture buffer (CPB) whose size is same as a target bit-rate [1]. In our experiments, both the proposed RC method and the default method work reasonably well for the CPB state controlling without buffer overflow and underflow. Fig. 15 shows the buffer fullness (= CPB) of Tango2 by the two RC methods under AI, LD, and RA configurations. As shown in Fig. 15, the CPBs are stably controlled for seamless streaming. However, as shown in Fig. 15-(a), the default RC method generates a great amount of bits to adjust the target bit-rates at the very end of the sequence where the CPB state goes down abruptly, thus it may cause a buffer underflow.

E. DISCUSSION ON HARDWARE ISSUES
A video encoder for real-time high-fidelity and highresolution applications may require a hardware implementation where bit rate estimation is an essential element [57]. Based on our complexity analysis on the random sampling function of RBE-based rate estimation using nineteen test sequences in Table I, the run time of our RC method approximately is less than 2msec per frame while other method in [8] requires more than 20msec per frame. It is noted that since our RBE-based rate estimation scheme only utilizes previous encoded distortions and rates for bit estimation, the complexity for rate estimation is not dependent on the image sizes of test sequences. Nevertheless, in order to reduce the processing time for bit rate estimation in the perspective of hardware optimization, a parallel hardware architecture for random sampling function of RBE-based rate estimation can be considered. In particular, a probability summation and indexing for our RBE-based rate estimation can possibly be implemented in parallel processing architecture.

V. CONCLUSION
In this paper, we propose a frame-level constant bit-rate (CBR) control using recursive Bayesian estimation (RBE) for Versatile Video Coding (VVC). The proposed RC method is based on a stochastic framework and considers the R-D characteristics of the previously encoded frames in estimating the rate for the current frame with a less visual quality fluctuation. Extensive experimental results have shown that our RC method can effectively reduce the NRMSE for rate estimation and σ PSNR (the standard deviation of all resulting PSNRs) for visual quality consistency compared to the default RC method of the original VTM-5.0. This performance gain comes from the fact that our proposed RC method uses an effective RBE for rate estimation and regulates the bit allocation (BA) process with the estimated bits for VVC. As a future work, a deep learning-based long short-term memory model (LSTM) for rate estimation will be studied to improve RC performance in VVC.