JND-Aware Two-Pass Per-Title Encoding Scheme for Adaptive Live Streaming

Adaptive live video streaming applications utilize a predefined collection of bitrate-resolution pairs, known as a bitrate ladder, for simplicity and efficiency, eliminating the need for additional run-time to determine the optimal pairs during the live streaming session. These applications do not incorporate two-pass encoding methods due to increased latency. However, an optimized bitrate ladder could result in lower storage and delivery costs and improved Quality of Experience (QoE). This paper presents a Just Noticeable Difference (JND)-aware constrained Variable Bitrate (cVBR) Two-pass Per-title encoding Scheme (JTPS) designed specifically for live video streaming. JTPS predicts a content- and JND-aware bitrate ladder using low-complexity features based on Discrete Cosine Transform (DCT) energy and optimizes the constant rate factor (CRF) for each representation using random forest-based models. The effectiveness of JTPS is demonstrated using the open source video encoder x265, with an average bitrate reduction of 18.80% and 32.59% for the same PSNR and VMAF, respectively, compared to the standard HTTP Live Streaming (HLS) bitrate ladder using Constant Bitrate (CBR) encoding. The implementation of JTPS also resulted in a 68.96% reduction in storage space and an 18.58% reduction in encoding time for a JND of six VMAF points.


I. INTRODUCTION
W ITH the rapid growth of online video consumption, the need for a streaming method that can adapt to varying network conditions and device capabilities became crucial.HTTP Adaptive Streaming (HAS) has emerged as the solution, allowing viewers to enjoy seamless playback of highquality videos regardless of their internet connection speed or device capabilities [1].HAS dynamically adjusts the video quality in real-time based on the viewer's context conditions (e.g., network or/and device characteristics).It breaks down video content into small segments and serves them through plain HTTP [2].Optimizing video encoding in streaming enhances the Quality of Experience (QoE) for the end-users and minimizes the costs for service providers, predominantly in Video on Demand (VoD) scenarios.
In live streaming situations where latency is crucial, video content typically employs a standardized set of encoding parameters without considering optimization.Traditionally, a fixed bitrate ladder is employed for the live streaming session, such as the HTTP Live Streaming (HLS) bitrate ladder. 1 However, due to the wide range of video content characteristics and network conditions, a content-adaptive approach, known as per-title encoding is introduced, which can improve QoE or reduce bitrate, especially for Videoon-Demand (VoD) services [3].Although per-title encoding schemes [3], [4], [5] improve the quality of video delivery, they have been only appropriate for VoD streaming applications because it is computationally expensive to determine the convex-hull.The biggest problem in video technology today is live (low latency), according to the Bitmovin Video Developer Report 2022. 2 Low-latency video coding optimization strategies are required for live-streaming applications.
Just Noticeable Difference (JND)-aware bitrate ladder prediction improves streaming by optimizing the allocation of bits based on the perceptual thresholds of human vision [6], [7].It ensures that the available bandwidth is utilized efficiently, focusing on perceptually important areas and reducing bitrate allocation for imperceptible details [8].This results in higher perceptual video quality within the given bitrate, enhancing the viewing experience and reducing Fig. 1.The ideal JTPS bitrate ladder targeted in this paper.The red line represents the envisioned RD curve, while the green dotted line indicates the maximum quality level q max .When the quality level is higher than q max , the encoded video stream is considered perceptually lossless.q J represents the target JND function.
buffering or playback interruptions [9].Furthermore, cVBR (Constrained Variable Bitrate) encoding schemes are better than the state-of-the-art CBR (Constant Bitrate) schemes used in live streaming, owing to its ability to adapt the bitrate according to the complexity of the video content.cVBR maintains a consistent perceptual quality throughout the stream, resulting in a visually pleasing experience for viewers [10].
In this light, this paper targets a cVBR two-pass encoding scheme with a content-adaptive, JND-aware, online bitrate ladder prediction optimized for adaptive live streaming applications.The minimum and maximum encoding bitrates (b min and b max ), the maximum quality level (q max ), and the target average JND function are considered as inputs to the scheme.Moreover, a priori information, such as the encoder/codec used and the encoding preset, are input to the scheme to ensure that the bitrate ladder is generated for the corresponding encoding configuration required by the streaming service provider.Based on the video complexity features and the input parameters, bitrate-resolution-CRF triples are predicted.As shown in Fig. 1, the adjacent points of the bitrate ladder are envisioned to have a perceptual quality difference of one JND.Please note that, in this paper, JND is considered a function of VMAF. 3 Although reducing the overall storage needed to store the representations, JTPS is expected to improve the overall compression efficiency of the bitrate ladder encoding.
In this paper, the main contributions are as follows: A low-latency two-pass encoding scheme termed JTPS (JND-aware Two-pass Per-Title Encoding Scheme) is proposed, which includes a content-adaptive, JNDaware, online bitrate ladder prediction for live video streaming applications.Optimized CRF is predicted for each representation for cVBR encoding to achieve the target bitrate with maximum compression efficiency. 3Other functions can be envisioned and are subject to future work.Random forest-based models are designed to predict optimized bitrate-resolution-CRF triples for each video segment using Discrete Cosine Transform (DCT)energy-based low-complexity spatial and temporal features of every video segment.This paper also presents the extension of our previous work, OPTE [11] and PPTE [7] CBR encoding, to use random forest models using spatial and temporal features to predict perceptual quality and bitrate, instead of linear regression models.OPTE cVBR encoding scheme is introduced which includes predicting the resolution-CRF pairs for each target bitrate of the bitrate ladder, which yield the highest perceptual quality.
A comprehensive evaluation of JTPS, comparing it with state-of-the-art encoding methods, is presented.Paper outline: Section II introduces background and related work on per-title encoding, just noticeable difference, and twopass encoding.In Section III, the proposed scheme (JTPS) is described in detail.In Section IV, the scheme's performance is validated, and the corresponding experimental results are presented.Finally, Section V concludes the paper.

II. BACKGROUND AND RELATED WORK A. Per-title encoding
Most of the state-of-the-art per-title encoding methods is based on choosing a particular resolution that provides better visual quality for each title's bitrate range [11].An illustration of this variation in rate-distortion (RD) characteristics can be seen in Fig. 2 for x265 4 High Efficiency Video Coding (HEVC) [16] encoding.For example, the cross-over bitrate between 1080p and 2160p resolutions for the RushHour_s000 video segment occurs at around 3.4 Mbps, meaning that, for bitrates lower than 3.4 Mbps, 1080p resolution yields a higher Video Multimethod Assessment Fusion (VMAF) 5 score than 2160p.On the contrary, for bitrates higher than 3.4 Mbps, 2160p outperforms 1080p.Conversely, the YachtRide_s000 video segment shows that 1080p resolution provides the best performance throughout the entire bitrate range, indicating that 1080p should be the resolution of choice for the entire bitrate ladder.However, The selection of bitrate-resolution pairs from the convex-hull is a challenging task.To determine the optimal per-title bitrate ladder for r resolutions and b bitrates, r × b test encodings are necessary.The literature describes several per-title encoding methods that reduce the number of encodings required to determine the convex-hull. 6One such approach, developed by Katsenou et al. [14], uses machine learning to identify the most effective bitrate range for each resolution.The method extracts spatiotemporal features and statistics from sequences at their original resolution and then, employs machine learning methods to predict the quantization parameters (QPs) at which the rate-distortion curves across the different resolutions intersect.Relatively lower number of encodes needs to be performed in order to determine the bitrates at which resolutions should be switched.This contentgnostic approach has been claimed to reduce the number of encodings required compared to other methods (by 81% -94%) compared to the bruteforce encoding approach.Another method proposed by Bhat et al. [4] uses machine learning to predict the resolution without requiring multiple encodings.Features from the low resolution encoding of the first few frames are used to predict better performing resolution for a decision period.Zabrovskiy et al. [15] used an artificial neural network to predict an optimized bitrate ladder for each scene, optimized based on the YPSNR quality metric.
There are video encoding enhancement solutions proposed in the literature, which can be used to improve the quality of video representations (each bitrate-resolution pair) [17] in the bitrate ladder.Amirpour et al. [13] proposed a content-aware per-title encoding approach, DeepStream to support CPU-only and GPU-available end-users.However, it has limitations that: (i) improvements are observed only for clients with GPU, (ii) train deep neural networks need to be trained for each representation which needs significant processing time and (iii) bruteforce encoding at all resolutions and CRF are needed to estimate the bitrate ladder, making it unsuitable for real-time live streaming solutions.
Table I shows the target scenario, the bitrate estimation method, the number of pre-encodings needed to determine the convex-hull, and the encoding type of the state-of-theart methods.The bruteforce method [3] and DeepStream [13] Fig. 3. RD curve of HLS 1 CBR encoding of Characters_s000 video sequence (segment) of VCD dataset [12] using x265 HEVC encoder at ultrafast preset.Consequently, there is significant storage waste when these representations are stored.needs encoding the video content r × c times, where r and c denote the number of resolutions supported by the streaming service provider and number of CRFs supported by the encoder, respectively.The pre-analysis method proposed in [14] needs to encode the video (r − 1) × 2 times.Moreover, it uses constant quantization parameter (CQP) encodes which are not used in real-time streaming applications.FAUST [15] and the method proposed in [4] needs a low-resolution encoding to extract features which are input to artificial neural network and random forest models, respectively, to predict the convex-hull for CBR encoding.As a result, these methods produce latency significantly higher than the accepted latency in live streaming.Our previous works OPTE [11], and PPTE [7], predict optimized bitrate ladder for CBR encoding without any pre-encoding step, hence, no additional latency in streaming.They use simple linear regression models to predict bitrate-resolution pairs.There are per-title encoding methods developed in the industry: from Bitmovin 6 , Brightcove [18], MUX, 7 and CAMBRIA. 8However, they are proprietary; hence, information about them is limited.

B. Just Noticeable Difference (JND)
Weber's law [19] introduced the notion of Just Noticeable Difference (JND) as the change in a threshold value required to detect a difference [20].In visual perception, JND refers to the slightest distinguishable difference between two levels of sensory stimulus [21].Additionally, JND represents the maximum tolerable level of distortion for the Human Visual System (HVS) when perceiving videos.Research has been conducted on JND, and several surveys have been published [8], [22], [23], [24], [25], [26].By utilizing JND in video coding, referred to as perceptual coding, the encoding bitrate can be reduced while still guaranteeing a certain level of video quality or minimizing distortion within a specific bitrate constraint.Furthermore, removing perceptual redundancy information from JND levels compared to traditional video coding methods can lead to additional compression gains [6].For instance, Fig. 3 shows the selected bitrate-resolution pairs and their VMAF 5 scores for the Characters_s000 sequence using the HLS bitrate ladder.It is seen that there are multiple representations with the same video quality and some with similar quality at mid-range bitrates.Choosing representations with similar quality does not enhance QoE, but it increases storage and bandwidth expenses [9].

C. Two-pass encoding
Most streaming service providers employ CBR rate-control mode to encode live videos.CBR's consistency in achieved bitrate makes it more dependable for time-sensitive data delivery, especially in live streaming applications.Bitrate overshoot, or the encoded bitrate exceeding internet speeds, is not a concern because CBR-encoded videos are streamed consistently.However, this method's dependability sometimes necessitates sacrificing compression efficiency. 9In contrast, VoD applications utilize Variable Bitrate (VBR), where video segments are encoded according to their content complexity to optimize the transmission at the expense of adding a preprocessing stage to evaluate the content complexity of the video segments (two-pass encoding).As shown in Fig. 4, the input data from the video is analyzed (and stored in a log file) in the first-pass of two-pass encoding.The collected data from the first-pass is used to achieve the best encoding compression efficiency in the second-pass.During the secondpass encoding, bitrate is allocated among segments based on content complexity such that the average bitrate remains constant.This fluctuating characteristic makes VBR best suited for VoD applications [10].
Two-pass encoding had been the de-facto solution proposed to distribute bits effectively and improve the compression efficiency in VoD applications.Other than the previously discussed pre-analysis methods, some schemes involve encoding the same content twice to adapt the encoding parameters per title.Que et al. [27] proposed a two-pass VBR method for Advanced Video Coding (AVC) [28].The first-pass uses CBR encoding to gather encoding statistics, while offline processing is used in the second-pass to detect scene-cuts, precisely allocate target bits, and determine the quantization parameter for each frame.Zupancic et al. [29] utilized a fast encoder with a condensed set of coding tools in the firstpass to collect data for rate allocation and model parameter initialization during the second-pass.Wang et al. [30] proposed a two-pass VBR control for HEVC, motivated by structural similarity (SSIM), that allocates available bits at the group of pictures (GOP), frame, and coding unit (CU) levels to create a perceptually uniform space.Since the two-pass encoding method generally involves processing all segments twice, the overall encoding time is increased two-fold, introducing added streaming latency.Hence, these schemes are not used for live video streaming.
Constrained Variable Bitrate (cVBR) is the most widely used type of two-pass Variable Bitrate encoding 9 [10].This encoding scheme involves setting a maximum bitrate and buffer window, requiring two encoding passes to complete the process.The target bitrate cannot be specified in Constant Rate Factor (CRF) rate control mode, so the information from the first-pass is used to determine the optimized CRF that achieves the target bitrate.During the second-pass, the video segment is encoded with the selected optimized CRF while maintaining the maximum bitrate and buffer window constraints.This results in reaching the desired target bitrate with maximum compression efficiency. 10In terms of computational costs, CBR encoding generally incurs lower computational costs due to its fixed and predictable bitrate allocation.CVBR encoding, with its complexity analysis and bitrate adjustments, requires more computational resources.However, the specific computational cost can vary depending on factors such as video resolution, content complexity, and hardware capabilities.
Another popular method of two-pass encoding is to extract video complexity features as the first-pass, and use them to predict encoding parameters in the second-pass.Low-complexity features must be chosen in live streaming applications to guarantee uninterrupted low-latency video streaming.An intuitive method for feature extraction would be to utilize Convolutional Neural Networks (CNNs).However, CNN-based feature extraction would not be effective as it lacks temporal motion information, which is crucial for video complexity detection and subsequent bitrate-ladder prediction.Architectures such as 3D-CNN [31] or Conv-LSTM [32], [33] could be alternatives to accommodate the temporal motion information present in the video stream.However, such models have several inherent disadvantages, such as higher training time, inference time, and storage requirements (to deploy the prediction models in real-time), which are impractical in live streaming applications.Although CNN-based approaches could result in rich features, simpler models which yield a significant prediction performance are more suitable for live video streaming.The popular state-of-the-art video complexity features are Spatial Information (SI) and Temporal Information (TI) [34].The rate of SI and TI feature extraction11 from 2160p resolution videos are observed as around five frames per second, which is insufficient for low-latency streaming applications [35].
To summarize, most related works on per-title and twopass encoding yield latency unsuitable for live-streaming applications.Machine learning-based methods in the literature are too complex and storage heavy, hence, they are not suitable for real-time deployments.To overcome these problems, this paper proposes a low-latency pre-processing step as the firstpass to analyze the video segment's complexity to predict an optimized encoding bitrate-ladder.

III. JND-AWARE TWO-PASS PER-TITLE ENCODING SCHEME (JTPS)
The architecture of the proposed JTPS scheme for live video streaming applications is shown in Fig. 5.For each segment of the input video sequence, the JND-aware bitrate ladder is determined so that the adjacent RD points of the bitrate ladder have a perceptual quality difference of one JND.The prediction of every segment is motivated by the fairly uniform frame-to-frame spatiotemporal content of frames within a segment [1].The bitrate ladder is predicted using video complexity features (i.e., E, h, and L features are explained in Section III-A) extracted for every segment and the set of pre-defined resolutions (R), minimum and maximum target bitrates (i.e., b min and b max ), average JND quality (v J ) function, and the maximum VMAF (v max ) of the bitrate ladder.This paper assumes that VMAF is the optimal measure of perceptual quality. 12To ensure that the predicted bitrates and VMAF values align with the preferences of the streaming service provider, JTPS takes inputs b min , b max , and v max .By considering b max and v max , JTPS can be adjusted to optimize the number of representations in the bitrate ladder.Additionally, the input R ensures that only the supported resolutions of the streaming service provider are selected for the encoding set.The process starts by predicting the VMAF corresponding to b min .The VMAF scores for the remaining representations are calculated by incrementing the previous VMAF in the bitrate ladder by one JND until either b max or v max is reached.These VMAF values are then used to predict the corresponding bitrateresolution pairs.Additionally, an optimized CRF is predicted to achieve maximum compression efficiency for the cVBR encoding of the selected bitrate-resolution pairs.For each segment, the encoding process is performed exclusively for the predicted perceptually aware bitrate-resolution-CRF triples.In this manner, compression efficiency is improved over traditional fixed bitrate ladder and CBR encoding schemes while decreasing storage and, consequently, content delivery network (CDN) costs.JTPS is classified into four phases (cf.Fig. 5; the first-pass comprises the first three phases and the second-pass comprises the last/forth phase): Video complexity feature extraction (Section III-A) Perceptually-optimized bitrate ladder prediction (Section III-B) Optimized CRF prediction for the selected bitrateresolution pairs (Section III-C) cVBR encoding of the segments using the predicted bitrate-resolution-CRF triples Optimized bitrate prediction and CRF prediction are separated into two distinct prediction modules for better interpretability and control over the prediction process.The first module derives the target resolution (r t ) and the upper limit for the instantaneous bitrate ( bt ), while the second module derives the CRF parameter ( ĉt ) based on (r t , bt ).Utilizing a two-module approach is advantageous, as it explicitly helps us model and optimize for different aspects of the problem.

A. Video Complexity Feature Extraction
In this paper, three DCT-energy-based features, (i) the average texture energy E, (ii) the average gradient of the texture energy h, and (iii) the average luminescence L are used as the spatial and temporal complexity measures [11], [35].The feature extraction method was proposed in our previous work [35] and is included here to have the paper selfcontained.
The following DCT-based energy function is used to determine the texture of every non-overlapping block k in each frame f , which is defined as: where w × w pixels is the size of the block, and DC T (i, j) is the (i, j) th DCT component when i + j > 0, and 0 otherwise [36].To determine the spatial energy feature per segment, denoted as E, the texture is averaged as illustrated below: Here, K represents the number of blocks per frame, and F denotes the number of frames per segment.Furthermore, the block-wise sum of absolute difference (SAD) of the texture energy of each frame compared to its previous frame is computed and then averaged per segment to obtain the average temporal energy (h) as shown below: The luminescence of non-overlapping blocks k of each frame p is defined as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where DC T (0, 0) is the DC component in the DCT calculation.The block-wise luminescence is averaged per segment denoted as L, as shown below.
Please note that E and L represent the spatial characteristics of the video segment, while h represents the temporal characteristic, which are used in the following steps to predict the encoding bitrate ladder.

B. Perceptually-Optimized Bitrate Ladder Prediction
The JND-aware bitrate ladder prediction method is presented in Algorithm 1 and comprises two steps.
Step 1: The perceptual quality, measured by VMAF, is modeled as a function of features such as E, h, and L, resolution r , and target bitrate b, which can be expressed as v r,b = f (E, h, L , r, b) [37].The first point in the bitrate ladder values output from the predicted models trained for resolutions r 1 ,.., r M .The resolution corresponding to the VMAF v1 is chosen as r1 . is determined by predicting VMAF for all resolutions r ∈ R at b1 = b min (as shown in Fig. 6) using VMAF prediction models.From the predicted VMAF values (i.e., vr, b1 values) for different resolutions, the resolution with the highest VMAF value, r1 , is selected to correspond to the bitrate b1 .This results in the first point of the bitrate ladder being (r 1 , b1 ).
Step 2: For every subsequent point in the bitrate ladder (t > 1), the target VMAF is set to vt = vt−1 + v J ( vt−1 ), which means one JND more than the previous point.Bitrate is modeled as a function of E, h, L features, resolution r , and target VMAF v, i.e., b r,v = f (E, h, L , r, v).The target bitrate br,v t required to achieve the VMAF vt is determined for each resolution in R (refer to Fig. 7).The minimum value of br,v t in all resolutions is considered as bt for the bitrate ladder, and the resolution corresponding to the minimum value is chosen as rt .This process is repeated until bt is greater than or equal to b max or vt is greater than or equal to v max .
Implementation of Prediction Models: The prediction models are trained for each resolution supported by the streaming service provider, ensuring the scalability of the design without the need to retrain the entire network when adding a new resolution to the framework.In this paper, the following prediction models (i) linear regression model [38], (ii) XGBoost13 [39] and (iii) random forest regression   [40], are used and compared for their prediction accuracy in terms of R 2 score and Mean Absolute Error (MAE).Random forest regressor is an ensemble regression model that uses a randomly selected subset of training samples and variables to train multiple decision trees in parallel, commonly known as bagging.The cumulative results of all the numerous decision trees in the ensemble are combined to obtain the final predictions.
Table II shows the results of the VMAF and log(b) prediction, respectively, using the models mentioned above for 2160p resolution using the default hyper-parameters of the models. 13, 14It is observed that the R 2 score is the maximum and the MAE score is the minimum for random forest models.Moreover, random forest models exhibit lower overfitting than XGB and are faster as the decision trees run in parallel (courtesy of distributed computing-based approaches).Hence, this paper uses random forest models for VMAF and log(b) prediction for each resolution.Please note that training models for each resolution ensure scalability, as more resolutions can be added to JTPS architecture in the future with minimal retraining.Hyper-parameter tuning is performed on the prediction models of 2160p to obtain a balance between model size and performance.The selected hyperparameters 14 for VMAF and log(b) prediction models are min_samples_leaf =1, min_samples_split=2, n_estimators= 00, and max_depth=14.
The total processing time of the bitrate ladder prediction algorithm (τ B ) is: where r and N denote the number of resolutions in R and the number of points in the bitrate ladder, respectively.τ vp 14 https://scikit-learn.org/stable/modules/ensemble.html#forests-ofrandomized-trees, last access: May 30, 2023.
denotes the inference time of the VMAF prediction models and τ bp represents the inference time of the bitrate prediction models.The amount of memory required to store the models for bitrate ladder prediction (s B ) is given by: where s vp r denotes the size of the VMAF prediction model trained for the resolution r , and s bp r denotes the size of the bitrate prediction model trained for the resolution r .

C. Optimized CRF Prediction
For HAS it is essential to avoid exceeding the maximum bitrates specified in the HLS/DASH manifests [2] during the encoding process.Failure to adhere to these limits can lead to buffer overflows or underflows in video players 10 .Therefore, accurately predicting CRF becomes of utmost importance.In this paper, CRF is predicted instead of quantization parameter (QP), since, it simplifies the encoding workflow by eliminating the need to manually set and adjust QP for each frame.Once the bitrate ladder is determined, the optimized CRF ĉt is estimated for every (r t , bt ).CRF c is modeled as a function of the features E, h, and L, the resolution r , and the target bitrate b, i.e., c r,b = f (E, h, L , r, b).A prediction model is trained for each resolution r , which determines ĉt based on E, h, L, and log( bt ) for each video segment as shown in Fig. 8.The minimum and maximum CRF (c min and c max respectively) are chosen based on the target video encoder.For example, x264 15 AVC [28] encoder and x265 4 HEVC [16] encoder support a CRF range between 0 and 51.SVT-AV1, 16an AV1 [41] encoder supports a CRF range between 1 and 63.
Implementation of Prediction Models: Similarily as for the bitrate prediction, linear regression model [38], XGBoost [39], and random forest regression model [40] are tested for their prediction accuracy in terms of R 2 score and MAE.As shown in Table III, R 2 score is the maximum, and the MAE score is the minimum for random forest models.Furthermore, random forest models for CRF prediction for every resolution exhibit a lower tendency of overfitting and can utilize distributed computing for faster training and prediction.Hyper-parameter tuning is performed on the prediction model of 2160p to obtain a balance between model size and performance.The selected hyperparameters 14 are min_samples_leaf =1, min_samples_split=2, n_estimators=100, and max_depth=14.Since the output of the prediction model is a floating point value, the decimal value is truncated so that the result is an integer.
The total processing time of the CRF prediction (τ C ) is: where τ cp denotes the inference time of the CRF prediction models.The amount of memory required to store the models  for CRF prediction (s C ) is given by: where s vp r denotes the size of the CRF prediction model trained for resolution r .

IV. EVALUATION A. Test Methodology
The Video Complexity Dataset [12] is used to validate the performance of the encoding schemes considered in this paper.The dataset needed to train and test the prediction models is generated as shown in Algorithm 2. E, h, and L features are extracted using VCA v2.017 open-source video complexity analyzer [35].The sequences are encoded at 30 fps using x265 v3.5 4 with the ultrafast preset on a dualprocessor encoding server with Intel Xeon Gold 5218R (80 cores, frequency at 2.10 GHz).VCA and x265 are run using a single thread with only x86 SIMD optimization [42] to compare the time complexity of the considered schemes.The resolutions specified in Apple HLS authoring specifications 1 are considered in the evaluation, i.e., R= {360p, 432p, 540p, 720p, 1080p, 1440p, 2160p}.The total memory used to store VMAF and bitrate prediction models for bitrate ladder prediction, s B (cf. Eq. 7) is 777 MB, i.e., 58 MB for VMAF prediction models for each resolution, and 53 MB for bitrate prediction models for each resolution.The total memory used to store CRF prediction models, s c (cf. Eq. 9) is 400 MB, i.e., 57 MB for CRF prediction models for each resolution.In order to check for the generalization of the models, 5-fold crossvalidation is performed, and the values are averaged from all folds.Since these values are similar, we assume that the model generalizes well.It was ensured that the training set does not include any segments from the same scenes in the test set.For a target bitrate of b t (in Mbps), the CBR encoding is achieved by setting the bitrate and vbv-maxrate 18  and enabling strict-cbr flag 18 .Similarly, for a target bitrate of b t (in Mbps) and CRF c t , cVBR encoding is achieved by setting the crf option 18 of x265 as c t , and vbv-maxrate option as b t .This paper considers the following encoding schemes to compare with JTPS : topsep=0pt,leftmargin=* • Bruteforce bitrate ladder encoding [3], where the bitrateresolution-CRF triples are determined by encoding videos using all CRFs supported by x265 for all resolutions.The representations are chosen such that there is a VMAF difference of one target JND.
• HLS CBR encoding, which is the CBR encoding of HLS bitrate ladder. 1 • OPTE [11] CBR encoding, where optimized resolution is predicted for the set of bitrates in the HLS bitrate ladder, as shown in Fig. 6.In [11], linear regression models were used to predict VMAF based on the E and h features.For the evaluation in this paper, the method is extended by using random forest models trained to predict VMAF (for all resolutions in R) based on the E, h, and L features using the CBR encoding dataset (cf.Algorithm 2).
• PPTE [7], where optimized bitrate-resolution pairs are predicted for JND-aware CBR encoding as shown in Algorithm 1.In [7], linear regression models were used to predict VMAF and bitrate based on the E and h features.
For the evaluation in this paper, the method is extended by using random forest models trained to predict VMAF and bitrate (for all resolutions in R) based on the E, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
h, and L features using the CBR encoding dataset (cf.Algorithm 2).
• HLS cVBR encoding, where the optimized CRF is predicted for the bitrate-resolution pairs of the HLS bitrate ladder, as shown in Fig. 8. Random forest models are trained to predict CRF (for all resolutions in R) using the cVBR encoding dataset (cf.Algorithm 2).
• OPTE cVBR encoding, where the optimized CRF is predicted along with the optimized resolution for the set of bitrates in the HLS bitrate ladder for cVBR encoding.This scheme predicts VMAF for all resolutions in R for a given set of target bitrates.The resolution which yields the maximum VMAF is chosen as the optimized resolution for the given target bitrate, as shown in Fig. 6.Random forest models are trained to predict VMAF and CRF (for all resolutions in R) using the cVBR encoding dataset (cf.Algorithm 2).For PPTE and JTPS encoding, the parameters, b min , and b max are set as 0.145 Mbps and 16.8 Mbps, respectively, to compare with the HLS bitrate ladder.The average target JND function (v J ) is considered as two [26], four, and six 19based on current industry practices.Accordingly, v max is set as 98, 96, and 94, respectively, to comply with the target JND value.
First, the pre-processing time (τ p ), i.e., latency in encoding due to the time taken for video complexity feature extraction, and the inference time of the models to predict the optimized bitrate-resolution-CRF triples are determined to evaluate the first-pass encoding time.τ p for state-of-the-art methods [3], [13], [14], [15], [43] is the time for pre-encoding.The additional computational time overhead to determine convex-hull T C is reported as a ratio to the sum of encoding times of all representations in the reference bitrate ladder encoding as shown below: Second, the VMAF, log(b), and CRF prediction models are assessed in terms of the prediction accuracy using the coefficient of determination (R 2 ) score and Mean Absolute Error (MAE) compared to the ground truth values.The achieved VMAF, bitrate, and CRF recorded in the cVBR encoding dataset are ground truth values.Third, the relative importance of the features used is evaluated using the SHapley Additive exPlanations (SHAP) values [44].
The encoding schemes' rate-distortion (RD) curves are analyzed for selected video sequences (segments) of various video content complexities.Bjøntegaard delta rates [45] B D R P and B D R V refer to the average increase in bitrate of the representations compared with that of the reference bitrate ladder encoding scheme to maintain the same PSNR and VMAF, respectively.A negative B D R suggests a boost in the coding efficiency of the considered encoding scheme compared to the reference bitrate ladder encoding scheme.BD-PSNR and BD-VMAF refer to the average increase in PSNR and VMAF, respectively, at the same bitrate compared with the reference bitrate ladder encoding scheme.A positive BD-PSNR and BD-VMAF denote an increase in the coding efficiency of the considered encoding scheme compared to the reference bitrate ladder encoding scheme.
The relative difference in the storage space needed to store all bitrate ladder representations of the considered encoding scheme ( S) is also evaluated as: where b r e f and b opt represent the sum of bitrates of all representations in the reference bitrate ladder encoding and the bitrate ladder encoding using the considered encoding scheme, respectively.Similarly, the relative difference in the encoding time of the considered encoding scheme ( T ) is also evaluated as: where t r e f and t opt represent the sum of encoding times of all representations in the reference bitrate ladder encoding and the bitrate ladder encoding using the considered encoding scheme, respectively.

B. Experimental Results
This section presents the results of JTPS.The preprocessing time (τ p ), i.e., the sum of feature extraction time and the inference time of the prediction models is evaluated.E, h, and L features are extracted at an average speed of 44 frames per second over the entire dataset, i.e., for a segment of four second duration, features are extracted in 2.71 s.The average inference time of the random forest models for the bitrate ladder and CRF prediction (τ vp , τ bp , and τ cp ) is 5 ms.Hence, τ p is 2.72 s.As observed in Fig. 9, τ p decreases as the video resolution (r max ) decreases.The inference time of the prediction models do not change, however, the featrue extraction time reduces considerably as the resolution decreases.In real-time applications, video complexity feature extraction and the encoding bitrate-ladder prediction can be executed as concurrent processes, using multi-threading optimizations.As an example, τ p is reduced to 0.35 s when eight CPU threads are used for feature extraction.As shown in Table I, the state-of-the-art methods have pre-encoding steps to determine convex-hull, making them unsuitable for live streaming applications.However, OPTE [11], PPTE [7] and JTPS do not need pre-encoding.The performance of the VMAF, bitrate and CRF prediction models are investigated using the R 2 score and MAE, as shown in Table V.The average R 2 score of the VMAF, bitrate, and CRF prediction models are estimated as 0.886, 0.910, and 0.968, respectively.Hence, it can be observed that there is a strong positive correlation between the predicted and ground truth values.The average MAE of the prediction models is estimated as 4.762, 0.483, and 1.848, respectively, which is acceptable in live streaming applications.Furthermore, this paper also examines the relative feature importance in the prediction models.Fig. 10 shows the SHAP values [44] corresponding to the features used in the prediction models.The target bitrate in the logarithmic scale (log(b t )) is the most influential feature for VMAF prediction, followed by the h, L, and E features.Similarly, target VMAF (v t ) is the most important feature for bitrate prediction, followed by the h, L, and E features.Furthermore, log(b t ) is the most vital feature for CRF prediction, followed by the h, L, and E features.Intuitively, lower CRF yields higher bitrate and VMAF, and vice versa.Additionally, in inter-coding, temporal activity is expected to influence the encoding decisions more than spatial content.Hence, h is expected to be more critical in the predictions than L and E, respectively.Fig. 11 shows the RD curves of selected video sequences (segments) of various video complexities with bruteforce encoding [3], HLS CBR encoding, OPTE CBR encoding [11], PPTE encoding [7], HLS cVBR encoding, OPTE cVBR encoding, and JTPS.It is observed that JTPS determines the RD points so that the average VMAF difference between consecutive RD points is the target JND value (in the figure, JND is assumed as 6 VMAF points).Furthermore, the VMAF achieved by JTPS is higher than HLS CBR encoding at the same target bitrates.In most cases, however, OPTE cVBR yields higher VMAF than the other encoding schemes at the same target bitrates for videos in all complexity classes.This is because OPTE cVBR encoding is optimized for maximizing VMAF, while JTPS is a joint optimization for maximizing VMAF and maintaining a perceptual gap between representations.Hence, the number of representations in JTPS for every video segment is lower than for HLS ladders and OPTE encoding.On average, JTPS (6 VMAF JND) yields eight representations for each video segment, while HLS ladders and OPTE encoding always have twelve representations.On average, PPTE (6 VMAF JND) yields ten representations for each video segment.
Considering temporal activity in live-streaming applications is crucial for achieving optimal video quality, and storage efficiency.Since h represents the temporal activity and is shown to have the strongest influence on the bitrate and VMAF prediction models compared to the other video complexity features (cf.Fig. 10), the correlation of h with the cumulative bitrate of all representations encoded using JTPS for different VMAF JND values (i.e., 2, 4, and 6) is analyzed as shown in Fig. 12a, 12b, and 12c, respectively.The average R 2 score of h with the cumulative bitrate is 0.65.This is because, video segments with high temporal activity and fast-paced motion tend to have more temporal changes between frames, resulting in more information to be stored or transmitted.As a result, higher bitrates and larger file sizes are needed to maintain video quality.Similarly, the correlation of h with | B D R V | of the videos encoded using JTPS is analyzed for different VMAF JND values (i.e., 2, 4, and 6) as shown in Fig. 12d, 12e, and 12f, respectively.The average R 2 score of h with | B D R V | is 0.51.In scenes with low temporal motion activity, where there is slower motion or minimal changes between frames, fewer bits are needed to represent the frames accurately.Hence, | B D R V | is high at low h values.However, | B D R V | is observed to be independent of the considered JND value.This is because the area under the RD curve using JTPS does not change based on JND values.To summarize, as h increases, i.e., when there is an increase in temporal activity, the storage requirement increases.Furthermore, as h increases, the bitrate savings while maintaining the same VMAF decreases.
Finally, Table VI summarizes the bitrate saving results of the schemes in terms of B D R P , B D R V , and S, the qualitative analysis results in terms of BD-PSNR and BD-VMAF, and encoding time saving ( T ) compared to the HLS CBR encoding.Bruteforce encoding with JND of 2, 4, and 6 VMAF points is the best possible result when the predictions are 100% accurate.Hence, the corresponding results are the highest bound of the compression efficiency improvement (considering VMAF as the quality metric) compared to the HLS CBR encoding.The encoding time using the bruteforce method is 47 times higher than the HLS CBR encoding.OPTE CBR encoding yields bitrate savings of 17.28% and 22.79% to maintain the same PSNR and VMAF, respectively, compared to the HLS CBR encoding, along with a 0.07% cumulative increase in storage space required and a 9.74% cumulative increase in encoding time for various representations.This scheme yields the highest bitrate saving to maintain the same VMAF compared to the other CBR encoding schemes.PPTE scheme is analyzed for the JND values of 2, 4, and 6 VMAF points.With a target JND of 2 VMAF points, PPTE yields bitrate savings of 11.06% and 16.65% to maintain the same PSNR and VMAF, respectively, compared to the HLS CBR Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.encoding, along with a 10.18% cumulative increase in storage space required and a 105.73% cumulative increase in encoding time for various representations.The increase in storage space and encoding time is owed to the increase in the number of representations in the bitrate ladder when the JND value decreases.With a target JND of 4 and 6 VMAF points, the decrease in storage space requirement is observed as 27.03% and 42.48%, respectively.The overall encoding time increases by 10.19% for a target JND of 4 VMAF points, while it decreases by 25.35% for a target JND of 6 VMAF points.

TABLE VI AVERAGE RESULTS OF THE ENCODING SCHEMES COMPARED TO THE HLS CBR ENCODING
It is observed that HLS cVBR encoding yields bitrate savings of 35.25% and 32.33% to maintain the same PSNR and VMAF, respectively, compared to the HLS CBR encoding, along with a 9.39% cumulative decrease in storage space and 1.64% cumulative increase in encoding time required for various representations.This result demonstrates that the compression efficiency of cVBR encoding is better than CBR encoding.Using OPTE cVBR encoding, bitrate savings of 34.42% and 42.67% to maintain the same PSNR and VMAF, respectively, are observed, compared to the HLS CBR encoding along with a 1.34% cumulative decrease in storage space requirement and a 62.73% cumulative increase in encoding time requirement.This scheme yields the highest bitrate saving to maintain the same VMAF compared to the other considered schemes.However, as observed in the RD figures, many representations are perceptually redundant, which wastes storage space.JTPS is observed to overcome this problem.With a target JND of 2 VMAF points, JTPS yields bitrate savings of 14.25% and 29.14% to maintain the same PSNR and VMAF, respectively, compared to the HLS CBR encoding, along with a 23.57% cumulative increase in storage space and a 184.62% cumulative increase in encoding time required for various representations.Similar to the observation for PPTE, when the JND value decreases, the number of representations in the bitrate ladder increases, causing an increase in storage space required.However, with a target JND of 4 and 6 VMAF points, the decrease in storage space requirement is observed as 56.38% and 68.96%, respectively.The overall encoding time increases by 26.14% for a target JND of 4 VMAF points, while it decreases by 18.58% for a target JND of 6 VMAF points.

V. CONCLUSION
This paper proposes a JND-aware two-pass cVBR per-title encoding scheme (JTPS) for adaptive live streaming applications.JTPS includes an optimized encoding bitrate ladder prediction algorithm, which estimates bitrate-resolution-CRF triples for a given video segment based on its spatial and temporal characteristics, using RF-based models.The bitrate ladder is predicted such that there is a perceptual difference of at least one JND between the representations in order to minimize the perceptual redundancy of the representations.Optimized CRF prediction for every representation in the bitrate ladder enables cVBR encoding.The experimental results show that, on average, JTPS yields bitrate savings of 18.80% and 32.59% to maintain the same PSNR and VMAF, respectively, compared to the CBR encoding of the reference HLS bitrate ladder with a negligible additional latency in streaming.This is accompanied by a cumulative decrease of 68.96% in storage space needed for various representations, and a cumulative decrease of 18.58% in encoding time, considering a JND of 6 VMAF.
In case the streaming service provider does not support pertitle encoding schemes, the HLS cVBR encoding scheme can be used, where the bitrate-resolution pairs are fixed.Hence, the network architecture used for fixed bitrate-ladder encoding shall remain unaltered.If the streaming service provider supports dynamic resolution changes while maintaining a selected set of bitrates, OPTE cVBR encoding scheme is the best choice.Finally, if dynamic bitrate-resolution pairs are supported, JTPS offers the best storage reduction and improved compression efficiency.
In the future, JTPS can be extended to support Common Media Client Data (CMCD) [46], so that the encoding can be optimized based on the user profile, geolocation, subscription model, ratings, etc.In this way, context-awareness can be incorporated in JTPS.

Fig. 2 .
Fig.2.Rate-Distortion (RD) curves of the Constant Bitrate (CBR) encoding of RushHour_s000 and YachtRide_s000 video sequences (segments) of VCD dataset[12] encoded at 1080p and 2160p resolutions using x265 HEVC encoder at ultrafast preset.Here, VMAF is used as the quality metric.

Fig. 6 .
Fig. 6.Estimation of the first point of the bitrate ladder.v1 is the maximum value among the vr, b1values output from the predicted models trained for resolutions r 1 ,.., r M .The resolution corresponding to the VMAF v1 is chosen as r1 .

Fig. 7 .
Fig. 7. Estimation of the t th point (t > 1) of the bitrate ladder.log( bt ) is the minimum value among the log( br, vt ) values output from the predicted models trained for resolutions r 1 ,.., r M .The resolution corresponding to log( bt ) is chosen as rt .

Fig. 8 .
Fig. 8. Estimation of the optimized CRF to achieve the target bitrate bt using a prediction model trained for resolution rt .
option of x265 as b t , Algorithm 2 Dataset Generation cVBR encoding dataset Inputs: R: set of resolutions c min : minimum supported CRF c max : maximum supported CRF for each video segment do Determine E, h, and L for each r ∈ R do for each c ∈ [c min , c max ] do Encode segment with CRF c ; Record E, h, L, r , c, achieved bitrate b ′ , VMAF v, and PSNR p ; CBR encoding dataset Inputs: R: set of resolutions B: set of target bitrates for each video segment do Determine E, h, and L for each r ∈ R do for each target bitrate b ∈ B do Encode segment with CBR b ; Record E, h, L, r , b, achieved bitrate b ′ , VMAF v, and PSNR p ;

TABLE I COMPARISON
OF THE STATE-OF-THE-ART PER-TITLE ENCODING METHODS WITH JTPS

TABLE II PREDICTION
ACCURACY OF VMAF AND log(b) PREDICTION MODELS FOR 2160P RESOLUTION ENCODING OF VCD DATASET [12] USING X265HEVC ENCODER AT ultrafast PRESET model14

TABLE III PREDICTION
[12]RACY OF CRF PREDICTION MODELS FOR 2160P RESOLUTION ENCODING OF VCD DATASET[12]USING X265 HEVC ENCODER AT ultrafast PRESET

TABLE IV COMPARISON
OF THE ADDITIONAL COMPUTATIONAL TIME OVERHEAD TO DETERMINE THE CONVEX-HULLTableIVshows the additional computational time overhead needed to determine the convex-hull (first-pass encoding time) compared to HLS CBR encoding time.It is observed that our previous works OPTE and PPTE, and JTPS need significantly lower processing time to predict the bitrate ladder, compared to the state-of-the-art methods; hence, they are suitable for live streaming applications.

TABLE V R
2SCORE AND MAE OF THE PREDICTION MODELS FOR VARIOUS RESOLUTIONS