Learning-Based JND-Directed HDR Video Preprocessing for Perceptually Lossless Compression With HEVC

The final consumer of videos is mostly human. Therefore, if videos can be compressed by fully utilizing the perception characteristics of human visual systems (HVS), the bitrates of the compressed videos can be significantly reduced with subjective visual quality degradation as little as possible. Based on this, we newly propose a learning-based Just Noticeable Distortion (JND)-directed preprocessing scheme for perceptual video compression, especially for 10-bit High Dynamic Range (HDR) videos, which is called the HDR-JNDNet. Our HDR-JNDNet effectively suppresses the perceptual redundancy of 10-bit HDR video signals so that the compression efficiency can be significantly enhanced for the HEVC main10 profile encoder. To our best knowledge, our work is the first approach to training a CNN-based model to directly generate the JND-directed suppressed frames of 10-bit HDR video with the negligible perceptual quality difference between the decoded frames for the original HDR video input with and without the preprocessing by our HDR-JNDNet. Via intensive experiments, when the HDR-JNDNet is applied as preprocessing for the HDR video input before compression, it allows to remarkably save the required bitrates up to the maximum (average) 40.66% (18.37%) for 4K-UHD/HDR test videos, with little subjective video quality degradation without increasing the computational complexity.


I. INTRODUCTION
Today, we live in the age of video. It has become a daily life for everyone to shoot videos on their smartphones and share the captured videos via various internet video platforms. For example, it is known that more than 400 hours of new videos are uploaded to YouTube TM every minute in 2019 [5]. Especially, due to the development of commercial cameras and displays, many high-quality video contents in the formats of High Dynamic Range (HDR) are easily generated and rapidly distributed. Compared to the conventional Standard Dynamic Range (SDR) video contents, HDR video can represent a wide range of luminance levels more similar to real-world scenes [2]. But, the required amounts of transmission bandwidths and storage spaces for such The associate editor coordinating the review of this manuscript and approving it for publication was Nilanjan Dey.
HDR video contents are much heavily demanded than the conventional SDR video contents because the HDR videos adopt floating-point representation [22]. Thus, to find efficient HDR video transmission and saving solution, ISO/IEC JTC 1 (International Standard Organization/International Electrotechnical Commission Joint Technical Committee 1) had launched a Call for Evidence (CfE) and recommended an HDR video coding chain to be suitable for High Efficiency Video Coding (HEVC) [26] to process the HDR video in 10-bit integers quantized by Perceptual Quantizer (PQ) [8], [14]. Fig. 1 shows the HDR video coding chain in conjunction with the HEVC main10 profile codecs. Note that the main10 profile of HEVC supports the format of encoded HDR video, and most of the HDR videos to be compressed are stored in the form of 10-bit quantized HDR. Most commercial HDR TVs are often manufactured to support the decoded 10-bit quantized HDR video. But, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ quantized 10-bit HDR video contents still require more large amounts of transmission bandwidths and storage spaces than 8-bit SDR video contents. Also, the 4K (3,840-pixel width) Ultra High Definition (UHD) TV/Broadcast video is specified in the formats of 10-bit HDR quantized video, thus requiring significantly increased amounts of transmission bandwidths and storage spaces. Therefore, there is a strong demand for effective compression on 4K-UHD/HDR videos, which can be realized by effectively removing the perceptual redundancy of HDR video. One of the approaches for reducing bitrates with little perceptual video quality degradation is based on Just Noticeable Distortion (JND)-directed preprocessing [7], [10], [11], [27], [28]. The JND-directed preprocessing can enhance compression efficiency by effectively removing perceptual redundancy, which often smooths out the original input signals to the points such that the resulting distortions are still not perceptually recognizable. These preprocessing-based approaches are advantageous since they can be applied to any video codec without its internal functional modifications. Furthermore, the JND-directed preprocessing methods are effective for compression efficiency improvement even when high quantization parameter (QP) values are used in the encoding. This is because the entire original input signals are suppressed by preprocessing before encoding, rather than trying to suppress the residual signals during encoding. However, most of the-state-of-the-art JND-directed preprocessing methods are developed for SDR signals. As a result, these methods do not provide sufficient bitrate saving for HDR signals.
In this paper, our contributions can be summarized as follows: • We newly propose a learning-based JND-directed suppression method to reduce the perceptual redundancy in 10-bit HDR videos as preprocessing to HEVC main10 profile encoders. Fig. 2 illustrates the pipeline of the preprocessing-based HDR video coding scheme using our proposed HDR-JNDNet with an HEVC main10 profile codec. To our best knowledge, our work is the first approach to preprocessing 10-bit HDR videos for perceptual redundancy reduction using a CNN-based JND-learning network (HDR-JNDNet). The preprocessed video by our HDR-JNDNet is fed as input to the HEVC main10 profile encoder, thus leading to significant bit saving with maximum (average) 40.66% (18.37%) over a standard HM reference codec of HEVC for 4K-UHD HDR test sequences.
• We propose a novel process of generating the ground truths to train learning-based JND-directed suppression networks, such as our HDR-JNDNet. The ground truths for the frames with JND-directed suppression for the original HDR frames can be constructed for given target QP values by the maximum energy-reduction strength according to perceptual-quality-difference detection probability (PQDP) values of less than 0.5. Since this process is not codec-dependent, it can be applied to any video coding scheme.
• To obtain the ground truths, our CNN-based PQDP predictor is proposed to precisely predict the perceptualquality-difference detection probability between the decoded frames for an original input video with and without our preprocessing applied. For this, based on a JND threshold level, we present a new methodology for training such a CNN model (CNN-based PQDP predictor) that predicts a PQDP value between two images using a new loss function with reliability weights. With the pretrained CNN-based PQDP predictor, the HDR-JNDNet can be trained to generate the preprocessed frames by taking into account the expected compression distortion for any given target QP value. In this paper, Section II briefly reviews related works; our proposed preprocessing method is described in Section III 1 ; Section IV presents intensive experimental results with analysis to show the effectiveness of our method; Finally, Section V concludes our work.

A. PREPROCESSING-BASED VIDEO CODING SCHEMES
In order to further explore the compression efficiency of video, many perceptual video coding (PVC) schemes have been proposed [3], [7], [10]- [12], [17], [18], [21], [27], [28], [31], [33]. PVC is a method of improving the coding efficiency without degrading subjective visual quality, unlike the conventional video coding schemes that minimize the absolute error between an original video and its compressed video. PVC schemes can be categorized into three folds: (i) Perceptual quantization schemes [3], [12], [17], [18] are implemented with quantization processes in video encoders by adjusting the quantization levels of the transform coefficients of the residual; (ii) Perceptual rate control schemes [33] find the optimal solution to the minimization problem for PVC using the JND-based Rate-Distortion (R-D) model; (iii) Preprocessing-based video coding schemes [7], [10], [11], [21], [27], [28], [31] reduce the energy of input signals to increase the entropy-theoretic compression efficiency before encoding. Compared to the other two PVC approaches, the preprocessing-based approaches are advantageous since they can be applied to any video codec without its internal functional modifications. Furthermore, the preprocessing-based approaches are effective for compression efficiency improvement even when high quantization parameter (QP) values are used in the encoding. This is because the entire original input signals are suppressed by preprocessing before encoding, rather than trying to suppress the residual signals during encoding. Thus, in this paper, we focus on preprocessing-based PVC schemes.
Preprocessing-based video coding schemes reduce the energy of input signals to increase the entropy-theoretic compression efficiency for video before encoding [25]. The energy of an SDR input frame can be reduced by lowering the magnitudes of the DCT coefficients of the input frame. The energy-reduced strength (K n ) of n-th m×m DCT block is determined by a maximum value at which the perceptual quality difference between an SDR frame and its energy-reduced frame is not recognizable. Finding the optimal K n values can be done in the pixel domain or the DCT domain. In the pixel domain, Ding et al. [7] proposed the JND-based Gaussian filtering of super-pixels using k-means clustering. Ding adopts a method of finding the sigma value of a Gaussian filter corresponding to the optimal K n in the pixel domain. To further improve this, Vidal et al. [27] applied an adaptive bilateral filter to each input frame in the pixel domain for perceptually lossless video coding. In the DCT domain, Xiang et al. [28] directly determined the optimal K n using the proposed DCT-based just noticeable distortion filter (JNDF). Recently, Ki et al. [10] newly proposed a preprocessing-based video coding scheme using the CNN-JNQD model which is trained based on their Energy-Reduced JND (ERJND) model, which is an optimized JND model for energy-reduced distortion. Ki's scheme can adjust the magnitudes of energy-reduced strength according to given quantization parameters using the CNN-JNQD model. To date, Ki's scheme has been the only preprocessing-based video coding scheme using CNN and has shown the best coding efficiency compared to conventional state-of-the-art PVC schemes.
However, the state-of-the-art preprocessing-based video coding schemes can not generate the optimal preprocessed frames for HDR video contents because they are modeled for SDR signal characteristics. Most of the HDR video coding researches are only focus on developing a new PQ for effectively quantizing real HDR videos to 10-bit HDR videos preserving their perceptual quality [13], [29], [30]. Thus, there are no preprocessing researches to improve the compression efficiency of HEVC main10 profile for quantized 10-bit HDR videos. Based on the results of the state-of-the-art preprocessing-based video coding schemes for SDR signals, we can suppose that a preprocessing-based video coding scheme extended for 10-bit HDR signals can effectively improve their coding efficiency in HEVC main10 profile encoder. Therefore, in this paper, we firstly propose a CNN-based preprocessing PVC scheme optimized for 10-bit HDR videos.

B. EXTENDING JND MODELS FOR SDR SIGNALS TO THOSE FOR HDR SIGNALS
In order to apply the preprocessing-based video coding schemes to HDR videos, it is necessary to know the JND threshold values for HDR video signals. It is too time-consuming and cumbersome to repeat the subjective experiment on HDR videos. Thus, Zhang et al. [31] proposed a method of extending the JND model of SDR signals to the JND model of HDR signals using a tone mapping operator (TMO) function between SDR image pixel value (I sdr ) and HDR image pixel value (I hdr ). A JND threshold value (J hdr ) of the HDR domain is calculated by dividing the JND threshold value (J sdr ) of the SDR domain by the derivative of the TMO function. The TMO curve offers finer granularity where the histogram of I hdr is denser while coarse mapping is applied to regions where I hdr values are more sparse. The fine (coarse) tone mapping means high (low) sensitivity to distortions. When tone mapping is finer (i.e., high derivative values), J hdr is reduced due to high sensitivity, but, when the tone mapping is coarse, J hdr is further raised due to low sensitivity. By applying this method, the existing JND models for the SDR domain can be extended to the HDR domain without additional subjective distortion experiments for HDR videos. Based on the extended JND model for HDR videos, Zhang proposed a method to increase the compression efficiency of HDR video coding with little perceptual quality degradation by adjusting the quantization parameter values in an HEVC encoder. However, Zhang's extended JND model is not suitable for preprocessing-based video coding for HDR videos because they are not modeled for energy-reduced distortions [10]. Therefore, in this paper, we newly propose a ERJND hdr model for 10-bit HDR signals by extending Ki's ERJND model for 8-bit SDR signals based on Zhang's method. It should be, however, noted that such a simple extension to obtain the ERJND hdr model could not be directly used to effectively remove perceptual redundancy in preprocessing because the perceptual-quality-difference detection probability (PQDP) values between two decoded frames with and without ERJND hdr -directed energy-reduction can not be obtained by the ERJND hdr model alone. Therefore, we need to develop a model to predict PQDP values between the two decoded frames. It is used to elaborately control the energy-reduction strength according to compression distortion levels so that the resulting compression efficiency can be greatly improved with little perceptual quality degradation for HDR input videos. For this, we newly proposed a CNN-based PQDP predictor for HDR input videos.

III. METHOD
In the pipeline of our preprocessing-based HDR video coding scheme as shown in Fig. 2, the CNN-based HDR-JNDNet can generate energy-reduced frames by reducing perceptual redundancy in the sense that imperceptible signal components are removed. Importantly, our HDR-JNDNet takes into account the expected compression distortion for a target QP value when compressing 10-bit HDR input frames. In the following subsections, we will explain how to train the HDR-JNDNet. Firstly, the ground truth (GT) of energy-reduced frames should be prepared a prior to train the HDR-JNDNet in a supervised manner. So, we first explain how to generate the ground truths for the HDR-JNDNet training. Fig. 3 shows the process of generating the ground truths for training the HDR-JNDNet. In Fig. 3, the optimal energy-reduced frame is found as having the maximum energy-reduction strength for which the same perceptual quality is maintained between the decoded frame of the original input and the decoded frame of the energy-reduced frame. However, it should be very important to note that an energy-reduced frame suppressed by JND threshold values for its original frame does not necessarily guarantee the same perceptual quality between the decoded energy-reduced frame and the decoded original frame after compression. This is because quantization is involved during encoding, which introduces additional distortions for both decoded original and energy-reduced frames. Thus, the energy-reduction strength for the original input frames should be adjustable in preprocessing before compression by considering the resulting compression distortion. But, an optimal energy-reduction strength can be determined only after compression. This is why multiple energy-reduced frames are prepared in various energy-reduction strengths for the original HDR input frame, and they are encoded for a given target QP value. Then, the optimal energy-reduced frame for the original HDR input frame can be constructed for the target QP value by placing the 64 × 64-sized blocks, each of which has the maximum energy-reduction strength with a PQDP value of less than 0.5 among its co-located 64 × 64-sized blocks of the various energy-reduced frames. Throughout this process, a set of GT for preprocessing the original HDR images can be obtained to train the HDR-JNDNet. Then the trained HDR-JNDNet can directly generate adjusted energy-reduced frames for original HDR input frames by considering given QP values before the encoding process starts. Fig. 4 shows the training phase of the HDR-JNDNet using the optimal energy-reduced frames as GT and the testing phase. In the following subsections, we will explain the steps of generating the GT for 10-bit HDR frames for given target QP values in more details.

A. ERJND hdr MODEL FOR 10-BIT HDR SIGNALS
In order to determine whether to recognize a perceptual quality difference between the decoded frame of an original HDR input frame and the decoded frame of its energy-reduced frame via preprocessing, it is necessary to know a JND threshold value to suppress the original HDR input frame for energy-reduction. The JND threshold value for energy-reduction distortions (ERJND) of SDR signals, K erjnd n,sdr , is modeled using the structural contrast index (SCI) [3] as: where C sdr n is the n-th DCT block of the input SDR frame. Here, it is noted that the energy-reduced SDR image where the magnitudes of 8×8 block DCT coefficients of its original SDR image are suppressed by the amounts of K erjnd n,sdr has no perceptual quality degradation compared with the original image. However, to apply the JND-directed preprocessing for an HDR video coding scheme, we need to know the JND threshold values for the HDR input a priori. According to [31], the ERJND model for SDR domain can be extended to HDR domain using a tone-mapping operator (TMO) as: where K erjnd n,hdr is a energy-reduction strength extended to HDR domain. f (·) and f (·) are a TMO and its derivative, respectively. is 10 −6 for zero derivative cases. The TMO used in this paper is a histogram-based piece-wise linear TMO, which is optimized for backward-compatible HDR image and video compression [15]. In our works, the TMO plays a role of mapping quantized 10-bit HDR signals to quantized 8-bit SDR signals. Thus, the luminance range of HDR signals is 0 to 1023 and the maximum luminance value of SDR signals is 255, and the number of pieces for the TMO is 1023.
In this paper, we further take into account the luminance masking effects to model the JND thresholds for 10-bit HDR frames (images). Since the ERJND model in SDR domain is developed based only on SCI as in Eq. (1), which represents the complexity of an image, it can not account for the luminance sensitivity of the SDR images. Especially for HDR videos, their perceptual redundancy can be reduced more effectively if the luminance masking effect [31] is incorporated. This is because the dynamic range of the luminance signals of 10-bit HDR images is four times that of the 8-bit SDR images. Since the ERJND model in SDR domain is developed in the most sensitive luminance level which is a pixel value of 128 in 8-bit depth [31], an ERJND model with the luminance masking effect can be easily obtained by multiplying the intensity-dependent quantization (IDQ) profile [31] to the ERJND model in SDR domain as follows: whereǨ erjnd n,sdr is the energy-reduction strength of the ERJND model with luminance masking, and IDQ sdr (µ n ) is the IDQ profile for a local luminance average value µ n of the n-th 8 × 8 block in a (tone-mapped) SDR image. Therefore, based on Eq. (2), the energy-reduction strength of the proposed ERJND hdr model can be derived from the SDR's ERJND model with luminance masking by dividing it with the derivative of the TMO as: where K erjnd n,hdr is the energy-reduction strength of our proposed ERJND hdr model.
The ERJND-directed energy-reduced frames of an SDR input frame [10] can be obtained by lowering the magnitudes of the DCT coefficients of the SDR input frame as much as K erjnd n,sdr as follows: where C sdr n (u, v) and C sdr n (u, v) are (u, v)-th DCT coefficients of n-th m×m DCT block of the input SDR frame and the energy-reduced frame, respectively. By substituting K erjnd n,sdr with K erjnd n,hdr in Eq. (5), we can obtain the ERJND hdr -directed energy-reduced frames as follows: where C hdr n (u, v) and C erjnd n,hdr (u, v) are (u, v)-th DCT coefficients of n-th 8 × 8 DCT block of the input HDR frame and the ERJND hdr -directed energy-reduced frames, respectively.

B. GENERATING THE VARIOUS VERSIONS OF THE ERJND hdr -DIRECTED ENERGY-REDUCED FRAMES
The ERJND hdr -directed energy-reduced frames can be obtained by Eq. (6). However, it should be reminded that the ERJND hdr -directed energy-reduced input is only considered to be perceptually the same as the original HDR input. Actually, in our problem, the ERJND hdr -directed energy-reduced input goes through the encoding process, yielding the encoded ERJND hdr -directed energy-reduced frames which are not necessarily to be perceptually the same as the encoded original HDR input frames. This is because the lossy compression with quantization may introduce different amounts of compression distortions for the ERJND hdr -directed energy-reduced frames and the original HDR input frames. This should be remedied since our goal is to have perceptually the same visual quality after compression between the ERJND hdr -directed energy-reduced frames and the original HDR input frames. Thus, the energy-reduction strength for the original input frames should be adjusted before compression by considering the resulting compression distortions.
In order to solve the aforementioned problem, we need to repeatedly compress energy-reduced frames of various energy-reduction strengths for an HDR input frame with respect to a given target QP value. We can obtain the various versions of the ERJND hdr -directed energy-reduced frames by introducing a scaling factor, α, as follows: X er,α n,hdr where α = 0.1k for k = 0, 1, . . . , 20, and C erjnd n,hdr,α (u, v) is (u, v)-th DCT coefficients of n-th 8 × 8 DCT block of the ERJND hdr -directed energy reduction with a scaling factor α. X er,α n,hdr is the resulting ERJND hdr -directed energy-reduced pixel block obtained by the inverse DCT of C erjnd n,hdr,α as in Eq. (8). In this paper, we generated total 21 ERJND hdr - Next, among the encoded frame of the original HDR input and the encoded frames of its energy-reduced frames with various suppression strengths, the perceptual quality differences are checked based on a pretrained CNN-based PQDP predictor for each QP value. Then, we constructed an optimal energy-reduced frame for the original HDR input by placing the 64 × 64-sized blocks, each of which has the maximum energy-reduction strength with a PQDP probability of less than 0.5 among its co-located 64 × 64-sized blocks of the corresponding energy-reduced frames before compression. In the following subsection, we will explain the CNN-based PQDP predictor in more details. The precondition on preprocessing-based video coding is that there is no perceptual difference between the decoded frames for an original input and a preprocessed input by the perceptually lossless preprocessing. To determine whether or not a perceptual difference does exist between the two decoded frames, we propose a novel CNN-based perceptualquality-difference detection probability (PQDP) predictor. The CNN-based PQDP predictor is trained to predict a probability value of 0 when two identical original image patches of 64 × 64 size are inputted, and to predict a probability value of 0.5 when an original patch and its corresponding ERJND hdr -directed energy-reduced patch are inputted because the definition of JND is often defined as the distortion level when 50% of subjects perceive the distortion [3]. In order to stably train the CNN-based PQDP predictor, it is also necessary to use the ERJND hdrdirected energy-reduced frames having various PQDP values other than 0 and 0.5. But, finding the ground truths of energy-reduction strengths for the various other detection probability values is very cumbersome and challenging from subjective tests. Thus, in order to circumvent this difficulty, we intuitively determine the PQDP values for the ERJND hdrdirected energy-reduced patches to be all 0's for 0 ≤ α < 1 or all 1's for 1 < α ≤ 2.   5 shows the GT P α gt of PQDP and its reliability weight curve λ α for α. In Fig. 5-(a), for α = 0 and α = 1, P α gt = 0 and 0.5, respectively, based on the definition of JND. However, for the other α values, we intuitively determined pseudo-GTs of PQDP to be 0 for 0 ≤ α < 1 and to be 1 for 1 < α ≤ 2, which is indicated in blue lines as shown in Fig. 5-(a). It should be noted that the JND thresholds are often set at 50% of PQDP [23] whose true values are 228610 VOLUME 8, 2020 unknown but monotonically increase as α becomes larger as shown in Fig. 5-(a). As α gets closer to 1, the distance between the pseudo-GTs and the unknown ideal PQDPs goes to increase. In order to alleviate this discrepancy, a reliability weight is introduced, which put more reliability away from α = 1, as shown in Fig. 5-(b). The reliability weight λ α in Fig. 5-(b) is incorporated into a loss function when training the CNN-based PQDP predictor. The reliability weight is modeled as: Thus, the novel loss function L α pqdp with the reliability weight to stably train the CNN-based PQDP predictor is defined as: where L 2 (P α gt,pgt , P α ) is an L 2 loss between the GT P α gt (or pseudo-GT P α pgt ) of a PQDP value and its predicted value P α for the ERJND hdr -directed energy-reduced frame with a scaling factor α. Fig. 6 shows the network architecture and training method of our CNN-based PQDP predictor. The feature extractor of the CNN-based PQDP predictor is similar to the VGG-13 structure [25], and the feature vectors, F x and F er , for the original HDR image patches and energy-reduced image patches, x hdr and x er,α hdr , are extracted by a Siamese feature extractor. The concatenated features ([F x − F er , F x , F er ]) are fed into two consecutive fully connected layers, and the output is passed through the sigmoid function at the end. To stably train the CNN-based PQDP predictor, the total loss is designed to have three losses: (i) a 0%-PQDP loss for α = 0, (ii) a 50%-PQDP loss for α = 1, and (iii) a random pseudo-GT loss for a randomly selected α rnd = 0.1k value for k = 1, 2, . . . , 9, 11, . . . , 20. The total loss is defined as: The training samples for GTs (α = 0, 1) are always used in the first and second losses while the training samples for pseudo-GTs (α = 0, 1) in the third loss are randomly selected, once at a time for each of the training samples, thus avoiding the heavy influence of the third loss with the pseudo-GTs on the total loss.

D. CONSTRUCTING THE GROUND TRUTH FOR HDR-JNDNet TRAINING
In Section III.B and Section III.C, we explained how to generate various versions of ERJND hdr -directed energy-reduced frames with different suppression strengths for each original training frame and how to train the CNN-based PQDP predictor, respectively. The next step is to construct a ground truth ERJND hdr -directed energy-reduced frame for each original training HDR frame for a give QP value, which is then encoded. In Fig. 3, the process of generating the ground truth ERJND hdr -directed energy-reduced frame is shown for each original training HDR frame for a give QP value by using the pretrained CNN-based PQDP predictor. Then, the pairs of the original HDR frames and their corresponding ground truth ERJND hdr -directed energy-reduced frames are used to train the HDR-JNDNet. The generation process of ground truth in Fig. 3 consists of a sequence of multiple steps: (i) For an HDR input frame X hdr , 21 versions (X er,α hdr ) of ERJND hdr -directed energy-reduced frames are generated with α = 0.1k, k = 0, . . . , 20 according to Eqs. (7) and (8); (ii) the 21 versions (X er,α hdr ) and the original HDR input frame X hdr are encoded and decoded by the HEVC reference software codec of main10 Profile (HM 16.17) at a given target QP value, QP = qp; (iii) The PQDP values are predicted in the units of 64 × 64 by using the pretrained CNN-based PQDP predictor for the decoded original HDR frames D hdr qp and the 21 decoded ERJND hdr -directed energy-reduced frames D er,α hdr,qp ; (iv) The ground truth ERJND hdr -directed energy-reduced frame X er,α qp hdr is constructed for the whole original HDR input frame X hdr by placing the 64 × 64-sized blocks, each of which has the maximum energy-reduction strength (α qp ) with a predicted PQDP value of less than 0.5 among its co-located 64×64-sized blocks of the 21 ERJND hdr -directed energy-reduced frames. VOLUME 8, 2020

E. TRAINING THE HDR-JNDNet AND OUR PVC SCHEME USING THE PROPOSED HDR-JNDNet
The ground truth ERJND hdr -directed energy-reduced frames X er,α qp hdr are used to train the HDR-JNDNet separately for different QP values. Fig. 7 shows the network architecture of the HDR-JNDNet, which is designed based on a simplified Residual Dense Network (RDN) [32] by reducing the number of Residual Dense Blocks (RDB) to 3, the number of convolutional layers per RDB to 2, the number of feature maps to 32, and the channel growth rate to 8. The input patch size is set to 64 × 64, and L 1 loss is used to train the HDR-JNDNet. In testing, the trained HDR-JNDNet directly yields an adjusted ERJND hdr -directed energy-reduced frame for a given QP value and a 10-bit HDR input frame before the encoding process starts. Then, the adjusted ERJND hdr -directed energy-reduced frame undergoes compression with the given QP value. This is our proposed preprocessing-based perceptual video coding (PVC) scheme. Our preprocessing-based PVC scheme can effectively remove perceptual redundancy by the proposed HDR-JNDNet, achieving high coding efficiency with little perceptual quality degradation without modifying the internal structures of any video codec.

A. TRAINING DATA AND TESTING DATA
To train our CNN-based PQDP predictor and HDR-JNDNet models, We use 376 images extracted from 15 training 4K-UHD (3840 × 2160) 10-bit HDR videos. For testing, we use 10 testing 4K-UHD (3840 × 2160) 10-bit HDR video clips, which are shown in Fig. 8. All training and test HDR videos are quantized 10-bit YCbCr 4:2:0 format within the BT2020 color container [1] after applying the perceptual quantization (PQ) transfer function [8]. It is should be noted that our preprocessing-based PVC scheme can be applied for pixel-domain HDR signals obtained after applying any HDR transfer function such as PQ [8] and Hybrid Log-Gamma (HLG) [9] functions. In this paper, we used the PQ transfer function, which is commonly used in standards and commercial products, for experiments. The training images for training the HDR-JNDNet are encoded using the HEVC reference software (HM16.17) [26] under the main10 HDR All-Intra profile for 4 QP values of 22, 27, 32, and 37. Since the HDR-JNDNet generates the adjusted ERJND hdr -directed energy-reduced frames only for Y-channel frames, Cb-and Cr-channel frames of input HDR videos are encoded without preprocessing in our preprocessing-based HDR video coding scheme. In this work, we focused on the energy-reduction of Y-channel because the data sizes of the Cb-and Cr-channel frames are one-fourth of the Y-channel frames in YCbCr 4:2:0 format.

B. EXPERIMENT SETTINGS FOR SUBJECTIVE TEST
The performance of our preprocessing-based HDR video coding scheme is measured by the amounts of bitrate reduction while maintaining the same perceptual video quality between a reference frame (decoded frame for a non-preprocessed input) and a degraded frame (decoded frame for preprocessed input using the HDR-JNDNet). The decoded HDR video quality can be measured objectively and subjectively. According to [31], it is argued that some known objective HDR metrics [16], [19] are not appropriate to measure the perceptual quality differences between reference frames and degraded frames because they are not well coincided with human perceptual quality. In general, the perceptual quality differences are experimented using user studies such as subjective quality assessment by following the standard evaluation protocols [24]. And the quality score values (e.g. MOS) obtained by the subjects are statistically analyzed to judge whether there exist the perceptual differences between reference sequences and degraded sequences under test.
Thus, we performed subjective tests to measure the perceptual qualities of decoded HDR videos. The subjective tests were performed under common test conditions [4]. The DSCQS (double stimulus continuous quality scale) method was used for subjective evaluation [24]. Both the reference video and the degraded video were evaluated with subjective voting scores in Mean Opinion Score (MOS), which ranged from 0 to 100 for the worst and the best perceptual qualities, respectively. Then, the two obtained MOS values were converted into a DMOS (Differential Mean Opinion Score) value, which is defined as the MOS value of the reference video minus that of the degraded video, to compare their perceptual quality [24]. The displays used for the subjective tests were an LG 65EF950 4K-UHD/HDR TV for the 4K-UHD/HDR test videos. The viewing distances for 4K-UHD/HDR videos are 1.6H [6], H is the height of the display. A total of 20 subjects (male: 15, female: 5, average age: 23.25, max-age: 26, min-age: 22) with normal vision who were non-experts in image processing participated in the subjective quality assessment experiments.

1) THE ANALYSIS OF THE SUBJECTIVE VISUAL QUALITY ASSESSMENT
Most of the preprocessing-based HDR video coding researches are only focus on developing a new PQ for 228612 VOLUME 8, 2020 effectively quantizing real HDR videos to 10-bit HDR videos preserving their perceptual quality [13], [29], [30]. Thus, since there are no preprocessing researches to improve the compression efficiency of HEVC main10 profile for quantized 10-bit HDR videos, our preprocessing-based HDR video coding scheme with the proposed HDR-JNDNet is compared with direct encoding which is standard HDR encoding schemes using HM16.17 main10 encoder under Random-Access profile [26] (hereinafter referred to as HM16.17). The compression efficiency of PVC schemes is measured under the condition that the decoded HDR videos with preprocessing are perceptually the same as those without it. For this, subjective visual quality assessment (VQA) tests are performed. 2 Table 1 shows the DMOS results of the subjective tests. For all test HDR videos with all QP values, the average DMOS values of our proposed PVC scheme are very close to zero with small standard deviation values in the DMOS range between −100 and +100. The zero value of DMOS implies that there is no perceptual difference between the two. For the statistical analysis on perceptual differences (DMOS values) between our PVC scheme and HM16.17, we performed Wilcoxon Signed-Rank test [20] for the MOS values obtained from our proposed PVC scheme compared to those of the HM16.17 for all 40 test cases (equal to 40 combinations of 10 test sequences and 4 QP values). The Wilcoxon Signed-Rank test is a nonparametric analysis that statistically compares the averages of two dependent samples and assesses significant differences. From the results of the Wilcoxon Signed-Rank test in Table 1, the p values of all outputs of our proposed PVC scheme were 0.27 at a minimum except for 'Kayak' video at QP = 27. Since all p-values except for the one case are larger than 0.05, the assumption that the two MOS distributions of two methods under comparison are statistically identical cannot be rejected.
In the case of 'Kayak' video at QP = 27, surprisingly, the decoded video of our PVC scheme was evaluated as better perceptual quality than that of the original HM16.17. The water droplets which are quickly splashing have smoothed motion due to preprocessing by our HDR-JNDNet, so the perceptual quality of the decoded video of our PVC scheme is rather improved despite the reduction in the energy of the video. In the subjective test, 55% of the subjects rated the perceptual qualities of two decoded videos as the same, and 40% of the subjects rated the decoded video of our PVC scheme as better quality for 'Kayak' video at QP = 27. Therefore, according to the results of the Wilcoxon Signed-Rank test, the assumption that the perceptual qualities of the two decoded videos are the same is rejected, and rather, the perceptual quality of the decoded video of our PVC scheme can be judged to be statistically better than that of the original HM16.17. Thus, it turned out that the MOS values for our PVC scheme are almost statistically identical or better to those of HM16.17. Fig. 9 shows the subjective evaluation results on the decoded HDR video for the original HDR input (called reference decoded video) and the preprocessed HDR input by our HDR-JNDNet (called test decoded video). As shown in Fig. 9, about 58.5% of the subjects could not recognize the perceptual visual quality difference between the reference and test decoded video. That is, considering the definition of JND where 50% of the subjects could not recognize the difference between the two, it can be confirmed that our CNN-based PQDP predictor can successfully predict the perceptual distortion boundary of JND considering the compression distortions. Table 1 shows the bitrate reduction (BR) performance of our PVC scheme compared to HM16.17 under Random-Access main10 profile for 4 QP values. Through the subjective quality comparison analysis, decoded frames of our PVC scheme are statistically the same perceptual quality with those of the HM16.17. So our proposed PVC scheme allows remarkably to save the required bitrates up to the maximum (average) 40.66% (18.37%) for all 4K-UHD/10-bit HDR test VOLUME 8, 2020 FIGURE 9. Subjective evaluation results on decoded HDR video for the original HDR input (called reference decoded video) and the preprocessed HDR input by our HDR-JNDNet (called test decoded video). The original HDR input and the preprocessed HDR input by the HDR-JNDNet are encoded and decoded using the HEVC reference software codec of the Main 10 Profile for QP = 22, 27, 32 and 37. 'LQ' ('HQ') indicates that the subjects have rated the test decoded video with lower (higher) perceptual quality scores than the reference decoded video. 'SQ' implies that the subjects perceived the reference and test decoded video with same visual quality. videos for all QP values, with little subjective video quality degradation compared to the original HM16.17. As can be seen from the Eq. (1)-(4), most of the perceptual redundancies exist in high texture regions and extremely bright or dark regions. Thus, the proposed HDR-JNDNet improved the BR performance of our PVC scheme by removing the perceptual redundancies in those regions of an input frame. Fig. 10 shows the ERJND hdr maps and the difference images between the first original HDR images and their processed HDR images by the HDR-JNDNet for the 10 test 4K-UHD/10-bit HDR video. The ERJND hdr maps in Fig. 10-(a) shows the JND threshold values in 8 × 8 DCT block units. The brighter regions are interpreted as having perceptually more redundant signal components that are mostly observed in complex texture regions due to contrast masking effects as well as very dark or bright areas due to luminance masking effects. The image differences between the original frames and their energy-reduced frames FIGURE 10. ERJND hdr maps and the difference images between the original HDR images and the processed HDR images by the HDR-JNDNet: (a) the ERJND hdr maps of the first frames of the 10 test 4K-UHD/10-bit HDR videos; (b) the difference maps between the first original frame and its energy-reduced frame by the HDR-JNDNet of the 10 test 4K-UHD/10-bit HDR video.

2) THE ANALYSIS OF THE BITRATE REDUCTION PERFORMANCE
by the HDR-JNDNet in Fig. 10-(b) show somewhat similar trends in brightness with the ERJND hdr maps. This indicates that our HDR-JNDNet works properly for preprocessing by effectively removing perceptual redundancy mostly from the high texture regions of the mountain in the 'Rainbow' video and a very dark region of the truck in the 'Truck' video. Also, it should be noted in Fig. 10-(b) that the homogeneous regions such as sky, lake, and unfocused blur areas (mostly the regions that have low frequency signal components) appear to be very dark, which means little suppression.
We also compared our method to the-state-of-the-art preprocessing-based PVC scheme for SDR videos using the CNN-JNQD model [9] (hereinafter referred to as the PVC scheme using the CNN-JNQD) in Table 1. Since there are no preprocessing researches to improve the compression efficiency of HEVC main10 profile for quantized 10-bit HDR videos, we have no choice but to compare with the preprocessing-based PVC scheme for SDR videos for comparison. In this comparison experiment, the CNN-JNQD model is applied for HDR input before encoding. Note that the CNN-JNQD model was developed for SDR images. This comparison can be used to inspect the effectiveness of our HDR-JNDNet for HDR PVC. The suppressed HDR input frames by the CNN-JNQD model are encoded using the HM16.17 main10 encoder under Random-Access profile. Table 1 shows the BR results of the PVC scheme using the CNN-JNQD under the main10 Random-Access profile for 10 test 4K-UHD/10-bit HDR videos. As shown in Table 1, the BR performance of our proposed PVC scheme outperforms that of the PVC scheme using the CNN-JNQD with average 9.62% point for 10 test 4K-UHD/10-bit HDR videos for all QP values. Thus, we can confirm that the-stateof-the-art preprocessing method developed for SDR videos does not provide sufficient bitrate saving for HDR videos than our preprocessing-based PVC using the HDR-JNDNet.

3) EFFECTIVENESS OF THE CNN-BASED PQDP PREDICTOR
This superiority of BR comes from the fact that the ground truth energy-reduced frames used to training the HDR-JNDNet are constructed using the CNN-based PQDP predictor based on the ERJND hdr model for HDR videos. To prove the effectiveness of the CNN-based PQDP predictor, we compared it with that of the compression distortion visibility metric (CDVM), which is used in the state-of-the-art preprocessing PVC scheme for SDR videos [10]. CDVM is a simple model-based equation based on the ERJND model for SDR videos. Fig. 11 shows the comparison of the ratios of the selected energy-reduction strength (α qp ) by CDVM and the proposed CNN-based PQDP predictor for the all frames of 4 test 4K-UHD/10-bit HDR videos ('Mask', 'Rainbow', 'Truck', and 'Waterfall') at a target QP = 22, 27, 32, and 37. For a fair comparison between the CDVM and our CNN-based PQDP predictor, the ERJND model of CDVM is replaced with our ERJND hdr model. In the case of using CDVM, all selected α values are smaller than 1 in the test videos at 4 QP values. The reason for VOLUME 8, 2020 it is that CDVM judges the detection probability using the difference between the decoded frame after preprocessing and its original frame, not the decoded frame of its original frame. Also, it can be seen that the ratios of the selected α hardly change depending on target QP values. In other words, it is noted that determining the α value based on CDVM does not sufficiently consider quantization distortions of video compression. In contrast, in the case of using the CNN-based PQDP predictor, it can be seen that relatively larger α values are selected than using CDVM at the same target QP value. Also, as the target QP value increases, the proportion of large α selection increases. This reflects that in the case of high QP value, since the stimulus of the distortion due to quantization in video compression is sufficiently large, a human cannot perceive the additional distortion even if relatively more energy is reduced. Therefore, the HDR-JNDNet based on the CNN-based PQDP predictor can reduce more energy of an input frame than based on CDVM. Also, through the subjective quality comparison analysis, we can prove that decoded frames of our PVC scheme base on the CNN-based PQDP predictor are statistically the same perceptual quality with those of the HM16.17. From this, it is clear that the CNN-based PQDP predictor can more accurately predict the human distortion detection probability than the CDVM used in the state-of-the-art preprocessing PVC scheme.

4) ANALYSIS ON COMPUTATIONAL COMPLEXITY
Since the proposed PVC scheme is the preprocessing-based PVC scheme, there is no additional operation time at decoding. Only, the preprocessing is required before encoding, which can even be performed offline. However, it should be noted that the preprocessing using the proposed HDR-JNDNet can accelerate the encoding speed because high-frequency components of perceptual redundancy are removed, which may lead to the more frequent selection of SKIP modes in encoding. Also, the entropy coding times can be reduced due to the zero-residual effect under Random Access configurations. The preprocessing runtime of our HDR-JNDNet takes 0.4 seconds for one 4K-UHD (3840 × 2160)/10-bit HDR frame on an NVIDIA TITAN TM RTX GPU. Table 2 tabulates the processing runtimes of our PVC scheme using the proposed HDR-JNDNet, compared to the original HM16.17 for four different QP values, QP = 22, 27, 32, and 37. As shown in Table 2, the preprocessing time taken by our HDR-JNDNet only increases an average 0.49% in total encoding time of the original HM16.17 for the four QP values. Also, it can be noted in Table 2 that the total encoding times of HM16.17 for the preprocessed input using the HDR-JNDNet have been reduced by average 1% compared to that of the HM16.17 without the preprocessing. In conclusion, the overall processing time of our PVC scheme is even reduced by an average 0.51% compared to that of the original HM16.17 at four QP values. In summary, our preprocessing-based PVC scheme using the HDR-JNDNet can achieve high coding efficiency improvement with little perceptual quality degradation for HDR videos without modifying any video codec structure and increasing the computational complexity.

V. CONCLUSION
In this paper, we firstly propose a learning-based preprocessing method based on JND-directed energy-reduction for 10-bit HDR video signals, called HDR-JNDNet, which can significantly reduce perceptual redundancy of 4K-UHD/10-bit HDR video for HEVC main10 encoder. In order to effectively train the HDR-JNDNet, a CNN-based perceptual-quality-difference detection probability (PQDP) predictor is presented which can predict the PQDP values between the decoded frames for an original input video with and without our preprocessing applied. The CNN-based PQDP predictor is trained to judge whether or not the compressed HDR video after preprocessing can be distinguishable against the direct encoding without preprocessing. With the ground truth detected by the pretrained CNN-based PQDP predictor, the HDR-JNDNet is trained to generate the preprocessed frames by taking into account the expected compression distortion for a target QP value. Throughout the extensive experiments, the HDR-JNDNet is verified that it can be very effectively applied as preprocessing for the 4K-UHD/10-bit HDR video input before compression. It yields remarkable performance improvements on coding efficiency with bit savings up to the maximum (average) 40.66% (18.37%) for 4K-UHD/10-bit HDR test videos, as well as with little subjective visual quality degradation without increasing the computational complexity.