Perceptual Adaptive Quantization Parameter Selection Using Deep Convolutional Features for HEVC Encoder

In this paper, we propose a perceptual adaptive quantization based on a deep neural network on high efficiency video coding (HEVC) for bitrate reduction while maintaining subjective visual quality. The proposed algorithm adaptively determines frame-level QP values for different picture types of the hierarchical coding structure in HEVC by taking into account the high-level features extracted from the original and previously reconstructed pictures. A predefined model based on the visual geometry group (VGG-16) network is exploited to extract the high-level features for subjective visual characteristics. Furthermore, the Lagrange multiplier for each frame is also adaptively determined by involving the proposed features for deciding the appropriate parameter of the Lagrange multiplier that can be used for rate-distortion optimization during the encoding process. Experimental results reveal that the proposed perceptual adaptive QP selection can facilitate bitrate savings up to 65.73% and 47.68% and improve the BD-rate based on SSIM by approximately 20.68% and 14.27% under low-delay-P and random-access coding structures, respectively, with very minimal visual quality degradation when compared to HM-16.20 without adaptive QP selection.


I. INTRODUCTION
High-efficiency video coding (HEVC) standard has been widely accepted to achieve better compression performance over H.264/Advanced Video Coding (AVC) by maintaining similar visual quality [1]. It has encompassed various video media services and applies not only to full high definition (FHD) but also to 4K/8K ultra-HD (UHD) [2]- [4]. Since the standard was released, many studies have been conducted for the sake of its advantages of visual quality improvement [5]- [7], computational complexity reduction [8]- [16], bitrate reduction [17], [18], and prospects as a future video coding standard [19]- [26]. Among many coding tools, rate-distortion optimization (RDO) in the HEVC software model (HM) [26]- [28] is used to improve its coding efficiency [30], [31]. It is based on optimization using the The associate editor coordinating the review of this manuscript and approving it for publication was Shiqi Wang. global Lagrange multiplier and determines the quantization parameter (QP) value using a QP-λ model. The Lagrange multiplier λ can be termed as a function of the quantization step size, which is closely related to the QP value. It is used for the coding efficiency of each basic unit by selecting the best coding mode under a given QP value, where the basic unit can be a frame, slice, or coding unit (CU). The common test condition (CTC) designed by the Joint Video Experts Team (JVET) employs static quantization parameters for fair comparison in standardization [32]. However, an adaptive QP selection is known to be effective in improving subjective visual quality for practical applications. The adaptive QP should be designed to be harmonized within the RDO process. It can adjust the QP value for a distinctive frame or slice according to different spatial, temporal, or visual aspects. Some studies have discovered approaches to improve the compression rates [33]- [37] or visual quality [38]- [44] with various adaptive QP techniques. Typically, these studies prioritize the determination of optimum QPs for the RDO process to produce better encoding parameters by analyzing the QP-λ relationship or by observing the effectiveness of spatial-temporal dependencies among the basic units. Generally, these studies take into consideration the essential role of λ in the RDO process. Thus, it will be interesting to consider a deep neural network (DNN) for more varied QPs in HEVC. Studies have prevailed benefits of DNN for video coding [45]- [49]. However, there is no existing effective DNN-based algorithm for perceptual adaptive QP purposes.
This study presents a DNN-based QP selection method by the adaptive determination of frame-level perceptual QP for HEVC to achieve bitrate reduction without inducing visual quality degradation. The proposed algorithm is embedded in HM-16.20 and generates QP values adaptively for different picture types and coding structures in HEVC. The proposed algorithm first determines a QP for the first frame in a sequence by averaging the standard deviation value of the original blocks (StD). Then, the proposed algorithm obtains high-level features from the original and reconstructed frames using a pretrained visual geometry group (VGG-16) network model [50]. Based on the extracted high-level features, more visual-friendly QP is then distributed for the next consecutive frames in the encoding order. The algorithm also determines the Lagrange multiplier adaptively for each frame based on the proposed model, which can be used for RDO in the encoding process. As a result, the proposed algorithm demonstrates significant coding gain with minimal visual degradation against HM-16.20 and other existing adaptive QP algorithms.
The rest of this paper is organized as follows. In section 2, we briefly present an overview of the QP decision in HM and related works. In section 3, we discuss the proposed perceptual adaptive QP for HM. In section 4, we review several performance evaluations of the proposed algorithm, and finally, we draw the conclusions and suggest further research directions in section 5.

II. CURRENT STATE OF QP SELECTION AND RELATED STUDIES OF PERCEPTUAL ADAPTIVE QP IN HEVC
The current QP selection within the RDO process in HEVC is not optimal. Many studies have revealed several weaknesses of the QP selection technique in the HEVC encoder. In this section, several adaptive QP techniques for HEVC are discussed as follows.

A. GENERAL QP SELECTION CONCEPT IN HM
QP selection in video coding can be mathematically described as an RDO problem [35], [36] that minimizes the total coding distortion D at a given bitrate R T as: where N denotes the number of basic units, D i is the coding distortion, and R i is the coding bitrate of the i−th basic unit. Note that the basic unit in HEVC term may be a frame, slice, or CU. D i and R i in (1) form on QP = (QP i , · · · , Q N ). QP i refers to the QP value for the i−th basic unit and QP * = (QP * i , · · · , QP * N ) represents the optimal QP set for the N basic units. Applying the λ method [29] into the following unconstrained form, equation (1) can be rewritten as: where J stands for the total rate-distortion (RD) cost function, and λ represents the trade-off parameter between D i and R i . Along with the RDO process, λ in HEVC can be obtained as where QP denotes the quantization parameter, and QP factor is a constant parameter related to coding configurations. The QP value in (3) is an integer introduced to represent an actual quantization step size by an exponential mapping function. However, the quantization step size in HEVC tends to be static for complexity reduction in the RDO process. Applying a fixed or predefined QP scheme may cause the compression rate to drop significantly, while HEVC has different coding configurations. Hence, this becomes a major challenge for any QP method design in HEVC. Many QP adjustment methods have been studied for better coding gain. For example, a QP-λ relationship is used to determine the λ value according to an initial QP, and subsequently, the new QP value is recalculated [30], [31]. This algorithm is widely known as a straight-forward algorithm for the RDO scheme in HEVC. Wang et al. [33] introduced an improved block-level adaptive QP value that considers previously coded block information.
Zhao et al. [34] proposed a QP cascading scheme that assigns QP values to different hierarchical temporal picture layers. Similar algorithms were also introduced by Li et al. [35] and He et al. [36], which presented only an inter-frame dependency technique. As far as we know, these last two algorithms can provide better coding gain for an HEVC encoder. Extensive use of spatial-temporal predictions in HEVC is important for adaptive QP selection in RDO. Although the integration of such propagation effects is desirable, there are not many such studies.

B. EXISTING METHODS OF PERCEPTUAL ADAPTIVE QP SELECTION FOR HM
Determining the QP value for video encoders also affects the entirely visual quality of a video sequence. To improve the subjective quality of adaptive QP, the spatial and temporal features or combination of those may be designed empirically. Open software of × 265 [38] becomes one of several algorithms that developed a perceptual adaptive QP method with spatial and temporal features. However, it still VOLUME 8, 2020 fails to give promising outcomes if a reference frame has characteristics different from the current coding frame. Test Model 5 (TM5 Model) of MPEG-2 software [39] also uses the method that scales a quantization step according to the spatial activity of one CU relative to a frame-level average of the spatial activity. This method fails when the size of a large CU block needs to be estimated, thus limiting its performance [37]. Similarly, Yeo et al. [40] also introduced a blocklevel adaptive QP selection algorithm. It observes the spatial and temporal pixel characteristics of CU blocks. However, it needs a higher encoding time. Prangnell et al. [41] used transform coefficients based on a soft thresholding method. However, the proposed soft thresholding method may still cause fluctuations of the visible quality, resulting in severe visual distortion. An alternative algorithm was proposed by determining a QP offset based on a QP − λ relationship that is formed. Yeo et al. [40] has also studied related topics. However, their method utilized only the spatial variance of a block, which is limited for videos with large homogeneous areas [42]. Xiang et al. [43] proposed a perceptual motion estimation method using a spatial-temporal just-noticeable-distortion (JND) model for a QP offset design. Rouis et al. [44] generated perceptual features temporally as well as CTU visual sensitivity for spatial features. However, both features considered in this algorithm are provided only for an adaptive λ in RDO. As a conclusion, spatial and temporal perceptual features for an adaptive QP decision can provide a better trade-off [43], [44].

C. DNN APPROACH TO PERCEPTUAL ADAPTIVE QP SELECTION FOR HM
The use of DNN for video coding has now become possible for the video coding community. Liu et al. [45] and Ma et al. [46] have presented case studies on deep learning-based video coding. Several researchers such as Choi and Bajic [47] studied a deep learning-based frame prediction using decoded frames to predict the textures of a block. It performs both uni-and bi-directional predictions at various distances from a target frame. Ki et al. [48] developed a JND model based on deep learning for the assessment of perceptual distortion in HEVC. Li et al. [49] proposed a DNN-based rate control for Intra coded pictures in HEVC that is designed to predict the parameters of the R − λ rate control model. Other studies have successfully revealed the benefits of deep learning for video encoding. However, it is still difficult to find one specific deep learning method for a perceptual adaptive QP. In this paper, we present a perceptual adaptive QP based on a predefined VGG network for HEVC.

III. PROPOSED ALGORITHM FOR PERCEPTUAL ADAPTIVE QP SELECTION FOR HEVC ENCODER
The main objective of the proposed algorithm is to achieve significant bitrate savings without inducing noticeable visual distortions in reconstructed video frames. We first observed the current setting of the QP − λ relationship in HEVC, as shown in (3). The two main factors involved are the QP factor and QP value. Frame-level QP decision in HM-16.20 is determined with the same QP offset for multiple frames in the same temporal ID layer, while the QP factor denotes for the coding structure parameter is always set static as 0.57, regardless of frame or slice types and coding structures. In HEVC, the different frames form a set of hierarchical structures within a group of pictures, GOP. For example, frames at a higher temporal layer in the same GOP can be predicted from one or more frames at the lower temporal layers. Therefore, giving only the default value of QP offset and QP factor to generalize different frames and coding structures is not perceptually wise for HEVC encoders. Both spatial and temporal features could be sufficient to resolve the issues. However, most of the existing adaptive QP methods mainly concentrate only on one of both elements. In this paper, the proposed algorithm demonstrates visual feature extraction based on a particular convolutional layer of a DNN model for a frame-level adaptive QP. We consider both the spatial and temporal features to generate the adaptive QP and QP factor decision for the proposed algorithm. Fig. 1 depicts the whole process of the proposed algorithm. As shown in Fig. 1, the proposed algorithm is embedded in the HEVC encoder. The proposed algorithm is processed during the slice initialization. Depending on the slice or frame types, the QP value and QP factor are determined adaptively. Fig. 2 shows the detailed process of the proposed algorithm. For the first frame in a sequence, the proposed algorithm is designed in a straightforward manner by considering the standard deviation values of the original frame to decide upon a QP value and set QP factor as its default value. Then, a pretrained VGG-16 model is employed to extract visual features from the original and reconstructed frames to predict the QP and QP factor for consecutive frames. The designed visual features result in a perceptual loss value based on the Euclidean distance measure, VGG feature . The QP and Lagrange multiplier values based on VGG feature are then adaptively estimated by considering the picture types and coding configurations in HEVC. A detailed discussion of this section is divided into several sub-categories as follows. Symbols and descriptions used in the proposed algorithm of the adaptive frame-level perceptual QP for HEVC are tabulated in Table 1.

A. GENERATION OF VISUAL FEATURES FOR THE PROPOSED PERCEPTUAL ADAPTIVE QP ALGORITHM
We propose to adaptively adjust a perceptual QP value per frame by employing a deep learning network, namely, the VGG-16 network [50]. The proposed algorithm employs a pretrained VGG-16 model to construct high-level feature descriptors using a specific convolutional layer. We select VGG-16 for this study due to some of its desirable characteristics. VGG-16 is widely recognized for its remarkable performance on image classification, which classifies over 14 million images to 1000 categories. It has a better image classification accuracy than the AlexNet model [51]. It has  a straightforward architecture that is constructed simply by stacking convolution, pooling, and fully connected layers without branches or shortcut connections to reinforce gradient flow. Such a design is versatile and adaptable for different practical purposes. Besides, the VGG-16 has an extremely deep convolutional layer design used to train on an enormous and manifold image dataset, which results in convolution filters that are well suited to search universal patterns and generalize them. It is also widely applied as a feature extraction technique in many computer vision solutions [52], [53]. For the same reason, the proposed algorithm also takes advantage of the VGG-16 convolution layers only for visual feature extraction. In this paper, a simplified VGG-16 network is employed by removing the latest pooling and fully connected layers, as depicted in Fig. 3. In the figure, h and w represent the height and width of the input 64 × 64 CTU block, respectively. Fortunately, the VGG network can handle any input block size, as long as h and w are multiplication of 32. Hence, the CTU block size can be used directly without necessary prior processing. By examining the visualization of convolution filters and trial-and-error experiments, we selected 'block5conv1', which is the first-fifth convolution layer to build general features for the proposed algorithm. The 'pool5' layer is initially included in the network. However, it is neither considered for the algorithm nor included in the figure. The 'pool5' layer is commonly affected by specific classification objects, which is not favorable for the detection of general features. We mainly consider the generalizability of the VGG network, and thereby, the proposed feature descriptors can search for common and universal patterns.
For better features with HVS consideration, we introduce a perceptual loss function with a full-reference visual quality measure that uses the Euclidean distance. It is based on a comparison of different feature maps extracted from original and reconstructed blocks, as depicted in Fig. 4. The reconstructed block fed to the network is derived after the in-loop filter process. The figure shows that the same model of the VGG-16 network is utilized for extracting those high-level features. The Euclidean distance is preferred owing to its simplicity in expressing VGG feature as a perceptual loss value. To do this, we first convert the color format of both the original and the reconstructed CTU blocks to the RGB color format. This process is suggested as a requirement of the VGG-16 architecture. Then, the network can operate adequately to obtain visual features from both input blocks. Once a VGG feature is generated, we then use it to determine the QP value and QP factor adaptively for the Lagrange multiplier decision.

B. PERCEPTUAL ADAPTIVE QP DETERMINATION WITH QP-λ RELATIONSHIP
From the formula in (3), the QP value per frame can be derived. However, the λ value in HM-16.20, which represents the Lagrange multiplier is decided later after the QP decision is determined, while the QP value per frame is decided empirically based on the HM configuration. Therefore, finding a proper parameter for predicting a frame-level perceptual adaptive QP is a challenging issue.
Generally, coding errors may propagate from the previous frame to subsequent frames because of the prediction coding scheme in video coding standards. In this study, the proposed algorithm determines the frame-level QP for different picture types by obtaining a perceptual loss value based on high-level features from the original and previously reconstructed pictures. With regards to the first frame in a sequence, the determination of a proper QP value is crucial as it will determine the overall coding performance. However, having only an original picture is not enough to provide a perceptual loss value before the encoding. Hence, we examine whether the standard deviation values (StD) of the original blocks can demonstrate the characteristics of a complete picture for frame-level QP decision. We activated rate control to observe the different QP values of every CTU within the intraframe using the 'BasketballPass' test sequence with QP 22, 27, 32, and 37. Subsequently, a relationship between QP and StD is presented in Fig. 5. A lower StD, which reflects a flat region, tends to have a higher QP, vice versa. Therefore, we can expect some coding gain with lower visual quality depression in this area. However, applying the StD value directly to vary λ over the QP factor may lead to high coding loss performance. Therefore, the QP decision in this algorithm is adjusted by firstly normalizing the pixel value of every CTU block in a frame before calculating StD and disregarded the λ and QP factor for QP decision. Then, the QP of the first frame can be more visual-friendly provided and can be expressed as: where QP 0 denotes the QP value of the first frame in a sequence, and QP init represents the initial QP value set by the encoder. Since we design the proposed algorithm in CTU wise, the final picture characteristic of the first frame is decided based on the StD intra value, which is the average StD of the total number N of the original CTU blocks in an Intra frame. Thus, the symbols σ i and µ i become the StD and mean values of the original i−th CTU block, respectively. M denotes the total number of pixel values x j . For the rest of the frames, the quality of the reconstruction frames is generally influenced by a previously coded frame with a certain QP value. In this study, instead of analyzing the distortion of two consecutive frames, we investigate the distortion of VGG features for determining a proper QP value perceptually. Note that the proposed VGG features are extracted from the original and reconstructed frames based on the VGG-16 model. Therefore, the distortion of VGG features of two consecutive frames can be expressed as I. Marzuki, D. Sim: Perceptual Adaptive QP Selection Using Deep Convolutional Features for HEVC Encoder  where D VGGpre is the VGG feature distortion of a predicted frame, D VGGref denotes the VGG feature distortion of a reference frame, and f (·) is the relationship between D VGGref and D VGGpre . Fig. 6(a) shows the VGG feature distortion relationship between two consecutive frames of the 'BasketballPass' test sequence. The sequence is encoded under LDP configuration with the coding structure of I-P-P-P-P. Each P frame uses only its previous coded frame as a reference. We set the predicted frame with a fixed QP value of 32 and encoded the first 15 frames. It can be seen that D VGGref influences D VGGpre .
A further experiment was also conducted with rate control enabled to support the observations. Fig. 6(b) shows a high correlation between the VGG feature and QP selection per frame. Accordingly, the QP decision for the rest of the frame can be determined by considering the picture types as in (8).
The QP decision for a future intra picture can be determined by using the VGG feature from a previously intra coded picture. With regards to the QP decision for P-and B-frames, we control QP init with pQP Fid i and bQP Fid i depending on the hierarchical frame index i(Fid i ) as shown in Table 2. The values of pQP Fid i and bQP Fid i are derived VOLUME 8, 2020  empirically, which also corresponds to the coding structure under the LDP and RA configurations, respectively. For avoiding large fluctuations in quality between neighboring frames, both pQP Fid i and bQP Fid i values for different temporal levels should satisfy the conditions described in (9)-(11), where QP OffsetModelScale i and QP OffsetModelOffset i are derived as the default settings as in HEVC encoder configurations organized depending on the frame index i. Values of both QP OffsetModelScale i and QP OffsetModelOffset i parameters can be found as in Table 3.

C. PERCEPTUAL ADAPTIVE LAGRANGE MULTIPLIER DETERMINATION WITH QP-λ RELATIONSHIP
For increased bitrate savings while maintaining the visual quality of the proposed adaptive QP decision algorithm, we also aim to determine the Lagrange multiplier by involving the proposed VGG feature . Note that the Lagrange multiplier in HM-16.20 is assigned a static QP factor value. Hence, it is essential to provide an adaptive QP factor designed for different picture types and coding structures in HEVC.

1) QP factor DECISION FOR I-FRAMES
First, we searched for the best QP factor of intra coded frames by assigning several constant values of equation (3) through experiments using HM-16.20 under All Intra configurations. 'BasketballPass', 'BQSquare', 'BlowingBubbles', and 'RaceHorses' were used with all the QP settings for the experiment. Fig. 7 depicts the BD-rate based on SSIM performance with the corresponding QP factor values. It shows an approximation of the optimum QP factor for intra frames, which lies in the range of 0.60 to 0.80 with a minimal BD-BR-SSIM gain of approximately −0.2%, while the highest coding gain is approximately −0.5% given by QP factor as 0.65. Accordingly, the QP factor for intra pictures can be   determined as where I QP factor must satisfy 0.57 ≤ I QP factor ≤ 0.80, POC denotes the picture order count, and VGG feature is a perceptual loss value from the original and previously intra coded pictures based on the VGG-16 model.

2) QP factor DECISION FOR P-FRAMES
In the Inter picture coding framework under the LDP configuration, the quality of the reconstruction frames is generally influenced by the coding structure factor (or QP factor as previously mentioned). As a result, the distortion of one frame with a certain QP value may affect both the visual quality and RD performance of future frames in encoding order according to the given QP factor . Based on the previous observation illustrated in Fig. 6(a), the VGG feature of a predicted frame D VGGpre increases linearly with the VGG feature of a reference frame D VGGref . Note that the λ values among different frames in the same GOP should be set differently, although they are coded with the same QP value. Hence, deciding the QP factor for different frames in a different temporal layer is desirable, and relationship in (7) can be approximated as where P QP factor stands for the QP factor of P-frame, and c is the linear coefficient, i.e., the slope of the approximated linear distortion relationship between D VGGpre and D VGGref . D where GOP size and Fid i denote the GOP size for LDP, which is set to 4 and the frame index listed in the same GOP, respectively. An illustration of how P QP factor is provided for P-frames under the LDP coding structure can be seen in Fig. 8. Then, the combination of (13) and (14) can be expressed as Since D VGGREF is the same as VGG feature for the perceptual retention purposes in P QP factor , (15) can be further adjusted as in (16), where the parameter c is empirically set as 0.45 in this study.

3) QP FACTOR DECISION FOR B-FRAMES
For RA configuration, the QP factor decision uses a similar concept as those in the LDP case with further adjustments. We first analyzed the hierarchical B coding structure under RA configuration in the HEVC depicted in Fig. 9. Both the coding distortion and visual quality of the higher temporal layers are affected by those of the lower temporal levels. For the first frame in a GOP coded as an I-frame, its coding distortion and visual quality will depend only on the spatial operation. However, those pictures coded as B-frames, including the frame with temporal ID = 0 but not an I-frames, need to be treated in Interframe fashion with its corresponding reference frames. Table 4 shows the POC difference between the current POC and its reference pictures to their temporal ID. This algorithm is designed to enable proper feature extraction for the coding frames. However, we used only the reference frame nearest to the current coded picture in the RA coding structure. As we follow a similar concept in LDP configuration, thus, the formula in (17) for the RA case can be where B QP factor represents the QP factor for the B-frame, and VGG feature denotes the VGG feature extraction of the reference frames. StD intra is given from the I-frame depending on the intra period of each sequence configuration. GOP size is the GOP size of the RA case, which is set to 16, and Tid i is the temporal ID of frames in the same GOP. Parameter c i is a constant value of the i−th temporal ID that determines the B QP factor of each frame in different temporal IDs. We first searched the best c per Tid i empirically with the default QP setting as in HM-16.20. Fig. 10   expressed as:

IV. EXPERIMENTAL RESULTS
The test configuration used for evaluating the proposed algorithm is listed in Table 5. Coding efficiency evaluation was performed under a common test condition for HEVC [32] with the SSIM term [54]. In addition, subjective evaluation was done using the difference mean opinion scores (DMOS). The assessments were conducted by comparing the proposed algorithm against HM-16.20 as an anchor software and also against other existing works [40], [42].

A. CODING PERFORMANCE EVALUATIONS
We conducted several evaluations of the coding performance to assess the objective quality of the proposed algorithm.
All the objective quality measures are tabulated in Table 6. First, we checked the SSIM difference, SSIM between the proposed algorithm and the anchor. It is defined by where SSIM PRO and SSIM HM denote the luma SSIM quality of the proposed algorithm and the anchor, respectively. For (19), a negative value means that the SSIM quality of the proposed algorithm is worse than that of HM-16.20. We also evaluate the bitrate reduction, Bitrate towards the anchor software, which can be denoted by where R PRO and R HM represent the output bitrate of the proposed and anchor algorithms, respectively. The proposed algorithm is also evaluated against the anchor in BD-BR with the SSIM metric (BD-BR-SSIM) [54], [55]. According to Table 6, the proposed algorithm can achieve better objective performances under the LDP configuration than RA. For the sake of visual quality, the number of intra coded pictures in the LDP case indicates that the proposed algorithm has an essential role in maintaining the quality of the reconstructed frames. Better quality of the reconstructed frames can provide better prediction modes for the future inter coded frames, as well as better visual features for the proposed QP and Lagrange multiplier selections. Considering both spatial and temporal visual features for the proposed algorithm results in significant bitrate reduction while retaining the visual quality of the test videos. For test sequences that have many homogeneous regions, slow motions, and larger background areas than the moving objects in a frame, the proposed algorithm can play a prominent role in obtaining higher objective measures. The visual characteristics of such test sequences can be seen in 'BQTerrace', 'Johnny', 'FourPeople', 'Cactus', 'KristenAndSarra' videos, etc., in which the most significant coding gains are obtained in perceptual terms. On the other hand, the proposed algorithm can contribute only moderate coding improvements for 'Kimono' and 'RaceHorses' that have more textures and fast or more motions. VOLUME 8, 2020

B. SUBJECTIVE PERFORMANCE EVALUATIONS
Subjective quality assessment was performed to compare the proposed algorithm and HM-16.20 for all the test sequences by following the double stimulus continuous quality scale (DSCQS) method [55]. There are 18 observers among which 11 are in the relative field, and the rest are naïve in image processing. Before the test, we conducted simple demonstrations for the observers to introduce the evaluation process. For each participant, the reconstructed frames from the proposed algorithm and HM-16.20 were randomly shown twice with all the QP values. Then, the observers were asked to provide MOS values in the continuous scale ranging from 1 to 5. Finally, we processed the MOS values to produce the DMOS scores between MOS PRO and MOS HM , which denotes the luma MOS quality of the proposed algorithm and the anchor, respectively. DMOS scores are defined by   the proposed algorithm is subjectively worse than that of the anchor ones. As presented, DMOS scales for the entire test sequences are quite close to 0. It means that the proposed algorithm can code nearly visually identical output over those by . For several video sequences, as shown in Table 7, the visual quality of the proposed algorithm is even VOLUME 8, 2020 slightly better than that of the anchor, such as in 'PeopleOn-Street', 'BQTerrace', 'BQMall', and 'BQSquare', primarily when they are generated under the RA coding structure. This similarity in video quality between the proposed algorithm and HM-16.20 can be seen for all the video sequence classes. We can see that the proposed algorithm degrades visually based on the DMOS test very slightly compared to its anchor, by only about −0.05 and −0.04 for LDP and RA configurations, respectively, as shown in Table 8.

C. COMPARISONS WITH EXISTING ALGORITHMS
After we presented both objective and subjective comparisons between the proposed algorithm and HM-16.20, we can conclude that the perceptual adaptive QP at the frame-level demonstrates its capability to maintain visual quality with better coding efficiency performances in the perceptual term. In this sub-section, we present the same comparisons (objective and subjective comparisons) of the proposed algorithm against other existing algorithms. Table 9 shows the SSIM-based BD-rate comparisons of Yeo et al. [40], Xiang et al. [42], and the proposed algorithms. As both existing algorithms were integrated into HM-16.0, we also implemented the proposed algorithm in the same software version to meet fair comparisons. As shown in Table 9, we can see that the proposed algorithm in the downgraded version can still outperform two existing algorithms in perceptual coding efficiency. Overall, we can achieve a coding gain of approximately −14.44%, while Xiang's and Yeo's are −4.51% and −3.56%, respectively. Note that all the presented results in Table 9 were generated under random-access configuration with all the quantization parameter values.
Furthermore, we also performed the MOS test to evaluate the subjective visual quality of all the algorithms. Fig. 11 presents the average DMOS results of Xiang's, Yeo's, and the proposed algorithms in the RA structure. The performance of the baseline, which refers to the HM software, is set to zero for the visual similarity evaluation of the three algorithms. DMOS scores that are close to the zero baseline indicate visual similarity to the anchor. From the experimental results, most of the test sequences tested under the proposed algorithm can stand more DMOS points closer to zero, followed by the Xiang's and Yeo's algorithms. This means that the proposed algorithm can give better quality subjectively than the two existing algorithms.

V. CONCLUSION
In this work, we propose a perceptual adaptive QP algorithm at the frame-level to obtain better subjective coding performance for HEVC. The proposed algorithm utilizes a predefined model of the VGG-16 network for feature extractions from the original and previously reconstructed pictures. We designed the proposed algorithm by developing a perceptual loss function based on the extracted features. The proposed algorithm adaptively determines perceptual QP values for different picture types of the hierarchical coding structure in HEVC. Results of approximately −21% and −14% coding gains in SSIM, are yielded by the proposed algorithm, compared with the HM-16.20, for LDP and RA, respectively. The subjective quality evaluation shows that the proposed algorithm can produce comparable visual quality against the anchor with significant bitrate-saving.