Deep Predictive Video Compression Using Mode-Selective Uni- and Bi-Directional Predictions Based on Multi-Frame Hypothesis

Recently, deep learning-based image compression has shown significant performance improvement in terms of coding efficiency and subjective quality. However, there has been relatively less effort on video compression based on deep neural networks. In this paper, we propose an end-to-end deep predictive video compression network, called DeepPVCnet, using mode-selective uni- and bi-directional predictions based on multi-frame hypothesis with a multi-scale structure and a temporal-context-adaptive entropy model. Our DeepPVCnet jointly compresses motion information and residual data that are generated from the multi-scale structure via the feature transformation layers. Recent deep learning-based video compression methods were proposed in a limited compression environment using only P-frame or B-frame. Learned from the lesson of the conventional video codecs, we firstly incorporate a mode-selective framework into our DeepPVCnet with uni- and bi-directional predictive modes in a rate-distortion minimization sense. Also, we propose a temporal-context-adaptive entropy model that utilizes the temporal context information of the reference frames for the current frame coding. The autoregressive entropy models for CNN-based image and video compression is difficult to compute with parallel processing. On the other hand, our temporal-context-adaptive entropy model utilizes temporally coherent context from the reference frames, so that the context information can be computed in parallel, which is computationally and architecturally advantageous. Extensive experiments show that our DeepPVCnet outperforms AVC/H.264, HEVC/H.265 and state-of-the-art methods in an MS-SSIM perspective.


I. INTRODUCTION
Conventional video codecs such as AVC/H.264 [45], HEVC/H.265 [38] and VP9 [29] have shown significantly improved coding efficiencies, especially by enhancing their temporal prediction accuracies for the current frame to be encoded using its adjacent frames. In particular, there are three coding modes of frames used in video compression: I-frame (intra-coded frame) mode that is compressed independently from its adjacent frames; P-frame mode that is compressed through the forward prediction using motion information; and B-frame mode that is compressed with bi-directional prediction for the current frame. P-frame coding is suitable for low latency in video compression. In perceptive of coding efficiency, B-frame coding provides The associate editor coordinating the review of this manuscript and approving it for publication was Hualong Yu . the highest coding efficiency compared to the I-frame and P-frame coding. Therefore, the standard codecs [38], [45] use both P-frame and B-frame coding methods for video coding.
Deep learning-based approaches have recently shown significant performance improvement in image processing. Especially, in the field of low-level computer vision, intensive research has been made for deep learning-based image super-resolution [12], [18], [20], [24] and frame interpolation [15], [28], [30]- [32]. In addition, there are many recent studies on image compression using deep learning [5], [6], [16], [21], [23], [27], [35], [40]- [42] which often incorporate auto-encoder based end-to-end image compression architectures by attempting to improve compression performance. These works showed outperformed results of coding efficiency compared to the traditional image compression methods such as JPEG [43], JPEG2000 [37], and BPG [7]. While the image compression tries to reduce only spatial redundancy around the neighboring pixels with limited coding efficiency, traditional video compression can achieve significant compression performance because it can take advantage of temporal redundancy among neighboring frames. Also, by exploiting the temporal redundancy, deep learning-based video compression has been studied in two main directions: First, some components (or coding tools) in the conventional video codecs are replaced with deep neural networks. For example, Park and Kim [33] first tried to improve compression performance by replacing the in-loop filters of HEVC with a CNN-based in-loop filer. In [10], Cui et al. proposed intra-prediction method with CNN in HEVC to improve compression performance. In [51], Zhao et al. replaced the bi-prediction strategy in HEVC with CNN to improve coding efficiency; Second, there are studies to improve the compression performance by using auto-encoder based end-to-end neural network architectures [4], [9], [11], [13], [25], [26], [36], [46], [47]. Although deep learning-based image compression has been intensively studied, video compression has drawn less attention. In this paper, we propose an end-to-end deep predictive video compression network, called DeepPVCnet, using mode-selective uni-and bi-directional predictions based on multi-frame hypothesis with a multi-scale structure and a temporal-context-adaptive entropy model. The contributions of our proposed DeepPVCnet are as follows: • We first show a mode-selective framework with both uni-and bi-directional predictive coding structures for deep learning-based predictive video compression in the rate-distortion minimization sense, thus achieving the improved coding efficiency. The selected mode information for frame prediction is transmitted to decoder sides with a negligible amount of bits; • We propose a temporal-context-adaptive entropy model that utilizes temporally coherent context information from the multiple reference frames to estimate the parameters of Gaussian entropy models for the quantized latent representation of the current frame. While the autoregressive entropy models for CNN-based image compression suffer from serialized processing, our temporal-context-adaptive entropy model allows for context computation in parallel; • Our DeepPVCnet tries to jointly compresses motion and residual information based on a multi-scale structure for the current frame and its reference frames via learned feature transformation in encoder sides. This structure can effectively reduce the coupled redundancy of motion and residual information; • Contrary to the deep neural network-based state-of-theart (SOTA) methods [26], [46] that reply on a single reference frame for each prediction direction, our method improves prediction accuracy for the current frame by utilizing multiple frames for both uni-and bi-directional prediction modes. This paper is organized as follows: Section II introduces the related work with deep neural network-based image/video compression, optical flow estimation and frame interpolation; In Section III, we introduce the details of our proposed deep video compression network, called DeepPVCnet; Section IV presents the experimental results to show the effectiveness of our DeepPVCnet compared to the conventional video codecs and SOTA methods [4], [13], [25], [26], [46], [47]; Finally, we conclude our work in Section V.

II. RELATED WORK
Both conventional image compression (such as JPEG, JPEG2000, and BPG) and video compression (AVC/H.264, HEVC, and VP9) methods have shown high compression performance. Recently, deep learning-based image and video compression methods have been actively studied. The key element that brings up high coding efficiency in video coding is temporal prediction to reduce temporal redundancy. Therefore, we also review deep learning-based optical flow estimation and frame interpolation networks that are essential elements for predictive coding.
Deep Learning-Based Image Compression: Unlike conventional image compression based on transform coding, recent deep learning-based image compression methods often adopt auto-encoder structures that perform nonlinear transforms. First, there are several works on image compression using Long Short Term Memory (LSTM)-based auto-encoders [16], [41], [42] where a progressive coding concept is used to encode the difference between the original image and the reconstructed image. In addition, there are studies on image compression using convolutional neural network (CNN) based auto-encoder structures by modeling the feature maps of the bottleneck layers for entropy coding [5], [6], [21], [23], [27], [35], [40]. In [6], Ballé et al. introduced an input-adaptive entropy model that estimates the scales of the latent representations depending on the input. In [21], Lee et al. have proposed a context-adaptive entropy model for image compression which uses two types of contexts: bit-consuming context and bit-free context. Their models in [6], [21] outperformed the conventional image codecs such as BPG. Our DeepPVCnet also adopts such an auto-encoder structure used in [6] as the baseline structure combined with our temporal-context-adaptive entropy model.
Deep Learning-Based Video Compression: There are two main directions of deep learning based video compression research: The first is to replace the existing components of the conventional video codecs with deep neural networks (DNN). For example, there are some works to replace in-loop filters with deep neural networks [14], [17], [33], [49], and post-processing to enhance the resulting frames of the conventional video codecs [22], [48]. The intra/inter predictive coding modules have also been substituted with DNN modules for video coding [10], [33]; And, the second direction includes CNN-based auto-encoder structures without the coding tools of conventional video codecs involved. In [26], Lu et al. proposed the first end-toend deep video compression network that jointly optimizes FIGURE 1. Overall architecture of our proposed DeepPVCnet. The notations of the parameters of the convolutional layers are denoted as: number of filters × filter height × filter width / the up-or down-scale factor, where ↑ and ↓ denote the up-and down-scaling, respectively. s denotes bilinear down-sampling factor for reference frames X R and current frame x 0 . The 'Motion compensation' process is performed by Eq. 2. AE and AD represent an arithmetic encoder and an arithmetic decoder, respectively. Also, Q represents a quantizer for latent representations y 0 .
all the components for low-latency scenarios of video compression. Then, Lin et al. in [25] extended it by utilizing multiple reference frames for low-latency scenarios of video compression. In [36], Rippel et al. proposed a novel video compression framework with propagation of the learned state and ML-based spatial rate control. In [4], Agustsson et al. proposed a low-latency video compression model based on the scale-space flow for better handling disocclusions and fast motion. However, the methods in [4], [25], [26], [36] have been proposed for P-frame predictive coding which only use previously encoded frames to predict the current frame. In [46], Wu et al. proposed a ConvLSTM-based video compression method to improve the coding efficiency. However, this method used conventional block motion estimation and compression methods, which will degrade the coding efficiency. In [9], Cheng et al. proposed a frame interpolation network based deep video compression. Since this work utilized a pre-trained frame interpolation network [32] from two frames which are far from each other, the prediction performance will be significantly reduced. The method in [11] uses both an interpolation-and residual-based autoencoder for B-frame coding. It shows good performance in high bit ranges, but not in low or mid bit ranges. In [47], Yang et al. proposed a hierarchical learned video compression method with the hierarchical quality layers and a recurrent enhancement network. The method in [13] incorporates a 3D auto-encoder that does not use the P-frame and B-frame coding concept. However, it requires lots of memory and computational complexity. In contrast to the SOTA methods using a single frame as reference for prediction, our method utilizes multiple frames as reference to predict the current frame. While these SOTA methods take either a P-frame or a B-frame coding structure for video compression, in this paper, we propose a mode-selective framework with uniand bi-directional prediction modes where the best mode is selected in a rate-distortion optimization sense and is signaled to the decoder side.

Deep Learning-Based Optical Flow Estimation and Frame
Interpolation: Optical flow estimation and frame interpolation can be used for predictive video coding. There have been many studies related to optical flow estimation using deep neural networks. In [34], Ranjan et al. introduced SpyNet that uses a spatial pyramid network and warps the second image to the first image with the initial optical flow. Also, the PWC-Net [39] was introduced with a learnable feature pyramid structure that uses the estimated current optical flow to warp the CNN feature maps of the second image. Their model outperformed all previous optical flow methods. Since they use the feature pyramid structures, the optical flow estimation is robust to large motion over other deep neural network-based optical flow methods. Therefore, we incorporate the pre-trained PWC-Net as an initial parameters of the optical flow estimation network into our DeepPVCnet. Recent CNN-based frame interpolation methods include convolution filtering-based [31], [32], phase-based [28], and optical flow-based [15], [30] approaches. The convolution filtering-based frame interpolation is based on the frame prediction between adjacent frames through convolution filtering operation without using optical flow. The phase-based frame interpolation uses a CNN to reduce the reconstruction loss in the phase domain rather than in the image domain. Finally, the optical flow-based interpolation generates the frames between two frames through a CNN after warping with optical flow between two frames. In this paper, we adopt an optical flow-based prediction scheme [15] for our DeepPVCnet.

III. PROPOSED METHOD
Our proposed DeepPVCnet for both uni-and bi-directional predictive coding is illustrated in Fig. 1. As depicted in Fig. 1, our DeepPVCnet uses the neighboring frames as reference frames to compress the current frame. The reference frames, denoted as X R , for current frame x 0 are composed of for uni-and bi-directional 74 VOLUME 9, 2021 FIGURE 2. Compression performance comparison for a single reference frame and multiple reference frames. The bitrates of the P-frame coding models are about 0.37 bpp and those of the B-frame coding models are about 0.59 bpp for HEVC Class B dataset [38].
coding, respectively. For X R and x 0 , the bilinear downsampling with a scale index s is performed for multi-scale motion estimation and compensation as follows: where X R,s and x s 0 denotes the down-scaled reference frames and the down-scaled current frame with the scale index s, respectively. Down(·, s) denotes a bilinear downsampling process with a scale factor 2 s (s = 0, 1, 2 for our experiments).
Each reference frame in X R,s is concatenated to x s 0 for estimating the optical flow between these frames using the fine-tuned PWC-Net [39]. The resulting opti- for uni-and bi-directional coding, respectively. Then, the prediction frames P R,s are calculated by a backward warping function w(·, ·) [15] with X R,s and F R,s . The resulting prediction frames P R,s are composed of {p s 0←−2 , p s 0←−1 } and {p s 0←−2 , p s 0←−1 , p s 0←1 , p s 0←2 } for uni-and bi-directional coding, respectively. The residual frames R R,s can be expressed as follows: where k denote an relative index of the reference frame for the current frame The joint information of F R,0 and R R,0 for scale 0 is mapped to a latent representation y 0 through the encoder network g a with five feature transformation (FT) layers. Similarly, the joint information of F R,s and R R,s for scales s = 1, 2 are concatenated into the feature maps of the same sizes in the encoder network g a as depicted in Fig. 1.
After the quantization step, we can obtain the quantized latent representationŷ 0 . Then, the reconstructed optical flowŝ F R , the reconstructed residual framer 0 and the synthesis coefficientsα i are estimated by the decoder network g s with the entropy model ofŷ 0 . The reconstructed framex 0 is given byx where the set, N R , of reference frame indices are composed of {−2, −1} and {−2, −1, 1, 2} for uni-and bi-directional predictive coding, respectively.
for uni-and bi-directional predictive coding, respectively. Finally, the Enhancement Net outputs enhanced framex 0 fromx 0 . The details of the proposed network are described in Section III.A-III.E.

A. MULTIPLE REFERENCE FRAMES
In general, the conventional video codecs like AVC/H.264 and HEVC/H.265 compress the current frame using multiple reference frames for each prediction direction. The usage of multiple reference frames allows to effectively deal with occlusion problems, thus resulting in accurate prediction for the current frame. In video compression, the quantization errors are propagated as subsequent frames are compressed. By using multiple reference frames, such a quantization error propagation can be alleviated for the prediction of the current frame, thus increasing the prediction accuracy and coding efficiency. By compromising the complexity of incorporating multiple reference frames, our DeepPVCnet utilizes two reference frames for uni-directional predictive coding (P2 in Fig. 2) and four reference frames for bi-directional predictive coding (B4 in Fig. 2) where two for forward and the other two for backward prediction, in contrast to the state-of-theart methods [9], [11], [26], [46] of deep learning-based video compression. The effectiveness of using multiple reference frames is shown in Fig. 2.

B. COMPRESSING JOINT INFORMATION WITH FEATURE TRANSFORMATION (FT) LAYERS
We incorporate five feature transformation (FT) layers into the encoder side of the CNN-based auto-encoder structure g a that jointly compresses the multi-scale motion information and residuals. To cope with various amounts of motion for different video sequences, multi-scale motion estimation and compensation are performed at the encoder side. Then, the generated multi-scale motions and residuals are concatenated to the output feature maps of each convolution layer, which are then fed as input into the following FT layer of the encoder network. By doing so, the multi-scale joint information of motions and residuals can be effectively fused for better compression.  The recent methods [9], [11], [26], [46] in deep video compression are designed to compress single-scale optical flow and residual separately, but our proposed DeepPVCnet jointly compresses multi-scale motion information and residual with compactization towards improving coding efficiency under the assumption that the redundancy between motion information and residual exists. Also, the FT layers alter interim feature output by the learned transformation under the guidance of multiple reference frames, which can help reducing the coupled redundancy between the multi-scale motion information and residual. The details of the FT layers are depicted in Fig. 3-(a). As shown in Fig. 3-(a), the FT layers serve to perform affine transform of each element of the input feature map. The parameters of the affine transform are learned with respect to the reference frames via two convolutional layers.

C. TEMPORAL-CONTEXT-ADAPTIVE ENTROPY MODEL
We propose a temporal-context-adaptive entropy model for the quantized latent representationŷ 0 . Our proposed entropy model adopts the basic structure [6] with the hyperprior z 0 and the hyper encoder-decoder network pair (h a , h s ) as shown in Fig. 1. The output feature map of the hyper encoder-decoder network is the context information c c of current frame x 0 . Since there exists a contextual similarity between x 0 and X R , we propose a Context-Net to estimate the mean µ and standard deviation σ of a Gaussian model for y 0 as follows:ŷ where our proposed Context-Net extracts the context information c c of x 0 and the temporal context information c t of X R . Then, it concatenates c c and c t to obtain the µ and σ as the same spatial size asŷ 0 . The Context-Net is illustrated in Fig. 3-(b). As shown in Fig. 3-(b), c t is generated using the reference frames via four convolutional layers. Then, µ and σ are estimated for c t and c c via three convolutional layers. More details of the entropy coding model are represented in Appendix B.

D. MODE-SELECTIVE FRAMEWORK
The SOTA deep learning-based video compression methods [4], [9], [11], [25], [26], [36], [46] tend to have a limitation that only compresses all of the frames in either a P-frame or a B-frame coding structure. Fig. 4 depicts a GOP (Group of Pictures) structure for our mode-selective framework with uni-or bi-directional predictions in a similar way as traditional video codecs. In our mode-selective framework, each frame can be encoded with an intra-mode, a uni-directional prediction mode or a bi-directional prediction mode. The uni-directional prediction mode has two sub-modes: M f uni for forward prediction and M b uni for backward prediction, and the bi-directional prediction mode is denoted as M bi . For the GOP structure in Fig. 4, I 1 and I 13 are encoded as the intra-mode using a pre-trained image compression network [21] while all other frames between I 1 and I 13 are encoded in either M f uni , M b uni or M bi . It should be noted in Fig. 4 that the frame I 4 is encoded by referencing I 1 which is encoded a priori. Next, I 7 is compressed, followed by I 10 . Depending on the availability of neighboring encoded frames, one or two encoded frames are referenced to encoded each frame between I 1 and I 13 as shown in Fig. 4. I 3 and I 6 may use the same reference frames as I 2 and I 5 , respectively. I 8 and I 9 have the same referencing structures as I 6 and I 5 , respectively. Also, I 11 and I 12 have the same referencing structures as I 3 and I 2 , respectively. The details of the coding order and the selection rule of the reference frame for our proposed method in a test phase are represented in Table 1. Note that for frames in which the reference frames are not multiple, duplicated reference frames are utilized as the inputs of deepPVCnet in our experiment.
The selected mode information is sent to the decoder sides as two-bit data which is a negligible bit amount. Based on this mode-selective framework, we train each deepPVCnet for where R n,m andÎ n,m denote the bitrate and the reconstructed frame of I n with mode m, respectively.

E. ENHANCEMENT NET
To further improve the qualities of the reconstructed frames, the Enhancement Net shown in Fig. 1 is incorporated into the decoder side of our DeepPVCnet to enable the role of an in-loop filter as in the traditional video codecs. We utilize the residual dense network (RDN) [50] which consists of five residual dense blocks (RDB) with three convolution filters per each block for our Enhancement Net which is described in details in Appendix C.

IV. EXPERIMENTS A. EXPERIMENTAL CONDITIONS
To show the effectiveness of our DeepPVCnet, extensive experiments are carried out to measure the performance of coding efficiency, and our method is compared with other video coding methods. For intra coding, we used a pre-trained CNN-based image compression model in [21]. For uni-and bi-directional predictive coding, we train our DeepPVCnet models for different bitrate ranges and test the trained models for each bitrate range. Note that we set the GOP size G to 12 for all experiments. Datasets: We train the DeepPVCnet with the UGC dataset [1]. For pre-processing, we excluded HDR, vertical video, interlaced video and the video that are smaller than 720p from the UGC dataset. The number of frames used for training is about 466K. For evaluation, we test the DeepPVCnet on the raw video datasets such as Ultra Video Group (UVG) [2] and the HEVC Standard Test Sequences (Class B, C, D and E) [38]. The UVG dataset contains seven videos of size 1920 × 1080. The videos in the HEVC dataset have different sizes depending on their class types.
Implementation: The proposed DeepPVCnet is trained in an end-to-end manner, based on the rate-distortion loss L as: (6) where λ controls the trade-off between rate and distortion terms, and d is the distortion measure, e.g. (1 -MS-SSIM). In Eq. 6, the first term indicates the conditional entropies ofŷ 0 givenẑ 0 and X R , and the second term is the entropy ofẑ 0 . For several bitrate ranges, we train the DeepPVCnet separately for different values of λ where the number of channels of the convolution filters is N except the convolution layer that has M filters to output the latent representation. We set N = 128 and M = 256 for three lower bitrates and N = 192 and M = 384 for two higher bitrates. Our DeepPVCnet is trained from scratch with the fixed PWC-Net [39] for 1M iterations using ADAM [19] with the initial learning rate 0.0001. Then, we fine-tune the PWC-Net with other components of our DeepPVCnet for additional 0.5M iterations. In addition, we used a batch size of 8 and a patch of 256 × 256 randomly cropped from the 466K training frames extracted from the UGC video dataset.
Evaluation: We measure both distortions and bitrates simultaneously. The multi-scale structural similarity index (MS-SSIM) [44] for an RGB color space, which is known to as a better metric for subjective image quality than PSNR, are used to measure the distortions in our experiments. We use bits per pixel (bpp) to measure the bitrates.

B. EXPERIMENTAL RESULTS
The DeepPVCnet is compared with the conventional video codecs such as AVC/H.264 and HEVC/H.265, as well as three deep learning-based video compression methods in [4], [13], [25], [26], [46], [47]. For fair comparison, the GOP size of the conventional video codecs is fixed to 12. We used the ffmpeg coding tool [38] and x265 [3] for H.264 and H.265, respectively. We use several settings of the conventional video codecs where the details of settings are described in details in Appendix A. Fig. 5 shows the rate-distortion (R-D) curves produced by our DeepPVCnet, H.264 and H.265, Wu's method [46], DVC [26], Habibian's method [13], M-LVC [25], Yang's method [47] and Agustsson's method [4] for VOLUME 9, 2021 FIGURE 5. MS-SSIM performance comparison of our DeepPVCnet, H.264 [45], H.265 [38], and CNN-based SOTA methods [4], [13], [25], [26], [46], [47] for the UVG dataset, HEVC Class B and E dataset. the UVG and HEVC datasets (Class B and E). It can be seen in Fig. 5 that our DeepPVCnet outperforms all the methods over most of the bitrate ranges while other SOTA methods in [13], [25], [26], [46], [47] show the limited results at only low bitrate ranges. In particular, our method shows significantly better compression performance than the other methods for the medium or high bitrate range. More experimental results for the HEVC datasets (Class C and D) and analysis are provided in Appendix D.

C. ABLATION STUDY
For our DeepPVCnet, ablation study is performed for some key components: the multi-scale motion estimation and compensation, the fine-tuned PWC-Net, the multiple reference frames, the temporal-context-adaptive entropy model using the multi-frame hypothesis, mode-selective framework, FT layers and Enhancement Net. In order to demonstrate the contribution of each component, we performed the experiments by excluding the key components one by one from the entire structure of the DeepPVCnet. Fig. 7 represents the resulting MS-SSIM performances of the ablation study.

Multi-Scale
Motion Estimation and Compensation: In order to effectively cope with various motions of different video sequences, we perform motion estimation and compensation based on a multi-scale structure. As can be seen in Fig 7, the multiscale motion estimation compensation improves the coding gain compared to the single-scale case.

Fine-Tuned
PWC-Net: In [39], the pre-trained PWC-Net has been trained to obtain only high accuracy of optical flows between frames. However, for the video compression problem, the motion estimation network must be trained not only to increase the accuracy of motion estimation, but also to compress the generated motion with high coding efficiency. Therefore, we fine-tuned the PWC-Net to be optimized for video compression in the rate-distortion optimization sense. As shown in Fig. 7, our DeepPVCnet with the fine-tuned PWC-Net outperforms that with the pre-trained PWC-Net for the whole bitrate range.
Multiple Reference Frames: As shown in Figs. 2 and 7, the multiple reference frames contribute to gain high coding efficiency. This gain is achieved thanks to effectively dealing with object occlusions, thus reducing the propagation error. In particular, the multi-frame hypothesis shows better performance in the high bitrate range because our DeepPVCnet can   Ablation study on the effectiveness of (a) the multi-scale motion estimation and compensation, (b) fine-tuned PWC-Net, (c) multiple reference frames, (d) a mode-selective framework, (e) feature transformation layers, (f) a temporal context-adaptive entropy model, and (g) an Enhancement Net for HEVC Class B dataset. We have the experiments of excluding these components one by one in a row from the DeepPVCnet.
fully utilize neighboring information from multiple reference frames in removing temporal redundancy.

Mode-Selective
Framework: It improves the coding gain especially for low and mid bitrate range as depicted in Fig. 7. Poor prediction in a low bitrate range can be compensated by selectively performing the prediction based on the best prediction mode with our proposed mode-selective framework by Eq. 5. As shown in Fig. 7, our proposed mode-selective framework is a key component to gain high coding efficiency along with the multiple reference frames.
Temporal-Context-Adaptive Entropy Model: As shown in Fig. 7, our DeepPVCnet with the temporal-contextadaptive entropy model achieved coding effiency improvement by reducing the redundancy of the latent representation with the temporal context information of the reference frames.
In addition, the proposed entropy model has a structural advantage that it can be computed in parallel in contrast to the autoregressive-based video compression methods [11], [13].
Other Components: The FT layers and an the Enhancement Net have a few parameters compared to those of the entire network. Nevertheless, a slightly improved coding gain has been achieved. In particular, the FT layers allow the encoder to compress joint information effectively. Table 3, the total numbers of parameters of our DeepPVCnet are about 25.5M and 39.6M for low and high bitrate models, respectively. For testing, the runtime of our DeepPVCnet was measured in a platform with Intel I9-9900X CPU, 128GB RAM and a single Titan TM RTX GPU. For sequences of sizes 416×240, 832×480, 1280×720 and 1920 × 1080, the encoding and decoding speeds of our DeepPVCnet are (5.9 fps, 44.2fps), (3.9 fps, 15.0fps), (2.2 fps, 6.7fps) and (1.1 fps, 3.2fps), respectively. Especially, the decoding speed is considerably faster than other autoregressive based entropy coding model methods [11], [13]. This is because parallel processing is not possible on the decoder side for these methods [11], [13].

E. VISUAL COMPARISONS
In this section, we visualize the interim results by our Deep-PVCnet. Then, we visualize the pre-trained and fine-tuned optical flows by PWC-Net. Also, some reconstructed frames by the H.264, H.265 and our DeepPVCnet are presented for subjective comparison.
Visualization of Feature Maps and Reconstructed Frames: Fig. 6 visualizes the optical flow maps, the output residual frame, a reconstructed frame and an enhanced frame for an VOLUME 9, 2021 FIGURE 8. Visual comparison of optical flows from a pre-trained PWC-Net and a fine-tuned PWC-Net. The pre-trained PWC-Net generates a lot of smooth area of optical flows because it is trained to only accurately obtain motion between frames, not the direction in which the frame is compressed well. However, since the fine-tuned PWC-Net is trained in the direction in which the frame is well compressed, it also generates the optical flows with the texture area. input frame of a Beauty sequence obtained via the pipeline of our DeepPVCnet. The optical flow F 0→−2 in Fig. 6 is an input to the encoder network of the DeepPVCnet, which is obtained from the PWC-Net.F 0→−2 is the output optical flow of the decoder network, which is used to synthesize the current input frame for reconstruction. The output residual framer 0 is the difference between the outputx 0 and the blended output (x 0 −r 0 in Eq. 3) of warped frames, as shown in Fig. 1.   Fig. 6 that the optical flows F 0→−2 and F 0→−2 look significantly different becauseF 0→−2 are generated to improve compression efficiency. Then,F 0→−2 includes more texture parts than F 0→−2 . Also, the output residual framer 0 contains texture parts, which makes it possible to reconstruct the areas that are difficult to recover by optical flows only. Finally, the enhanced framê x 0 is generated from the reconstructed framex 0 by the Enhancement Net, which is visually much closer to the input frame x 0 .

Visualization of Pre-Trained and Fine-Tuned Optical
Flows: Fig. 8 presents visual comparison of motion information for the pre-trained PWC-Net and the fine-tuned PWC-Net. As shown in Fig. 8, the motion information from the pre-trained PWC-Net contains large-sized fields of smooth motion since the pre-trained PWC-Net does not consider compression efficiency, only focusing on motion information between frames. Also, the pretrained PWC-Net is trained with a smooth motion constraint. However, the motion information from the fine-tuned PWC-Net contains both smooth and textured motion fields since it extracts motion information in a rate-distortion sense. Therefore, the optical flow with the texture parts that are generated by the fine-tuned PWC-Net is more suitable for video compression than the optical flow with the smooth parts that are generated by the pre-trained PWC-Net.
Subjective Visual Comparisons: Fig. 9 shows some cropped regions of decoded frames of HoneyBee and YachtRide sequences by H.264, H.265 and our method for visual comparisons. Our method yields decoded frames with higher contrast and less artifact than H.264 and H.265. The decoded results by H.264 and H.265 show that the wing and leg of the honey bee are poorly reconstructed, but our DeepPVCnet reconstructs those with a higher contrast and less artifacts. Also, similar results in a low bitrate range are observed for BQTerrace and Cactus sequences as shown in Fig. 10. Similarly, Fig. 11 shows some cropped regions of decoded frames of Bosphorus and Kimono sequences by H.264, H.265 and our method for visual comparisons. As shown in Fig. 11, while the H.264 and H.265 produce the the decoded regions with blocking artifacts in a low bitrate range, our method yields the decoded region of higher fidelity without such artifacts.

V. CONCLUSION
We propose an end-to-end deep predictive video compression network, called DeepPVCnet, based on multi-frame hypothesis with a multi-scale structure and a temporal-contextadaptive entropy model. Our DeepPVCnet incorporates a mode-selective framework with uni-and bi-directional predictive codings in a rate-distortion optimization sense by jointly compressing optical flows and residual data that are generated from the multi-scale structure via the FT layers in an encoder side. In addition, our DeepPVCnet with the temporal-context-adaptive entropy model has a much faster decoding speed because it can be performed in parallel unlike the recent video compression methods [11], [13] using the autoregressive-based entropy coding model. Based on these advanced components in a combination, the DeepPVCnet shows better compression performance than the existing video standard compression codecs (AVC/H.264 and HEVC/H.265) and recent SOTA methods in terms of MS-SSIM. In our future work, our DeepPVCnet is extended to learn a fully automatic selection of the best prediction modes during training.

APPENDIX B THE IMPLEMENTATION OF OUR PROPOSED ENTROPY CODING MODEL
For more details of the implementation of our proposed entropy coding model, we follow the same concept and notations in the CNN-based image compression methods [6], [21]. In the main paper, we provided the training loss L as the rate-distortion optimization problem for video compression.
Since the quantization of the latent representation is discrete, we substitute additive uniform noise for the quantization process during training. Then the approximated latent representationsỹ 0 andz 0 are used instead of the quantized latent representationsŷ 0 andẑ 0 , respectively, in the training loss L as follows: L ≈ E x 0 ∼p x 0 Eỹ 0 ,z 0 ∼q [− log 2 p˜y 0 |(z 0 ,X R ) (ỹ 0 |(z 0 , X R )) − log 2 pz 0 (z 0 ) + λ · d(x 0 ,x 0 )], (7) where x 0 ,x 0 and X R denote the current frame to be encoded, the reconstructed frame and the reference frames for x 0 , respectively. The joint factorized posterior with the additive uniform noise for the quantization process as in [6], [21] can be expressed as follows: q(ỹ 0 ,z 0 |x 0 , φ g , φ h ) = i U(ỹ 0,i |ỹ 0,i − 1 2 ,ỹ 0,i + 1 2 ) · i U(z 0,i |z 0,i − 1 2 ,z 0,i + 1 2 ) with y = g a (x 0 ; φ g ), z = h a (y 0 ; φ h ), where U, φ g and φ h denote a uniform distribution, the parameters of g a and h a , respectively. Our proposed entropy coding model approximates the required bits forŷ 0 andẑ 0 as in Eq. 7. The entropy coding model forŷ 0 is based on Gaussian model with mean µ i and standard deviation σ i . Our proposed Context-Net C and the hyper encoder-decoder network pair (h a , h s ) with the multiple reference frames X R estimate the values of µ i and σ i . The Context-Net C generates the temporal context information c t from X R and the hyper encoder-decoder network generates the context information c c from y 0 . Then the Context-Net concatenates c t and c c to estimate the values of µ i and σ i . The expression for this process is as follows: where θ c and θ h denote the parameters of the Context-Net C and the hyper decoder network h s . Note that our proposed VOLUME 9, 2021 entropy coding model can estimate mu and sigma in parallel during decoding process since the entropy coding model with the multiple reference frames is not autoregressive. We utilized the same entropy coding model forẑ 0 which follows a zero-mean Gaussian model with standard deviation σ as in [6]. Sinceẑ 0 has little effect on the total bit-rate of the current frame coding, we use a simpler entropy coding model forẑ 0 than the entropy coding model ofŷ 0 as follows: pz 0 (z 0 ) = i (N (0, σ 2 i ) * U(− 1 2 , 1 2 ))(z 0,i ).

APPENDIX C THE ARCHITECTURE OF ENHANCEMENT NET
In the main paper, we described the overall structure of our DeepPVCnet that consists of an encoder-decoder network pair (g a , g s ) with the feature transformation layer, a hyper encoder-decoder network pair (h a , h s ), the pre-trained PWC-Net [39], a Context-Net and an Enhancement Net. The Enhancement Net is incorporated into the decoder side of our DeepPVCnet to enhance the image quality of the reconstructed framex 0 . Fig. 12 shows the details of the Enhancement Net that consists of the residual dense network (RDN) [50]. As depicted in Fig. 12, the Enhancement Net consists of five residual dense blocks (RDB) with three convolution filters per each block.

APPENDIX D THE EXPERIMENTAL RESULTS FOR HEVC CLASS C AND D IN MS-SSIM
In the main paper, we showed the results of the rate-distortion (R-D) curves for our DeepPVCnet, H.264, H.265, Wu's method [46], DVC [26] and Habibian's method [13] with the UVG [2] and HEVC datasets [38] (Class B and E) that are consist of high-resolution sequences. Additionally, Fig. 13 shows the results of the R-D curves in terms of MS-SSIM for the HEVC datasets (Class C and D) that are low-resolution sequences. Our DeepPVCnet outperforms H.264, H.265 and DVC for most bitrate ranges in terms of MS-SSIM. Note that the experimental results for the HEVC datasets are provided only by DVC [26] among the recent deep video compression methods.