A Video Frame Extrapolation Scheme Using Deep Learning-Based Uni-Directional Flow Estimation and Pixel Warping

This paper investigates video frame extrapolation, which can predict future frames from current and past frames. Although there have been many studies on video frame extrapolation in recent years, most of them suffer from the unsatisfactory image quality of the predicted frames such as severe blurring because it is difficult to predict the movement of future pixels for multi-modal video frames, especially with fast changing frames. An additional process such as frame alignment or recurrent prediction can improve the quality of the predicted frames, but it hinders real-time extrapolation. Motivated by the significant progress in video frame interpolation using deep learning-based flow estimation, a simplified video frame extrapolation scheme using deep learning-based uni-directional flow estimation is proposed to reduce the processing time compared to conventional video frame extrapolation schemes without compromising the image quality of the predicted frames. In the proposed scheme, the uni-directional flow is first estimated from the current and past frames through a flow network consisting of four flow blocks and the current frame is forward-warped through the estimated flow to predict a future frame. The proposed flow network is trained and evaluated using the Vimeo-90K triplet dataset. The performance of the proposed scheme is analyzed using the trained flow network in terms of prediction time as well as the similarity between predicted and ground truth frames such as the structural similarity index measure and mean absolute error of pixels, and compared to that of the state-of-the-art schemes such as Iterative and cycleGAN schemes. Extensive experiments show that the proposed scheme improves prediction quality by 2.1% and reduces prediction time by 99.7% compared to the state-of-the-art scheme.


I. INTRODUCTION
Recently, image prediction such as video frame interpolation (VFI) and video frame extrapolation (VFE) has attracted much attention due to its promising potential for useful applications in various image-related fields, and this trend has accelerated explosively due to the advances in computer hardware and deep learning [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15].Various VFI schemes estimating intermediate frames between two consecutive The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang.frames have been proposed [1], [2], [3], [4].VFI can enhance the quality of video contents by increasing the number of frames per second without consuming extra communication network resources, or save network resources needed for Internet-based real time streaming without losing streaming quality.
The problem of synthesizing new video frames into existing video was investigated [1].A context-aware synthesis scheme that warps not only the input frames but also their pixel-wise contextual information to interpolate a high-quality intermediate frame was proposed [2].These pioneering studies were enhanced by the real time intermediate flow estimation (RIFE) [3], [4].RIFE can estimate multiple intermediate frames by estimating optical flows between two different frames based on a neural network, significantly reducing the prediction time compared to previous studies.In addition, a privileged distillation scheme was also introduced to improve the stability of training and the overall performance.The computational speed of RIFE is fast enough for real-time video streaming while enhancing the quality of estimated frames.
To enhance unsatisfactory quality of predicted frames, a reference frame alignment scheme using deep neural networks is proposed [9].Several studies have used generative adversarial network (GAN) for VFE [10], [11], [12].Their essential idea is to train a generator that can predict future frames from past frames.These studies have demonstrated the potential of GAN as a new approach to VFE.
Inspired by the amazing results obtained with flow estimation-based VFI, novel VFE schemes have been proposed to predict future frames based on future motion estimation [13] or optimization reusing a pre-trained differentiable VFI model without training [14].These schemes outperform existing schemes regarding the image quality of predicted frames.However, estimating future motions through bi-directional optical flows requires additional procedures [13] or does not support real-time predictions due to their heavy computational complexity [14].
This paper proposes a simplified VFE algorithm using deep learning-based uni-directional flow estimation and pixel warping (EUFPW).The EUFPW scheme can directly predict future frames by forward-warping the last single frame with uni-directional flow estimated from the two past frames.Thus, real-time predictions are supported with a prediction quality comparable to VFI.Extensive experiments show that the proposed scheme enhances prediction quality by 2.1% and reduces the prediction time by 99.7% over the state-of-the-art scheme [14].
The remainder of this paper is organized as follows.Section II discusses previous works related to VFE and a new methodology using deep learning model to predict future video frames is proposed in Section III.Numerical results of the proposed scheme are shown and compared with the state-of-the-art schemes in Section IV.Finally, Section V concludes this paper.

II. RELATED WORKS
ConvLSTM has been widely used as a pioneering approach to predict future video frames [5], [6], [7].Villegas et al. proposed a motion and content decomposition method to predict future frames based on the encoder-decoder convolutional neural network and ConvLSTM [5].Many variants using ConvLSTM are present for VFE.
Specifically, Lotter et al. proposed a top-down context guided LSTM model, PredNet, to utilize the contextual information [6].VFE using ConvLSTM-based Autoencoder was also proposed [7].However, the performance was only verified on a simple gray-scaled modified National Institute of Standards and Technology (MNIST) handwritten digit dataset.Despite remarkable advances in ConvLSTM, the accurate estimation of pixel movements for future frames is still challenging for multi-modal frames.In addition, ConvLSTM is difficult to train and suffers from prediction quality degradation compared to the state-of-the-art schemes.
Liu et al. proposed ConvTransformer, which is a multihead convolutional self-attention layer that can learn the sequential dependence of video sequence.ConvTransformer consists of an encoder and a decoder, which encodes the sequential dependence between the input frames using the self-attention layer and decodes the long-term dependence between the target synthesized frames and the input frames, respectively [8].It was shown that ConvTransformer can improve the prediction quality compared to ConvLSTMbased schemes, but it causes a heavy computational complexity [16].
The concept of frame alignment was introduced as an essential technique to solve the difficulties in VFE due to the complex and diverse motion patterns in natural video frames [9].The reference frames are first aligned by block-based motion estimation and motion compensation, and then future frames are extrapolated from the aligned frames by a trained deep network.This scheme does not predict future frames directly from past frames but instead predicts future frames through motion estimation.In addition, it requires four past frames to predict future frames, contrary to the state-of-the-art schemes using two past frames.
GAN has proven its excellence in various image-related fields and is another alternative deep neural network for VFE [10], [11], [12].Lin et al. proposed a frame extrapolation method for video coding with the Laplacian pyramid of GANs [10].They used a simplified network to focus on frame compression efficiency and computational complexity instead of the quality of predicted frames.
A cycle generative adversarial network (cycleGAN)-based VFE was also proposed [11] and its feasibility has been verified through experiments [12].A single generator is trained to predict both future and past frames, concurrently upholding bi-directional prediction consistency through the application of retrospective cycle constraints.
The primary benefit of this approach lies in its ability to facilitate real-time prediction.Simultaneously, it faces a drawback in terms of prediction quality, as it directly generates a prediction of the future frame without the utilization of flow estimation.The prediction quality is also affected by the number of input frames.Furthermore, the training process is considerably more computationally intensive compared to flow estimation-based schemes using single-network, as it involves simultaneous training of the generator and two discriminators.It's worth noting that the discriminators become redundant once the generator's training is finished.
Lately, flow estimation-based VFE methods, known for their effectiveness in VFI, have attracted significant attention [13], [14], [15].Woo et al. investigated predicting future frames by using motion trajectories between past frames and proposed a novel VFE algorithm based on future motion estimation [13].They first estimated bi-directional optical flows between a pair of input frames and used them to approximate future motions by warping frames.The warped frames aggregate through a synthesis network.Although experimental results demonstrated that the proposed algorithm outperforms several conventional algorithms, estimating future motions using bi-directional optical flows requires extra process.
Inspired by the impressive results of VFI, Wu et al. proposed a new optimization framework for VFE via VFI [14], [15].They solved an extrapolation problem based on optimization by reusing a pre-trained differentiable VFI model without training.It shows that the quality of prediction is superior to existing schemes.However, this scheme cannot support real-time processing because it requires iterative optimizations for every prediction.

III. PROPOSED METHODOLOGY FOR PREDICTING FUTURE VIDEO FRAMES
In this section, a methodology for predicting future video frames is proposed.The overview of the proposed methodology is explained, followed by the detailed architecture of flow network based on deep learning.Training losses are also defined and dataset to train and evaluate the proposed flow network is also described.

A. OVERVIEW OF THE PROPOSED METHODOLOGY
Fig. 1 illustrates the conceptual framework underlying our proposed methodology aimed at predicting future video frames.In contrast to prevalent conventional approaches reliant on bi-directional flows, our method adopts a unique perspective, wherein a uni-directional flow is estimated through a dedicated flow network utilizing information from two past frames.Subsequently, the future frame is predicted by forward-warping the most recent frame through the estimated uni-directional flow.This distinctive process is the essential idea of our methodology for VFE.

B. ARCHITECTURE OF FLOW NETWORK
Fig. 2 shows the architecture of the proposed deep learning-based flow network (FNet) for predicting future video frames based on the intermediate flow network (IFNet) used in RIFE [3], [4].For three consecutive frames I n , I n+1 , and I n+2 , IFNet estimates two intermediate flows F n→n+1 , F n+2→n+1 and a fusion mask.Two intermediate frames are estimated from I n and I n+2 through F n→n+1 and F n+2→n+1 , respectively.Then, the fusion mask combines the two estimates into the final prediction.
In contrast, the FNet proposed in this paper only estimates single uni-directional flow of every pixel, F n+1→n+2 between two consecutive frames I n and I n+1 on a real-time basis.Then, the last frame I n+1 is forward-warped by the estimated flow  by taking (I n , I n+1 , F 0 n+1→n+2 ) and (I n , I n+1 , F 1 n+1→n+2 ) as input, respectively.Finally, FBlock P has the privilege of using the ground truth frame I n+2 along with (I n , I n+1 , F 2 n+1→n+2 ) to the first blocks to accurately estimate În+2 .Thus, the final block, FBlock P only engages in the training phase.The flow estimated by each flow block can be written by (1).
The estimated flow consists of the normalized positional shift between a pair of corresponding pixels from I n+1 to În+2 .Thus, each pixel in I n+1 is forward-warped by the flow to predict În+2 .The image warped by F i n+1→n+2 is denoted by − → I i n+2 and can be written by ) where − → w (I , F) is the warping function for given frame I and flow F [17].
An encoder-decoder called RefineNet is also used to improve the quality of predicted frames as in previous studies [3], [18], [19].The encoder consists of four convolutional blocks and each convolutional block has two convolution layers, while the decoder consists of four transposed convolutional layers.The detailed structure of RefineNet and its implementation can be found in the Appendix of [3] and [4], respectively.
where ⊙ denotes the element-wise multiplication of two arrays and M i is the Boolean mask array defined by M i indicates whether each pixel of − → I i n+2 is closer to the corresponding pixel of I n+2 or not, compared to − → I P n+2 .The L1 loss of the predicted frame is calculated by where LP(•) denotes the Laplacian pyramid representation of frame [3], [19] and MAE indicates the mean absolute error between each element in two frames.The pyramid level sets to 5. Similarly, the L1 loss of the last FBlock P is also calculated by The total training loss is defined by linearly combining three losses: where λ is the weight of L s.flow to balance the scale of losses and is set to 0.001 in experiments.The FNet is trained on the Vimeo-90K triplet dataset, primarily used for temporal frame interpolation.This dataset compromises 73,171 threeframe sequences with a fixed resolution of 448 × 256.All the frames are extracted from 15,000 selected video clips from Vimeo-90K [20], [21].For a sequence, three frames are denoted by I n , I n+1 , and I n+2 , respectively, where I n+2 is the ground truth to be predicted using I n and I n+1 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
reserved for validation of the deep learning network during training and the remaining 95% is used for training.

IV. RESULTS AND DISCUSSIONS
The performance of the proposed EUFPW is analyzed in terms of the similarity between predicted and ground truth frames and the average prediction time required to predict each future frame.A comparative analysis is conducted against the performance of two contemporary schemes: the Iterative scheme [14] and the cycleGAN scheme [11].Two metrics, the structural similarity index measure (SSIM) and MAE, measure the similarity between predicted and ground truth frames.The SSIM is a well-established measure employed to quantify the structural similarity between two images, serving as a full reference metric [22].One thousand sequences randomly chosen from Vimeo-90K triplet test sequences are used to evaluate the performance of the three schemes.
The learning rate and total epochs are set to 10 −4 and 100 for the proposed and the cycleGAN schemes.The batch size for the proposed scheme is 128, while the batch size for the cycleGAN scheme is 16 because the cycleGAN scheme consumes more memory to train the generator and discriminators.The Iterative scheme undergoes 100 rounds of optimization.Fig. 3 shows the three schemes' SSIM values between predicted frames and ground truths.The side-length of the sliding window used in the comparison is 7 [23].SSIM will be 1 for two identical frames.The average SSIM values are 0.932, 0.913, and 0.614 for the proposed, Iterative, and cycleGAN schemes, respectively.The proposed scheme has 2.1% higher SSIM than Iterative scheme.Fig. 5 shows the L1 loss of the three schemes.The average L1 losses are 0.015, 0.017, and 0.060 for the proposed, Iterative, and cycleGAN schemes, respectively.The EUFPW outperforms the other two schemes by 11.7% and 75%, respectively.The total similarity score is defined as SSIM L1 loss to integrate SSIM and L1 loss into one metric.Fig. 4 shows the total similarity scores of the three schemes.The average total   The average prediction time required to predict one future frame is shown in Fig. 6 to compare the proposed scheme with the conventional schemes in terms of computational complexity.The average prediction times are 0.015, 5.441, and 0.016 for the proposed, Iterative, and cycleGAN schemes, respectively.The Iterative scheme takes a much longer prediction time than the other two schemes due to iterative optimization to predict each future frame.In contrast, the proposed scheme can reduce the prediction time by 99.7% and 6.3%, compared to Iterative and cycleGAN schemes.
Figs. 7 present the three exemplary prediction results of the three schemes.Each sequence of Videmo-90K triplet dataset is identified by xxxxx −yyyy, where the first five digit number xxxxx is the video clip ID and the second four digit number is the sequence ID in the video clip.I n and I n+1 are two input frames and I n+2 is the target frame to predict.Three frames in the second row are the predicted frames În+2 estimated by three schemes.The ambiguity due to blurring is shown in the red-circled area for the cycleGAN.In particular, Fig. 7-(b) shows that the cycleGAN suffers from a severe performance degradation.The Iterative scheme shows quality of predictions comparable to the proposed scheme at the expense of prediction time.

V. CONCLUSION
In this paper, EUFPW that is a simplified video frame extrapolation scheme using deep learning-based uni-directional flow estimation is proposed to reduce the processing time without compromising the image quality of the predicted frames.In the proposed scheme, FNet consists of four flow blocks and each flow block estimates the uni-directional flow using the last two frames.The estimated single uni-directional flow can directly forward-warp the last frame to predict the future frame.In addition, the last flow block can exclusively use the ground truth frame to teach the first three flow blocks in the training phase, and does not engage in the inference phase.
The EUFPW model is trained on the Vimeo-90K triplet dataset where each sequence consists of three consecutive frames.The performance of the EUFPW scheme is evaluated in terms of the similarity between predicted and ground truth frames and average prediction time required to predict a future frame through extensive experiments and compared to state-of-the-art schemes such as Iterative and cycleGAN.The experimental results show that the proposed scheme is superior to the state-of-the-art schemes in both similarity and prediction time.In particular, the proposed scheme exhibits improvements of 2.1% and 51.8% in SSIM, along with reductions of 99.7% and 6.3% in prediction time, when compared to the Iterative and cycleGAN schemes, respectively.
The proposed scheme can be applied to various applications such as video streaming in our future works.

FIGURE 2 .
FIGURE 2. The architecture of flow network.

F
n+1→n+2 to generate În+2 .Thus, no fusion mask is needed contrary to RIFE.As shown in Fig.2, FNet consists of four flow blocks.Each flow block is denoted by FBlock i , i ∈ {0, 1, 2, P} and the flow predicted by FBlock i is denoted by F i n+1→n+2 .FBlock 0 estimates F 0 n+1→n+2 by taking (I n , I n+1 ) as input.FBlock 1 and FBlock 2 estimate F 1 n+1→n+2 and F 2 n+1→n+2 C. TRAINING LOSSES AND DATASETAll flows and frames are 3-dimensional and have the size of C × H × W , where C, H , and W represent the number of channels, the height of frames, and the width of frames, respectively.C = 3 for (colored) video frames, while C = 2 for flows since they contain every pixel's positional shift on the x − y coordinate system.Three loss functions are defined to train the FNet.The selective loss of flow L s.flow is defined by

FIGURE 4 .
FIGURE 4. Total similarity scores of three schemes.

FIGURE 6 .
FIGURE 6.Average prediction time for three schemes.

FIGURE 7 .
FIGURE 7. Three exemplary results of three schemes.

TABLE 1 .
Pros and cons of conventional schemes.