Multi-Scale Attention Generative Adversarial Networks for Video Frame Interpolation

Video frame interpolation is a fundamental task in computer vision. Recent methods usually apply convolutional neural networks to generate intermediate frame with two consecutive frames as inputs. But sometimes existing methods fail to handle with complex motion and long-range dependencies. In this paper, a multi-scale dense attention generative adversarial network is proposed. First, a multi-scale generative adversarial framework is established for video frame interpolation. Generators from coarse to ﬁne can better combine global and local information. Second, an attention module introduced to generator makes network accurately focus on moving objects. Third, a sequence discriminator is designed to improve the ability of capturing spatial and temporal consistency in frame sequence. Experimental results of ablation study prove the effectiveness of our three contributions. And results on several datasets demonstrate that our approach attains higher performance and produce more photo-realistic in-between frame comparing with previous works.


I. INTRODUCTION
Video frame interpolation is a fundamental issue in computer vision and aims to synthesize plausible intermediate frames between any two adjacent frames. This is a challenging problem requiring a generative framework that can model motion information, and generate both spatially and temporally consistent frames. Video frame interpolation can be used in numerous applications, such as video frame rate conversion [1], slow-motion video generation [2] and video compression [3]. It helps surmount the temporal constraint of camera sensors and has drawn increasing attention in image and video processing community.
Video frame interpolation is a complicated problem due to three reasons: (1) The variability of real world scene makes the video sequences modeling considerably difficult. (2) Large motion generally appears in videos. It is more troublesome to capture temporal and spatial consistency in that case. (3) The video interpolation framework could not generate realistic frames with accurate content when local occlusion occurs in the video.
The associate editor coordinating the review of this manuscript and approving it for publication was Shiping Wen . Most existing video interpolation methods usually use convolutional neural networks (CNN). These methods can be divided into two categories: based on interpolation convolution kernels estimation [4], [5] and based on optical flow estimation [2], [6], [7]. The first technique combines motion estimation and pixel synthesis into a single process, which produces a convolution kernel for each pixel. It can capture both the motion information between the original consecutive frames and the coefficients for pixel synthesis via a convolutional neural network. The other technique estimates optical flow between input frames firstly, and then interpolate intermediate frames depend on the estimated dense correspondences The accuracy of optical flow estimation directly affects the quality of the generated frame. However, these two kinds of CNN-based techniques are essentially indirect methods to synthesize intermediate frames. On the one hand, they will produce unrealistic and blurry results if the intermediate process fail. The estimated optical flow or kernel is commonly not correct when the motion regions under the case of occlusion, blur and sudden illumination change. And on the other hand, the computation of optical flow or kernel for each pixel requires considerable time consumption and memory overhead, which significantly escalates the difficulty of video frame interpolation. Recently, generative adversarial networks (GANs) have achieved great success in the field of image and video generation [8], [12], [48]. GANs contain a generator network and a discriminator network, which are trained against each other. Through adversarial learning, generator can achieve satisfactory results.
Building an effective frame interpolation model of natural images is hard due to video sequence are complex and high dimensional. It is easy to get blurred results and artifacts when processing high-resolution image, because there are different sizes of objects in the image. Although we can contact multiple convolution operations to expand the receptive field, the problem of long-distance dependence will become more prominent [15], [49]. The proposed method utilizes multiscale structure that can more accurately model global information and local features with different size level images as input. The effective image patch size become smaller as going up the multi-scale pyramid, and the network can process more elaborate image local feature.
In this paper, a novel and explicit video frame interpolation method is proposed which can relieve the calamity caused by mistaken intermediate process. Inspired by recent advances in generative adversarial networks [8]- [11] for image synthesis [12]- [14] and video prediction [15]- [17], we introduce a frame interpolation framework using multi-scale dense attention generative adversarial networks, i.e., FI-MSAGAN. FI-MSAGAN is an effective end-to-end trainable fully convolutional neural network, manipulating two consecutive frames at arbitrary resolution and directly producing the high-quality intermediate frame. The proposed FI-MSAGAN progressively reconstruct intermediate frames in a coarse-tofine manner. At each scale, the frame generator uses residual blocks [18], [19] and skip connection to form a dense network structure. And frame discriminators judges whether the input frame patches are fake or true. It helps the model to capture fine details and texture structure. In order to make the network focus on large moving objects and handle with complex motion, we introduce an attention module for generator. In addition, a sequence discriminator is designed to provide feedback signal for generator to capture the temporal consistency between video frames. The training process of FI-MSAGAN is end-to-end and guided by a comprehensive loss function that contains a unique multi-scale frame adversarial loss, a sequence adversarial loss, a perceptual loss and a reconstruct loss. This paper mainly consists of three contributions condensed as follows: 1. We propose an efficient multi-scale dense attention generative adversarial networks (FI-MSAGAN) for video frame interpolation. Our method can utilize global and local information to produce more realistic intermediate frame directly without optical flow estimation. 2. We design a dense attention generator for better reconstruct temporal and spatial consistency of video sequences. The generator consists of synthesis module and attention module. Attention map obtained from attention module makes generator adaptively focus on the dynamic areas. 3. We construct a sequence adversarial loss through a sequence discriminator. And frame discriminators recover high-frequency structure. Ultimately, the total loss of the network includes four terms: multi-scale frame adversarial loss, sequence adversarial loss, feature perception loss, and reconstruction loss. It is adopted to further improve the quality of the interpolated frame.

II. RELATED WORK
Many advanced approaches for video frames interpolation explicitly or implicitly presume consistent motion between video sequences. Conventional video frame interpolation approaches usually consist of two stages: motions estimation and pixel synthesis. Phase-based method, optical flow and local convolution kernel can be used to capture motion consistency. Mahajan et al. [21] holds that a given pixel in the interpolated frames traces out a path in the original video sequences. Based on the intuitive idea, they copy and move pixel gradients from the inputs to the interpolated frames along this path. Meyer et al. [1] presents an efficient method to computing in-between frames by simple per-pixel phase information propagation across the levels of multi-scale pyramid, but it fails gracefully in case of large appearance changes. Optical flow prediction is the key technology of motions estimation and determines the accuracy of the interpolation results. However, they are often challenged by large motions and brightness change [22]- [24]. Deep learning has make enormous success and dramatic developments in image and video recognition. Most state-of-the-art optical flow models use deep learning [25], which suggest that CNNs can understand motion information between frames. In order to get better results, many researchers merge optical flow estimation and video interpolation frame in a single model [2], [6], [7], [26]- [28]. Liu et al. [6] designs a deep network with a voxel flow layer to synthesize video frames by flowing pixel values from input video volume. The DVF is trained in an unsupervised manner and can be easily extended to extrapolating video. Jiang et al. [2] first uses a flow computation U-Net to estimate the bi-directional flow, which is then linearly fused to get rough intermediate flow. And then, they use another flow interpolation U-Net to refine the approximate flow and visibility maps. Finally, synthesizing intermediate frames by applying the visibility maps to the warped images. Niklaus and Liu [27] apply pixel-wise contextual information extracted by a pre-trained network to estimated bidirectional flow, and uses a frame synthesis network to produce the interpolated frame in a context-aware fashion. Zhang et al. [7] uses a 3D U-Net feature extractor to excavate spatio-temporal context and rebuild texture, and a coarse-to-fine architecture to improve optical flows estimation. Li et al. [28] proposes a lightweight network to estimate optical flow at feature level and introduce a new sobolev loss achieve better results. VOLUME 8, 2020 Different from optical flow based models, Niklaus et al. [4] develops a network (AdaConv) to estimate spatially-adaptive convolution kernel for each pixel. However, the memory demand increases quickly while the model needs large kernel to deal with large motion. Niklaus et al. [5] improves Ada-Conv by using a pairs of 1D kernels to approximate regular 2D ones. The use of local separable convolution kernels significantly reduces the parameters of the model. There are several papers [29], [30] that use CNNs to directly produce intermediate frames, yet the models are simple and hard to generate reasonable results.
After the invention of generative adversarial networks [8], many researches applied this framework to generate images in the context of image-to-image translation [12], [31], super resolution [32], and video prediction [15]- [17], [48]. Mathieu et al. [15] first applies adversarial learning to video prediction, which employ an image gradient loss in a multiscale architecture and obviously reduce blurring artifacts. On the other hand, many works have improved performance of GANs [13], [33], [34], [36]. WGAN [33] and LSGAN [36] introduce a new discriminator loss function to relieve training instability. BigGAN [13] and ProGAN [34] allow generator to map noise to high-resolution and realistic images. Inspired by this, our network uses multi-scale generative adversarial strategy to produce middle frame in an explicit manner.

III. PROPOSED APPROACH
In this section, we expatiate our method of video frame interpolation. First, we define the problem and give some notations. Then, we introduce our proposed multi-scale frame interpolation networks in details. At last, we show several types of loss functions used in the networks.

A. PROBLEM DESCRIPTION
Our goal is to synthesize the intermediate frame between two successive frames I 1 , I 2 ∈ R H ×W ×C . H , W , C are the height, width and channel number of the frame, respectively. We denote the generated in-between frame as I s with the same size of I gt , which is real frame used in training phase and unavailable during test.
In this work, we take advantage of generative adversarial networks to achieve more effective video frame interpolation task. The generator receives consecutive frames I 1 and I 2 as inputs and find a map to synthesize the intermediate frame I s : The discriminator distinguishes real frames from fake frames produced by the generator and provide the update signal for G.

B. FI-MSAGAN STRUCTURE FOR VIDEO FRAME INTERPOLATION
Our proposed method adopts a multi-scale structure, which consists of several generators and discriminators with different size of images as inputs. The multi-scale structure can better combine global information and local details for photo-realistic frame generation. In order to preserve the spatial and temporal consistency of the generated video, we introduce a dense attention generator. Moreover, we use both a frame discriminator and a sequence discriminator to distinguish fake data from real. The generators and discriminators as follows.

1) MULTI-SCALE DENSE ATTENTION GENERATOR
We construct the proposed video frame interpolation method using a multi-scale pyramidal structure as shown in Figure 1.
There are four scale levels in the network and each level of generator is a subnetwork with several residual blocks, denote as G i , i = 1, . . . , 4. Our generators takes original scale or downsampling version of I 1 and I 2 as inputs and synthesize the interpolated frames via a gradual approach. The generated frames of each level are obtained as: where U denotes upsampling. At the coarse level, we feed the downsampled frames I 1 1 and I 1 2 of size H /2 4−1 × W /2 4−1 into G 1 . However, at other finer level, the inputs of G i also contains the upsampled of output from G i−1 . In practice, we add output of G i−1 between the frames I i 1 and I i 2 so that generator at finer level can capture temporal coherence.
The entire generator network is a cascade of sub-networks with common structure at each scale. Generator G i consists of a synthesis module S i and an attention module A i . The structure of G i is described in Figure 2. For synthesis module, as shown in the blue rectangle in Figure 2, the input layer uses a convolution kernel with 5 × 5 to obtain a larger receptive field. The number of residual blocks in S i increases with the growth of input images scale. We adopt pre-activation residual block with 3 × 3 kernel size and stride 1. All convolution operations output 64 feature maps, but the final layer outputs a frame with tanh activation. We utilize skip connections between the input layer feature map/shallow layer maps and the last residual block to maintain information extracted from shallow layer in the form of a dense connection.
Recent works have achieved great success in imageto-image translation [12], [31], [37], [38]. These works adopt similar architecture as [39] for generator, which contain stride-2 convolutions for down-sample, several residual blocks for domain feature space translation and fractionallystrided convolutions with stride 1/2 for up-sample. However, inspired by [15], we remove down-sample and up-sample convolution layers and only use a few residual blocks in our multi-scale generator architecture, both synthesis module S i and an attention module A i . For each scale, the size of feature map remains unchanged in information flow. Our generator structure can avoid two disadvantages of previous works: (1) partial information loss of shallow layers feature maps caused by down-sampling convolutions and (2)   generation process is from coarse to fine so that the network can better model global information and local feature.
Video frame synthesis is a quite complicated and challenging task due to large and complex motion or occlusions. Effective video interpolation algorithm should be able to accurately focus on moving objects. However, a network with a series of convolution operations could not achieve the above property. Because traditional convolution structure only account for short-range dependencies, limited by the kernel size. Recent researches show that, motivated by human perception procedure, attention mechanism has an advantage in computer vision, e.g. image classification [40], image-toimage translation [41], [42], video classification [20]. Rather than processing a single image or a sequence using local information, attention allows the network to focus on the most relevant part of features as needed.
We produce an attention map by inserting an attention module to solve the above issues. As shown in the purple rectangle in Figure 2, Attention module A i help the generator to capture long-range dependencies so that generator G i can adaptively focus on the dynamic areas. Attention module is slightly different from synthesis module. We remove skip connections and the final activation function is sigmoid. The attention map is continuous value between [0;1].
For example at scale i, we utilize attention map am i and input frame I i 1 to create the final finer intermediate frame: where ⊗ denote element-wise produce. The first part produces the intermediate frame, while the second part corresponds to pixels warped from the previous frame. Our attention module can effectively recognize the key region of moving objects in the frame sequence and keep the static background perfectly.

2) MULTI-SCALE FRAME DISCRIMINATOR AND SEQUENCE DISCRIMINATOR
Our network is a multi-scale structure, so we provide each generator with a corresponding frame discriminator. In other words, the whole network incorporates many pairs of generators and discriminators that can process different size input images. These GANs models generate the final intermediate frame from coarse to fine. Frame discriminators with different scales have different receptive fields compared with original size images. The coarsest scale frame discriminator has the largest receptive field and better global view, which can guide the generator to generate globally consistent frames. While the frame discriminator at the top level is adept in guiding the generator to produce more exquisite detail information. At the same time, it makes it easier to training the generator easier and more stability.
The architecture of frame discriminator D i is similar to patchGAN [12], shown in Figure 3. It is generally known that L1 loss can merely capture the low frequency information for VOLUME 8, 2020  image and video generation task. In our proposed method, we apply patchGAN to distinguishing image local patch so that the frame discriminator can model high frequencies.
In order to make frame discriminator at different level focus on image region with diverse size. D i from coarse to fine contains 2 to 5 convolution layers. The kernel size is 3 × 3 used in all convolutional layers, and stride is 2 except for the first layer is 1. The feature maps are 64, but the last layer averages all local responses to provide the ultimate output of D i .
Temporal consistency of generated intermediate frames is crucial for video frame interpolation. Therefore, we design a sequence discriminator D S to guide the generator to capture spatial and temporal consistency in video. The structure of sequence discriminator is similar to traditional discriminator [43], shown in Figure 4. We only adopt the sequence discriminator at finest scale. The consecutive two real frames and the generated intermediate frames compose a fake sequence, Fseq = contact(I 1 , I S , I 2 ), and D S distinguishes real sequence from fake sequence.

C. LOSS FUNCTIONS
Our multi-scale video frame interpolation approach is based on GANs [8] which optimizes the following objective functions: The proposed method uses two kinds of discriminator which corresponding to two adversarial losses: 1) multi-scale frame adversarial loss; 2) sequence adversarial loss. Moreover, we adopt two reconstruction losses: 3) frame reconstruction loss; 4) feature perceptual loss. So our objective contains four parts in total. We use multi-scale frame discriminators and a sequence discriminator, so there are two kinds of adversarial loss. The adversarial loss for frame discriminator at level i as formula (5). We Maximum it to optimize D i and minimum it to optimize G i : 94846 VOLUME 8, 2020 where I i r is the real intermediate frame and I i s is synthesized by G i as formula (2). The multi-scale frame adversarial loss L F is the weighted sum of all scale losses: Sequence discriminator D S output a probability that the input is a real frame sequences. The sequence contains three consecutive frames of the original size, i.e. Fseq = contact(I 1 , I S , I 2 ) and Rseq = contact(I 1 , I r , I 2 ). The sequence adversarial loss is only employed at top level: We use a pixel-wise reconstruction loss in L1-norm, which can produce sharper results than MSE, the expression as follow: Inspired by perceptual loss used in single image superresolution style [39], we employ features 5_4 from the VGG network [44] as a feature perceptual loss: where φ(·) is the feature extracted from a pretrained VGG network on ImageNet. Feature perceptual loss is correlated to human perception. Ultimately, the total loss functions to train our multi-scale dense attention frame interpolation as following. In practice, we set λ GAN = 0.0001 and λ vgg = 0.001.

IV. EXPERIMENTS
In this section, we perform extensive experiments to demonstrate the effectiveness of our approach. First, we explain the experiment details. Then, we provide ablation study to evidence the contribution of the proposed method, i.e. 1) multi-scale generator; 2) attention module and 3) sequence adversarial loss. At last, we compare our method with state-of-the-art video frame interpolation methods and evaluate them both qualitatively and quantitatively.

A. EXPERIMENT DETAILS
We can use any available video to train our network, and we do not need any labels about the video. So we use the adobe240-fps video collected by [45] due to real and diverse scenes. We extract three consecutive frames from these videos to form frame triplets. In order to train the model effectively, we discard triplets for which first and third frames is almost identical. Finally, 50k triplets are selected. Among them, randomly selected 783 triplets are used for test and the rest are used for train. When compared with other methods, we also use UCF101 [46] to verify network performance. The triplets are cropped to patches with size of 128 × 128 that allows us to avoid patches that contain no useful information, while use original size frames for test. Because using a fully convolution network, our frame interpolation method can be applied to Arbitrary size of input image. Peak-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are widely used objective metrics in video frame interpolation task, and we also use the two metrics for evaluation. Higher values of PSNR and SSIM indicate better results. To train our model, we use Adam optimization [47] with β 1 = 0.9 and β 2 = 0.999 for generator and discriminator. The initial learning rate is 0.0001 and is decayed every 20 epochs. The batch size is set to 8. The number of residual blocks in generators are 4, 6, 6, 8 from coarse to fine. We implement our network by TensorFlow on NVIDIA GeFore GTX 1080 Ti.

B. ABLATION STUDY
We perform ablation study to prove the effectiveness of our contributions: multi-scale generator structure, attention module and sequence adversarial loss. A baseline model is simply based on GANs framework without multi-scale set and attention module in generator. And loss function to train baseline model includes original scale frame adversarial loss, frame reconstruction loss and feature perceptual loss. The adobe240-fps video test dataset are used in these experiments. The results of ablation study are shown in TABLE 1. Our full model denotes that the model use all of proposed three contributions. As reported in TABLE 1, the baseline model acquires acceptable PSNR and SSIM values. It shows that the GANs framework can use to video frame interpolation task. But the predicted intermediate frame is not good visually, as shown in Figure 5(b). The word in the moving car is blurry. The PSNR and SSIM can be significantly improved by inserting our improved strategy into the baseline mode. According to two quantitative evaluation metrics, our full model has the best performance than other ablation models. Comparing rows 1 and 2 of TABLE 1, the multi-scale structure is effective because the model can integrate global and local information. Generator only with attention module is a little worse than only with multi-scale structure, this may due to multi-scale loss can be constructed by using multi-scale structure. In addition, multi generator and discriminator can improve training stability of GAN. Using multi-scale structure and attention module simultaneously can get better performance. Comparing the last two rows in TABLE 1, our full model can capture spatial and temporal consistency owe to a sequence discriminator.
The results of models with different combination are shown in Figure 5. Although the baseline model is feasible, it still produces blurry results. Baseline model with multi-scale structure can generate more realistic frame than model with attention. The produced middle frame of our full model is sharper. The latter two models are very similar in visual perception, but the full model has better quantitative perfor-mance.

C. COMPARISON WITH STATE-OF-THE-ART METHODS
In order to verify the progressiveness of the proposed method, we compare our network with several state-of-the-art video frame interpolation methods and evaluate them both qualitatively and quantitatively. Comparison methods includes DVF [6], SepConv [5] and Super SloMo [2]. These three methods are representative of various video frame interpolation techniques. DVF and SepConv are directly rencet CNN-based methods. And Super SloMo use optical flow estimated by CNN to assist in the generation of intermediate frame. Due to lack of pretrianed model, we train our data with official implementation. We make a quantitative and qualitative comparison on UCF101 and adobe240-fps. Hence, the comparison results are persuasive. The PSNR and SSIM of different methods are shown in TABLE 2 and  TABLE 3.
As reported in TABLE 2 and TABLE 3, our method is better than other methods. The SSIM value of our model is a little lower than SepConv and Super SloMo on UCF101 test datasets. The PSNR score of our model is best among these approaches. The quantitative comparison proves that our method is effective. Although the SSIM value of our method is not the highest in UCF101 test data, the intermediate frame generated by our model has better visual reality, as shown in Figure 7. Maybe it's because numerical indicators are not always consistent with human perception. We list the runtime of different methods in TABLE 3. When testing at 640 × 360 resolution with a 1080Ti GPU, our model is more efficient with 0.324s per frame. Other methods take more time to  generate an intermediate frame than ours. And parameters of our generator model are about 3 million. However, the parameters of other methods are more than 10 million. Due to multiscale generative adversarial structure and attention modules, our model can get better performance with less parameters.
Some frame interpolation results of other methods and our model are shown in Figure 6 and Figure 7. On adobe 240-fps test dataset, our proposed model can produce more realistic intermediate frame, as shown in Figure 6 (e) and (j). In Figure 6 (e), the front tyre of red cars have less blurring. In Figure 6 (j), the foots of the person riding on a bicycle are sharper. The chain and transmission looks more realistic. Figure 7 shows some results of these methods on UCF101 dataset. Column (e) is the result of our method. The edge of hula hoops and paddle are better reconstructed and the texture of the sole is more vivid. The generated intermediate frames of our model have better visual perception and accurate spatial and temporal consistency. Due to the use of multi-scale structure and attention mechanism, our model can focus on moving objects, and combine the global and local features to synthesize the intermediate frame preferably.

V. CONCLUSION
In this paper, we propose a novel multi-scale dense attention generative adversarial network (FI-MSAGAN) for video frame interpolation. First, we establish a generative adversarial framework with a multi-scale loss to produce middle frame in a form of coarse-to-fine. Generators can better combine global and local information by receiving rough results from lower levels. Second, an attention module is embedded in generator so that the network can accurately focus on moving objects and deal with large motion precisely. Third, we adopt a sequence discriminator to judge whether the sequence is true or false. So generators can capture spatial and temporal consistency in frame sequence via feedback information from the sequence discriminator. The ablation study proves the effectiveness of our contributions. We implement comparative experiments with several state-of-the-art video frame interpolation methods and evaluate them both qualitatively and quantitatively. The results show our method outperforms state-of-the-art methods.
It has been shown in recent works that traditional numerical metrics (PSNR, SSIM) are not always consistent with human perception. It can be helpful to evaluate video frame interpolation models by designing new metrics that are more suitable for human cognition. Furthermore, combining GANs with optical flow estimation is also worthy of future research.