Local-Global Fusion Network for Video Super-Resolution

The goal of video super-resolution technique is to address the problem of effectively restoring high-resolution (HR) videos from low-resolution (LR) ones. Previous methods commonly used optical flow to perform frame alignment and designed a framework from the perspective of space and time. However, inaccurate optical flow estimation may occur easily which leads to inferior restoration effects. In addition, how to effectively fuse the features of various video frames remains a challenging problem. In this paper, we propose a Local-Global Fusion Network (LGFN) to solve the above issues from a novel viewpoint. As an alternative to optical flow, deformable convolutions (DCs) with decreased multi-dilation convolution units (DMDCUs) are applied for efficient implicit alignment. Moreover, a structure with two branches, consisting of a Local Fusion Module (LFM) and a Global Fusion Module (GFM), is proposed to combine information from two different aspects. Specifically, LFM focuses on the relationship between adjacent frames and maintains the temporal consistency while GFM attempts to take advantage of all related features globally with a video shuffle strategy. Benefiting from our advanced network, experimental results on several datasets demonstrate that our LGFN can not only achieve comparative performance with state-of-the-art methods but also possess reliable ability on restoring a variety of video frames. The results on benchmark datasets of our LGFN are presented on https://github.com/BIOINSu/LGFN and the source code will be released as soon as the paper is accepted.


I. INTRODUCTION
As one of the fundamental sub-tasks of video enhancement, video super-resolution (VSR) aims at mapping the lowresolution (LR) videos into corresponding high-resolution (HR) ones. VSR can either be regarded as a separate mission or be combined with other works, such as video interpolation [1]. In addition, high availability of various videos makes it practical in many scenarios including satellite [2], high dynamic range (HDR) [3] and surveillance videos [4], [5]. Therefore, research on VSR has both scientific and realistic significance.
It is widely known that super-resolution (SR) problem is ill-posed and VSR is no exception. Such characteristic The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . means that one given input may correspond to various results and how to restore a sophisticated output still remains difficult. On the other hand, while single image super-resolution (SISR) had a breakthrough in recent years, directly applying SISR on multi-frames with time dependency can result in neglect of temporal relationship. By taking advantage of correlated video frames, features can complement with each other to obtain better reconstruction outcomes and maintain temporal consistency at the same time. Similar studies which use multiple images for reconstruction also appears in stereo image SR [6]- [9], light field SR [10]- [12] and multi-image SR (MISR) [13]- [16]. The connections between images in these SR tasks are mostly related to different viewpoints or small spacial displacements instead of time.
Since existing approaches [17]- [31] commonly take several LR frames as input and reconstruct only the central VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ reference frame, one key issue need to be tackled is how to effectively use information from neighboring frames. A lot of works [24]- [30] have proved that aligning neighboring frames with the reference frame is a feasible choice rather than directly concatenate them together. Benefiting from the development of optical flow estimation methods, making use of optical flow for warping and motion compensation is a useful strategy [24]- [27]. However, it is hard to obtain accurate prediction using optical flow due to motion blur and it might also introduce unnecessary noise to the final SR results. Besides, such tactics usually require high computational cost. Another inevitable question is how to reasonably fuse features together to achieve better results. One simple and straightforward way is to splice all the features together and leave the rest of the framework to learn fusion strategy automatically [17], [18]. However, stacking numerous information can cause great pressure on the following components, making the model unable to fully utilize effective features. In order to overcome such drawbacks, several methods [21]- [23] attempted to integrate the current features with one neighbor in front and one behind, while some others [29]- [31] devoted to combine arbitrary neighboring frames with the reference frame directly. The former imposes a strong coherence between adjacent features and the latter involves abundant temporal information to complement the central features. Within our fusion module, we cover the two aspects to ensemble both advantages, so as to acquire adequately fused features.
In this paper, we propose a Local-Global Fusion Network (LGFN) for restoring the degraded videos into a high-quality state. Overall, our method can be divided into the following stages. Firstly, a shallow feature extractor is applied to the input frames. Then, we align the neighboring frames with the reference frame at the feature level through stacked and improved deformable convolutions (DCs). In contrast to the normal operation, in each DC, we utilize a decreased multi-dilation convolution unit (DMDCU) to predict more accurate sampling offsets and modulation scalars. Specifically, the number of dilation convolutions and the dilation rates in each DMDCU taper off according to the position of the DC. The deeper, the less. This idea is based on the observation that shallow neighboring features, which have more displacements with the reference features, need larger receptive fields to capture the motions between pixels, whereas deep features that have already been roughly aligned are easier to achieve accurate alignment. Such practice not only improves performance but also maintains a small increase on computational cost. After the alignment, the features of consecutive frames are sent into two different branches respectively for feature fusion. One of the branches, called Local Fusion Module (LFM), is designed to combine information between adjacent frames so that the generated features can maintain close correlation and temporal consistency. In order to exploit abundant features from all input frames, the other branch, called Global Fusion Module (GFM), fuses the features by making use of a video shuffle [32] strategy across all frames. The information can be exchanged globally, which brings more available and effective features for each frame. Next, the features from both branches are integrated correspondingly by a 1 × 1 convolution and then merge together to represent the features of the central reference frame. In the end, SR reconstruction module is responsible for restoring high-resolution video frames, where we apply the advanced residual channel attention blocks (RCABs) to fulfil the task.
Overall, the major contributions of this work can be summarized as follows: • We present an end-to-end trainable deep learning model LGFN for high-quality video super-resolution. Experimental results on several benchmark datasets demonstrate that our concise and effective network can achieve comparative performance with the state-of-the-art methods.
• In order to proceed motion compensation implicitly, we introduce an optimized deformable convolution DC) with a decreased multi-dilation convolution unit (DMDCU). Unlike the ordinary version of DC, the modification can predict more accurate parameters and thus perform feature alignment more precisely.
• We carefully design a structure with two branches for superior feature fusion. Local Fusion Module LFM) imposes strong correlation to adjacent features and maintains temporal consistency. Global Fusion Module GFM) exchanges information among all features and utilizes more abundant details through a video shuffle strategy, which mutually reinforces the fusion capability. Both branches dedicate to performing feature fusion effectively. The rest of this paper is structured as follows. Section 2 presents the related work including SISR and VSR. Section 3 explains the network architecture and introduces each component in detail. Experimental results with both quantitative and visual comparisons are shown in Section 4. Ablation experiments in Section 5 study the specific impact of each module. Finally, conclusion and future work are discussed in Section 6.

II. RELATED WORK A. SINGLE IMAGE SUPER-RESOLUTION
Single image super-resolution (SISR) has achieved impressive performance in the past few years. Dong et al. [33] first utilized a simple network SRCNN for super-resolution, which only consists of three convolution layers. They upscaled the LR images at the front of the network, and some subsequent approaches inherited this strategy. With the network becomes deeper, such practice results in high computational cost for the following components. To overcome such limitations, FSRCNN [34] applied a deconvolution layer at the end of the network for upsampling so that the middle mapping layers can be shared for different resolutions. Another efficient design ESPCN [35] used a sub-pixel convolution layer which transforms channel dimension into space dimension. Thus the upsampling process can be learned implicitly.
More complex and advanced models are introduced in recent study. Tong et al. [36] put forward SRDenseNet with dense blocks which utilizes features from previous layers by concatenation. RDN [37] combines the advantages of both residual learning and skip connections to achieve better results. Dai et al. [38] came up with an attention skill called second-order attention which focuses on the feature correlation inside layers. Zhang et al. [39] proposed the residual channel attention network (RCAN) with attention in channel dimension to obtain ampler high frequency information. We will utilize the basic structure of RCAN in our reconstruction module.

B. VIDEO SUPER-RESOLUTION
Since video is a kind of sequence, it is feasible to use a recurrent structure to model the temporal information.
Sajjadi et al. [20] introduced the frame-recurrent video super-resolution (FRVSR) approach to restore an HR frame from the current and the former LR frames, where the former one is warped using current predicted optical flow. Yan et al. [21] proposed the frame and feature-context video super-resolution (FFCVSR) approach, where the former information is replaced with the current ones after several iterations to reduce the amassing deviation. Zhu et al. [22] put forward the residual invertible spatio-temporal network (RISTN) for video super-resolution, which consists of a residual dense connected LSTM to capture the dependencies among frames. Fuoli et al. [23] further utilized the recurrent latent space propagation (RLSP) with high-dimensional latent states to efficiently restore video frames. However, once it comes to using hidden states to save previous information, such strategy makes the restoration effect of the first few frames relatively poor, since there is no corresponding prior information.
The usage of optical flow estimation for explicit motion compensation is a prevalent approach for VSR. Tao et al. [24] proposed a sub-pixel motion compensation (SPMC) layer and performed alignment with optical flow. Wang et al. [25] came up with a multi-memory convolutional neural network (MMCNN), which jointly trains an optical flow estimation network and a multi-memory reconstruction network. Xue et al. [26] introduced TOFlow, which produces task-oriented optical flows through a gaussian pyramid architecture. The flows are gradually refined and warped later only once through a spatial transformer network. Li et al. [27] proposed a deep dual attention network (DDAN), which applies a pyramid multi-level structure to predict optical flows and performs warping on every resolution. Yi et al. [40] utilized optical flow for simple motion compensation and proposed an ultra dense memory block (UDMB), which combines hierarchical features to improve performance. Wang et al. [41] considered that there is a gap between the LR optical flows and the HR outputs, which could result in loss of details. Thus, they proposed super-resolve optical flows for video SR (SOF-VSR), which predicts the optical flows at high-resolution scales to reduce the displacement. Although optical flow is an effective means of motion compensation, it is challenging to obtain accurate optical flow estimation results. In addition, the computational cost and the introduction of extra noise are noteworthy problems, which may result in undesired artifacts.
To overcome the limitations of optical flow estimation, several methods provided valuable directions. Jo et al. [31] proposed to use dynamic upsampling filters (DUFs) on input center frames for reconstruction. Liu et al. [28] put forward a locally connected video super-resolution (LCVSR) approach, which also utilizes a similar tactic to perform motion compensation and later refines the features through a refinement network. Furthermore, Haris et al. [29] proposed a recurrent back-projection network (RBPN) for video super-resolution, which adopts an encoder-decoder based module to continuously fuse information between frames.
Recently, Tian et al. [30] proposed a temporally-deformable alignment network (TDAN), which takes advantage of deformable convolution [42] for implicit frame alignment and achieves splendid performance. Since the sampling point of the kernel can be shifted on the input feature maps, deformable convolution can automatically learn the effective features of the neighboring frames and offer valid information for reconstruction of the reference frame. However, using ordinary convolution to predict the required parameters may limit the relevant performance. Thus, in our work, we will utilize the deformable convolution v2 [43] for the alignment module, and further improve the performance by our proposed decreased multi-dilation convolution unit (DMDCU), which obtains more accurate results and keeps the extra computational cost relatively low.

III. LOCAL-GLOBAL FUSION NETWORKS A. NETWORK ARCHITECTURE
In this section, we first introduce the overall architecture of our method and then describe each component in detail.
The input of our network is a sequence of low-resolution video frames {I LR t−N , . . . , is the reference frame while the others are the neighboring frames. For VSR, our purpose is to reconstruct a high-resolution reference frame I SR t through fully exploiting the correlations among the video frames. The framework of our approach is depicted in Fig.1 and the pseudo-code of the training phase is shown in Algorithm 1. The whole process is described as follows.
The input frames as RGB images are first fed into a feature extractor, which consists of a simple convolution layer and a number of residual blocks, to extract shallow features. The process can be written as: where F T denotes the extracted features and Net fea is the feature extractor. The extracted neighboring features F i are then aligned to the reference feature F t through the alignment module, which consists of stacked deformable convolutions with decreased multi-dilation convolution units. The implicit motion compensation in the alignment module can be represented as: where F a i is the aligned neighboring features. Following the alignment, the features go through two branches to fuse two different types of information. One branch is the local fusion module (LFM) which focuses on the correlation between adjacent frames so as to maintain the temporal consistency. The LFM can be formulated as: . . , L t+N } represents the locally fused features and {F a t−N , . . . , F t , . . . , F a t+N } are the aligned features. In the meantime, the other branch, called global fusion module (GFM), aiming at sufficiently utilizing the information of all the input frames, produces the global fused features {G t−N , . . . , G t , . . . , G t+N }, which are calculated by: . . , F a t+N }) (4) Afterwards, the features from both branches are concatenated and merged together through a simple 1×1 convolution: where F M i is the integrated output. Next, the merged features pass through a 3 × 3 convolution layer for feature fusion and channel reduction. Then, the advanced residual channel attention blocks (RCABs) are applied for high performance SR reconstruction. A skipped connection from shallow features of reference frame is used for residual learning. The whole process can be defined as: In the end, the output is sent to a pixel shuffle layer for upsampling and then added into the bilinear upsampled LR reference frame from a skipped connection to construct the final super-resolved reference frame: where I SR t is the restored high-resolution reference frame.

B. ALIGNMENT MODULE
We now introduce the alignment module in detail. For the purpose of fully utilizing the neighboring frames, methods based on optical flow warping can suffer from undesired artifacts and expensive computation. To overcome these shortcomings, we stack the superior modulated deformable convolution [43] to perform frame alignment and implicit motion compensation. Common convolution operation can be described as: Here, y(p) refers to the output features y on each position p, w k is the weight, and p k is the location offset on the kernel sampling grid. For instance, if the convolution kernel is 3 × 3, K equals to 9, and p k ∈ {(−1, −1), (−1, 0), . . . , (1, 0), (1, 1)}. Essentially, modulated deformable convolution adds predictable offsets and modulation scalars on the original fixed sampling position. As depicted in Fig.2, such tactic can adaptively learn dynamic sampling positions on the neighboring features that need to be aligned with the reference features.
denote the input neighboring features of the b-th modulated deformable convolution. The aligned output F b i can be calculated as: 10: : / * Loss Computation and Optimization * /  where p i,k and m i,k are the offset and the modulation scalar of the k-th sampling location. Bilinear interpolation is applied to compute F b−1 i (p + p k + p i,k ) in case that p i,k is fractional. Note that both parameters are learned automatically and dependent on the relationship between the neighboring features F i and the reference features F t . The process of obtaining deformable parameters can be expressed as: where f denotes the prediction function, and [·,·] represents concatenation.
In the original version of modulated deformable convolution [43], f in Equation (11) simply consists of several convolution layers. However, more precise P i and M i can result in more accurate alignment results. It is worthy to explore a delicate structure and replace the common convolutions. Here, we propose a decreased multi-dilation convolution unit (DMDCU) to enhance the ability of learning deformable parameters. As shown in Fig.3, the input features first pass through a 3 × 3 convolution. Then, convolutions with different dilation rates are applied to extract multi-level features and halve the number of channels simultaneously. After that, multi-level features are concatenated together and pass through a 1 × 1 convolution layer for feature fusion and restoring the number of channels to the same as that of the input.
As we stack several deformable convolutions, features in more shallow layers need larger receptive fields to obtain more accurate parameters while the deep features only require relatively small ones to achieve precise alignment. Thus, we adjust the number of multi-dilation convolutions and the dilation rates according to the position of the deformable convolution. The deeper, the less.
Starting from the first deformable convolution, the number of dilation convolutions in the DMDCU of the subsequent deformable convolutions will decrease by two as the network progress. For example, in our experiments, we set the number of deformable convolutions n dc to be 4, and the number of dilation convolutions n dil in the DMDCU of the first deformable convolution to be 8. Then, n dil in the next three deformable convolutions are 6, 4, and 2, respectively. Meanwhile, the dilation rates range from 1 to 8, 1 to 6, 1 to 4 and 1 to 2, correspondingly.
In practice, if n dc exceeds 4, we keep n dil in the first several deformable convolutions to be 8 until the third from the last deformable convolution, and then sequentially reduce n dil VOLUME 8, 2020 by two. In other words, the amount of dilation convolutions is at most 8 to prevent too much computation. Ablation study on the case of n dc = 5 is shown in Section 5. Using the novel DMDCUs, we promote the ability of predicting more accurate deformable sampling parameters. Moreover, the decreasing design avoids the addition of too many parameters and reduces redundant calculations.

C. LOCAL FUSION MODULE
After the alignment module, details from different time period need to be effectively fused so that features of neighboring frames can be benefit for better reconstruction of the reference frame. In order to maintain the temporal consistency and impose tight coherence between adjacent frames, we design a local fusion module (LFM) which consists of several local fusion blocks (LFBs). . The output has two copies, one as the result of the current input and the other to be sent to the next frame as part of the input features. again. In this way, the current features contain information from previous fused features and itself. After that, they go through a 1 × 1 convolution for integration and channel reduction. Two residual blocks are used to extract features and an element-wise addition is applied for residual learning. At last, the output L b t comprises the detail information of previous frames and temporal coherence. The LFB which handles features of one frame can be defined as: The previous output features contain most information of the former frame due to the concatenation and residual connection in LFB. Besides, the characteristics used to maintain temporal consistency are also included implicitly. Therefore, the reuse of previous output features fuses the knowledge of two adjacent frames locally and makes the information continuously flow from the former frame to the latter. The stacked LFBs constantly strengthen the connection between adjacent frames and ensure the continuity of video sequences.

D. GLOBAL FUSION MODULE
Although LFM solves the problem of feature combination from a local perspective, the latter features cannot be fully combined with the former features, which results in a waste of potential space characteristics. Therefore, we present a global fusion module (GFM) which consists of stacked global fusion blocks (GFBs). GFB exquisitely exchanges features among different video frames and fuses them together by taking advantage of a video shuffle strategy. This not only fills the defect of LFM, but also allows the information in different frames with long temporal interval to be fully covered. Video shuffle (VS), firstly introduced by Ma et al. [32], is utilized for exploring video representation which is essentially a channel switching tactic. As depicted in Fig.6, VS is a bijective transformation. The features of each frame are divided equally into the same number of groups as that of the input frames. For example, if there are T input frames, from each of which is extracted a feature map of C channels, then each feature map will be divided into T groups, each consisting of C/T channels. Note that we use a 3 × 3 convolution layer in GFB to make sure the output channel divided by T is integer and then we perform grouping. After that, the groups with the same group id are rearranged to form new feature maps, each of which consists of a certain amount of information belonging to various periods. The inverted process is to reverse the groups that belong to the same frame back to the original feature maps.
In order to take advantage of features from multiple frames and fuse the abundant features comprehensively, we enhance the capability of VS by using multi-kernel convolution. Our proposed GFB is shown in Fig.7. After the forward shuffle, each feature map will pass through a 1×1 convolution layer to reduce the channels. Then, convolutions with various kernel sizes are responsible for exploring multi-level features. Note that to reduce parameters and promote efficiency, we have utilized two 3 × 3 convolutions to implement an equivalent 5 × 5 convolution. Afterwards, features are concatenated together and sent into a 3 × 3 convolution for feature fusion. The outputs of inverted video shuffle are sent into another 3 × 3 convolution followed by a residual skip connection. The application of multi-kernel convolution can discover more abundant details and integrate them from diverse period effectively, compared to the common VS strategy. Within each GFB, the exchange of features from various frames fuses plentiful information together. Moreover, correlations among frames are learned implicitly. The whole GFM ensures that spacial characteristics in every input frames can be fully utilized by each other, in case of missing important details.
After the two feature fusion branches, we now obtain two kinds features of consecutive frames. One from the LFM maintains high temporal consistency and the other from the GFM combines overall characteristics. As shown in Fig.1, for two kinds of features which represent each frame, we merge them together through a simple 1 × 1 convolution, which focuses on the relationship of diverse patterns.

E. RECONSTRUCTION MODULE
In this module, the fusion features are aggregated together and pass through the advanced reconstruction blocks for restoring high quality reference frame. Here, we utilized the sophisticated residual channel attention blocks (RCABs). Note that it can be replaced by more effective blocks if necessary.
The channel attention mechanism essentially obtains a series of weights coefficient to various channels and multiplies them with feature maps, so that the network can bring different emphasises on the features and enhance the ability of learning representation. Details about channel attention mechanism for SR reconstruction can be found in [39]. In the end, the restored features pass through a pixel shuffle layer to produce high-resolution clear reference frame in RGB format.

IV. EXPERIMENTS A. DATASET AND TRAINING DETAILS 1) DATASET
Since there is no unified training dataset for VSR, we adopt a widely used dataset Vimeo-90K collected by Xue et al. [26] for network training. 64612 training samples with a fixed resolution of 448 × 256 are included. They contain not only manifold, but also intricate motion and scenes. To generate the LR input, MATLAB imresize function, which blurs the input with cubic filters and then downsamples them through bicubic interpolation, is applied for 4× downscaling.

2) TRAINING DETAILS
Through the whole network, the filter number is set to 64 and the kernel size is set to 3 × 3 for the convolution if not specified. The shallow feature extractor applies five residual blocks. Four deformable convolutions are utilized for frame alignment. The number of dilation convolutions in each DMDCU is [8,6,4,2] respectively. Both LFM and GFM have 10 internal blocks for feature fusion. At the end, 20 RCABs are used for SR reconstruction.
It is observed that each training sample in Vimeo-90K contains seven frames. Thus, our full model chooses seven continuous frames as inputs, which are cropped randomly into fixed size patches during the end-to-end training process. The learning rate is set to 10 −4 for the first 50 epochs, and declines by half every 20 epochs. Besides, the Adam optimizer [44] is chosen with β 1 = 0.9, β 2 = 0.999, and = 10 −8 . We adopt simple L 1 loss to minimize the distance between the restored reference frame and the ground truth HR. After convergence, we fine-tune the model slightly by L 2 loss for better performance. All experiments were conducted on two NVIDIA RTX 2080ti GPUs using PyTorch 1.0.1 [45].

B. COMPARISON TO STATE-OF-THE-ART METHODS
Several state-of-the-art methods, including RCAN [39], TOFlow [26], FRVSR [20], DUF [31], RBPN [29] and TDAN [30], are used to compare with our proposed LGFN. We either directly utilize the results they provided or download their pretrained model to predict the SR frames. All quantitative results are evaluated on Y channel (i.e., luminance) from the transformed YCbCr space with both the TABLE 1. Quantitative results of state-of-the-art SR algorithms on Vid4 for 4×. Bold type indicates the best and underline indicates the second best performance (PSNR/SSIM). During inference, the first and last frames are not participated. Only DUF needs to crop borders and eight pixels near image boundary are eliminated.
PSNR and the SSIM [46] metrics. Note that the first and last two frames are discarded for a stable comparison. In order to prevent the severe boundary effects, we crop eight pixels near image boundary only for the DUF results as suggested in [29].
Experiments are conducted on three public testing datasets: Vid4 [47], SPMCS [24] and Vimeo-90K-T [26]. Average flow magnitude (pixel/frame) is presented following [29]. Commonly reported Vid4 consists of four video sequences including calendar, city, foliage and walk. Note that the inter-frame motion is restricted and some artifacts exist on a few regions of the ground truth. On the other hand, SPMCS contains more varied and clean scenes with abundant texture information which brings higher requirements for the models. In addition, Vimeo-90K-T, with large number of video frames, comprises manifold circumstances and objects. Various light transformation and multiple motion types rise great challenges for the restoration ability, which allows us to distinguish the strengths and weaknesses of different methods. Table 1 presents the results on Vid4. Our model attains comparative performance and achieves the best PSNR on calendar and walk clips. More visual comparisons are depicted in Fig.8. It can be seen that our method restores the date in the calendar without deformation while others show up certain blur or undesired artifacts. Moreover, LGFN portrays the correct white boundary of the building in the city clip while others reduce the clarity.
Since SPMCS has 32 testing sequences, we show the quantitative outcomes in Table 2 and Table 3 respectively. The former are the average results of all clips while the latter correlate with several individual clips. Apparently, our method achieves the best performance and surpasses both SISR and VSR method. The abundant texture information in the dataset requires the network to learn precise high frequency features, which confirms that our approach possesses such kind of capability. Qualitative comparisons are shown in Fig.9. In the first row, the areas with dense textures were effectively restored with our LGFN. Meanwhile, the distant and small street lamp in the fifth row can well distinguish different methods with stronger abilities. Our approach is the only one that produces intact details and correct pattern while others suffer from varying degrees of indistinctness.
For a more comprehensive comparison, we divided the Vimeo-90K-T into three subsets according to the motion velocities as [29]. Quantitative results are shown in Table 4. With the motion velocity increases, LGFN surpasses RBPN by 0.16 dB, 0.25 dB and 0.34 db respectively, which indicates that our model can handle video frames under different situations. High improvement in fast motion velocity especially reflects the superiority of our model which takes fully advantage of temporal information. On the contrary, others with limited representation ability exhibit poor outcomes. Visual evaluations are presented in Fig.10. When the color of the texture is similar to the surrounding environment, such as row one, our model is the exclusive one which can restore similar pattern with the ground truth. Other examples give evidences that LGFN can recover detail information accurately in various scenarios.

C. SUPER-RESOLVING REAL-WORLD VIDEO SEQUENCES
Since most of the existing approaches are trained on simulated datasets, it is challenging to restore satisfied real-world video sequences which contain camera shake and complex image formation methods. To make the results of LGFN more convincing, we conduct experiments on real-world dataset provided by Tao et al. [24] for evaluation. The dataset includes 30 video clips which are obtained through both SLR camera and mobile phone. Quantitative comparisons are shown in Table 5. Here, we utilize two reference-free image quality assessment indices NIQE [48] and SSEQ [49] to measure the performance of the methods. Both indices range from 0 (best) to 100 (worst). It can be noticed that our LGFN achieves the best performance under NIQE, which further demonstrates the effectiveness of our model. Moreover, visual results are presented in Fig.11. Our model can not only restore the correct text information but also the detailed textures, while other methods could suffer from blur or distortion.

D. MODEL SIZE AND RUNNING TIME ANALYSES
We compare the model size and running time of several methods. All pretrained models are downloaded from the public official repositories. We conduct the entire testing experiments on the machine with 3.6GHz Inter i7 CPU (16G RAM) and single NVIDIA RTX 2080ti GPU. In addition, the corresponding dependencies are strictly followed. The input frames for testing is with size 112 × 64. As shown in Table 6, our model achieves the best result and maintains the model size into a moderate level. In the meantime, LGFN runs faster than the optical-flow based approach TOFlow and dynamic kernel prediction method DUF, which verifies the efficiency of our network. The application of dilation convolutions in DMDCU may be a little time consuming, which explains why RBPN (without dilation convolutions) runs slightly  faster even it has more parameters. However, it is worthy to sacrifice a small amount of time to promote performance significantly.

V. ABLATION STUDY
In order to explore the exact contribution of each module in our framework, we perform more experiments to demonstrate  their effectiveness. We first remove every module individually to observe how much influence can these components exert on the network. The results are presented in Table 7. It can be observed that eliminating the alignment module including deformable convolutions and the DMDCUs damages the restoration ability significantly. Excessive differences between frames make it difficult for other parts of the network to learn effective information. The model can only utilize very limited features and even introduce inaccurate artifacts from other frames. Moreover, deletion of the fusion module also leads to reduction of the performance. It proves that the two kinds of fusion strategy can take advantage of abundant features locally and globally, which indicates the ascendancy of our proposed method. As for the reconstruction module, normal residual blocks displace RCABs to recover video frames. The evaluation metrics decrease slightly compared to other components which means the improvement of performance is mainly caused by other two modules rather than the channel attention mechanism.
Since the alignment module plays an indispensable role, we further investigate the functionality from the following two aspects: one is the decreased multi-dilation convolution unit (DMDCU), and the other is the deformable convolution (DC).
We first explore the impact of DMDCU by altering the number of dilation convolutions in each unit. The results are shown in Table 8. Following [30], we set the number of DC to four. The first column represents the number of dilation convolutions in each DMDCU. For example, [8,6,4,2] indicates that the alignment module has four DCs and the number of dilation convolutions in each DMDCU is 8, 6, 4 and 2 respectively from shallow to deep DC. Also, the dilation rates range from 1 to 8, 1 to 6, 1 to 4 and 1 to 2 correspondingly. 0 means that in this DC we apply ordinary convolution rather than the DMDCU to predict the parameters. Through comparing [8,6,4,2] and [8,8,8,8] group, the usage of more dilation convolutions just promotes the performance slightly. The decreased strategy can avoid many redundant calculations so as to reduce the cost of space and time. Meanwhile, applying DMDCU in the first two DCs can further improve performance and reducing DMDCU can lower the restoration effect.
Moreover, we adjust the number of DCs to see if the performance is subject to fluctuations. As shown in Table 9, we gradually increase the number of DCs, accompanying with the corresponding DMDCUs. It can be observed that the restoration ability improves more significantly with the increase of DCs, compared to the impact of the number of dilation convolutions in each DMDCU shown in Table 8.   This demonstrates the pivotal role of alignment module which enables the features of the neighboring frames to be effectively utilized by the following modules. Besides, the performance improves more prominently when adds the first two DCs, which indicates the shallow features need to be aligned more accurately. After we increase the number of DC into 5, the PSNR only promotes 0.03 db, which means four DCs are a reasonable trade-off between performance and computational cost.    Next, more experiments are conducted on the two-branch structure to reveal the effect of combining various fusion strategies. We only keep one branch at a time to study the impact of a single branch on the entire network. For a fairer comparison, we additionally compensate the parameter number to be close to the complete model. The results are presented in Table 10. Notice that applying a single fusion strategy can not bring more benefits to the model. In the meantime, simply adding their number of blocks improves performance slightly but introduces an unworthy increase in computational cost. Therefore, our framework takes advantage of both feature fusion tactics and unites them for a better reconstruction output. Compared with a single fusion branch, the integration method promotes at least 0.11 dB on PSNR.
In order to explain the different concerns of the two branches more intuitively, we visualize the feature maps of the reference frames after different fusion strategies and the integration results of two branches. As depicted in Fig.12, the local fusion features supplement detail textures while the global fusion features describe the overall outline structure. The integrated features of two branches using 1 × 1 convolution present clear and accurate high frequency information. The reason for this phenomenon is as follows. The local fusion module combines features of adjacent frames and maintains the temporal consistency, which makes the branch pay more attention to the difference between high frequency regions of adjacent frames. In the meantime, since the global fusion branch exchanges information through video shuffle globally, abundant features from all input frames are effectively utilized. In the end, two kinds of fusion features are integrated together and the outputs show splendid outcomes which contain both clear textures and outlines.
Finally, we study the impact of different amount of frames for restoration. This factor determines how much initial information can be provided. Since the VSR task focuses on the effective usage of spatial and temporal information, less number of features will undoubtedly weaken the capability of the network. As shown in Table 11, three, five and seven LR frames are utilized as inputs to train three models respectively. We can see that the evaluation metrics become better with increasing number of input frames as more abundant features can be explored.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose a novel local-global fusion network (LGFN) for high quality video super-resolution. In order to handle the implicit motion compensation for input consecutive frames, LGFN adopts improved deformable convolutions with decreased multi-dilation convolution units for effective frame alignment. A structure with two branches is introduced for local and global feature fusion. The local fusion module focuses on combining adjacent features and maintaining the temporal consistency while the global fusion module figures out how to fully take advantage of all input frames. Extensive experimental results on benchmark datasets demonstrate that our model can achieve comparative performance with state-of-the-art methods and restore a variety of video frames efficiently.
Although improved deformable convolutions with decreased multi-dilation convolution units perform implicit alignment well, the aliasing effect could still exist in the aligned neighboring features, which means that the alignment may be incorrect in some cases. One possible reason is that the alignment process is performed in low-resolution scale, which indicates that less information could be utilized and it results in adverse impact on alignment. Besides, there is no reasonable supervision approach to guarantee the alignment accuracy. To overcome such drawbacks, a valuable direction is to apply the attention mechanism to discover the area that has not been aligned precisely. The misaligned positions can be corrected appropriately through suppression operation. Spatial attention and its variants are potential choices. Moreover, other prior knowledge can be introduced to enhance the ability of alignment. For example, segmentation maps or depth information can be used to guide the alignment process to obtain more accurate results. It would be our great interest to explore possible solutions.