Grouped Spatio-Temporal Alignment Network for Video Super-Resolution

Video super-resolution based on CNNs has recently achieved significant progress. The existing popular super-resolution methods usually impose the optical flow between neighboring frames for temporal alignment. However, estimation of accurate optical flow is hard and expends greater computation. To address this problem, we propose a novel grouped spatio-temporal alignment network (GSTAN) that effectively incorporates spatio-temporal information in a hierarchical way. The input sequence is divided into several groups corresponding to different frame rates. These groups provide complementary information, which is helpful to restore missing textures for the reference frame. Specifically, each group employs deformable 3D convolution to incorporate spatio-temporal information, which avoids artifacts from explicit motion estimation. In addition, a Gated-Dconv information filter is proposed to control information flow to focus on the fine details. Finally, these groups provide complementary information, which is integrated with the inter-group fusion module. Extensive experiments have demonstrated our method achieves state-of-the-art SR performance on several benchmark datasets.


I. INTRODUCTION
T HE goal of video super-resolution (VSR) is to recover highresolution (HR) video frames from the low-resolution (LR) ones. This technique has great value in many applications such as satellite imagery [1] and video surveillance [2]. Compared to single-image super-resolution [3], [4], [5], [6], video superresolution poses an unsolved challenge of how to fully employ the spatio-temporal dependency since video sequence provides additional temporal information.
Previous VSR methods usually use the optical flow between a reference frame and neighboring frames to align consecutive frames. Therefore, the quality of optical flow estimation directly affects the restoration of video super-resolution. For example, Wang et al. [7] proposed an optical flow reconstruction network to learn temporal details between consecutive frames through optical flow estimation. Chan et al. [8] proposed a bidirectional Manuscript  propagation network based on optical flow alignment to aggregate information in the spatio-temporal dimension. These methods mainly use optical flow estimation for alignment, which leads to the introduction of artifacts and greatly affect the restoration.
To avoid such problems, [9], [10], [11] implicitly integrated motion information among low-resolution frames without computing optical flow. However, these methods ignore the complementary information between video frames and cannot fully employ the spatio-temporal information. In this work, we divide the input sequence consisting of 2N + 1 frames into N groups based on decoupled motion. Instead of aligning 2N neighboring frames to the reference frame with optical flow directly, we exploit the complementary information between these groups to restore missing textures for the reference frame. With the powerful modeling capability of 3D deformable convolution [12], the spatio-temporal information of each group is jointly integrated. To further obtain fine information flow for efficient restoration, we propose a Gated-Dconv information filter to focus on fine details at the end of the spatio-temporal alignment module. Finally, the inter-group features providing different complementary information are integrated with temporal attention and deeply fused with a bottleneck layer including 2D convolution units to produce a high-resolution reference frame. Overall, our method follows a hierarchical way, which can adaptively integrate information from groups of different frame rates.
In summary, our contributions in this letter are as follows: 1) We propose a novel grouped spatio-temporal alignment network (GSTAN), which can make use of complementary information of frame-rate-aware groups and fully exploit the spatio-temporal information using deformable 3D convolution. 2) We propose a Gated-Dconv information filter to control feature transformation. It suppresses less informative features and transmits useful information for effective super-resolution. 3) Our network achieves excellent performance through the evaluation of commonly used datasets with high computational efficiency.

A. Frame Feature Extraction
Given 2N + 1 consecutive low-resolution video frames consisting of one reference frame I L t and 2N neighboring frames  framework is shown in Fig. 1. Specifically, the input sequence {I L 1 , I L 2 , . . . , I L 7 } is first fed to the network. We use a 3 × 3 convolution layer to extract the shallow features {F s t } 7 t=1 of the input sequence. Considering that lack of long-range spatial information of these shallow features due to the localty of the naive convolutional layer, which may lead to poor quality in intra-group spatio-temporal alignment. Thanks to the residual Swin Transformer block (RSTB) proposed by Liang et al. [13], is powerful in image restoration to model long-range dependency, as shown in Fig. 2. We use two RSTBs to extract shallow features further for {F D t } 7 t=1 .

B. Spatio-Temporal Alignment
In contrast to previous work [9], [11], [14], the feature maps are used directly for spatio-temporal alignment. Considering the contributions of spatio-temporal alignment with different temporal distances are not equal. We divide the feature maps at different frame rates. For these groups, we integrate spatio-temporal information with a residual deformable 3D convolution (ResD3D) block. It has long been demonstrated that deformable 3D convolution has a large spatial receptive field. [15] Therefore, it models spatio-temporal information efficiently. In our work, we use five ResD3D blocks to model group-wise features F g n , as shown in Fig. 3(a).
To improve representation learning, we propose a Gated-Dconv information filter at the end of the spatio-temporal alignment module, as shown in Fig. 3(b). It consists of two parts: 1) gating mechanism, and 2) depthwise convolutions. The gating mechanism can be regarded as the product of two parallel transmitted information, one of which is activated with the GELU non-linearity [16]. To learn local video structure for an effective restoration, we use depth-wise convolution to encode information from spatially neighboring pixel positions. Given a tensor X ∈ RĤ ×Ŵ ×Ĉ , the process is formulated as: where W pd represents the 1 × 1 point-wise convolution and the 3 × 3 depth-wise convolution, denotes element-wise multiplication, LN is the layer normalization [17] and φ represents the GELU non-linearity. Overall, the module controls information flow to focus on the fine details complementary to other groups. A discussion of this Gated-Dconv information filter will be given in the ablation study.

C. Inter-group Fusion
To integrate features from these groups, we proposed an inter-group fusion module with temporal attention [18]. The previous section illustrates that the input sequence is divided into these groups, which usually contain complementary information. Based on the facts that 1) neighboring frames with slow frame rates are more similar to the reference one. They have richer information for SR recovery, and 2) neighboring frames with fast frame rates also have particular information that the reference one is missing. Hence, this module serves as guidance to integrate features from these groups.
At the end of the spatio-temporal alignment module, the group-wise feature F g n channel dimension is 64. The corresponding single channel feature map F c n is computed applying a ReLU and 3 × 3 convolution layer. They are further concatenated and each position (x, y) i across channels is calculated for the attention maps M (x, y) with softmax function. Then, attention weighted featureF g n is calculated by element-wise multiplication of M n (x, y) and F g n .
The inter-group fusion module aims to integrate the complementary information of these groups and produce a highresolution version of the residual map. We first concatenate those featuresF g n to aggregate them along the temporal axis. A bottleneck layer including 2D convolution units is employed to reduce channels and compress feature dimensions. Finally, like most super-resolution methods, the fused feature are upsampled with a sub-pixel layer for SR reconstruction.

A. Implementation Details
Similar to [11], [19], [20], we employe Vimeo-90k [21] containing 64612 training samples with spatial resolution 448 × 256 as the training set. The corresponding LR sequecnces with spatial resolution 112 × 64 are obtained by bicubically sampling with 4 times. We augment the training data by randomly rotating and flipping [22]. To evaluate our model, we employ the Vimeo-90K-T [21], Vid4 [23], and SPMC [23] set for performance evaluation. We use PSNR and SSIM as quantitative metrics, which are widely used for the task of video super-resolution.
We implement all experiments in Pytorch with an Nvidia RTX 3090 GPU. The networks are optimized by an ADAM optimizer [24] (β 1 = 0.9, β 2 = 0.999) with a batch size of 32 and supervised by Charbonnier loss (ε is set to 10 −3 ). The initial learning rate is set to 4 × 10 −4 and halved for every 20 epochs. We train the network for 200 epochs.

1) Input Sequence Length:
To demonstrate the importance of input sequence length, we fix the batch size to 32. The results of our network fed to input sequence with different lengths (3, 5 and 7) are shown in Fig. 4. The comparison shows that training with longer sequences is preferable (e.g., L = 7 has a clear and smooth pattern). Because training with longer sequence allows the network to employ long-term information more effectively. When the input sequence is 7, it is divided into three groups, which can be recovered by using the complementary information from the other two groups. The input sequence is 3 or 5, the less complementary information is employed. However, with a fixed computational overhead, the increased input sequence length when training model inevitably leads to a decrease in batch size. Training with a small batch size causes more unstable gradients.   With a constraint of B × L = 224, we adjust L to 9 (input sequence length) and B (batch size) to 24. As shown in Fig. 4 that restoration becomes worse, which indicates employing a longer sequence is preferable when a computational constraint is imposed. Therefore, we train the network with a batch size is 32 and the input sequence length is 7.
2) Intra-Group Spatio-Temporal Alignment: To verify the performance improvement introduced by Gated-Dconv information filter, we compare the performance change after adding the module. In addition, the results with different numbers of residual D3D blocks are shown in Fig. 5. Finally, it can be observed that the PSNR and SSIM values of the network adding this module are increased by 0.09 and 0.003 on average. It achieves the best performance with 5 residual D3D. That illustrates the module controls information flow to focus on the fine details, which improves representation learning.
3) Inter-Group Feature Fusion: We experiment with different ways of organizing the input sequence. First, we feed the input sequence directly into the network. In addition, we experiment with other ways of grouping: {142, 345, 647} and {123, 345, 567}. As shown in Table I, No-grouping method has the worst performance, which illustrates integrating temporal information hierarchically is a more effective way. {123, 345, 567} indicates that each group does not have a reference frame. Its performance is inferior to the other two groups of methods. Compared to {142, 345, 647} including arbitrary neighboring frames, our grouping method {345, 246, 147} improves 0.04 dB, which is attributed to the effectiveness of frame rate-based grouping in employing temporal information.
The 4× quantitative reconstruction results are listed in Table II. The test results of various data sets show that our method achieves the highest evaluation scores among all compared methods. Compared with the suboptimal approach, our method improves the average index by 0.33/0.011. Fig. 6 shows qualitative results achieved by different methods. It can be observed that our method can recover finer details (e.g., the sharp edge of building window lines). That is, our method fully incorporates spatio-temporal information in a hierarchical way. We also provide a comparison of computational efficiency listed in Table III. As compared with RCAN, our GSTAN achieves improvements in performance and computational efficiency. The FLOPs of GSTAN are 74.3% of optimal explicit alignmentbased BasicVSR [8] and 22.7% of VRT [27] based on the prevalent model Transformer. Furthermore, the inference speed of our method is a bit faster than BasicVSR and is about four times faster than VRT.

IV. CONCLUSION
In this work, we proposed a grouped spatio-temporal alignment network (GSTAN) for video SR. Our network exploits spatio-temporal information in a novel way. The proposed method maintains temporal consistency and also reconstructs high-quality HR frames. Extensive experiments on several benchmark datasets achieve state-of-the-art SR performance. In the future, we will further explore addressing general real-world degraded videos for practical application with our approach.