SRFormer: Efficient Yet Powerful Transformer Network for Single Image Super Resolution

Recent breakthroughs in single image super resolution have investigated the potential of deep Convolutional Neural Networks (CNNs) to improve performance. However, CNNs based models suffer from their limited fields and their inability to adapt to the input content. Recently, Transformer based models were presented, which demonstrated major performance gains in Natural Language Processing and Vision tasks while mitigating the drawbacks of CNNs. Nevertheless, Transformer computational complexity can increase quadratically for high-resolution images, and the fact that it ignores the original structures of the image by converting them to the 1D structure can make it problematic to capture the local context information and adapt it for real-time applications. In this paper, we present, SRFormer, an efficient yet powerful Transformer-based architecture, by making several key designs in the building of Transformer blocks and Transformer layers that allow us to consider the original structure of the image (i.e., 2D structure) while capturing both local and global dependencies without raising computational demands or memory consumption. We also present a Gated Multi-Layer Perceptron (MLP) Feature Fusion module to aggregate the features of different stages of Transformer blocks by focusing on inter-spatial relationships while adding minor computational costs to the network. We have conducted extensive experiments on several super-resolution benchmark datasets to evaluate our approach. SRFormer demonstrates superior performance compared to state-of-the-art methods from both Transformer and Convolutional networks, with an improvement margin of $0.1 \sim 0.53dB$ . Furthermore, while SRFormer has almost the same model size, it outperforms SwinIR by 0.47% and inference time by half the time of SwinIR. The code will be available on GitHub.


I. INTRODUCTION
The Super Resolution has been studied since 1974, when Gerchberg [1] introduced the notion of Super Resolution (SR) to improve optical system resolution over and above diffraction, since then the idea of super resolution has been defined as a way to obtain high resolution (HR) images from its degraded low resolution (LR) image with high visual this core technology can be applied to a wide range of Computer Vision tasks, which leads to improvements in various Vision tasks, such as object detection [2], [3], medical imaging [4], [5], security and surveillance imaging [6], [7], face recognition [8], [9].
There are several reasons that make image super resolution remains challenging: i) Super Resolution is fundamentally an ill-posed inverse problem.There are multiple solutions for the same low-quality image instead of a unique single solution.
ii) The complexity of the problem increases, as the upscale factor increases.The retrieval of missing scene details becomes even more complicated with greater factors, which often leads to the reproduction of incorrect information; and iii) there are fundamental uncertainties among the LR and HR data since the down-sampling of different HR images may lead to a similar LR image [10].
Formerly, different methods were utilized to tackle super resolution problems, such as statistical methods, predictionbased methods, patching methods, edge-based methods, and sparse representation methods.However, researchers have lately been using Deep Learning (DL) approaches to solve the problems of image super resolution due to advanced progress in computer computational power.Deep learning ConvNet-based approaches have consistently improved significantly to the classical methods over the last decade.Numerous deep convolutional neural networks introduced [11], [12], [13], [14], [15], [16] as well as many lightweight networks and techniques to reduce the computational complexity of the networks, such as using filter pruning [17], knowledge distillation [18] to minimize computing time by narrowing the network.However, these techniques often lead to poor performance due to several reasons such as lower network capacity, long inference time, and a large number of operations due to several iterations through the forward process.
In addition, ConvNet-based approaches suffer from two main issues that come from the fundamentals of the convolution layer.First, there is no content dependency in the interactions between images and convolution kernels.The same convolution kernel is used to restore various image regions, which is not the ideal solution.Second, convolution is effective for capturing local context information but ineffective for capturing long-range dependency [19].
Transformer [20] introduced to tackle the aforementioned problems of convolution layer, by designing a self-attention mechanism to capture global interactions between contexts, has shown promising performance in several Vision and NLP tasks [21], [22], [23].However, the self-attention mechanism computational cost increases quadratically when dealing with spatial resolution and also ignores the local 2D structure information of the image by processing images as a 1D structure [24].Furthermore, these methods usually need to occupy heavy GPU memory, which greatly limits their flexibility and application scenarios for low-capacity devices.
In this paper, we propose a novel lightweight approach for a single image super resolution task, namely SRFormer by bringing the strengths of both the convolution The extension of this work on cross-spectrum applications can be found at [25].
The main contributions of our work can be summarized as follows: • We present SRFormer, an efficient yet powerful Transformer based network for single image super resolution task, which is faster in training and inference time while generating more accurate SR images.
• We present a lightweight Dual Attention layer, which significantly improves the reconstruction quality by generating a global attention map from two local attention weights, which obtain individually by two branches in parallel while it's not memory hunger.
• We present a low-cost Gated MLP Feature Fusion module that yields a powerful representation by aggregating multi-stage feature representation from Transformer blocks with minor computation complexity.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Image Super Resolution (SISR) tasks compared to CNN/Transformer based networks.The rest of the paper is organized as follows: Section II discusses the related work, including CNN-and Transformerbased super resolution methods.Section III describes the proposed SRFormer and its core components in detail.Experimental comparisons against several state-of-the-art methods are presented in Section IV.The model investigation presents in section V. Section VI concludes the paper.

II. RELATED WORK
In this section, the most recent state-of-the-art SR deep learning CNN and Transformer based approaches are detailed.

A. DEEP LEARNING BASED IMAGE SUPER-RESOLUTION
Single Image Super Resolution aims to restore the welldetailed image from its low-quality version.Dong et al. [10] introduced Super-Resolution Convolutional Neural Network (SRCNN), which is the first work using CNN to tackle the SR task.The SRCNN presents a shallow neural network that receives an upsampled image as an input that cost extra computation.Later on, to address this drawback, Fast Super-Resolution Convolutional Neural Network (FSRCNN) [26] and Efficient Sub-Pixel Convolutional Neural network (ESPCN) [27] have been proposed by receiving the LR image as input to reduce the large computational and runtime cost and upsampling the features near the output of the network by a single transposed convolution layer.Even though the strength of deep learning shows up from deep layers, the above-mentioned methods are referred to as shallow networks.Therefore, Kim et al. [28] use residual learning to ease the training challenges and increase the depth of their network by adding 20 convolutional layers.Then, [29] proposed a memory block in MemNet for deeper networks and solves the problem of long-term dependency with 84 layers.Lim et al. [30] introduce Enhanced Deep Super-Resolution network (EDSR) by expanding the network size and enhancing the residual block by omitting the batch normalization from a residual block.Zhang et al. [31] propose Residual Dense Network (RDN) with residual and dense skip connections to fully use hierarchical features.
Furthermore, in recent years the interest in building lightweight and efficient models has increased in super resolution tasks to reduce the high computational cost of this task.Ahn et al. [32] design an efficient network that is suitable for the mobile scenario.Later, [33] introduces Multi-Attentive Feature Fusion Super-Resolution Network (MAFFSRN) by proposing multi-attention blocks to improve the performance.LatticeNet [34] introduces an economical structure to adaptively combine Residual Blocks.Recently, OverNet [16] introduced by designing an efficient network structure by introducing a multi-loss function to boost the network performance.Also, a neural architecture search (NAS)based strategy has been also proposed in SISR to construct efficient networks-Multi-Objective Reinforced Evolution in Mobile Neural Architecture Search (MoreMNAS) [35] and Fast, Accurate and Lightweight Super-Resolution (FALSR) [36] are some examples of using NAS strategy in their network.However, due to the limitation in NAS strategy, the performance of these models is limited.Since introducing the first work, many Transformer based architectures have been proposed for the Vision tasks in image recognition [37], object detection [21], [23], segmentation [38], [39], and action recognition [40], [41].In addition, Transformer based models have been studied for low-level vision problems such as super resolution [19], [42], [43], image colorization [44], denoising [45], and image restoration [46].For instance, DEtection TRansformer (DETR) [23] is a transformer network designed for object detection, which can predict a set of objects and model their relationships.SwinIR [19] was introduced by Jingyun et al. for low-level vision tasks by using Swin Transformer [21] by applying self-attention within local image regions to solve the low-level vision problems.
Although the Transformer based networks achieve excellent performance in low-level Vision tasks, these methods still depend on providing heavy GPU resources to train the model, which is not feasible or available to most researchers.Also, the computational complexity of self-attention in Transformers can increase quadratically with the number of tokens to mix (i.e., image patches), thereby prohibiting its application to high-resolution images.Therefore, in contrast to recent work in the super resolution domain, we present a Transformer based network that can learn long-range dependency and local context information while remaining computationally efficient without the need for heavy GPU resources.

III. PROPOSED METHOD
In this section, the overall network architecture of the proposed SRFormer is described.Later, detailed information on the Dual Attention layer is provided.

A. OVERALL PIPELINE
The primary goal is to design an efficient Transformer architecture, which can generate well-detailed high-quality images while remaining computationally efficient.Thus, we utilize the basic Transformer structure but specially designed for efficient network structure with significant performance gains compared to existing CNN and Transformer networks.The overall architecture of the SRFormer is shown in Fig. 2. In particular, the proposed SRFormer consists of four modules: Shallow Feature Extraction (SFE), Dense Feature Extraction (DFE), Gated MLP Feature Fusion (GMFF), and Multi-Scale Up-Sampling (MS-UP) modules.We defined I LR and I SR as the low-and high-quality input and output of our network.

B. SHALLOW FEATURE EXTRACTION
The convolution layer proves that can perform well at early visual processing, which leads to improved performance of the network [47].Therefore, a single 3 × 3 convolutional layer is applied on the given low-quality input image I LR to extract the initial features and map the input image space to a higher dimensional feature space to generate a better SR image.Therefore, we extract the shallow features F 0 as:

C. DENSE FEATURE EXTRACTION
Next, the extracted shallow feature passes through the Dense Feature Extraction F DFE as an input.DFE is built up with a set of Transformer blocks.The input is first processed by input embedding such as patch embedding for Vision Transformers (ViTs): where I EMB denotes the embedding tokens with the length of N sequence and C embedding dimension.Our Dense Feature Extraction module takes embedding tokens as input to our Transformer blocks.Specifically, Dense Feature Extraction contains several Transformer blocks, which include i th Transformer layers and a 1 × 1 Conv layer at the end of each block with the benefit of waterfall residual connection to transfer the information from the previous stage to the current stage.The shallow features from the SFE process through different Transformer stages extract more abstract features and spotlight the high-level information (further details provided in section III-G).Thus, we extract the feature as follows: where H DFE (.) is Dense Feature Extraction module with several Transformer blocks, which can be seen as where H DATB (.) denotes the i th Transformer blocks.C denotes the concatenation operation between the input feature of each DATB block and the output.Concatenating a convolutional layer within each stage of the Transformer block helps to transfer inductive bias from the convolution operation into the Transformer-based network and provides a more solid foundation for the later aggregation of shallow and deep features together.

D. GATED MLP FEATURE FUSION
The aim of the Gated MLP Feature Fusion (GMFF) design is to highlight the location information in the stacked feature map of different stages of Transformer blocks.GMFF consists of N stacked residual DATB as shown in Fig 2 .GMFF first, accumulates the multi-stage features from different Transformer stages to create multi-stage representations of the input image.Then, passes the features through the lightweight MLP network.However, in contrast to a standard MLP network, we propose a novel MLP module by using a 3 × 3 Depthwise Conv layer inside the module to leak the spatial information in order to boost the network performance since highlighting such features are important in super resolution task to achieve high performance.Also, the gating mechanism is used by formulating the element-wise product of two parallel routes of linear transformation layer that one is activated with the GELU [48].Thus, Gated MLP Feature Fusion can be formulated as follow: where F GMFF denotes the output of our feature aggregation of multi-stage Transformer block with the initial features, which is later used by the Multi-Scale Up-Sampling module.In the ablation study, we will show the effectiveness of our proposed Gated MLP Feature Fusion compared to the standard MLP network.

E. MULTI SCALE UP-SAMPLING
Given the feature from previous modules, which contains an aggregation of low-and high-level information, our model generates a high-quality image I SR .Multi-Scale Up-Sampling (MSUP) module takes the features directly from GMFF module to be able to reconstruct the high-quality output.MSUP consists of several convolutional and pixel-shuffle layers to upsample the features to the corresponding sizes in one training phase instead of training for each interested scale factor separately.Furthermore, we incorporate a global connection path H UP with only a bicubic interpolation to grant access to the original LR information and facilitate the back-propagation of the gradients.The Multi Scale Up-Sample module can be formulated as: where H Rec (•) and I SR denote the up-sampling module and high quality reconstructed image respectively: To keep the consistency with previous works, we use L 1 loss as a cost function during training to optimize the parameters of the proposed SRFormer.
where I SR is obtained by taking a low-quality image as the input of our model and I HR is the corresponding ground truth.
In the next subsections, more details about our Transformer layer are given.

G. DUAL ATTENTION LAYER
This section presents the proposed Dual Attention layer by completely revising the token mixer (i.e., self-attention).As well known, self-attention is playing an important role to achieve high performance in Natural Language Processing (NLP) and Computer Vision Transformer based networks.However, self-attention can be problematic due to several reasons, especially when it comes to working with spatial resolution, which involves high-resolution images.The computational complexity of self-attention increases quadratically to the number of tokens to mix.Besides that, self-attention treats images as flattened sequences, which neglects the original structure of images therefore it ignores the adaptability in channel dimensions, which has proven important for visual tasks.Also, self-attention does not take into account the local contextual information due to the nature of self-attention.Thus, we introduce the Dual Attention layer to overcome the aforementioned shortages by generating a global attention map with less computational cost compared to the existing token mixer.Dual Attention generates a global attention map by aggregating two local attention maps, which are separately obtained by using two different branches, CNN-based Attention Module, and Transformer self-attention in parallel.By doing so, unlike the previous token mixer, Dual Attention can also consider both longrange dependency and local contextual information with less computational complexity.
As shown in Fig 2, we design our Dual Attention in a way that it splits the channel features equally for both attention module branches (SpAM and SeAM).From the Norm layer tensor X , both of our branches receive half of the input tensor to create the local attention maps individually.SeAM is a self-attention Transformer, which first generates the query (Q), key (K), and value (V) projections enriched with the local context.We apply SeAM only across channels rather than spatial dimensions.Our SeAM uses only depth-wise convolutions to emphasize the channel-wise spatial context before computing feature covariance to produce the attention map.Thus, Q, K , V are computed as: where d is the 3 × 3 bias-free depth-wise convolution.Next, query and key projections reshape in a way that their dot-product interaction generates a transposed attention map.Thus, the attention map generates as follows: where X is the input feature map and α is a learnable scaling parameter that is used to regulate the magnitude of the dot product of K and Q before applying the Softmax function.Similar to previous works [19], [20], [49], we perform the attention function for h times to learn separate attention maps in parallel in our SeAM module.
The second branch of the Dual Attention layer is the Spatial Attention Module (SpAM), which is an almost parameterfree attention mechanism.SpAM receives the other half of the input tensor to generate the local attention map.The goal of the SpAM module is to encode the spatial information, which represents the importance of each pixel in the input feature with a negotiable cost.Given half of the input tensor information, the channels of the input tensor are reduced by mean and max operations, of which the shape is 1 × H × W .The obtained features concatenated and then passed through a convolution layer with a kernel size of 7 × 7. After, a sigmoid activation layer applies to the output feature to generate the attention weights of shape 1 × H × W which are later multiplied with the input tensor to refined tensors of shape C × H × W . Thus, the SpAM can be formulated as follow: where F Mean (•) and F Max (•) denotes for mean and max operations.Later, generated local attention maps from SpAM and SeAM are concatenated together to obtain a unified global attention map with less computational cost.Thus, the generated attention map contains both long-range dependency and local context information with enrich of spatial features.Following that, a multi-layer perceptron (MLP) with two fully connected layers and a GELU non-linearity activation function between them is employed for further feature modifications.The norm layer is also added before MLP, and both modules contain the residual connection between them.Thus, the entire procedure inside of our Dual Attention is as follows: where Norm(•) stands for the normalization layer and Y for the output feature map.

2) EVALUATION PROTOCOL
Two widely used quantitative metrics have been considered to measure the performance of our SRFormer in order to maintain consistency with previous works.Peak Signal-to-Noise Ratio (PSNR) is measured in decibels (dB) and the Structural Similarity index (SSIM), is computed between generated SR images and the corresponding ground truths.Keeping up with the SR community, the RGB reconstruction results are first transformed to YCbCr space, and then just the luminance channel is considered to compute the PSNR and SSIM in our experiments.

3) DEGRADATION MODELS
In order to demonstrate the efficiency of the proposed model, following the work of [31], three different degradation models were created to simulate LR images and make fair comparisons with available methods.Degradation data were obtained as follows: Firstly, a bicubic (BI) down-sampling dataset with scaling factors [×2, ×3, ×4] has been created.Secondly, Blur-Downsampled (BD) has been created by applying Gaussian kernel 7 × 7, and σ = 1.6 to HR images and then downsampled images with scaling factor ×3.
Aside from the BD, a more challenging degradation model has been created, referred to as Downsample-Noisy (DN).DN degradation model is down-sampling HR images with bicubic followed by adding 30% Gaussian noise.

4) IMPLEMENTATION DETAILS
In the training phase, RGB patches are provided as inputs with the size of 64 × 64 from each of the randomly selected 32 low-quality training images.Data augmentation is applied on patches by means of horizontal random flips and 90 degree rotation.AdamP [71] optimizer has been employed with the initial learning rate 10 −3 and its halved every 4 × 10 5 steps.
L1 is used as a loss function to optimize the model.Also, the configurations of our transformer encoder are as follows, we used 4 Transformer blocks within 6 Transformer layers for each block, Embedding dimension set to 64, and MLP ratio of 2 for all Transformer blocks.Also, a Conv1 × 1 is used inside each Transformer block.SRFormer was developed by using the PyTorch framework and trained on a single NVIDIA RTX 3090 GPU to achieve its performance.

B. COMPARISON WITH STATE-OF-THE-ART METHODS
In this section, SRFormer and SRFormer+ are compared to other lightweight state-of-the-art SR methods.Self-ensemble method [72] is also used to further boost the performance of the proposed SRFormer (denoted as SRFormer+).

1) RESULTS ON BICUBIC DEGRADATION
We present comparisons between the proposed method (SRFormer and SRFormer+) and several of the most recent lightweight SOTA CNN and Transformer based models: VDSR [28], DRCN [55], CARN [32], CBPN [56], FALSR [57], LAPAR-A [59], LatticeNet [61], MADNet [62], HDRN [63], DPN [64], A 2 F [65], ESRT [42], and SwinIR [19] on the Bicubic (BI) degradation model for scale factors [×2, ×3, ×4].Also, the number of network parameters and Multi-Adds operations are presented in Table 1 to demonstrate the complexity of the model and have a fair comparison with the existing methods.As can be seen, SRFormer produces superior outcomes in practically all circumstances when compared to the other methods mentioned above.This shows that SRFormer is capable of continuously 121462 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.accumulating these hierarchical characteristics to build more robust representative features that are well-focused on spatial context information.This trait can be confirmed by the obtained SSIM scores, which are based on the visible  structures in the image and are therefore more accurate.Furthermore, it can be observed using self-ensembles [72], the proposed SRFormer+ gains even more performance benefits.Several visual outcomes are presented in Fig. 3.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As can be the texture direction of the reconstructed images from all of the compared approaches is utterly incorrect while the text is blurred in all the cases at different levels.However, the results obtained by SRFormer are similar to ground truth texture.

C. RESULTS ON BD AND DN DEGRADATION MODELS
We also provide the performance of SRFormer and SRFormer+ on the BD (Blurry) and DN (Noisy) benchmark datasets in Table 2 and Table 3 to illustrate the strengths of the proposed model when it comes to a challenging situation with SOTA models.Due to degradation mismatch the following methods SRCNN and VDSR are re-trained for both BD and DN.As can be seen, SRFormer outperforms all other lightweight SOTA models on challenging benchmark datasets, and it is particularly impressive when compared to other lightweight SOTA models.A highcapability model, RDN [31] is also listed, which is used to demonstrate the superior performance of our SRFormer in comparison to a deep and costly model in these challenging datasets.SRFormer performs better in both datasets notwithstanding, RDN is a significantly expensive network compared to the low-cost SRFormer.RDN is near ×20 more expensive in terms of computational complexity.Furthermore, a visual representation of both challenging BD and DN benchmark datasets is shown in Fig. 4 and Fig. 5 respectively.As can be seen, our proposed method performs better in comparison with other SOTA methods in removing the noises and fuzzy regions from the input image, which results in generating sharper with fine details SR images.

V. ABLATION STUDY
The performance of the proposed model is further investigated through an extensive ablation study that includes in-depth examinations of the impact of each module.The ablation study is designed to provide additional insight into the performance of the proposed model.

C. IMPACT OF DUAL ATTENTION
We further study the impact of both proposed SpAM and SeAM to illustrate the effectiveness of the proposed Dual Attention.We investigate the performance of SRFormer with the standard self-attention layer [20] and each sub-branch of our Dual Attention layer.As can be seen in Table 4, the SRFormer with Dual Attention boosts the performance of the network while using less computational cost compared to when the standard self-attention layer replaces in the network.
In contrast to other self-attention layers, Dual Attention is built up with two parallel branches, which are able to encode the spatial information more efficiently and enables Dual Attention to preserve a rich representation while shrinking its depth to make further computation lightweight.Also, it helps the network train faster compare to other transformer-based networks.

D. INFLUENCE OF GATED MLP FEATURE FUSION
Table 5 shows the impact of our proposed lightweight Gated MLP Feature Fusion compared to without and with baseline MLP on the performance of the proposed network.
In addition, we investigate the impact of the usage of depthwise, pointwise conv layer, and gated mechanism in our Gated MLP Feature Fusion.As can be seen, SRFormer obtains performance gain compared to when the network does not contain any MLP module or even when it is compared to the baseline MLP with a less computation cost.The intuition behind that is, GMFF uses a gated mechanism to allow gradients to backpropagate more easily through depth, and a Dw-Conv layer between the MLP layers to leak the location information, which leads the network to pay attention to positional information, unlike the baseline MLP that uses positional encoding [22] to introduce the location information, which is not suitable when the test resolution is different from training resolution.Furthermore, we illustrate the performance gain of our Gated MLP Feature Fusion with pointwise, depthwise convolution layers, Gated Mechanism, and without GMFF.As shown in Table 6, the performance of our SRFormer boosts when a depthwise convolution layer with a gated mechanism is used compared to other settings.

E. PERCEPTUAL INDEX METRIC
To assess the quality of the generated super resolution images, the Perceptual Index (PI) is used, which is more accurate in reflecting human perceptions of image quality compared to other metrics (PSNR and SSIM).Table 7 illustrates the PI metric between SRFormer and SOTA methods with the same order of magnitude in terms of network model size.
It can be seen that the proposed model achieves lower results (lower is better) compared to other models.This demonstrates the ability of the proposed SRFormer for generating more realistic images.This comparison illustrates that our model successfully strikes a balance between performance and running time requirements.

VI. CONCLUSION AND FUTURE WORK
In this paper, we present a novel and efficient Transformer architecture-based network called SRFormer.The proposed model is designed by using the strength of both Convolutional and Transformer layers to extract and preserve the fine details of the features while remaining memory efficient.
To do so, we introduce a Dual Attention layer, a Transformer layer, which generates the global attention map from two different branches (SpAM and SeAM) in order to capture both local context information and global dependency between sequences.Also, we introduce a lightweight Gated MLP Feature Fusion to aggregate the multi-stage feature representation by focusing on inner spatial information before upsampling module.We demonstrate the efficiency of the proposed method through a series of ablation investigations.We have empirically demonstrated that our approach outperforms previous lightweight state-of-the-art methods on all benchmark datasets, despite having similar or fewer network parameters.In the future, we will expand our proposal for blind super resolution when there is no ground truth during training and inference.To do so, we will attempt to change the methodology of our proposed architecture to use Generative Adversarial Network.

FIGURE 1 .
FIGURE 1. PSNR vs. Model size trade-off on Urban100 (×4).SRformer achieves superior performance among all the CNN and Transformer networks.
layer and Transformer layer together to address the aforementioned problems.By advancing both Convolution and Transformer together, SRFormer is able to capture both local context information and global interactions between contexts while staying computationally efficient.The combination of both CNN and Transformer together with the precise design of our SRFormer architecture, allows our model to perform exceptionally well on benchmark datasets with faster training and inference times compared to other Transformer based networks.It is worth mentioning that, SRFormer trained with only a single GPU for 3 days, while SwinIR trained on 8 GPUs for almost 2 days to achieve their results.Also, SRFormer has the advantage of multi-scale training, which can generate SR images with different scale factors [×2, ×3, ×4] in one training phase, while other methods need to train separately for each scale factor.As illustrated in Fig. 1, the proposed SRFormer yields to 21% improvement on average of all benchmark datasets for scale factor 4 when compared to the SwinIR [19]-SOTA Transformer-based model, which shows the efficiency of the proposed model.

B
. VISION TRANSFORMER Transformer networks show breakthrough performance in the Natural Language Process (NLP).In contrast to ConvNets, Transformer networks have the advantage of capturing longrange dependency in the input with global self-attention.The core idea of the Transformer is the self-attention module, which is capable of capturing long-term information between sequence elements.The impressive performance Transformer based network in the NLP domain inspires the Computer Vision community to adopt the Transformer for Vision tasks.The first work in this direction has been done by Alex et al. who propose ViT [22] as a Vision Transformer, which replaces the standard CNN with Transformer and directly trains on the mediumsize flattened patches with large-scale data pre-training.

FIGURE 2 .
FIGURE 2. The overall network architecture of the proposed SRFormer.

TABLE 1 .
Average PSNR/SSIM comparison with state-of-the-art CNN-and Transformer-based methods with the same range of network parameters on the Bicubic (BI) degradation for scale factors [×2, ×3, ×4] (Transformer based methods separated with horizontal line).Red is the Best and Blue is the second best performance.We assume that the generated SR image is 720P to calculate Multi-Adds (MAC).SRFormer with self-ensemble results are Highlighted.

FIGURE 3 .
FIGURE 3. Visual results of BI degradation model for ×4 scale factor.

FIGURE 6 .Figure 7
FIGURE 6. Performance investigation on different settings of SRFormer on Urban100 for scale factor ×4.

F
. MODEL COMPLEXITY AND INFERENCE TIME ANALYSIS Table 8 illustrates the advantages of the proposed SRFormer architecture in terms of Network Parameters (M) Inference Time (s) and Memory Consumption (MB) compared to existing light-and heavy-weight SOTA CNN and Transformer base architectures on Urban100.In order to make a fair comparison, all the models are measured with the same configuration with their published source code and default hyper-parameters on a single NVIDIA RTX3090 GPU.As shown, our model has the shortest inference time and less memory hunger per image compared to Transformer models.

TABLE 2 .
Quantitative results with BD degradation model.Performance is shown for scale factor ×3.The best and second best results are highlighted in red and blue respectively.SRFormer with self-ensemble results are Highlighted.

TABLE 3 .
Quantitative results with degradation models.Performance is shown for scale factor ×3.The best and second best results are highlighted in red and blue respectively.SRFormer with self-ensemble results are Highlighted.

TABLE 4 .
Influence of different settings of the dual attention layer on Urban100 for scale factor ×4.

TABLE 5 .
Gated MLP feature fusion performance investigation on Urban100 for ×4.

TABLE 6 .
Impact of different gated MLP feature fusion setting on Urban100 for scale factor ×4.

TABLE 7 .
Perceptual index comparison between proposed method and recent lightweight state-of-the-art methods on benchmark datasets for scale factor ×4.The lower is better.

TABLE 8 .
Average running time (s) and memory consumption (MB) comparison on Urban100 for scale factor ×4.