Frequency-based Enhancement Network for Efficient Super-Resolution

Recently, deep convolutional neural networks (CNNs) have provided outstanding performance in single image super-resolution (SISR). Despite their remarkable performance, the lack of high-frequency information in the recovered images remains a core problem. Moreover, as the networks increase in depth and width, deep CNN-based SR methods are faced with the challenge of computational complexity in practice. A promising and under-explored solution is to adapt the amount of compute based on the different frequency bands of the input. To this end, we present a novel Frequency-based Enhancement Block (FEB) which explicitly enhances the information of high frequencies while forwarding low-frequencies to the output. In particular, this block efficiently decomposes features into low- and high-frequency and assigns more computation to high-frequency ones. Thus, it can help the network generate more discriminative representations by explicitly recovering finer details. Our FEB design is simple and generic and can be used as a direct replacement of commonly used SR blocks with no need to change network architectures. We experimentally show that when replacing SR blocks with FEB we consistently improve the reconstruction error, while reducing the number of parameters in the model. Moreover, we propose a lightweight SR model — Frequency-based Enhancement Network (FENet) — based on FEB that matches the performance of larger models. Extensive experiments demonstrate that our proposal performs favorably against the state-of-the-art SR algorithms in terms of visual quality, memory footprint, and inference time. The code is available at https://github.com/pbehjatii/FENet.


I. INTRODUCTION
S INGLE image super-resolution (SISR) has recently received a considerable amount of attention from both academia and industry. The purpose of SISR is to reconstruct a high-resolution (HR) image from its low-resolution observation (LR). This offers an opportunity for overcoming resolution limitations in various computer vision applications such as medical imaging [49], security and surveillance [47]. In general, SISR is an inverse ill-posed problem since multiple HR images can map to the same LR input. To tackle such an inverse problem, numerous image SR methods have been proposed [16] based on deep neural architectures [9,34,36,61] and shown prominent performance.
Convolutional Neural Networks (CNNs) have recently achieved unprecedented success in various problems [18,57]. The powerful feature representation and end-to-end training paradigm of CNNs make them a promising approach to SISR. Recently, most CNN-based SR methods focus on elaborate architecture designs such as residual learning [2,5,31,32] and dense connections [25,67]. Although significant progress has been made, as discussed in [21,46], texture details of the LR images often tend to be smoothed in the superresolved results since most existing CNN-based SR methods do not pay enough attention to the limited high-frequency information in the LR images. In natural images, information is conveyed at different frequencies. The output feature maps Behjati et  of a convolutional layer can also be seen as a mixture of information at lower and higher frequencies. The lower frequency information is composed of global structures and textures that can directly be forwarded to the final HR output without substantial computations. The higher frequency information consists of fine details where more complex restoring functions are expected. At this point, leading CNN-based methods such as EDSR [37] and RDN [67] overlook the fact that most of the low-frequency information is already contained in the input. As a result, these models spend the same amount of computation treating low-and high-frequency information and lack flexible modulation ability in dealing with them, which ends up the representational ability of the network. Please note that, in this paper, the term frequency refers to low-and high-frequency features, and is not related to the frequency domain. Previous works address this problem by incorporating attention mechanisms [9,61,66] into the networks to model interdependencies among spatial locations, channels, or both. The common idea behind attention-based SR methods is to adjust network architectures so that they produce rich feature representations. However, as SR networks are so diverse, the attention module is usually designed solely for a specific network structure [55]. Recently, various SR methods such as multi-branch networks [33,60] and progressive reconstruction methods [35,69] mainly focus on refining the highfrequency texture details. Although these methods delivered impressive results, they demand substantial memory and computational resources. Therefore, the efficient reconstruction of high-frequency details in SISR is still a challenge today.
In this paper, we address the aforementioned problems from a different perspective. Instead of designing deep and complex networks or adding various shortcut connections to strengthen feature representations, we introduce a novel Frequency-based Enhancement Block (FEB) which is able to separate features into low and high frequencies while also enabling efficient communication among them. Since low frequencies are preserved by downsampling operations and thus can be recovered directly from the input, FEB assigns more computational capacity to high frequencies.
The proposed FEB gradually and iteratively enhances highfrequency feature maps during training while preserving lowfrequency information, resulting in more accurate features that improve reconstruction quality.
The proposed FEB offers the following advantages. First, it is generic and can be easily applied to existing SR models without the need of modifying network architectures or requiring hyper-parameters tuning. Second, FEB reduces model parameters in the baseline SR models while simultaneously obtaining better SR performance. In Figure 1, we provide an example of visual quality of EDSR [37], which uses residual blocks [18] as its building module. It can be observed that, when we replace residual blocks with our blocks (EDSR-FEB), the network obtains better visual quality while reducing the number of parameters.
Based on FEB, we build a lightweight SR network named Frequency-based Enhancement Network (FENet), illustrated in Fig 2. Our network leads to significant improvements for single image SR, surpassing SR networks with complicated skip connections and concatenations. In summary, these are the main contributions of the paper: • We propose a novel Frequency-based Enhancement Block (FEB) to perform frequency-based computation. Such a mechanism allocates more computation to highfrequency bands, allowing the network to focus on more informative features and improve its discriminative capabilities. • The proposed block leads to the reduction of parameters by half in the baseline SR models while achieving better SR performance. • We propose a lightweight Frequency-based Enhancement Network (FENet) for fast and accurate image super-resolution. Extensive experiments on a variety of public datasets demonstrate the superiority of the proposed architecture over state-of-the-art models, in terms of both quantitative and visual quality.

II. RELATED WORK
In recent years, the field of image SR has been dominated by CNNs, which achieve state-of-the-art performance [5,36,42,43,54,61]. Here, we focus our discussion on the approaches that are most related to our work.

A. EVOLUTION OF ARCHITECTURES FOR SR
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3176441, IEEE Access et al. [31] employed residual blocks proposed in [18] to construct deeper network (SRResNet) for image SR, which was further improved by EDSR [37] and MDSR [37] by removing unnecessary modules (e.g., batch normalization) from the residual blocks. By using effective building modules, image SR networks became deeper and yielded better performance. Later, in order to employ hierarchical features from all the convolutional layers in deep networks, dense blocks started being employed in SR architectures [3,22,25,53,63]. More recently, Zhang et al. [67] and Liu et al. [39] also used dense and residual connections in RDN and RFANet to utilize information from the whole feature hierarchy. In addition to residual and dense blocks, Li et al. [32] and Lan et al. [30] proposed a multi-scale block to explore the multi-scale information of LR images. Although these existing CNN-based SR approaches have provided outstanding performance, they devoted to designing deeper and wider network to enhance their representational learning capacity. Increase in depth and width has also raised computational demands and memory consumption. This makes modern architectures less applicable in practice. Numerous lightweight models have been proposed to alleviate the aforementioned computational burden. For example, DRCN [27] was the first to apply recursive algorithm to SISR to reduce the number of parameters by reusing them multiple times. Tai et al. [51] and Ahn et al. [2] improved DRCN by combining the recursive and residual network schemes in order to achieve better performance with even fewer parameters. Likewise, Behjati et al. [5] and Jiang et al. [25] also joined residual connections and recursive layers to reduce the computational cost. On the other hand, LapSRN [28] employed a pyramidal framework to increase the image size gradually. By doing so, LapSRN effectively performed SISR on extremely low-resolution cases. Chu et al. [11] and Ahn and Cho [1] introduced neural architecture search strategies to automatically build an SR model given certain constraints. Meanwhile, Hui et al. [24] proposed an information multi-distillation block (IMDB) that extracted features at a granular level with the channel splitting strategy. More recently, Luo et al. [40] proposed lattice blocks that applied so-called butterfly structures to combine residual blocks. Later, Xuehui Wang and Chen. [58] proposed an attentive feature block to utilize auxiliary features of previous layers for facilitating features learning of the current layer. Li et al. [34] proposed a linearly-assembled pixel-adaptive regression network, which casts the direct LR to HR mapping learning into a linear coefficient regression task. Recently, to simplify the challenges of directly super-resolving details, some authors adopted the progressive structure to reconstruct HR images in a stage-by-stage upscaling manner [36,38,69].
By considering that there are different types of information within and across feature maps which have a different contribution for image SR, the aforementioned SR approaches cannot capture low-and high-frequency feature representations separately in the process of feature embedding, thus hindering their representational ability [21,46,66].

B. FREQUENCY BASED SR METHODS
It is well known that high-frequency information (e.g. texture, edges) is significant for SISR. Li et al. [35] proposed a super-resolution feedback network (SRFBN) based on a recurrent architecture design. The network is based on a feedback block that consists of several projection groups. Each projection group first finds high-resolution features (via deconvolution) and then generates low-resolution features (via convolution). As a result, this network is able to gradually recover high-frequency components. Later, Haris et al. [17] proposed a method to refine high-frequency texture details with a series of up and downsampling layers that are densely connected with each other to combine HR images from multiple depths in the network. More recently, Qiu et al. [46] and Yang and Lu [60] proposed multi-branch architectures. In these methods, one branch is responsible for capturing high-frequency features such as texture and edge, and another is to learn low-frequency features such as image outline and contour. Similarly, Li et al. [33] introduced the octave convolution to image SR which uses two branches to perform information update and frequency communication between low-and high-frequency features.
Although these existing SR approaches have made good efforts to improve SR performance, they tend to increase the amount of compute on high-frequency information by increasing the overall number of operations of the model, without paying attention to model complexity. The increase in complexity due to the independent treatment of multiple frequencies is a key issue that limits the performance of these deep CNN-based methods.

C. ATTENTION BASED SR METHODS
Attention mechanism has demonstrated great superiority in improving performance of CNNs for various computer vision tasks [20,57]. Hu et al. [20] introduced squeeze-andexcitation (SE) block that models channel-wise relationships in a computationally efficient manner and enhances the representational ability of the network, showing its effectiveness on image classification. CBAM [57] modified the SE block to exploit both spatial and channel-wise attention. Zhang et al. [66] first incorporated SE [20] with SR and pushed the stateof-the-art performance of SISR. More recent works, such as [9,21,29,43,44,58,59,61], extend this idea by adopting different spatial attention mechanisms or designing advanced attention blocks.
All above-mentioned approaches improve CNNs for image SR by either refining architectural designs or adding complexity to hand-designed blocks. Conversely, our proposal is able to efficiently restore textures at different frequencies.
Such mechanism helps the network to explicitly allocate computation to high-frequency features, thus improving the discriminative capabilities of the network.

III. FREQUENCY-BASED ENHANCEMENT NETWORK
In this section, We first describe the overall network architecture. Next, we detail the proposed Frequency-based En- hancement Block (FEB). Finally, we discuss the differences between the proposed method and similar related works.

A. NETWORK OVERVIEW
As shown in Fig 2, the overall network architecture of Frequency-based Enhancement Network (FENet) consists of a non-linear mapping module and a reconstruction module. Let's denote as I LR and I SR the input and output of FENet, respectively. Following [2,5,40,58,68], we apply only one 3 × 3 convolutional layer (H) to extract the initial features H 0 from the LR input image: It is worth noting that only one convolutional layer is used here for lightweight design. Then, we use the non-linear mapping module, which consists of several stacked FEBs to generate new powerful representations, which can be formulated as where B k denotes mapping function of the k-th FEB. H k−1 represents the features from the previous adjacent FEB, and M is the total number of FEBs. Inspired by [24,32,58,67], we apply a feature fusion strategy to integrate the features from all the FEBs. This strategy helps to extract more hierarchical contextual information. The fusion operation is formulated as where [H 1 , H 2 , ..., H M ] refers to the concatenation of feature maps produced by FEBs and F is a 1 × 1 convolutional operation. Finally, we utilize the reconstruction module that contains convolutional layers and pixelshuffle layers [50] to upsample the features to the HR size. In addition, we incorporate a global connection path (green line in the Fig 2) to grant access to the original LR information and facilitate the backpropagation of the gradients, in which only a bicubic interpolation is applied to the input I LR . Therefore, we obtain: where R is the reconstruction module, and I SR is the final output of the network.
To optimize the network parameters, we adopt L 1 loss as a cost function for training. Given a training set with N pairs of LR images and HR counterparts, denoted by {I i LR , I i HR } N i=1 , the network is optimized to minimize the L 1 loss function: where θ denotes the parameter set.

B. Frequency-based Enhancement Block (FEB)
A natural image can be decomposed into a low frequency component that describes smoothly changing structures and a high-frequency component that describes the rapidly changing fine details [8,48]. Similarly, we argue that the output feature maps of a convolutional layer can also be decomposed into features of different frequencies, and propose an efficient Frequency-based Enhancement Block (FEB) which naturally decomposes low and high frequencies at feature level. The high-frequency information part is processed by highercomplexity operations (in number of parameters and nonlinearities), whereas the lower-frequency part is processed by lower-complexity operations to compensate for the increase of computation. As a result, the proposed approach learns discriminative representations in order to efficiently achieve more accurate reconstructions. As demonstrated in Fig 3, the proposed FEB contains two pathways, each of which is responsible for a different functionality. Each pathway has a 1 × 1 convolutional layer at the beginning. Given the input X ∈ R C×H×W , where C denotes the number of channels and H × W the spatial dimensions, we have where {X 1 , X 2 } only have half of the channel number of X. F ′ split and F ′′ split are two 1 × 1 convolutional operations, respectively.
: Schematic illustration of the proposed Frequency-based Enhancement Block (FEB). As it can be seen, the original filters are separated into two processing lines, each of which is in charge of a different functionality. More details in Section III-B.
Then, the described operations are separately sent into a dedicated pathway for collecting different types of information (i.e. low-and high-frequency information). The first pathway targets at retaining the original information (lowfrequency). To save computation, we perform only a simple 3×3 convolutional operation to capture the global layout and coarse details as follows: where Y 1 is the output of the 3 × 3 convolutional layer (F 1 ).
In the second pathway, we first apply an average pooling layer upon X 2 , yielding T 1 where k denotes the kernel size of the pooling layer and the size of the intermediate feature map Each value in T 1 can be viewed as the average intensity of each specified small area of X 2 . After that, T 1 is upsampled via a bicubic interpolation operator to produce a new tensor T 2 of the same size as X 2 where T 2 contains averaged information and it can be regarded as a smoother version of the original X 2 . Then, in order to obtain the high-frequency information, T 2 is element-wise subtracted from X 2 : The visual activation maps of X 2 , T 2 and high-frequency information (T 3 ) are also shown in Fig 4. It can be observed that T 2 is smoother than X 2 as it is the average information of X 2 . Meanwhile, T 3 retains the details and edges. Now, the high-frequency enhancement operation can be formulated as follows Y where σ is the sigmoid function, and F 2 and F 3 are two 3 × 3 convolutional layers, respectively. As shown in (12), we use X 2 as residuals to form the weights, which is found beneficial. Then the output of the second pathway can be written as where F 4 is a 3 × 3 convolutional operation. Finally, both intermediate outputs of the first and second pathways {Y 1 , Y 2 } are concatenated together as the output Y ∈ R C×H×W to obtain a rich feature representation. Compared to other works such as [33,60], which require a considerably large amount of computations for decomposing features of different frequencies, FEB can separate the lowand high-frequency feature representations in an efficient way and focus on reconstructing the high-frequency ones.

1) Difference to Prominent SR Blocks
Prominent SR blocks such as residual blocks [37] or dense blocks [53] process low-and high-frequency information simultaneously by the same convolution operations and do not discriminate the computation of features by their frequencial components. Therefore, some local details of LR images cannot be effectively utilized for HR reconstruction, leading to blurry super-resolved results [33]. In contrast, our proposal treats different frequencies in a heterogeneous way and also models inter-channel dependencies, which consequently enrich the output feature. Moreover, FEB benefits VOLUME 4, 2016 SR approaches by reducing the number of parameters while achieving superior SR performance.

2) Difference to Attention-Based Methods
Our work is quite different from existing methods such as [12,21,43,66] which rely on supplementary attention blocks and require additional learnable parameters. In contrast our approach internally changes the way of exploiting convolutional filters of convolutional layers, and hence require no additional learnable parameters. In the following experiment section, we will demonstrate without any extra learnable parameters, FEB can yield significant improvements over baselines and other attention-based SR approaches. Moreover, it is complementary to attention mechanisms, and also benefit from their inclusion into the pipeline.

IV. EXPERIMENTAL RESULTS
In this section, we first conduct an ablation study to validate the effectiveness of the proposed FEB. Then, we systematically compare FENet with state-of-the-art SISR algorithms on five commonly used benchmark datasets.

A. SETTINGS
Datasets and Metrics. Following [67], we use 800 highquality images from the DIV2K dataset [52] for training. We evaluate our model on several benchmark datasets: Set5 [6], Set14 [62], B100 [4], Urban100 [23], and Manga109 [41], each with diverse characteristics. To keep the consistency with previous works [11,25,29,33,34,36,58,61], we use Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [56] as the assessment methods to evaluate image reconstruction accuracy. PSNR evaluates the image by statistically measuring distortion values between the reconstructed image and the ground-truth image. The higher the PSNR, the better the quality of the reconstructed image. SSIM measures the structural similarity between two images based on luminance, contrast, and structure. The SSIM values range between 0 to 1, 1 means perfect matching the reconstructed image with the original one. All results are evaluated on the luminance channel (Y). In addition to PSNR and SSIM, we adopt Perceptual Index (PI) [7] to evaluate reconstructed image perceptual quality accurately. PI has a high correlation with human-opinion scores and can avoid the situation where over-smoothed images may present a higher PSNR and SSIM when the performances of the two methods are similar. The lower PI value denotes the better perceptual quality. Degradation models. To fairly compare against existing works, we adopt bicubic downsampling (denoted as BI) as our standard degradation model for generating LR images from ground truth HR images, at ×2, ×3, and ×4 scales. Moreover, to comprehensively illustrate the efficacy of the proposed FEB, we further adopt two other multi-degradation models as in [67]. We define BD as a degradation model that performs bicubic downsampling on HR images at ×3 scale, and then blurs them with a Gaussian kernel of size 7×7 and standard deviation 1.6. Additionally, we further produce LR images in a more challenging way: we first bicubic downsample HR images with scaling factor ×3 and then add Gaussian noise with noise level 30 (denoted as DN). Implementation details. During training, data augmentation is carried out by means of random horizontal flips and 90 • rotation. At each training mini-batch, 64 LR RGB patches of size 64 × 64 are provided as inputs. We train FENet using an ADAM optimizer with learning rate 10 −3 . The learning rate is halved every 2 × 10 5 iterations. We set the number of FEB to 12 in our FENet. Our network has been implemented using PyTorch, and trained on a NVIDIA RTX 3090 GPU.

1) The importance of FEB
In this section, we conduct ablation experiments to explore the influence of each pathway (low-and high-frequency paths) inside the proposed FEB on the reconstruction performance. Therefore, we use FENet as the basic network and run the following experiments: (1) deactivating low-frequency path (Y 1 ) in FEB; (2) deactivating high-frequency path (Y 2 ) in FEB, and (3) activating both low-and high-frequency paths. To keep the number of parameters similar, we use 8 and 6 FEBs in the first two experiments respectively without channel reduction. As reported in Table 1, we observe a significant performance drop when either low-or high-frequency path is deactivated in FEB. This is mainly because: 1) when lowfrequency path (Y 1 ) is deactivated in FEB the high-frequency path (Y 2 ) focuses too strongly on high-frequency details, smoothing other important aspects of the input that should be preserved by the low-frequency path; 2) the network without high-frequency path (Y 2 ) processes low-and high-frequency information simultaneously by the same convolution operations, and do not explicitly extrude the high frequencies from image features. Thus, some local details of the LR image cannot be effectively utilized for HR reconstruction.
In Fig 5, we additionally visualize the average feature maps of low-frequency (Y 1 in (8)) and high-frequency (Y 2 in (13)) paths within the first four FEBs. It can be observed that low-frequency feature maps describe the overall outline  (8)) and high-frequency (Y2 in (13)) paths.
of the butterfly, while high-frequency ones represent the edges and textures of the butterfly. This visualization shows how FEB is able to efficiently restore textures at different frequencies and can potentially improve performance. 2

) The effectiveness of FEB
To demonstrate the effectiveness of our proposed FEB scheme, we use FENet as the basic network. To keep the number of parameters similar, we replace the 12 FEBs with 8 residual blocks (RB) [37], 5 dense blocks (DB) [53], 6 information multi-distillation blocks (IMDB) [24], or 4 multi-scale residual blocks (MSRB) [32]. In Table 2, we compare the number of parameters and the performance in PSNR for all methods for scale factor ×4. As reported in Table 2, the method with FEB outperforms all the methods with different SR blocks with fewer number of parameters. The reason is the proposed block treats different frequencies in a heterogeneous way and thus it improves the performance of super-resolution. This experiments justify that the proposed FEB results more helpful for image SR. We additionally provide visual comparisons (Fig 6) of FENet using different SR blocks for scale factor ×4. It can be observed that the network using FEB obtains better visual quality and represents more diverse structure patterns.   (CA) [66] (ResNet-CA) and channel-wise and spatial attention residual [21] (ResNet-CSAR) into residual blocks as done in [66], respectively. Furthermore, we replace 8 residual blocks with 12 FEBs (FENet) and integrated the two mentioned attention mechanisms into FEBs and named them as FENet-CA and FENet-CSAR.
As reported in Table 3, ResNet-CSAR and ResNet-CA obtain better performance than ResNet but they require additional learnable parameters. Quite differently, FENet does not rely on any extra learnable parameters since it heterogeneously exploits the convolutional filters and thus achieves better performance than ResNet-CSAR and ResNet-CA. It should also be mentioned that the proposed FEB is also compatible with the above mentioned attention mechanisms. For example, when adding CA blocks to each FEB of FENet (FENet-CA), we can further gain another 0.07dB in average. This also indicates that our approach is orthogonal to this kind of supplementary attention modules.
To dig deeper into difference between the proposed block and attention-based approaches, we visualize the average feature map of the output of Y ′ 2 in (12), the outputs of CA, and CSAR attentions in the residual blocks in Fig 7. Our network should focus on high-frequency components (i.e. VOLUME 4, 2016 edges and contours) and suppress the smooth area of the original input image. Compared with the CA and CSAR, feature maps acquired from Y ′ 2 in (12) contain more negative values, showing a stronger effect of suppressing the smooth area of the input image as well as directing computations towards edges and details. This visualization indicates that the network with FEB can generate richer and more discriminative feature representations than the different attention mechanisms.

4) Generalization ability
To demonstrate the generalization ability of the proposed structure, we select two state-of-the-art SR networks with different model sizes, called EDSR [37] and RCAN [66]. The EDSR contains 32 stacked residual blocks with 256 × 256 filters. The RCAN consists of 200 residual channel attention blocks with 64 × 64 filter sizes. We replace their building blocks with FEBs. The corresponding networks with FEB are named as EDSR-FEB and RCAN-FEB, respectively. For fair comparison, all networks are trained on their default settings.
As shown in Table 4, EDSR-FEB has an improvement of 0.08dB in average with almost ×2 fewer number of parameters (parameters: 28M) compared to the original EDSR (parameters: 43M). Moreover, the improvement by RCAN-FEB is also higher than RCAN with approximately half amount of parameters. From these comparisons, we can easily find that (1) the proposed FEB perform much better than channel attention, (2) for deeper networks, a similar phenomenon can also be observed, (3) FEB reduces the number of parameters by half while achieving better performance. Fig 1 and 8 additionally show visual comparisons for scale factor ×4. It can be observed that EDSR-FEB and RCAN-FEB can reconstruct sharper and more natural-looking images. This is mainly because FEB can extract high-frequency features and use them for reconstruction.

5) Comparing pooling methods
In this section, we investigate the influence of different pooling types on the performance. The proposed block adopts average pooling for downsampling and bicubic interpolation for upsampling. In our experiments, we use FENet as the basic network and then replace average pooling operators in all FEBs with maximum pooling operators. As shown in  the rest of configurations unchanged yields a performance increase of about 0.08dB in average. We argue that this may be due to the fact that, unlike maximum pooling, average pooling builds connections among locations within the whole pooling window, which can better capture local contextual information.
In addition, we further investigate the behavior of the proposed block (FEB), when the average pooling for downsampling in (9) and bicubic interpolation used for upsampling in (10) are replaced with a convolutional layer and a deconvolutional layer, respectively. As reported in Table 5, it can be observed that the performance as well as the number of parameters of the network increase, when we replace average pooling and bicubic interpolation with learnable operations. Although the performance of the network increases by 0.11dB on average, this leads to a more complex network with more parameters. While weighing the network performance and network complexity, we finally use average pooling and bicubic interpolation for the rest of the experiments, the results are close to the network with convdeconv operations, but the number of model parameters is only one fourth of it.

6) The effect of Downsampling rate
We also investigate how the downsampling rate in FEB influences the image SR performance. In Table 6, we show the performance with different downsampling rates used in FEB.
It can be observed that as the downsampling rate increases, slightly better performance is achieved. However, we do not use larger downsampling rates due to two reasons: (1) the resolution of the input features is already very small; (2) higher downsampling rates lead to performance improvements at the expense of more computations due to bicubic operation. Therefore, for the rest of experiments, we set the downsampling rate to 4 for all scale factors, as it still provides significant improvements with a lower computational cost than ×5.

7) The effect of increasing the number of FEB
As discussed in [37], increasing the depth of the network can effectively improve the performance. In this work, adding the number of FEBs is the simplest way to gain excellent result. For better balancing the model size and performance, we compare the proposed model with the different numbers of FEBs, i.e., 6, 8, 10, and 12. As shown in Table 7, our FENet performance improves rapidly with the growth in number of FEBs. Although the performance of the network would further improve by using more FEBs, we found it leads to diminishing returns with respect to the number of parameters. Therefore, we use 12 FEBs in our experiments. Furthermore, we find by adding a global connection path (green line in Fig 2) to grant the output access to the original LR input is beneficial for reconstruction performance. Discarding this connection decreases performance (-0.04dB on average).
In Fig 9, we present some qualitative visual comparisons for the ×4 scale factor. It can be observed that SR images reconstructed by FENet have more refined details, especially in the edges and lines. This further validates the effectiveness of the proposed FEB.

2) Results with BD and DN degradation models
Following [67], we also show the SR results with BD degradation model and further introduce DN degradation model. The proposed FENet is compared with state-of-theart methods including SPMSR [45], SRCNN [14], FSR-CNN [15], VDSR [26], IRCNN_G [64], IRCNN_C [64], and VOLUME 4, 2016   SRMDNF [65]. As shown in Table 9, FENet performs the best on all datasets with BD and DN degradation models. The significantly better results of our method indicate that FENet adapts well to scenarios with multiple degradation models.
In Fig 10, we show two sets of visual results with BD and DN degradation models from the standard benchmark datasets. For BD degradation model, the proposed FENet suppresses the blurring artifacts and recovers sharper edges. For DN degradation model, FENet can not only handle the noise efficiently, but also recover details more accurately. These comparisons further showcase the robustness and effectiveness of our method in handling BD and DN degradation models.

5) Perceptual Metrics
Perceptual metrics better reflect the human judgment of image quality. In this paper, Perceptual Index (PI) [7] is chosen as the perceptual metric. Table 11 shows the PI for those works with publicly available source code, and the same order of magnitude in terms of parameters. We observe that our proposed model obtains better results than all the compared baselines. This demonstrates the ability of the proposed FENet for generating realistic images.

V. LIMITATIONS AND FUTURE WORK
Although our method is the fastest compared to other SR approaches we have identified the bicubic interpolation operation in (10) as one of the main computational bottlenecks. Thus, we hypothesize that substituting it for a more efficient operation or implementation would effectively speed up our model. Furthermore, the loss function adopted by our method is the distortion-oriented rather than perception-oriented metric, which also limits obtaining better perceptual quality HR images.
In future work, we will explore the extensions of the proposed framework on other image restoration applications, such as deblocking, inpainting, and low-light image enhancement. We also wish to further develop this work by applying our technique to video data. Many streaming services require a large storage to provide high-quality videos. In conjunction with our approach, one may devise a service that stores lowquality videos that go through our SR system to produce high-quality videos on the fly.

VI. CONCLUSION
This paper presents a novel Frequency-based Enhancement Block (FEB). This block is able to naturally decompose features into low and high frequencies and explicitly allocate more computational capacity to high-frequency ones thus improving the discriminative capabilities of the network. The proposed FEB can be easily replaced with commonly used SR blocks. We proved that when replacing SR blocks with FEB we consistently improve the reconstruction error (PSNR: +0.08dB on average) while reducing the number of parameters by half in the model. Furthermore, We showed that the proposed block is orthogonal and complementary to attention-based SR methods. Based on FEB, we proposed a lightweight Frequency-based Enhancement Network (FENet) for accurate image SR. Experimental results on several benchmark datasets demonstrate that our method can achieve superior performance at a moderate size. We hope that the idea of decomposing low-and high-frequency information at the feature level for adaptive computation can provide the computer vision community with a different perspective on network architecture design.

A. ABBREVIATIONS
The abbreviations and acronyms used in this paper are first introduced in the text, and for convenience, the list of abbreviation is summarized in Table 12.

B. DATA DESCRIPTION
In Table 13, we list a number of image datasets commonly used by the SR community and this work. We specifically indicate their amount of HR images, average resolution, image formats, and category keywords.