Multi-Scale Fast Fourier Transform Based Attention Network for Remote-Sensing Image Super-Resolution

Recently, with the rise and progress of convolutional neural networks (CNNs), CNN-based remote-sensing image super-resolution (RSSR) methods have gained considerable advancement and showed great power for image reconstruction tasks. However, most of these methods cannot handle well the enormous number of objects with different scales contained in remote-sensing images and thus limits super-resolution performance. To address these issues, we propose a multiscale fast Fourier transform (FFT) based attention network (MSFFTAN), which employs a multiinput U-shape structure as backbone for accurate RSSR. Specifically, we carefully design an FFT-based residual block consisting of an image domain branch and a Fourier domain branch to extract local details and global structures simultaneously. In addition, a local–global channel attention block is developed to further enhance the reconstruction ability of small targets. Finally, we present a branch gated selective block to adaptively explore and aggregate features from multiple scales and depths. Extensive experiments on two public datasets have demonstrated the superiority of MSFFTAN over the state-of-the-art (SOAT) approaches in aspects of both quantitative metrics and visual quality. The peak signal-to-noise ratio of our network is 1.5 dB higher than the SOAT method on the UCMerced LandUse with downscaling factor 2.

remote-sensing imaging [1], [2], [3], [4], medical imaging [5], [6], [7], and face recognition [8]. SISR is a classic ill-posed problem since numerous distinct HR images can be mapped to the same LR image, which poses a significant challenge to restoration task. In recent years, with the rapid development and popularization of aerospace technology, remote sensing vision has attracted an increasing number of researchers' attention. In the field of remote sensing, the long distance of the imaging device from target objects leads to a small resolution of target objects, which affects performance of subsequent high-level tasks (object detection [9], [10], classification [1], [11], and change detection [12], [13]). The most straightforward solution to this problem is to upgrade the physical equipment to get a HR and clearer image, but this is often unrealistic and requires a significant price. Therefore, the utilization of hardware-agnostic image super-resolution techniques (SISR) for enhancing the resolution of remote sensing images has become the current preferred approach.
To improve the resolution of image, researchers have proposed a variety of approaches, ranging from interpolation-based methods, reconstruction-based methods to example-based methods. Interpolation-based method uses a pixel around an unknown pixel to predict the unknown pixel, which is prone to produce blurred images with artifacts. To solve these problems, reconstruction-based methods often introduce various prior knowledge (sparse prior [14], low-rank prior [15], nonlocal prior [16], and edge prior [12]) to constrain the solution space in pursuit of a better reconstruction. Nevertheless, once the introduced prior knowledge conflicts with the fact, reconstruction performance drops dramatically. In addition, reconstructionbased methods often require long optimization times. Exampledbased methods establish a direct mapping from LR to HR using hand-designed features, but the poor generalization performance of hand-designed features limits its practical application.
Recently, SISR methods based on convolutional neural networks (CNNs) have substantially outperformed traditional method due to the powerful feature extraction capability of deep neural networks. Dong et al. [17] pioneered the introduction of a CNN into an SISR task with unprecedented success. Since then, various kinds of super-resolution networks based on CNN have emerged. Kim et al. [18] constructed a deep network with 20 layers by introducing residual learning. Deeper and larger networks are becoming increasingly frequent in the search of better reconstruction results. Lim et al. [19] constructed a deep network This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ with 50 convolution layers by discarding batch normalization [20] and won the NTIRE 2017 challenge. Thanks to the booming development of natural image super-resolution, deep learningbased algorithms for remote-sensing image super-resolution (RSSR) have made great progress. Despite the impressive results obtained by these approaches, the majority of them recover characteristics at a single scale, making it difficult for networks to efficiently extract multiscale information. Therefore, it is important to investigate multiscale feature extraction.
Some recent work has initiated efforts in this direction. Residual aggregation and split attention fusion network [2] uses a UNet-based encoder and decoder structure to extract both shallow semantic information and high-level features. Although this approach is capable of extracting multiscale features, it leads to irreversible information loss through frequent up and down sampling, which will eventually affect the reconstruction results. To minimize this information loss, a dense feature fusion approach is introduced. Specifically, not only the output of the current layer's encoder is taken into account, but also the output of the previous and next layer's encoders. In addition, it is not enough to extract multiscale features only at feature level. We introduce an auxiliary branch to extract features at different scales directly on the picture domain. In this way, we are able to exploit multiscale features in both the image and feature domains. For a super-resolution task, both low-frequency and high-frequency information are critical. Since the normal residual block [21] lacks the ability to integrate high-frequency features, a fast Fourier transform (FFT) is applied on the top of residual block. It is worth noting that each feature value in the frequency domain represents an abstraction of all the values in the original image features, allowing us to easily obtain global dependencies. Therefore, an FFT-based residual block (FFT-RB) can utilize both global and local information.To further strengthen the discriminative power of the network, a novel attention mechanism is introduced called local-global channel attention.
The main contributions of this article can be summarized as follows: 1) For the accurate remote-sensing image super-resolution (RSSR) task, we propose a novel SR approach named multi scale FFT-based attention network (MSFFTAN). MSFFTAN incorporates a multiinput encoder-decoder structure that can capture objects at different sizes in remote-sensing images. 2) To enable efficient extraction of high-frequency features, the FFT is incorporated into the ResBlock. In this way, high and low frequency can aggregate in a comprehensive manner. This operation ensures that our model can obtain rich features to recover texture and edge information efficiently. 3) An effective local-global channel attention block (LGCAB) is elaborately developed in MSFFTAN to enable the network focus on more useful information consistent with a global branch and a local branch which is beneficial to feature learning and model training. The rest of this article is organized as follows. Section II discusses relevant RSSR research. The MSFFTAN network design is described in full in Section III. In Section IV, the network design and experiment results, including ablation analysis, are presented. Finally, Section V concludes this article.

II. RELATED WORKS
In this section, we go through some of the most important approaches for our method, which include CNN-based nature image super-resolution and RSSR. Since CNN-based approaches have shown outstanding performance in recent years, we mainly introduce CNN-based methods.

A. CNN-based Nature Image Super-Resolution
CNN-based techniques have dominated SISR in recent years, thanks to the fast growth of deep convolutional neural networks. Dong et al. [17] introduced the first SISR approach based on CNN (SRCNN). Despite the fact that SRCNN only has three convolutional layers, it outperformed earlier conventional approaches. He et al. [21] used residual connection to build a deep model VDSR [18] with 20 convolutional layers that outperforms the SRCNN significantly. This meant that the higher the network's depth, the greater the performance. To get higher performance, researchers seek to create deeper, wider, and more complicated networks from then on. Following that, EDSR [19] built a network of around 50 layers by eliminating the unnecessary components. Nevertheless, this method treats LR features similarly and overlooks their long-range associations, resulting in inefficient detail retrieval. Thus, several techniques have recently been developed that include an attention mechanism into a CNN-based Super-Resolution (SR) model to rebalance the relevance of various elements. Zhang et al. [22] used residual in a residual structure to build a network with over 400 layers in terms of improving reconstruction performance. The context reasoning attention network was developed by Zhang et al. [23] to adjust the convolution kernel according to the global context adaptively. Mei et al. [16] combined nonlocal operation and sparse representation into an SISR task and proposed a nonlocal sparse attention to alleviate the large computational resources required for nonlocal operation. In addition, using a coarse-to-fine approach, a two-stage attentive network [24] is presented for accurate SISR.

B. Remote-Sensing Image Super-Resolution
Remote-sensing picture SR has recently gained significant attention. Deep learning-based algorithms [25] have recently exceeded these early classical methods considerably.
LGCNet [26] is the first CNN-based model for super-resolution in remotesensing images, using both local and global representations to learn the reconstructed SR image. Dong et al. proposed SMSR [27], which aggregates diverse multiscale characteristics utilizing first-order and second-order learning mechanisms. Meanwhile, during the last few decades, the attention mechanism has made significant advances in a variety of computer vision tasks, such as image classification [11] and object detection [9]. Thus, attention mechanism was introduced to the field of remotesensing image SR. HSENet [28] exploits the single-scale and cross-scale self-similarity information using multiscale nonlocal attention. A split attention fusion block was established by Chen et al. [2] allowing the method to adapt to varied multiscale land surface reconstructions. Rather than exploring first-order attention (channel or spatial statistics), Zhang et al. [29] advocated a high-order attention block to restore the missing details. Salvetti et al. [25] proposed the residual attention multiimage superresolution network, which leverages feature extraction from multiple LR images of the same scene, resulting in reconstructed images with fine texture details. Hu et al. [30] proposed a network that utilizes a HR, spatially lossless multispectral image to guide the super-resolution reconstruction of a LR hyperspectral image. The experimental results demonstrate that this strategy can effectively preserve spatial detail information in the recovered image. Xu et al. [31] utilized an iterative regularization technique based on tensor subspace representation to amalgamate paired multispectral and hyperspectral images, thereby reconstructing HR hyperspectral images with distinct texture and sharp edges. Hong et al. [32] proposed a decoupled and coupled high-spectral-resolution image super-resolution algorithm that progressively aggregates high-spectral and multispectral information. Through experimentation, it was demonstrated that this fusion method can enhance the quality of reconstruction. In addition, The CUCaNet [33] proposed a cross attention module that is also proposed to efficiently explore the spatial-spectral information. Furthermore, many researchers have introduced generative adversarial networks [34] (GAN) into remote-sensing SR tasks for generating perceptually pleasing remote-sensing images. Pan et al. [35] introduced the concept of back-projection into a generator to further enhance the visual quality. In addition, an attention-based GAN (SRAGAN) was proposed by Li et al. [36], which combined both local and global attention mechanisms to distinguish features at various sizes on different objects. Lei et al. [37] used a transformer to fuse high-and low-frequency information to reconstruct detail-rich pictures, building on the success of transformer in the fields of natural language processing and computer vision.

III. PROPOSED METHODS
In this section, we introduce the MSFFTAN for remotesensing super-resolution. First, the overall framework of MSFFTNet is presented in Section III-A. Then, branch gated selective block (BGSB), FFT-RB, and LGCAB are described in the following three subsection.

A. Network Architecture
As shown in Fig. 1, MSFFTAN mainly consist of the following four parts: 1) auxiliary path (AP) 2) shallow feature extraction block (SFEB) 3) multiscale deep feature extraction module; 4) reconstruction block. We present the MSFFTAN, a multiscale feature extraction approach that fully leverages multiscale features retrieved from an input image. The architecture of MSTFFAN is based on a three stage U-shape structure [38] with significant development for efficient multiscale feature representation. Specifically, an MSFFTAN is composed of three encoder blocks (EBs) and decoder blocks (DBs). Each EB or DB is composed of multiple cascaded FFT-RBs. We define I LR ∈ R H×W ×3 , I SR ∈ R sH×sW ×3 , and I HR ∈ R sH×sW ×3 as the input LR image, the reconstructed SR image, and the corresponding HR image, respectively. In addition, H and W denote the height and width of the image, respectively, with s representing the upsampling factor.
For I LR 1 , the SFEB is used to transform the original LR image to feature domain where F 0 represents the shallow feature extracted from the LR image. H SFEB (·) denotes the shallow feature extraction block. In detail, SFEB consists of two 3 × 3 Conv layers with an activation unit. Conv 3×3 (·) and δ(·) denote 3 × 3 Conv layer and linear rectification function (ReLU) activation function, respectively. Then, the extracted shallow feature F 0 is fed to the next EBs stands for the ith EB which consists of multiple FFT-RBs, F EB i represents the deep encoder feature extracted by the ith EB. H SFEB and H BGSB denote SFEB and BGSB, respectively. In addition to the first EB, not only the downsampled features of previous encoder are received, but also the information of corresponding downsampled image. In this way, our EB is anticipated to successfully handle multiscale features by utilizing the complimentary information from the downsampled feature space and the feature available from the image domain. To alleviate the inconsistency in the image and feature domains, we use BGSB for feature selection and feature fusion. In this work, we use a total of three encoder layers. Then, the DB can be described as follows: where H DB i represents the ith DB which consists of multiple FFT-RBs, F DB i stands for the deep decoder feature extracted by the ith decoder layer. Upsampling operation or downsampling operation is indicated by UP factor (·) or Down factor (·) and factor represents the magnification factor. Notably, to further enhance the network's ability to extract multiscale features, we aggregate features of different sizes and dimensions using BGSB module denoted as F fuse_i . Thus, F fuse_i contains rich structural information. Finally, deep decoder feature is fed into the reconstruction block which is consistent with a Conv layer, a subpixel layer, and a Conv layer as where H ↑ (·) is the function of upscale operation and SR represents the recovered HR image. We further adopt residual connection between a shallow feature and a deep feature to alleviate the training difficulty. In this way, we are able to force the network to focus on the lost high-frequency information, thus accelerating the convergence of the network.

B. Branch Gated Selective Block
Simple concatenation or summation are the most frequent strategies for feature aggregation. However, these choices hinder the representation capability of the network. Based on the fact that visual cortical neurons can adaptively change their receptive fields depending on the intensity of the stimulus [39], we propose a novel multiscale multiresolution feature fusion block named BGSB (see Fig. 2), which is composed of branch aggregation (BA) and gate selective fusion (GSF).
1) Branch Aggregation: The BA generates global feature descriptors by combining the information from multiresolution branches. Specifically, the downsized feature F 1 ∈ R H×W ×C and the feature obtained from the downsampled image F 2 ∈ R H×W ×C are summed as the input F , and the global average pooling (GAP) is utilized to squeeze the global spatial information into a channel descriptor Z, which can be expressed as where H GAP (·) denotes the global average pooling operation. F c and Z c denote cth channel input feature and output feature of H G A P , respectively. F c (i, j) is the value at the position (i, j) of cth channel of input feature F .

2) Gate Selective Fusion:
The channel statistic Z may be thought of as a grouping of local descriptors whose statistics can be utilized to represent the entire image. To make full use of the multiresolution feature interdependences, we employ a gating mechanism by the simple softmax function where δ(·) denotes activation function and Conv 1×1 (·) denotes 1 × 1 convolution. Then, we use softmax function to obtain attention weights belonging to each branch where Z 1 and Z 2 represent attention weight of different resolution branch. These descriptors are used by the GSF operator to recalibrate the feature map after aggregation. In this way, it is possible to adaptively aggregate different resolutions branches that carry information at different scales.

C. FFT-Based Residual Block
Image recovery task requires both low-frequency and highfrequency information, however, the standard ResBlock lacks the capacity to integrate high-frequency characteristics. Inspired by Mao et al. [40], we propose an FFT-RB as shown in Fig. 3 which consists of a conventional spatial domain Conv branch and a frequency domain branch. Specifically, to convert information to frequency domain space and extract complementary features for the space domain, we employ the discrete Fourier transform. Let X ∈ R H×W ×C be the input volume, where H, W, and C indicate the height, width, and channel of the feature, respectively. The bottom branch is processed as follows: where H r F F T 2D(·) represents 2-D discrete FFT and X represents the result of 2-D real FFT. Then, the real part and imaginary part are concatenated along the channel dimension where I(·) and R(·) get real and imaginary parts, respectively.
[·] denotes the concatenate operation. We use two 1 × 1 Conv to extract high-frequency features Here, Conv 1×1 (·) and δ(·) denote the 1 × 1 Conv and ReLU activation function, respectively. Finally, inverse 2-D real FFT operations are used to transform frequency features back to spatial domain. It is worth noting that due to the intrinsic characteristics of the Fourier transform, FFT can easily obtain the global field of perception without adding any additional burden. Influenced by ConvNest [41], we added a large convolution kernel to expand the perceptual field in the spatial branch where Conv 3×3 (·) and δ(·) denote the 3 × 3 Conv and ReLU activation function, respectively. Conv 7×7DW (·) denotes depthseparable convolution [42] with kernel size 7. Then, the final output Y = X + X space + X high of FFT-RB is calculated through LGCAB to further refine features.

D. Local-Global Channel Attention Block
Existing channel attention mechanisms [43] typically build channel descriptors via a global average pooling operation, which overlooks many beneficial little objects that play a vital role in RSSR. Hence, to be capable of assessing both informative large and tiny target objects, an LGCAB is proposed, as shown in Fig. 4. It allows the network to concentrate on significant features while still paying attention to minor target details. Consider an input feature F ∈ R H×W ×C , where C, W, and H indicate channel number, width, and height, respectively. The top branch of the LGCAB is in charge of characterizing little items, whereas the bottom branch is responsible for detecting global essential foundational features. The top branch can be expressed as where A local denotes the local channel attention map. δ(·) denotes activation function.W U (·) and W D (·) denote the weights of two 1 × 1 Conv layers to increase and decrease the number of channels by reducing factor r, respectively. This branch does not use global average pooling, preserving the original resolution of the features and enabling the capture of fine-grained information.
In this way, it is possible to concentrate on the attributes of the whole features. The GAP operation can be expressed as where F C (i, j) is the value at the position (i, j) of cth channel of input feature X . Then, the bottom branch can be expressed as where A global denotes the global channel attention map. δ(·) denotes activation function. W U (·) and W D (·) denote the weights of two 1 × 1 Conv layers to increase and decrease the number of channels by reducing factor r, respectively. As this branch uses global average pooling, it allows the network to focus on large objects that occupy a significant portion of the image. Next, global and local attention maps are be used to rescale the input feature XX whereX indicates the refined output features. σ(·) and ⊗(·) represent a sigmoid function and element-wise multiplication between feature maps, respectively. By using the above steps, we enable to emphasize important information and suppress irrelevant features using a global and local manner, thus enhancing the discriminative capacity of the network.

E. Loss Function
To optimize the RSSR network, various loss functions have been investigated, such as L1 loss [19], L2 loss [44], perceptual loss [45], and adversarial loss [34]. As stated by Lim et al. [19], L2 loss can maximize peak signal-to-noise (PSNR) metrics, but it is prone to produce blurry images. Therefore, L1 loss is chosen as our optimization function for training MSFFTAN. Then, MSFFTAN is optimized by minimizing the pixel-wise dissimilarity between estimated super-resolved image SR and corresponding ground truth HR. The optimization function L L1 is formulated as where Θ denotes trainable parameters of MSFFTAN network, and deep MSFFTAN is trained by using a training set , which contains N LR images patches and their HR counterparts. Auxiliary loss terms, in addition to the L1 loss, has been suggested in recent research for performance enhancement. Auxiliary loss terms that reduce the distance between the input and output in the feature space have been frequently employed in image restoration tasks and have shown promising results. Since the primary objective of super-resolution is to recover the lost high-frequency characteristic, it is critical to minimize the frequency space comparison. To this end, we introduce an FFT-based frequency reconstruction loss L FFT function. The L FFT loss measures the Euclidean distance between HR images and SR images in the Fourier entity as follows: where F denotes the FFT that transfers image domain to the frequency domain. The following is the final loss function for training our network: where we experimentally set τ = 0.01.

IV. EXPERIMENT
In this section, experiments are conducted to evaluate our proposed model. The datasets and metrics we employed in our experiments are described in Section IV-A. Then, the implementation details are presented in Section IV-B. Section IV-C compares our model to state-of-the-art (SOAT) methods on several datasets to show that our proposed approach is superior. Finally, an ablation study is performed in Section IV-D to analyze the contribution of each component to our MSFFTAN network.

A. Dataset and Implementation Details
To test the efficiency of the proposed approach, we utilize the following two publicly available (some examples of these two datasets are shown in Fig. 5): 1) UCMerced LandUse [46]; and 2) AID [47] datasets. These datasets have seen a lot of application in the field of remote-sensing super-resolution [28], [36], [37]. The HR images were downsampled with a scale factor using a bicubic interpolation operation in the MATLAB setting to produce LR images.
UCMerced LandUse dataset: This collection includes agricultural, runway, sparseresidential, storagetanks, and other remotesensing types. Each class has 100 pictures, each of which is 256 × 256 pixels in size and has a spatial resolution of 0.3 m/pixel. This dataset was divided into two halves: Train and test, with 20% of the training set used as validation.
AID dataset: This dataset contains 10 000 photos from 30 different types of remote-sensing scenarios, such as airports, bareland, churches, dense-residential areas, and so on. All of the photos are 600 × 600 pixels, with a spatial resolution of up to 0.5 m/pixel. According to TransNet [37], 80% of the whole dataset is randomly selected to be the training set, and the remaining images are used as the test set in the AID dataset. Moreover, we randomly select five images per class in a total of 150 images to construct the corresponding validation.
Metrics: Peak signal to noise ratio (PSNR) and structural similarity (SSIM) [48] are chosen as the common image superresolution evaluation metrics, and all super-resolution results are evaluated on the RGB space. Besides, we further introduce the learned perceptual image patch similarity (LPIPS) [49] to evaluate the reconstruction quality of the competing methods. A lower LPIPS value indicates a higher perceptual quality. We also analyze the floating point operations (FLOPS) and runtime of the models. Note that the FLOPs is calculated corresponding to a 48 × 48 image.

B. Implementation Details
To obtain better generalization performance, we use data augmentation, which includes random rotation by 90 • , random horizontal flipping and vertical flipping. We use Pytorch framework to implement and train the proposed MSFFTAN, and the model is trained using one NVIDIA GeForce GTX 3090. We train different models to super-resolve the remote-sensing images for scale factors 2, 3, and 4 with random initialization. The ADAM [50] optimizer with β 1 = 0.9 and β 2 = 0.999 is used. The learning rate is initialized as 2 × 10 −4 and halved every 400 epochs. For training, we randomly crop 16 48 × 48 LR patches as a training batch while HR image size corresponding to the scaling factor. In our MSFFTAN, all convolution layers contain 64 filters except 1 × 1 convolution layers. Specifically, the number of FFT-RB included in our backbone of different depths is 3, 2, and 1.

1) Quantitative Results:
To demonstrate the superiority of MSFFTAN, eight SOAT super-resolution methods, including Bicubic, SRCNN [17], FSRCNN [51], VDSR [18], LGCNet [26], DCM [52], HSENet [28], and CTN [53], are compared in terms of quantitative and visual quality on the UCMerced Lan-dUse dataset. Among them, SRCANN, FSRCNN, and VDSR are the approaches proposed for nature image SR task, while LGCNet, DCM, HSENet, and CTN are currently leading SR methods exclusively developed for remote-sensing images. It should be noted that, we analyze various comparison methods using the open-source code, and all of these methods are trained and evaluated in the same environment. Specifically, quantitative evaluations are made in two datasets for three scale (×2, ×3,and×4). Table I displays the average results of different  approaches on the UCMerced Landuse test dataset which clearly  TABLE I  PSNR/SSIM RESULTS ON UCMERCED LANDUSE DATASET OF SCALE X2, X3, AND X4   TABLE II  PSNR/SSIM RESULTS ON AID DATASET OF SCALE X2, X3, AND X4 reveal that MSFFTAN outperforms other advanced methods by a wide margin, offering the best restoration results in all three upscale factors. Specifically, our model achieves 1.66, 1.17, and 1.24 dB improvement over the second-best method (HSENet) on all three upscale factors. Furthermore, for the SSIM metric, our model outperforms HSENet by 0.0394, 0.0395, and 0.0386, respectively. However, the complexity of MSFFTAN is half of HSENet, which is attribute to the ability of our designed network to fully exploit and explore multiscale information. The AID dataset is utilized to evaluate the generality and generalization performance further since the images in this dataset contain more categories and a higher disparity than those in UCMerced Landuse dataset. In this dataset, we evaluate the developed MSFFTAN against several SR algorithms, including Bicubic, SRCNN, LGCNet, VDSR, DCM, and TransENet [37]. According to Table II, it can be seen that MSFFAN has the greatest average PSNR and SSIM score in all three upscale factors. More specifically, compared to the currently leading method TransENet, we improve the PSNR (SSIM) from 35.28 (0.9374) to 37.04 (0.9626) for upscale factor 2 and from 29.38 (0.7909) to 30.78 (0.8185) for upscale factor 4. The results reveal that in most circumstances, the designed MSFFTAN exceeds the existing leading approaches, confirming the stronger generalization ability of MSFFTAN. In addition, Table III provides comprehensive discovery of several approaches for all 30 scene classes of the AID dataset at an upscale factor of 4. MSFFTAN yields the highest PSNR scores in 14 scene classes, while TransENet scored better in the remaining scene categories. It is worth mentioning, however, that MSFFTAN obtains a good result that is 1.4 dB higher than TransENet. To further demonstrate the superiority of our proposed method, we employed the LPIPS metric. The lower the image quality is, the higher LPIPS is. As seen in Table IV, MSFFTAN outperforms other approaches by a significant margin. Specifically, MSFFTAN is 0.0009 lower than the current SOTA method HSENet on scale factor 2. This reveals that the reconstructed images generated by our method exhibit a higher degree of aesthetic appeal to the human visual system. Finally, as demonstrated in Fig. 6, it can be observed that the MSFFTAN exhibits faster convergence, further highlighting the effectiveness and superiority of the proposed module. These positive results support the efficacy of our method.
2) Visual Comparison: We assess the visual quality of the given MSFFTAN to current leading approaches to further validate its efficacy. Figs. 7-11 display multiple example superresolution results from the testset acquired utilizing various approaches, as well as the HR images for convenient comparison. It is worth noting that a close-up region denoted by red rectangle is displayed below the related image for convenient comparison. According to Fig. 7, we can observe that MSFFTAN is able to reconstruct images closest to HR. It is noteworthy      that the self attention-based HSENet and TransENet produced significant checkerboard effects and artifacts. We conjecture that this is due to the self-attention mechanism being influenced by noise and degradation, aggregating incorrect information. As displayed in Fig. 8, MSFFTAN produces the clearest parking places at a large magnification, whereas other approaches yield variable degrees of blurring, distortion, and warping, which further demonstrates the superiority of our method. The second-best network recovered by the zebra line loses a lot of lines, as seen in Fig. 9, but our MSFFTAN can provide the closest image to the HR. Furthermore, other approaches cause artifacts in the most challenging situation (magnification scale of 4×), however our method produces good visual results, as shown in Figs. 10 and 11. As depicted in Fig. 11, our proposed MSFFTAN ensures the maximum preservation of the yacht's authenticity, while other methods exhibit varying degrees of distortion and degradation. To further prove the generalization performance of the proposed method, we tested it on real remote-sensing images. As shown in Fig. 12, MSFFTAN has better reconstruction performance than the leading RSSR method. Specifically, MSFFTAN is able to recover better lines (as shown by the red arrow in the figure), while HSENet and TransENet's recovered image lines distorted. From the above analysis, we can conclude that MSFFTAN can produce visually satisfying HR images, which have rich and real textures, sharp edges, and clear boundaries.

D. Ablation Study 1) Study the number of FFT-RB in Encoder and Decoder:
The number of basic blocks on network performance is investigated in this subsection, as network depth has a substantial impact on model reconstruction properties. As a result, we perform a series of experiments to investigate this point. Table V compares the reconstruction using the UCMerced LandUse dataset with different basic block settings when the upscale factor is set to 2. Specifically, MSFFTAN_abc stands for the  different depths of backbone number of basic module settings, which are a, b, and c. We can see that when the number of FFT-RBs of MSFFTAN of the encoder and decoder are set to 321, MSFFTAN can obtain the highest PSNR and SSIM. It is worth noticing that when we increase the number of blocks in the network to reach MSFFTAN_333, the performance of network drops, which we ascribe to parameter overfitting. Finally, this also demonstrates that using an appropriate blocks setting may further enhance the reconstruction quality.
2) Effectiveness of MultiInput Mechanism: Multiinput strategy is an essential part of multiscale information exploration and aggregation. To achieve improved performance, the multiinput technique is designed to permit as much origin multiscale information in remote-sensing images as feasible. Here, we investigate the effect of this design with different inputs. According to the Table VI, when we add the Input-2 AP, our model achieves a 0.007-dB improvement. In addition, by adding Input-3 AP, we get a 0.06-dB improvement. We discover that the benefit of providing extra auxiliary pathways grows as the network deepens, owing to the increasing loss of shallow information as MSFFTAN grows. As a result, we may conclude that using a multiinput strategy can lead to improved performance. 3) Study of BGSB: BGSB is specifically designed for multiscale and hierarchical feature exploration and aggregation. In this part, we perform a series of experiments to illustrate the efficacy of BGSB comparing with the SUM operation and CONCAT operation (as shown in Table VII). In comparison to the SUM operation, our BGSB improves PSNR and SSIM by 0.019 dB and 0.000016, respectively, with near little increase in Parameters. In addition, the FLOPS for these two operations are almost identical, but BGSB achieves better performance. More importantly, when compared to CONCAT operation, BGSB has a significant performance and complexity advantage. Specifically, BGSB obtains a boost of 0.116 dB and 0.0013, but only takes up 96% of the parameters and 97% of FLOPS. These positive results support the efficacy of our BGSB. Finally, this also demonstrates that using an appropriate multiscale feature fusion approach may show considerable future reconstruction effort.

4) Study of LGCAB:
In LGCAB, we use the dual-branch structure to better extract small-and large-size information simultaneously. To prove the effectiveness of using LGCAB, we remove LGCAB or add other commonly used channel attention blocks (e.g., SE or CBAM) to perform ablation experiments. As shown in Table VIII, we show the results of these modified networks. If we do not employ any channel attention mechanism, the super-resolution performance will drop dramatically, and the usage of LGCAB raises the PSNR and SSIM scores by 0.067 dB and 0.001, respectively. It is worth noting that the use of the widely employed SE and CBAM modules resulted in a rapid decline in network performance. We hypothesize that this is due to the fact that the SE and CBAM modules employ global average pooling and global max pooling to compress spatial information, resulting in the loss of a large number of small-scale features that are also crucial for the final reconstruction. Additionally, they only capture global peak signals that do not accurately reflect texture and structural information, which is another reason for the decrease in network performance. Furthermore, compared with one branch channel attention block (all spatial information is discarded), using the dual-branch structure promotes the average PSNR and SSIM values by 0.159 dB and 0.0016, respectively. Therefore, we can draw a conclusion that we can get better performance by applying the LGCAB which can capture both large-and micro-scale characteristics.

5) Study of FFT Loss:
In this section, we investigate the impact of the loss function on the final performance of the model as shown in Table IX. In order to balance the distribution of the FFT loss and L1 Pixel, we use a relatively small weight on the FFT loss term, which helps to optimize the network. In addition, through experimentation, we found that adding the FFT loss can improve the quality of the reconstructed image, specifically resulting in a 0.018 and 0.035-dB increase in PSNR on the UCMerced LandUse and AID datasets, respectively.

6) Effectiveness of Our Proposed Components:
In this subsection, we investigate the individual contributions of various components of our proposed model through ablation experiments. We use a baseline model consisting of the main path without an AP, FFT-RB, and LGCAB. All comparative models are trained for 500 epochs on the UCMerced LandUse dataset under consistent experimental conditions. From Table X, we can conduct that using AP (Model 1) can improve 0.086 dB compared with the baseline model (Model 0). The AP module can automatically supplement a variety of origin shallow multiscale information which play important roles in the reconstruction of degraded remote-sensing images. Compared with the baseline model (Model 0), the FFT-RB model (Model 2) achieves an improvement of 0.121 dB and 0.001 in terms of PSNR and SSIM. The incorporation of the FFT-RB module, which utilizes FFT, enables efficient capture of global information, which is crucial for the reconstruction of high-spatial resolution remote-sensing imagery. Furthermore, the proposed LGCAB module derives a numerical gain of 0.24 dB and 0.002 for PSNR and SSIM, respectively. The LGCAB model employs a resource allocation strategy that prioritizes the allocation of resources to regions of higher criticality, while concurrently implementing mechanisms to suppress irrelevant information from both a local and global perspective. In summary, the overall performance of the network is notably superior when incorporating our proposed components, thereby demonstrating the effectiveness of our proposed modules.

E. Model Complexity Analysis
The tradeoff between PSNR and FLOPS is examined as shown in Fig. 13 in this section. FLOPs stands for floating point operations, which is defined as the number of computations and can be used to measure the complexity of models. Obviously, MSFF-TAN achieves competitive results with fewer FLOPS. Despite CTN having fewer FLOPS than MSFFTAN, its performance is 2.29 dB worse. MSFFTAN, one the other hand, achieves a 1.66-dB enhancement while only requiring half of HSENet's FLOPS, indicating that MSFFTAN can reach a reasonable balance between model complexity and performance. In conclusion, MSFFTAN has fewer FLOPS and produces excellent super-resolution results than previous approaches, demonstrating that our method has achieved a satisfactory balance between network complexity and image super-resolution quality.

V. CONCLUSION
In this work, a novel FFT-based multiscale attention network, referred to as MSFFTAN, is proposed for the task of RSSR. The MSFFTAN utilizes a multiinput encoder-decoder structure to extract multiscale information and enhance features, resulting in superior reconstruction capabilities. In particular, a FFT-RB, containing a convolution operation and FFT operation, is elaborately designed to extract and aggregate both local details and global structures. To enhance the ability of MSFFTAN to utilize both large and small target information, an LGCAB is constructed. More importantly, a BGSB is presented to make full use of middle features from multiple scales and depths in order to increase the quality of the reconstructed results. Extensive experiments on both two public datasets indicate that MSFFTAN outperformed other presently leading approaches in quantitative and qualitative evaluations.