Processing math: 100%
CFFormer: Channel Fourier Transformer for Remote Sensing Super Resolution | IEEE Journals & Magazine | IEEE Xplore

CFFormer: Channel Fourier Transformer for Remote Sensing Super Resolution


Abstract:

The objective of super-resolution in remote sensing imagery is to enhance low-resolution images to recover high-quality details. With the rapid progress of deep learning ...Show More
Topic: Recent Advances in Remote Sensing Image Super-Resolution for Earth Observation

Abstract:

The objective of super-resolution in remote sensing imagery is to enhance low-resolution images to recover high-quality details. With the rapid progress of deep learning technology, the deep learning-based super-resolution technology for remote sensing images has also made remarkable achievements. However, these methods encounter several challenges. They often struggle with processing long-range spatial information that encompasses complex scene changes, adversely affecting the image's coherence and accuracy. Furthermore, the lack of connectivity in feature extraction blocks hinders effective feature utilization in deeper network layers, leading to issues such as gradient vanishing and exploding. Additionally, constraints in the spatial domain of previous methods frequently result in severe shape distortion and blurring. To address these issues, this study proposes the CFFormer, a new super-resolution framework that employs the Swin Transformer as its core architecture and incorporates the Channel Fourier Block (CFB) to refine features in the frequency domain. The Global Attention Block (GAB) is also integrated to enhance global information capture, thereby improving the extraction of spatial features. To increase model stability and feature utilization efficiency, a Jump-Joint Fusion Mechanism is designed, culminating in a Residual Fusion Swin Transformer Block (RFSTB) that alleviates the gradient vanishing issue and optimizes feature reuse. Experimental results confirm the CFFormer's superior performance in remote sensing image reconstruction, demonstrating outstanding perceptual quality and reliability. Notably, the CFFormer achieves a Peak Signal-to-Noise Ratio (PSNR) of 29.83 dB on the UcMercedx4 dataset, surpassing the SwinIR method by approximately 0.5 dB, indicating a substantial enhancement.
Topic: Recent Advances in Remote Sensing Image Super-Resolution for Earth Observation
Page(s): 569 - 583
Date of Publication: 05 November 2024

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Remote sensing images have a broad range of applications in the military [1], environmental monitoring [2], agriculture [3], urban planning, and disaster prevention [4]. Under such extensive application requirements, humans often need higher resolution remote sensing images for research and discussion. However, due to limitations imposed by current imaging equipment, cost, weather factors, and lighting conditions, it is extremely difficult to directly obtain high-resolution images. Therefore, finding methods to obtain higher resolution images has become a major research topic in the field of remote sensing. Among these methods, single remote sensing image super resolution (RSSR) technology, which aims to restore high-resolution (HR) images from low-resolution (LR) images, has garnered significant attention in remote sensing applications [5], [6], [7].

So far, a plethora of literature has successfully achieved remarkable accomplishments in deep learning-based single natural image super-resolution (SR) methods [8], [9]. Learning-based SR methods leverage sufficient information from LR images to infer and predict corresponding high-resolution details (e.g., edges). Thanks to their sophisticated nonlinear learning capabilities and extensive training data, deep learning models are favored over interpolation-based SR methods [10], [11], [12] and reconstruction-based SR methods [13], [14] for their ability to generate clearer and more natural image details and easier integration with other deep learning applications and services.

Although deep learning has achieved outstanding results in natural image SR, remote sensing image SR differs significantly from natural image SR. Natural image SR primarily focuses on enhancing image quality, recovering image details and textures, and maintaining natural aesthetics, while remote sensing image SR faces challenges such as dealing with large homogeneous areas (like oceans and deserts) and highly complex features (like urban areas). Moreover, the spatial resolution and global information of remote sensing images play a crucial role in remote sensing image SR, aspects that general natural image SR tends to overlook. This leads to the necessity for researchers to design specialized methods for remote sensing images to improve the effectiveness of remote sensing SR [15], [16], [17].

In 2014, Dong et al. [18] proposed super-resolution convolutional neural network (SRCNN), marking the first application of deep learning technology to image SR reconstruction, achieving an end-to-end mapping from low to high resolution. Subsequently, Kim et al. [19] proposed VDSR, which significantly improved the performance of deep learning in SR by increasing the depth of the network. Tuna et al. [20] applied the SRCNN and VDSR models combined with intensity-hue-saturation (IHS) transformation to satellite images obtained from VHR SPOT6 and 7 as well as Pleiades 1A and 1B satellites. The experimental results showed that the VDSR method outperformed SRCNN on both panchromatic image (PAN) and multispectral (MS) remote sensing images. Later, Xu et al. [21] proposed a deeply modulated convolutional neural network, which combines image details with contextual information using local and global memory connections to generate high-quality images. Recently, Ren et al. [22] proposed an enhanced residual convolutional neural network based on a dual-brightness scheme, which enhances the feature flow module of the residual convolutional neural network and the ability to learn distinctively across feature maps. Qian et al. [23] adopted the ResNet architecture and utilized 3-D separable convolution to better capture spatial–spectral features, providing a new self-supervised learning method for the SR problem of Sentinel-2 multispectral images.

Following this, with the development of attention mechanisms in computer vision applications, Gu [24] et al. proposed a deep residual squeeze-and-excitation network for RSSR, where they introduced a Residual Squeeze-and-Excitation Block to simulate the interdependencies between channels, enhancing the network's representational power. Recently, Wang et al. [25], while focusing on VHR satellite images, applied channel attention and spatial attention to a deep dense residual network to improve the performance of single image super resolution (SISR) solutions. Wang et al. [26] proposed a distance attention block (DAB) as a bridge between the main branch and the distance attention residual connection block (DARCB) branch, with DAB effectively alleviating the loss of detail features during the extraction process by deep convolutional neural networks (CNNs).

However, CNN-based methods face an inevitable obstacle in remote sensing image super-resolution (RSISR). Due to the design of convolutional layers, the interaction between the convolutional kernel and the image is a content-agnostic process. It is illogical to use the same convolutional kernel to reconstruct different regions of an image. To address this issue, Vision Transformer (ViT) [27] and the Swin Transformer [28], which is based on ViT, have emerged, with subsequent researchers further improving and exploring on this basis. For instance, ConvFormerSR [29] later adopted a dual-branch structure, with one branch using the Swin Transformer mechanism combined with global attention, and the other branch using a series of residual groups to further extract image texture information, ultimately achieving outstanding results on the HLSSR-GJ remote sensing dataset. Ren et al. [30] proposed a cross dual-branch U-Net architecture that combines convolutional neural networks and Transformers, by designing a spectral-spatial parallel Transformer and a spectral–spatial feature interaction module, effectively improving the spatial resolution of hyperspectral images.

Despite the significant achievements of current Transformer-based remote sensing image SR methods, they still face challenges in capturing global information and utilizing richer spatial resolution. These limitations not only affect the quality of detail recovery, but also restrict the stability of the model when dealing with deep networks, often leading to vanishing or exploding gradients. To address these issues, we propose a new RSISR framework: CFFormer. This framework consists of three key parts: 1) shallow feature extraction; 2) deep feature extraction; and 3) upsampling. In the deep feature extraction phase, our CFFormer utilizes a carefully designed Global Attention Block (GAB) to capture richer global information and global contextual information. Subsequently, the Channel Fourier Block (CFB) processes image features by mapping through Fourier transformation into the frequency domain to more effectively handle high-frequency information. In addition, we optimize the information flow and alleviate the vanishing and exploding gradients caused by the increase in network depth through the residual fusion Swin Transformer layer. In summary, the main contributions of this article are as follows.

  1. To enhance the recovery of image details, we have carefully designed the CFB, which combines channel attention with the Fast Fourier Transform (FFT) and employs depthwise convolutions and pointwise convolutions to extract more comprehensive, detailed, and stable features.

  2. We propose the use of a GAB and have made improvements to it, placing it at the beginning and end of the deep feature extraction section to further optimize global information and enhance the model's expressive power.

  3. To effectively combine features from early and later layers, capturing a broader range of contextual and detailed features without adding more computational burden, we have combined jump-joint and Swin Transformer to carefully design the Residual Fusion Swin Transformer Block (RFSTB).

SECTION II.

Related Work

In this section, we first explain the spatial resolution and global information of remote sensing images, then elaborate on CNN-based SR networks and Transformer-based SR networks, and finally explain the FFT used extensively in this work.

A. Spatial Resolution and Global Information of Remote Sensing Images

Remote sensing images significantly differ from natural images, particularly in terms of spatial resolution and global information, both of which greatly impact SR. Spatial resolution is defined as the ground area represented by a single pixel in the image. Higher spatial resolution means that each pixel covers a smaller ground area, revealing more detailed and fine geographic features. However, this also places higher demands on models to utilize more spatial resolution information.

In addition, global information plays a crucial role in the SR of remote sensing images. Global information refers to information involving large-scale or entire scenes in the image. Remote sensing images usually cover vast geographic areas and contain complex natural and man-made landscapes. Global information helps to understand the contextual relationships in the image, such as terrain continuity and the distribution of ecological regions. In SR technology, dealing with global information often involves considering the relationships and interactions between different regions within the image. This requires SR models to have a high degree of spatial perception ability to recognize and utilize the global patterns and structures in the image.

B. SR Based on CNNs

Previously, traditional CNN-based SR networks have been widely applied and achieved tremendous success. Since SRCNN [18] and VDSR [31], various advanced networks have emerged continuously. Furthermore, with the vigorous development of ResNet [32] and GAN [33], researchers in the field of SR have gradually begun exploring their applications. In 2017, the advent of SRGAN [34] effectively linked GANs with SR and introduced numerous residual networks into the generator. Ren et al. [30] proposed a context-aware edge-enhanced generative adversarial network (CEEGAN) SR framework for reconstructing visually pleasing images that can be practically applied in real scenarios. In addition, increasing the “width” of networks has been considered a solution to enhance SR performance. Therefore, the proposal of enhanced deep residual networks for single image super-resolution (EDSR) [34] significantly enhanced network performance by increasing the number of network channels and filter channels.Subsequently, in the rise of attention mechanisms, residual channel attention network (RCAN) [8], SAN [35], and others have again raised peak signal-to-noise ratio to new heights, demonstrating the importance of focusing more on high-frequency information than low-frequency information. MSAN [36] later applied this mechanism to remote sensing images, achieving success in multilevel feature extraction for the complex structure of remote sensing images. Huang et al. [37] developed residual dual attention blocks, including local multilevel fusion modules and dual attention mechanisms, to focus the network more on high-frequency information areas.

However, despite achieving good results, the above CNN-based SR models primarily focus on local information due to their limited feature extraction capabilities, thus failing to fully utilize contextual and global information. Specifically, contextual and spatial information may be lost during the decoding stage, limiting the recovery of high-resolution information.

C. SR Based on Transformers

In order to solve problems such as not being able to fully utilize contextual information, Transformer was applied to SR. In 2021, pre-trained image processing transformer (IPT) [38] pioneered the application of Transformer to low-level visualization tasks. By constructing a pretrained model based on Transformer and utilizing its powerful modeling capability, the corresponding underlying visualization tasks were effectively accomplished. Compared to various attention mechanisms, IPT proves more effective in SR tasks. However, IPT necessitates extensive pretraining and incurs high computational costs, thereby limiting its efficiency. Subsequently, SwingIR [9] and NLSA [39], built upon Swin Transformer [28], emerged between 2021 and 2022. Swin Transformer innovatively introduced Localized Self-Attention, which computes self-attention within localized windows, significantly enhancing performance over IPT and thereby enhancing Transformer utility. Recent advancements like hybird attention transformer for remote sensing super-resolution (HAT) [40] and SwinFIR [41] further extend SwinIR [9] by integrating overlapping cross-attention and FFT, enabling broader pixel activation across the network and further boosting performance.

D. FFT Applied to Images on SR

FFT is an algorithm of paramount importance, facilitating the swift computation of the Discrete Fourier Transform and its inverse. As one of the foundational algorithms in contemporary digital signal processing, FFT is extensively utilized across a myriad of scientific and engineering disciplines, encompassing digital image processing, audio signal analysis, telecommunications systems, and beyond. Characteristics that prove challenging to manipulate in the spatial domain may often be more readily addressed in the frequency domain. Concretely, FFT enables the transition of an image from its native spatial representation to the frequency domain, where targeted modifications of specific spectral components can be effected to achieve the desired image processing outcomes, followed by an inverse transformation back to the spatial domain via the Fast Inverse Fourier Transform. In the realm of image SR, the application of FFT has been gaining increasing traction. Wang et al. [42] leveraged Fourier transformation to capture global facial structural information and augmented model capability by harmonizing spatial and spectral information through dual pathways that segregate local and global dependencies. Sinha et al. [43] introduced nonlocal attention-assisted fast Fourier convolution to broaden the receptive field and engendered learning of long-range dependencies. Liu et al. [28] expanded the receptive field via the spatial frequency block, thereby enhancing SR by capturing long-range interdependencies. Despite the remarkable efficacy of current FFT-driven SR methodologies, there remains a dearth in the extraction of more profound feature representations, leading to imperfections in the recovery of more delicate textures. To further harness the potential of FFT in the domain of image SR, we have conducted an exploration and crafted an innovative CFB. A detailed exposition of this development will be presented in Section III-D.

SECTION III.

Method

In this section, we first describe the motivation behind the design of our CFFormer, then introduce the overall structure of the proposed CFFormer, and finally describe its three important innovations: 1) the RFSTB; 2) the CFB; and 3) the GAB.

A. Motivation

Despite the significant advancements in the field of image SR through deep learning techniques, challenges still remain when dealing with remote sensing images that have high spatial resolution and complex global information. Remote sensing images typically cover vast geographical areas with large-scale and diverse scene content, necessitating SR technology that can effectively capture extensive global information, utilize higher spatial resolution, and accurately recover details ranging from urban structures to natural landforms. However, traditional SR methods, such as CNNs and Transformer models, while achieving certain successes in spatial domain processing, often overlook the critical frequency domain features in remote sensing images, such as the low-frequency large-scale structures and high-frequency detailed textures, which are essential factors in determining image quality. To address this issue, we have designed a novel SR model that combines frequency domain and spatial domain processing, specifically tailored to the needs of remote sensing images. By processing in the frequency domain, the model can more effectively utilize frequency information to restore the global structure of the image and further refine the texture details, which is crucial for enhancing the spatial resolution of remote sensing images. Spatial domain processing ensures the precise recovery of local image features, especially in urban and agricultural areas, where detail recovery is vital for the practical application of remote sensing images. We believe that a model combining these two strategies can not only enhance the overall visual quality of remote sensing images but also more comprehensively analyze and utilize the spatial and frequency information in the images, thereby significantly improving the descriptive ability of complex remote sensing scenes.

B. Overall Structure of CFFormer

Our CFFormer is designed to more effectively harness and integrate global information, further addressing the meticulous feature mappings in the spatial domain that are challenging to handle, thereby enhancing feature reusability and effectively mitigating issues such as gradient vanishing and explosion, thus improving the performance of RSSR. The overall structure of the CFFormer is depicted in Fig. 3. Our CFFormer consists of a shallow feature extraction component, a deep feature extraction component, and an image upsampling section. Specifically, the input LR image (I_{LR}) first undergoes a shallow feature extraction part composed of a convolutional layer, then proceeds to a deep feature extraction part comprised of a series of RFSTB, CFB, and GAB positioned at the beginning and end, and finally passes through an image reconstruction part consisting of a series of convolutional layers and PixelShuffle blocks \begin{equation*} I_{SR} = F_{\text{PixelShuffle}}(F_{DFE}(\text{Conv}(I_{LR}))) \tag{1} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where FDFE denotes the deep feature extraction function, FPixelShuffle denotes the image upsampling function, Conv denotes convolution, and ISR denotes the final reconstructed image.

Fig. 1. - Comparison results of our CFFormer with state-of-the-art methods SwinIR, HAT, SwinFIR. On UC Merced × 4, CLRS × 2, CLRS × 4, our CFFormer achieves the best performance over existing works.
Fig. 1.

Comparison results of our CFFormer with state-of-the-art methods SwinIR, HAT, SwinFIR. On UC Merced × 4, CLRS × 2, CLRS × 4, our CFFormer achieves the best performance over existing works.

Fig. 2. - Comparison of our CFFormer with and without CFB on harbor67 in the UC Merced dataset. It can be seen that in the absence of CFB, a larger distortion is produced.
Fig. 2.

Comparison of our CFFormer with and without CFB on harbor67 in the UC Merced dataset. It can be seen that in the absence of CFB, a larger distortion is produced.

Fig. 3. - Overall structure of the proposed CFFormer, GAB denotes Global Attention Block and CFB denotes Channel Fourier Block.
Fig. 3.

Overall structure of the proposed CFFormer, GAB denotes Global Attention Block and CFB denotes Channel Fourier Block.

In the deep feature extraction part, each RFSTB internally contains a series of Swin Transformer layers (STLs), shift Swin Transformer layers (SSTLs), and CFBs to realize the deep extraction of feature mapping.

C. Residual Fusion Swin Transformer Block (RFSTB)

In existing Swin Transformer-based literature, STLs or SSTLs are typically connected in a direct series, a structure we believe may lead to sparse relationships between layers. This sparse connectivity can hinder the flow of features and result in the loss of critical information such as local-global features in remote sensing images. To address this issue, we propose an RFSTB to enhance interlayer connections, as depicted in Fig. 3(a). Specifically, each STL is concatenated with the output of its second neighboring SSTL, and channel adjustments are made using 1 × 1 convolutions. Similarly, each SSTL undergoes the same feature fusion operation with its second neighboring STL. This approach strengthens connections between features captured at lower layers and high-level abstract features at higher layers, optimizing information flow and enhancing the model's feature expressive power. Consequently, this improves the overall performance of the model. It is important to note that the number of (S)STLs can vary, as illustrated in Fig. 4, and different configurations will yield varying performance and parameter counts, which will be detailed in Section IV-E.

Fig. 4. - Structure diagram of RFSTB under different numbers of (S) STL.
Fig. 4.

Structure diagram of RFSTB under different numbers of (S) STL.

As shown in Fig. 3, taking six (S)STLs as an example, we perform feature fusion between (1).STL and (4).SSTL, between (2).SSTL and (5).STL, and between (3).STL and (6).SSTL. Taking the first RFSTB as an example, the IGAB that has been processed by the GAB is then processed by (1).STL, (2).SSTL, (3).STL, and (4).SSTL to obtain ISR4 \begin{equation*} I_{SR4} = \operatorname{F_{SSTL}}\left({F_{STL}}\left(\operatorname{F_{SSTL}}\left(\operatorname{F_{STL}}\left(I_{GAB}\right)\right)\right)\right) \tag{2} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where FSTL and FSSTL denote the STL function and SSTL function, respectively, and ISRx denotes the feature mapping after the xth layer (S)STL processing. Then, the obtained ISR1,ISR4 are fused as inputs to the fifth layer (5).STL, and ISR2, ISR5 are fused as inputs to the sixth layer (6).SSTL, and then ISR3,ISR6 are fused as inputs to CFB. Finally the output of CFB is summed with IGAB to get the final output of RFSTB \begin{align*} \left\lbrace \begin{array}{l} I_{\text{SR1,4}} = \text{Conv}(\text{Cat}(I_{\text{SR1}}, I_{\text{SR4}})) \\ I_{\text{SR5}} = F_{\text{STL}}(I_{\text{SR1,4}}) \\ I_{\text{SR2,5}} = \text{Conv}(\text{Cat}(I_{\text{SR2}}, I_{\text{SR5}})) \\ I_{\text{SR6}} = F_{\text{SSTL}}(I_{\text{SR2,5}}) \\ I_{\text{SR3,6}} = \text{Conv}(\text{Cat}(I_{\text{SR3}}, I_{\text{SR6}})) \\ I_{\text{CFB}} = F_{\text{CFB}}(I_{\text{SR3,6}}) \\ I_{\text{RFSTB}} = I_{\text{GAB}} + I_{\text{CFB}} \end{array} \right. \tag{3} \end{align*}
View SourceRight-click on figure for MathML and additional features.
where ISRx,y denotes the feature mapping after fusing the (S)STL results of layer x and layer y, Conv(\cdot) denotes the convolution function, Cat(\cdot) denotes the splicing function, and FCFB(\cdot) denotes the CFB function, which we will specifically introduce in Section III-D.

By introducing a fusion layer between different processing stages, our architecture effectively integrates information across layers. This optimization not only enhances the flow of features, but also improves the network's capacity to recognize complex patterns. Consequently, it mitigates issues such as vanishing and exploding gradients, significantly enhancing RSSR performance.

D. Channel Fourier Block (CFB)

In remote sensing images, there are often important feature boundaries (e.g., farm roads or rivers) that are blurred due to the lack of high-frequency information. By transforming the feature maps from the spatial domain to the frequency domain through FFT, these high-frequency details can be effectively restored, making the boundaries clearer and significantly enhancing the image's resolution and practicality. Therefore, to further enhance the recovery of image details and overall quality, we have carefully designed the CFB. As shown in Fig. 5, the CFB consists of a Channel Attention part (CA) and a Fourier Block part (FB). The input feature maps will be processed through both the CA and the FB, and the results of the two branches will be fused as the output of the CFB.

Fig. 5. - Channel Fourier block CFB, where (a) is channel attention (CA) and (b) is Fourier block (FB).
Fig. 5.

Channel Fourier block CFB, where (a) is channel attention (CA) and (b) is Fourier block (FB).

Channel Attention: To enhance the performance of CNNs, we have applied the traditional channel attention mechanism from [44] to our network. Unlike the original approach, as shown in Fig. 5(a), we first perform a Conv-act-Conv operation at the beginning of the channel attention to further extract features, then multiply element-wise with the feature maps processed by the channel attention, and finally perform feature fusion to obtain the output of CA. Our CA optimizes the model's feature expression by assigning different importance to features of different channels, enabling the network to focus more on features that are more useful for the current task.

Fourier Block: As mentioned at the beginning of Section III-D, in order to better recover image details and structures, we have combined the FFT and carefully designed the FB, as shown in Fig. 5(b). Before starting the processing, we assume that the dimensions of the input feature map Iin are (b, c, h, w), where b represents the batch size, c represents the number of channels, h represents the image height, and w represents the image width. First, since FFT is a symmetric transformation, it is only necessary to store half of the frequency components to reconstruct the complete spectrum. Therefore, to improve computational efficiency, we first reduce the channels of the input feature map by half through a 1 × 1 convolution. Next, we use the FFT to convert the feature map from the spatial domain to the frequency domain. For a 2-D feature map Iin (x, y), its Fourier transform and inverse Fourier transform are mathematically expressed as \begin{equation*} F_{\text{FFT}}(u,v) = \sum _{x=0}^{M-1}\sum _{y=0}^{N-1}I_{\text{in}}(x,y) \cdot e^{-j2\pi \left(\frac{ux}{M} + \frac{vy}{N}\right)} \tag{4} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where M, N represent the width and height of the input feature map, respectively, and u and v represent the coordinates in the frequency domain.

During the FFT process, due to the symmetry of the FFT, we only need to retain the positive frequency part and the dc component. Therefore, the width w will become w/2 + 1. In addition, an extra dimension of size 2 is added to save the real and imaginary parts of each frequency component. Thus, after a 2-D image undergoes the FFT operation, the tensor dimensions change from (b, c/2, h, w) to (b, c/2, h, w/2 + 1, 2). Next, the feature map transformed to the frequency domain will further extract features after passing through a conv-act and then separately through 1 × 1 pointwise convolution and 3 × 3 depthwise convolution. The pointwise convolution is used for information mixing and feature transformation between channels. Although the convolution kernel is small, it can learn more complex feature representations by integrating information from different channels. The depthwise convolution is mainly used to apply the convolution kernel separately within each input channel. Unlike traditional convolutional layers, it does not mix information between channels but processes the spatial features of each channel independently. Passing the feature map through both depthwise and pointwise convolutions and then fusing them allows the model to optimize internal features and cross-channel features simultaneously, enhancing the expressiveness of features and the learning efficiency of the network. Moreover, it enables the network to achieve complex feature extraction and fusion without adding too much computational overhead.

After fusing the feature maps that have been processed by depthwise convolution and pointwise convolution, we then apply max pooling and average pooling to them, respectively. Max pooling extracts the strongest activation signals of regional features, emphasizing the most salient parts of the features, which helps to maintain the sharpness and prominence of image features. Average pooling calculates the average value of feature regions, aiding in extracting smoother features and taking more account of the statistical information of the entire region. Next, to enable the network to learn how to extract and integrate information from different pooling strategies for more accurate judgment or prediction, we concatenate the results of max pooling with those of average pooling. After concatenation, we continue with depthwise convolution processing, followed by Act-Conv for channel dimension adjustment. After passing through a sigmoid activation, we perform element-wise multiplication with the feature IFB4 before the pooling operation, and then undergo inverse Fourier transformation (Inv FFT) to convert the modified feature map from the frequency domain back to the spatial domain. Finally, we add this to the feature before the Fourier transformation and adjust the channels with a 1 × 1 convolution \begin{align*} \left\lbrace\! \begin{array}{l} I_{\text{FFT}} = F_{\text{FFT}}(\text{Act}(\text{Conv}(I_{3,6}))) \\ I_{\text{PConv}} = \text{PConv}(\text{Act}(\text{Conv}(I_{\text{FFT}}))) \\ I_{\text{DConv}} = \text{DConv}(\text{Act}(\text{Conv}(I_{\text{FFT}}))) \\ I_{\text{FB2}} = \text{Conv}(\text{Cat}(I_{\text{PConv}}, I_{\text{DConv}})) \\ I_{\text{FB3}} =\text{Cat} (F_{\text{AVG}}(I_{\text{FB2}}), F_{\text{MAX}}(I_{\text{FB2}})) \\ I_{\text{FB4}} = \text{Conv}(\text{Act}(\text{DConv}(\text{FB3}))) \\ I_{\text{FB}}\! = \!\text{Conv}(\text{F}_{\text{invFFT}}(I_{\text{FB2}} \times \text{Sig}(I_{\text{FB4}}))) \!\!\!+\! \text{Act}(\text{Conv}(I_{3,6})). \end{array}\right. \tag{5} \end{align*}

View SourceRight-click on figure for MathML and additional features.

In the description, Conv(\cdot) represents a standard 1 × 1 convolution, Act(\cdot) denotes the activation function, DConv(\cdot) and PConv(\cdot) indicate depthwise convolution and pointwise convolution, respectively, FFFT(\cdot) represents the Fast Fourier Transform function, FMAX(\cdot) denotes the max pooling function, FAVG(\cdot) signifies the average pooling function, Sig(\cdot) refers to the sigmoid function, and FinvFFT(\cdot) indicates the inverse FFT function. The expression for the inverse Fourier transform is \begin{equation*} F_{\text{invFFT}}(x,y) = \frac{1}{MN} \sum _{u=0}^{M-1} \sum _{v=0}^{N-1} I_{\text{freq}}(u,v) \cdot e^{j2\pi \left(\frac{ux}{M} + \frac{vy}{N}\right)} \tag{6} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where M, N, u, and v still represent the width and height of the feature map, the coordinates in the frequency domain, and Ifreq(u,v) denotes the feature in the frequency domain. Finally, we fuse the output of CA with the output of FB to obtain the final output of the CFB.

E. Global Attention Block (GAB)

For remote sensing images, processing global information is crucial for correctly interpreting and reconstructing large-scale geographical environments. Global attention mechanisms can help models identify and leverage contextual information across vast spatial extents within images, which aids in maintaining the continuity and overall consistency of geospatial features during the SR process, thereby enhancing the practicality of spatial resolution. Therefore, to enhance the model's ability to capture information on a global scale, we propose a GAB, placing it at the beginning and end of deep feature extraction (as shown in Fig. 3), where the initial GAB is used for richer feature extraction of LR, and the final GAB, together with the initial GAB, captures longer-range and global information more effectively.

The GAB we propose is specifically shown in Fig. 6. It should be noted that, in order to enable the model to better focus on key features, we propose applying a “maximize-expand-subtract” strategy within the GAB, and we also employ jump-joint between the feature maps before and after the GAB to promote information flow in deep networks and prevent gradient vanishing during training. Initially, the input feature map undergoes Conv-Act-Conv to further extract features, then undergoes three reshapes to obtain three feature maps IGA1, IGA2, IGA3. IGA1 is transposed and multiplied with IGA2 to get I'GA12, followed by max pooling and expansion processing, and then subtracted from I'GA12 and passed through a sigmoid operation to normalize and highlight important features. The normalized result is then multiplied with IGA3, added to the initial transposed feature, and finally fused through feature fusion to obtain the final output of the GAB, IGAB. We take the GAB at the head of the deep feature extraction as an example \begin{align*} \left\lbrace \begin{array}{l} I_{\text{GA2}} = I_{\text{GA3}} = F_\text{Reshape}(\text{Conv}(I_{\text{LR}})) \\ I_{\text{GA1}} = F_\text{Transpose}(F_\text{Reshape}(\text{Conv}(I_{\text{LR}}))) \\ I^{\prime }_{\text{GA12}} = I_{\text{GA1}} \times I_{\text{GA2}} \\ I_{\text{GA12}} = \text{Sig}(\text{MAX}(I^{\prime }_{\text{GA12}}) - I^{\prime }_{\text{GA12}}) \\ I^{\prime }_{\text{GA}} = I_{\text{GA12}} \times I_{\text{GA3}} + \text{Conv}(I_{\text{LR}}) \\ I_{\text{GA}} = \text{Conv}(\text{Cat}(I^{\prime }_{\text{GA}}, \text{Conv}(I_{\text{LR}}))) \end{array} \right. \tag{7} \end{align*}

View SourceRight-click on figure for MathML and additional features.where FReshape and Ftranspose denote the reshape function and transpose function, respectively, and MAX(\cdot) denotes the maximum pooling and expansion operation.

Fig. 6. - Global attention block.
Fig. 6.

Global attention block.

SECTION IV.

Experiment

A. Datasets and Implementation Details

We have conducted extensive experiments on UC Merced, CLRS, and RSSCN7 datasets to demonstrate the effectiveness of CFFormer.

1) UC Merced Dataset and Implementation Details

UC Merced contains 21 categories of remote sensing scenes, including agricultural, airplane, Baseball diamond, and so on, each category has 100 images with the size of 256 × 256 pixels, the spatial resolution of these images is 0.3 m/pixel. We randomly partition these images in the ratio of 6:2:2 as the corresponding training set, validation set and test set. We use bicubic interpolation to generate 64 × 64 images for SR on the × 4 scale versus 128 × 128 images for × 2, respectively, and augment the training data with horizontal and random flip strategies. We optimized our network using Adam with a batch size of 2. The initial learning rate was then set to 1e-4 for training at × 4 scale and 2e-4 for × 2 scale, both halving the learning rate at 125 000, 180 000 iterations. We trained our network for 220 000 iterations, choosing CharbonnierLoss loss as the network optimization function.

2) CLRS Dataset and Implementation Details

in order to further confirm the validity of our method for the case of richer categories, we continue our test analysis using the CLRS dataset. CLRS contains a total of 15 000 images in 25 categories, each of which has a size of 256 × 256 and a spatial resolution ranging from 0.26 to 8.85 m/pixel, which is the same as that of UC Merced adopts the same strategy. However, CLRS has higher spatial resolution and more categories than UC Merced, and higher spatial resolution means more detailed information, which challenges whether our method can successfully utilize the richer detailed information. Similar to UC Merced, we also use bicubic interpolation to generate images at × 4 and × 2 scales, with the difference that due to the higher spatial resolution and information contained in CLRS, we perform 300 000 iterations during training, with the initial learning rate set to 2e-4, and the learning rate is halved at 125 000, 200 000, 250 000, and 280 000 iterations.

3) RSSCN7 Dataset and Implementation Details

the RSSCN7 contains 2800 images in seven different categories, which poses a significant challenge due to the different range of captured scenes, which vary according to seasons, weather conditions and angles. We first cropped the images to 256 × 256 pixel size and proceeded to generate their corresponding 64 × 64 sized images, which were then randomly split in the same 6:2:2 ratio for training, validation, and testing. We continued to optimize our network using Adam, setting the batch size to 2 and choosing to halve the learning rate at 30 000 iterations and 50 000 iterations, respectively, for a total of 100 000 trainings. The learning rate was set to 1e-4 and the network optimization function was set to CharbonnierLoss.

We set the window size of our model to 12 and mlp \_ ratio to 2. All experiments were conducted on a server equipped with dual NVIDIA RTX 3090 GPU cards, using the PyTorch framework for implementation. Fig. 7 shows the variation curves of L \_ pix, Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) during the training process of UC Merced × 4, CLRS × 4, and RSSCN7 × 4, where early stopping is chosen when both PSNR and SSIM reach saturation. Through Fig. 7, we can clearly see the convergence of the model during the training process.

Fig. 7. - During the training process, the variation curves of L $\_$ pix, PSNR, and SSIM as iters increase are as follows: (a) shows the change of L $\_$ pix with iterations during the training process of UC Merced × 4, (b) shows the change of PSNR and SSIM with iterations during the training process of UC Merced × 4, (c) shows the change of L$\_$ pix with iterations during the training process of CLRS × 4, (d) shows the change of PSNR and SSIM with iterations during the training process of CLRS × 4, (e) shows the change of L$\_$ pix with iterations during the training process of RSSCN7 × 4, (f) shows the change of L $\_$ pix with iterations during the training process of RSSCN7 × 4.
Fig. 7.

During the training process, the variation curves of L \_ pix, PSNR, and SSIM as iters increase are as follows: (a) shows the change of L \_ pix with iterations during the training process of UC Merced × 4, (b) shows the change of PSNR and SSIM with iterations during the training process of UC Merced × 4, (c) shows the change of L\_ pix with iterations during the training process of CLRS × 4, (d) shows the change of PSNR and SSIM with iterations during the training process of CLRS × 4, (e) shows the change of L\_ pix with iterations during the training process of RSSCN7 × 4, (f) shows the change of L \_ pix with iterations during the training process of RSSCN7 × 4.

B. Evaluation Indicators

We evaluated our model comprehensively using five indicators including PSNR, SSIM, SAM, QI, SCC, and all results were obtained on the Y channel.

1) Peak Signal-to-Noise Ratio and Structural Similarity Index

PSNR and SSIM are the most commonly used metrics to measure the quality of image reconstruction. PSNR evaluates the quality of an image by comparing the pixel differences between the original image and the processed image, and is expressed in decibels (dB), with the higher values indicating the lower the distortion of the image and the better the image quality. The higher the value, the lower the image distortion and the better the image quality. SSIM, on the other hand, measures the similarity between two images in terms of brightness, contrast and structure. Unlike PSNR, SSIM takes into account the characteristics of the human visual system and pays more attention to the structural information of the image content.The calculation formulas of PSNR and SSIM are as follows: \begin{align*} \text{PSNR}(x, y) &= 10 \log _{10} \left(\frac{1}{\text{MSE}(x, y)}\right)\tag{8}\\ \text{SSIM}(x, y) &= \frac{(2 \mu _{x} \mu _{y} + c_{1})(2 \sigma _{xy} + c_{2})}{(\mu _{x}^{2} + \mu _{y}^{2} + c_{1})(\sigma _{x}^{2} + \sigma _{y}^{2} + c_{2})} \tag{9} \end{align*}

View SourceRight-click on figure for MathML and additional features.where the true value HR is x and the reconstructed SR is y.

2) Spectral Angle Mapper (SAM)

The SAM metric is used to measure the angular similarity between two pixels irrespective of their absolute intensities.The lower the value the smaller the angular difference between the spectral vectors of the two pixels. It is widely used as an effective measure of spectral similarity between different pixels.The formula for SAM is given below \begin{equation*} \text{SAM}(a, b) = \cos ^{-1} \left(\frac{\sum _{i=1}^{n} a_{i} b_{i}}{\sqrt{\sum _{i=1}^{n} a_{i}^{2}} \sqrt{\sum _{i=1}^{n} b_{i}^{2}}} \right). \tag{10} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

3) Universal Quality Index (UQI)

UQI is used to assess the similarity of two images, it takes into account the mean, variance and covariance between them, which ranges from [-1,1], the closer the value is to 1 means the more similar the two images are. In this article, we use moving windows to traverse the image and calculate the UQI value for each window in the image and finally take the average.

4) Spatial Correlation Coefficient (SCC)

SCC is a measure of spatial correlation between two images to assess their spatial structural similarity. A high spatial correlation coefficient indicates a high degree of similarity in spatial distribution and structural features between the images. The formula for SCC is given below \begin{equation*} \text{SCC}(X, Y) = \frac{\sum _{i=1}^{N} (X_{i} - \overline{X})(Y_{i} - \overline{Y})}{\sqrt{\sum _{i=1}^{N} (X_{i} - \overline{X})^{2}} \sqrt{\sum _{i=1}^{N} (Y_{i} - \overline{Y})^{2}}} \tag{11} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where X and Y represent the pixel values of the two images, respectively, N is the total number of pixels, and \overline{X} and \overline{Y} denote the average pixel values of X and Y, respectively.

C. Comparison to State-of-the-Arts Methods

In this section, we conduct comprehensive performance comparison experiments by evaluating our proposed model against several state-of-the-art models. Specifically, we compare it with VDSR [31], EDSR [34], RDN [45], RCAN [8], SAN [35], SwinIR [9], HAT [40], CFAT [46], HSPAN [47], HAUNet_RSSR [48], TSFNet_RSSR [49], and SwinFIR [41]. The comparative results for the UC Merced datasets at 4 × and 2 × magnifications are presented in Tables I and II, respectively. In addition, results for the CLRS datasets at 4 × and 2 × magnifications are detailed in Table III, and the outcomes for the RSSCN7 dataset at 4 × magnification are shown in Table IV. Utilizing multiple metrics for a comprehensive evaluation, our model achieves optimal values across all comparisons.

TABLE I Quantitative Comparison With State-of-The-Art Methods on the Y-Channel UC Merced Dataset in YCbCr Space of Remotely Sensed Image SR At × 4 Scale
Table I- Quantitative Comparison With State-of-The-Art Methods on the Y-Channel UC Merced Dataset in YCbCr Space of Remotely Sensed Image SR At × 4 Scale
TABLE II Quantitative Comparison With State-of-The-Art Methods on the Y-Channel UC Merced Dataset in YCbCr Space of Remotely Sensed Image SR At × 2 Scale
Table II- Quantitative Comparison With State-of-The-Art Methods on the Y-Channel UC Merced Dataset in YCbCr Space of Remotely Sensed Image SR At × 2 Scale
TABLE III Quantitative Comparison With State-of-The-Art Methods on the Y-Channel CLRS Dataset in YCbCr Space of Remotely Sensed Image SR At × 4 Scale and × 2 Scale
Table III- Quantitative Comparison With State-of-The-Art Methods on the Y-Channel CLRS Dataset in YCbCr Space of Remotely Sensed Image SR At × 4 Scale and × 2 Scale
TABLE IV Quantitative Comparison With State-of-The-Art Methods on the Y-Channel RSSCN7 Dataset in YCbCr Space of Remotely Sensed Image Sr At × 4 Scale
Table IV- Quantitative Comparison With State-of-The-Art Methods on the Y-Channel RSSCN7 Dataset in YCbCr Space of Remotely Sensed Image Sr At × 4 Scale

In particular, we note that our method improves about 0.18 dB in CLRS × 2 compared to the recent SwinFIR, which suggests that our method can extract richer texture features for images with high spatial resolution and much embedded information.

In addition, although the number of parameters in our method is not the smallest, it has reached a relatively balanced level compared to other methods. We also compared the PSNR and SSIM values of different classes on the UC Merced × 4 dataset, as shown in Table V, where classes 1–21 represents classes 1–21 in the UC Merced dataset, such as buildings, overpass, etc., with each class containing 100 images. It can be seen that in the vast majority of categories, our method has achieved the best results, and has shown particularly prominent effects in the first category, namely the agricultural class.

TABLE V PSNR and SSIM Under × 4 SR for Each Class on the UC Merced Dataset, Where the First Row of Each Class is the PSNR and the Second Row is the SSIM, With the Best Results in \color{red}{\text{Red}} and the Second Best Results in \color{blue}{\text{Blue}}
Table V- PSNR and SSIM Under × 4 SR for Each Class on the UC Merced Dataset, Where the First Row of Each Class is the PSNR and the Second Row is the SSIM, With the Best Results in $ \color{red}{\text{Red}} $ and the Second Best Results in $ \color{blue}{\text{Blue}} $

To illustrate the advantages of our method in image processing more intuitively, we provide qualitative analyses of results from UC Merced × 4 and CLRS × 4 datasets, shown in Figs. 8 and 10, respectively. For the image “agricultural90,” methods such as RDN, RCAN, and SAN fail to achieve satisfactory SR, resulting in significant loss of image details. In addition, SwinIR, HAT, and SwinFIR methods exhibit noticeable distortions in processing field paths. In contrast, our CFFormer method successfully achieves desired SR with clear image details and natural textures.

Fig. 8. - Visualization results of different SR methods on some examples of UC Merced × 4, HR represents high-resolution, best results in bold.
Fig. 8.

Visualization results of different SR methods on some examples of UC Merced × 4, HR represents high-resolution, best results in bold.

Fig. 9. - Only remove the visual effects of CA, FB, CFB, GAB, and jump joint in agricultural 90.
Fig. 9.

Only remove the visual effects of CA, FB, CFB, GAB, and jump joint in agricultural 90.

Fig. 10. - Visualization results of different SR methods on some examples of CLRS × 4, HR represents high-resolution, best results in bold.
Fig. 10.

Visualization results of different SR methods on some examples of CLRS × 4, HR represents high-resolution, best results in bold.

In the case of “Parking_228” from CLRS × 4, VDSR fails to reconstruct the image correctly, while CNN-based methods like RDN, EDSR, and SAN produce serious blurring and artifacts. SwinIR and HAT also struggle to recover clear parking lines and vehicles. Only our CFFormer demonstrates the most favorable result in achieving accurate SR.

D. Ablation Study

1) Perform Ablation Experiments on Each Module

To verify the effectiveness of CA, FB in CFB, and the addition of connections to RSTB to form RFSTB and GAB, we conducted ablation studies on the UC Merced dataset at a scale of × 4, as shown in Table VI. It should be noted that the “ × ” in RFSTB indicates retaining the original RSTB state, meaning that no fusion processing is performed between (S)STLs, rather than removing them entirely.

TABLE VI Ablation Experiments Performed on CA,FB,CAB,RFSTB, Respectively, \checkmark and × Denote With and Without the Specific Module, and the Best Results are in Bold
Table VI- Ablation Experiments Performed on CA,FB,CAB,RFSTB, Respectively, $\checkmark$ and × Denote With and Without the Specific Module, and the Best Results are in Bold

We can observe that when all modules are present, the SR performance reaches its peak; when fusion processing between (S)STLs is not performed or GAB is not added, the performance decreases. When neither CFB, GAB, nor RFSTB are included, there is an improvement in performance whether only CA is added, only FB is added, or both are included to form CFB, which is sufficient to demonstrate the effectiveness of the modules we propose. These comparative results fully demonstrate the independent and complementary roles of CFB, GAB, and RFSTB in enhancing model performance.

To further demonstrate the effectiveness of the modules we proposed, we conducted a visual effect analysis taking “agricultural90” as an example, as shown in Fig. 8. It can be observed that when CA is removed, the resulting image exhibits more artifacts compared to the original, specifically manifested by deeper shadows between each terrace than in the original image. When FB and CFB are removed, the terraces undergo varying degrees of deformation, which is because FB effectively processes image information in the frequency domain, aiding in capturing fine details and reducing noise amplification; removing it would be detrimental to better restoration of image edges and textural details. After removing GAB, we found that although the terraces did not undergo severe deformation (bending), a significant amount of blurriness was introduced. When the jump-joint mechanism is removed, it can be seen that although there are no severe deformations or artifacts overall compared to the complete CFFormer result and HR, the white area in the lower right corner did not clearly recover the edges.

2) Ablation Experiments on “Maximize-Expand-Subtract” in GAB

To further demonstrate the effectiveness of the GAB we proposed, as well as the “maximize-expand-subtract”and fusion strategy within it, we conducted a visual effect analysis taking “buildings69” and “overpass64” as examples, comparing the feature maps obtained by removing the GAB, only removing the expansion part of the GAB, and the complete GAB, as shown in Fig. 11.

Fig. 11. - Completeness of the Global Attention Block (GAB) was investigated on both buildings64 and overpass69. (a) shows the feature map without the first GAB, (b) shows the feature map after the first GAB with the “maximize, expand, subtract” and fusion processes removed, (c) shows the feature map with the complete first GAB, (d) shows the feature map without the second GAB, (e) shows the feature map after the second GAB with the “maximize, expand, subtract” and fusion processes removed, (f) shows the feature map with the complete second GAB.
Fig. 11.

Completeness of the Global Attention Block (GAB) was investigated on both buildings64 and overpass69. (a) shows the feature map without the first GAB, (b) shows the feature map after the first GAB with the “maximize, expand, subtract” and fusion processes removed, (c) shows the feature map with the complete first GAB, (d) shows the feature map without the second GAB, (e) shows the feature map after the second GAB with the “maximize, expand, subtract” and fusion processes removed, (f) shows the feature map with the complete second GAB.

It can be observed that Fig. 11(a) and 11(d) display relatively raw feature maps with distinct local details, but overall they lack a certain structure and integration of global information, making the feature maps appear scattered and lacking directionality. In Fig. 11(b) and 11(e), after removing some global processing steps, the feature maps show a different balance between local and global aspects. The features in the figures are more concentrated, showing the preliminary impact of global attention, but due to the absence of maximize, expand, subtract and fusion operations, this impact is not as significant as that of the complete GAB. Finally, in Fig. 11(c) and 11(f), it can be seen that the feature maps are more focused and structured, with global information well integrated and highlighted, which helps the model better understand image content and context in subsequent steps.

E. Discussion

In order to further validate the feature fusion we designed in RFSTB and the effectiveness of different numbers of STLs and SSTLs on the CFFormer, we explored the effect of the number of (S)STLs on the performance and parameters of the model,,as shown in Table VII. We found that the performance of the CFFormer continues to improve as the number of layers increases, however, it results in more parameters, and if the number of layers continues to increase after the number of layers reaches 16, the performance improvement will be very limited.

TABLE VII Performance and Parameter Performance for Different (S)STL Layers, With the Best Results in \color{red}{\text{Red}} and the Second Best Results in \color{blue}{\text{Blue}}
Table VII- Performance and Parameter Performance for Different (S)STL Layers, With the Best Results in $ \color{red}{\text{Red}} $ and the Second Best Results in $ \color{blue}{\text{Blue}} $

In addition, we also explored the impact of window size on model performance. The window size refers to the dimensions of the nonoverlapping patches or windows into which the input is divided during self-attention calculations. When selecting the window size, there is a tradeoff between computational efficiency and the size of the receptive field—the larger the window, the larger the receptive field of each attention head, which can capture more extensive information, but the computational cost is also higher, as shown in Table VIII. We found that performance reached a relatively balanced effect only when the window size was 12, so we chose a medium window size of 12 as the window size in our method.

TABLE VIII Performance of Different Window Sizes, With the Best Results in \color{red}{\text{Red}} and the Next Best in \color{blue}{\text{Blue}}
Table VIII- Performance of Different Window Sizes, With the Best Results in $ \color{red}{\text{Red}} $ and the Next Best in $ \color{blue}{\text{Blue}} $

We also investigated the impact of different mlp \_ ratios on performance. The mlp \_ ratio refers to the ratio of the dimension of the hidden layer to the dimension of the embedding layer in the multilayer perceptron (MLP). This ratio determines the size of the hidden layer in the feedforward network in each Transformer layer, which affects the model's capacity and the number of parameters in each layer, thereby indirectly affecting the model's learning ability and complexity. As shown in Table IX, after several discussions on mlp \_ ratio, considering the balance between parameters and performance, we ultimately chose an mlp \_ ratio of 2 as an important parameter in our method.

TABLE IX Performance and Parameter Performance for Different Mlp_ratio Sizes, With the Best Results in \color{red}{\text{Red}} and the Second Best in \color{blue}{\text{Blue}}
Table IX- Performance and Parameter Performance for Different Mlp_ratio Sizes, With the Best Results in $ \color{red}{\text{Red}} $ and the Second Best in $ \color{blue}{\text{Blue}} $

In addition to the effectiveness of the modules we proposed, we also assessed whether our CFFormer would result in significant memory consumption and increased execution time. We compared our model with the current state-of-the-art Transformer-based SR models SwinIR, HAT, and SwinFIR in terms of FPS, parameter count, Memory Allocated (MA), Max Memory Allocated (MMA), and execution time. FPS refers to the frame rate calculated based on execution time, indicating the number of input samples the model can process per second. MA refers to the total amount of GPU memory allocated by PyTorch at a specific point in time, including the GPU memory currently occupied by all tensors and caches allocated by PyTorch. MMA refers to the highest GPU memory usage among all completed operations before the program runs to the current position. This includes not only the currently active memory but also the peak memory usage during previous operations. Execution time refers to the time required for a single forward propagation of the model. The comparison results are shown in Table X. It can be seen that our CFFormer, due to its more complex structure, has more execution time and lower FPS compared to SwinIR and SwinFIR, but these increases and decreases are limited and still within a tolerable range. Moreover, compared to HAT, our method has less MA and higher performance.

TABLE X Comparison of FPS, Parameters, Memory Allocated, Max Memory Allocated and Execution Time Among SwinIR, HAT, SwinFIR, and CFFormer Models
Table X- Comparison of FPS, Parameters, Memory Allocated, Max Memory Allocated and Execution Time Among SwinIR, HAT, SwinFIR, and CFFormer Models
SECTION V.

Conclusion

In this article, we introduce CFFormer, a novel method leveraging the SwinTransformer architecture, which substantially enhances SR performance in remote sensing images. This improvement is achieved through the design of RFSTBs and the implementation of innovative components such as CFBs and GABs. The architecture comprises multiple RFSTBs, each containing several (S)STLs and advanced feature fusion techniques. CFBs are strategically placed at the end of each RFSTB, where they transform the feature mapping from the spatial to the frequency domain, effectively addressing challenges like shape distortion and blurring. Compared to existing methods, CFFormer excels in producing more realistic and detailed textures, achieving state-of-the-art (SOTA) performance metrics including PSNR, SSIM, SAM, QI, and SCC. This demonstrates its superior capability in enhancing the quality and clarity of high-resolution images from remote sensing data.

However, our research has primarily focused on LR images generated by bicubic interpolation. While yielding relatively good results, our method encounters challenges with LR images under different degradation models. Therefore, our future work aims to explore the applicability of our method to more complex real-world degradation scenarios. In addition, we plan to investigate the effectiveness of our approach in other domains such as rain removal and denoising, aiming to enhance the model's generalization capabilities.

References

References is not available for this document.