Introduction
Remote sensing images have a broad range of applications in the military [1], environmental monitoring [2], agriculture [3], urban planning, and disaster prevention [4]. Under such extensive application requirements, humans often need higher resolution remote sensing images for research and discussion. However, due to limitations imposed by current imaging equipment, cost, weather factors, and lighting conditions, it is extremely difficult to directly obtain high-resolution images. Therefore, finding methods to obtain higher resolution images has become a major research topic in the field of remote sensing. Among these methods, single remote sensing image super resolution (RSSR) technology, which aims to restore high-resolution (HR) images from low-resolution (LR) images, has garnered significant attention in remote sensing applications [5], [6], [7].
So far, a plethora of literature has successfully achieved remarkable accomplishments in deep learning-based single natural image super-resolution (SR) methods [8], [9]. Learning-based SR methods leverage sufficient information from LR images to infer and predict corresponding high-resolution details (e.g., edges). Thanks to their sophisticated nonlinear learning capabilities and extensive training data, deep learning models are favored over interpolation-based SR methods [10], [11], [12] and reconstruction-based SR methods [13], [14] for their ability to generate clearer and more natural image details and easier integration with other deep learning applications and services.
Although deep learning has achieved outstanding results in natural image SR, remote sensing image SR differs significantly from natural image SR. Natural image SR primarily focuses on enhancing image quality, recovering image details and textures, and maintaining natural aesthetics, while remote sensing image SR faces challenges such as dealing with large homogeneous areas (like oceans and deserts) and highly complex features (like urban areas). Moreover, the spatial resolution and global information of remote sensing images play a crucial role in remote sensing image SR, aspects that general natural image SR tends to overlook. This leads to the necessity for researchers to design specialized methods for remote sensing images to improve the effectiveness of remote sensing SR [15], [16], [17].
In 2014, Dong et al. [18] proposed super-resolution convolutional neural network (SRCNN), marking the first application of deep learning technology to image SR reconstruction, achieving an end-to-end mapping from low to high resolution. Subsequently, Kim et al. [19] proposed VDSR, which significantly improved the performance of deep learning in SR by increasing the depth of the network. Tuna et al. [20] applied the SRCNN and VDSR models combined with intensity-hue-saturation (IHS) transformation to satellite images obtained from VHR SPOT6 and 7 as well as Pleiades 1A and 1B satellites. The experimental results showed that the VDSR method outperformed SRCNN on both panchromatic image (PAN) and multispectral (MS) remote sensing images. Later, Xu et al. [21] proposed a deeply modulated convolutional neural network, which combines image details with contextual information using local and global memory connections to generate high-quality images. Recently, Ren et al. [22] proposed an enhanced residual convolutional neural network based on a dual-brightness scheme, which enhances the feature flow module of the residual convolutional neural network and the ability to learn distinctively across feature maps. Qian et al. [23] adopted the ResNet architecture and utilized 3-D separable convolution to better capture spatial–spectral features, providing a new self-supervised learning method for the SR problem of Sentinel-2 multispectral images.
Following this, with the development of attention mechanisms in computer vision applications, Gu [24] et al. proposed a deep residual squeeze-and-excitation network for RSSR, where they introduced a Residual Squeeze-and-Excitation Block to simulate the interdependencies between channels, enhancing the network's representational power. Recently, Wang et al. [25], while focusing on VHR satellite images, applied channel attention and spatial attention to a deep dense residual network to improve the performance of single image super resolution (SISR) solutions. Wang et al. [26] proposed a distance attention block (DAB) as a bridge between the main branch and the distance attention residual connection block (DARCB) branch, with DAB effectively alleviating the loss of detail features during the extraction process by deep convolutional neural networks (CNNs).
However, CNN-based methods face an inevitable obstacle in remote sensing image super-resolution (RSISR). Due to the design of convolutional layers, the interaction between the convolutional kernel and the image is a content-agnostic process. It is illogical to use the same convolutional kernel to reconstruct different regions of an image. To address this issue, Vision Transformer (ViT) [27] and the Swin Transformer [28], which is based on ViT, have emerged, with subsequent researchers further improving and exploring on this basis. For instance, ConvFormerSR [29] later adopted a dual-branch structure, with one branch using the Swin Transformer mechanism combined with global attention, and the other branch using a series of residual groups to further extract image texture information, ultimately achieving outstanding results on the HLSSR-GJ remote sensing dataset. Ren et al. [30] proposed a cross dual-branch U-Net architecture that combines convolutional neural networks and Transformers, by designing a spectral-spatial parallel Transformer and a spectral–spatial feature interaction module, effectively improving the spatial resolution of hyperspectral images.
Despite the significant achievements of current Transformer-based remote sensing image SR methods, they still face challenges in capturing global information and utilizing richer spatial resolution. These limitations not only affect the quality of detail recovery, but also restrict the stability of the model when dealing with deep networks, often leading to vanishing or exploding gradients. To address these issues, we propose a new RSISR framework: CFFormer. This framework consists of three key parts: 1) shallow feature extraction; 2) deep feature extraction; and 3) upsampling. In the deep feature extraction phase, our CFFormer utilizes a carefully designed Global Attention Block (GAB) to capture richer global information and global contextual information. Subsequently, the Channel Fourier Block (CFB) processes image features by mapping through Fourier transformation into the frequency domain to more effectively handle high-frequency information. In addition, we optimize the information flow and alleviate the vanishing and exploding gradients caused by the increase in network depth through the residual fusion Swin Transformer layer. In summary, the main contributions of this article are as follows.
To enhance the recovery of image details, we have carefully designed the CFB, which combines channel attention with the Fast Fourier Transform (FFT) and employs depthwise convolutions and pointwise convolutions to extract more comprehensive, detailed, and stable features.
We propose the use of a GAB and have made improvements to it, placing it at the beginning and end of the deep feature extraction section to further optimize global information and enhance the model's expressive power.
To effectively combine features from early and later layers, capturing a broader range of contextual and detailed features without adding more computational burden, we have combined jump-joint and Swin Transformer to carefully design the Residual Fusion Swin Transformer Block (RFSTB).
Related Work
In this section, we first explain the spatial resolution and global information of remote sensing images, then elaborate on CNN-based SR networks and Transformer-based SR networks, and finally explain the FFT used extensively in this work.
A. Spatial Resolution and Global Information of Remote Sensing Images
Remote sensing images significantly differ from natural images, particularly in terms of spatial resolution and global information, both of which greatly impact SR. Spatial resolution is defined as the ground area represented by a single pixel in the image. Higher spatial resolution means that each pixel covers a smaller ground area, revealing more detailed and fine geographic features. However, this also places higher demands on models to utilize more spatial resolution information.
In addition, global information plays a crucial role in the SR of remote sensing images. Global information refers to information involving large-scale or entire scenes in the image. Remote sensing images usually cover vast geographic areas and contain complex natural and man-made landscapes. Global information helps to understand the contextual relationships in the image, such as terrain continuity and the distribution of ecological regions. In SR technology, dealing with global information often involves considering the relationships and interactions between different regions within the image. This requires SR models to have a high degree of spatial perception ability to recognize and utilize the global patterns and structures in the image.
B. SR Based on CNNs
Previously, traditional CNN-based SR networks have been widely applied and achieved tremendous success. Since SRCNN [18] and VDSR [31], various advanced networks have emerged continuously. Furthermore, with the vigorous development of ResNet [32] and GAN [33], researchers in the field of SR have gradually begun exploring their applications. In 2017, the advent of SRGAN [34] effectively linked GANs with SR and introduced numerous residual networks into the generator. Ren et al. [30] proposed a context-aware edge-enhanced generative adversarial network (CEEGAN) SR framework for reconstructing visually pleasing images that can be practically applied in real scenarios. In addition, increasing the “width” of networks has been considered a solution to enhance SR performance. Therefore, the proposal of enhanced deep residual networks for single image super-resolution (EDSR) [34] significantly enhanced network performance by increasing the number of network channels and filter channels.Subsequently, in the rise of attention mechanisms, residual channel attention network (RCAN) [8], SAN [35], and others have again raised peak signal-to-noise ratio to new heights, demonstrating the importance of focusing more on high-frequency information than low-frequency information. MSAN [36] later applied this mechanism to remote sensing images, achieving success in multilevel feature extraction for the complex structure of remote sensing images. Huang et al. [37] developed residual dual attention blocks, including local multilevel fusion modules and dual attention mechanisms, to focus the network more on high-frequency information areas.
However, despite achieving good results, the above CNN-based SR models primarily focus on local information due to their limited feature extraction capabilities, thus failing to fully utilize contextual and global information. Specifically, contextual and spatial information may be lost during the decoding stage, limiting the recovery of high-resolution information.
C. SR Based on Transformers
In order to solve problems such as not being able to fully utilize contextual information, Transformer was applied to SR. In 2021, pre-trained image processing transformer (IPT) [38] pioneered the application of Transformer to low-level visualization tasks. By constructing a pretrained model based on Transformer and utilizing its powerful modeling capability, the corresponding underlying visualization tasks were effectively accomplished. Compared to various attention mechanisms, IPT proves more effective in SR tasks. However, IPT necessitates extensive pretraining and incurs high computational costs, thereby limiting its efficiency. Subsequently, SwingIR [9] and NLSA [39], built upon Swin Transformer [28], emerged between 2021 and 2022. Swin Transformer innovatively introduced Localized Self-Attention, which computes self-attention within localized windows, significantly enhancing performance over IPT and thereby enhancing Transformer utility. Recent advancements like hybird attention transformer for remote sensing super-resolution (HAT) [40] and SwinFIR [41] further extend SwinIR [9] by integrating overlapping cross-attention and FFT, enabling broader pixel activation across the network and further boosting performance.
D. FFT Applied to Images on SR
FFT is an algorithm of paramount importance, facilitating the swift computation of the Discrete Fourier Transform and its inverse. As one of the foundational algorithms in contemporary digital signal processing, FFT is extensively utilized across a myriad of scientific and engineering disciplines, encompassing digital image processing, audio signal analysis, telecommunications systems, and beyond. Characteristics that prove challenging to manipulate in the spatial domain may often be more readily addressed in the frequency domain. Concretely, FFT enables the transition of an image from its native spatial representation to the frequency domain, where targeted modifications of specific spectral components can be effected to achieve the desired image processing outcomes, followed by an inverse transformation back to the spatial domain via the Fast Inverse Fourier Transform. In the realm of image SR, the application of FFT has been gaining increasing traction. Wang et al. [42] leveraged Fourier transformation to capture global facial structural information and augmented model capability by harmonizing spatial and spectral information through dual pathways that segregate local and global dependencies. Sinha et al. [43] introduced nonlocal attention-assisted fast Fourier convolution to broaden the receptive field and engendered learning of long-range dependencies. Liu et al. [28] expanded the receptive field via the spatial frequency block, thereby enhancing SR by capturing long-range interdependencies. Despite the remarkable efficacy of current FFT-driven SR methodologies, there remains a dearth in the extraction of more profound feature representations, leading to imperfections in the recovery of more delicate textures. To further harness the potential of FFT in the domain of image SR, we have conducted an exploration and crafted an innovative CFB. A detailed exposition of this development will be presented in Section III-D.
Method
In this section, we first describe the motivation behind the design of our CFFormer, then introduce the overall structure of the proposed CFFormer, and finally describe its three important innovations: 1) the RFSTB; 2) the CFB; and 3) the GAB.
A. Motivation
Despite the significant advancements in the field of image SR through deep learning techniques, challenges still remain when dealing with remote sensing images that have high spatial resolution and complex global information. Remote sensing images typically cover vast geographical areas with large-scale and diverse scene content, necessitating SR technology that can effectively capture extensive global information, utilize higher spatial resolution, and accurately recover details ranging from urban structures to natural landforms. However, traditional SR methods, such as CNNs and Transformer models, while achieving certain successes in spatial domain processing, often overlook the critical frequency domain features in remote sensing images, such as the low-frequency large-scale structures and high-frequency detailed textures, which are essential factors in determining image quality. To address this issue, we have designed a novel SR model that combines frequency domain and spatial domain processing, specifically tailored to the needs of remote sensing images. By processing in the frequency domain, the model can more effectively utilize frequency information to restore the global structure of the image and further refine the texture details, which is crucial for enhancing the spatial resolution of remote sensing images. Spatial domain processing ensures the precise recovery of local image features, especially in urban and agricultural areas, where detail recovery is vital for the practical application of remote sensing images. We believe that a model combining these two strategies can not only enhance the overall visual quality of remote sensing images but also more comprehensively analyze and utilize the spatial and frequency information in the images, thereby significantly improving the descriptive ability of complex remote sensing scenes.
B. Overall Structure of CFFormer
Our CFFormer is designed to more effectively harness and integrate global information, further addressing the meticulous feature mappings in the spatial domain that are challenging to handle, thereby enhancing feature reusability and effectively mitigating issues such as gradient vanishing and explosion, thus improving the performance of RSSR. The overall structure of the CFFormer is depicted in Fig. 3. Our CFFormer consists of a shallow feature extraction component, a deep feature extraction component, and an image upsampling section. Specifically, the input LR image (
\begin{equation*}
I_{SR} = F_{\text{PixelShuffle}}(F_{DFE}(\text{Conv}(I_{LR}))) \tag{1}
\end{equation*}
Comparison results of our CFFormer with state-of-the-art methods SwinIR, HAT, SwinFIR. On UC Merced × 4, CLRS × 2, CLRS × 4, our CFFormer achieves the best performance over existing works.
Comparison of our CFFormer with and without CFB on harbor67 in the UC Merced dataset. It can be seen that in the absence of CFB, a larger distortion is produced.
Overall structure of the proposed CFFormer, GAB denotes Global Attention Block and CFB denotes Channel Fourier Block.
In the deep feature extraction part, each RFSTB internally contains a series of Swin Transformer layers (STLs), shift Swin Transformer layers (SSTLs), and CFBs to realize the deep extraction of feature mapping.
C. Residual Fusion Swin Transformer Block (RFSTB)
In existing Swin Transformer-based literature, STLs or SSTLs are typically connected in a direct series, a structure we believe may lead to sparse relationships between layers. This sparse connectivity can hinder the flow of features and result in the loss of critical information such as local-global features in remote sensing images. To address this issue, we propose an RFSTB to enhance interlayer connections, as depicted in Fig. 3(a). Specifically, each STL is concatenated with the output of its second neighboring SSTL, and channel adjustments are made using 1 × 1 convolutions. Similarly, each SSTL undergoes the same feature fusion operation with its second neighboring STL. This approach strengthens connections between features captured at lower layers and high-level abstract features at higher layers, optimizing information flow and enhancing the model's feature expressive power. Consequently, this improves the overall performance of the model. It is important to note that the number of (S)STLs can vary, as illustrated in Fig. 4, and different configurations will yield varying performance and parameter counts, which will be detailed in Section IV-E.
As shown in Fig. 3, taking six (S)STLs as an example, we perform feature fusion between (1).STL and (4).SSTL, between (2).SSTL and (5).STL, and between (3).STL and (6).SSTL. Taking the first RFSTB as an example, the IGAB that has been processed by the GAB is then processed by (1).STL, (2).SSTL, (3).STL, and (4).SSTL to obtain ISR4
\begin{equation*}
I_{SR4} = \operatorname{F_{SSTL}}\left({F_{STL}}\left(\operatorname{F_{SSTL}}\left(\operatorname{F_{STL}}\left(I_{GAB}\right)\right)\right)\right) \tag{2}
\end{equation*}
\begin{align*}
\left\lbrace \begin{array}{l}
I_{\text{SR1,4}} = \text{Conv}(\text{Cat}(I_{\text{SR1}}, I_{\text{SR4}})) \\
I_{\text{SR5}} = F_{\text{STL}}(I_{\text{SR1,4}}) \\
I_{\text{SR2,5}} = \text{Conv}(\text{Cat}(I_{\text{SR2}}, I_{\text{SR5}})) \\
I_{\text{SR6}} = F_{\text{SSTL}}(I_{\text{SR2,5}}) \\
I_{\text{SR3,6}} = \text{Conv}(\text{Cat}(I_{\text{SR3}}, I_{\text{SR6}})) \\
I_{\text{CFB}} = F_{\text{CFB}}(I_{\text{SR3,6}}) \\
I_{\text{RFSTB}} = I_{\text{GAB}} + I_{\text{CFB}} \end{array} \right. \tag{3}
\end{align*}
By introducing a fusion layer between different processing stages, our architecture effectively integrates information across layers. This optimization not only enhances the flow of features, but also improves the network's capacity to recognize complex patterns. Consequently, it mitigates issues such as vanishing and exploding gradients, significantly enhancing RSSR performance.
D. Channel Fourier Block (CFB)
In remote sensing images, there are often important feature boundaries (e.g., farm roads or rivers) that are blurred due to the lack of high-frequency information. By transforming the feature maps from the spatial domain to the frequency domain through FFT, these high-frequency details can be effectively restored, making the boundaries clearer and significantly enhancing the image's resolution and practicality. Therefore, to further enhance the recovery of image details and overall quality, we have carefully designed the CFB. As shown in Fig. 5, the CFB consists of a Channel Attention part (CA) and a Fourier Block part (FB). The input feature maps will be processed through both the CA and the FB, and the results of the two branches will be fused as the output of the CFB.
Channel Fourier block CFB, where (a) is channel attention (CA) and (b) is Fourier block (FB).
Channel Attention: To enhance the performance of CNNs, we have applied the traditional channel attention mechanism from [44] to our network. Unlike the original approach, as shown in Fig. 5(a), we first perform a Conv-act-Conv operation at the beginning of the channel attention to further extract features, then multiply element-wise with the feature maps processed by the channel attention, and finally perform feature fusion to obtain the output of CA. Our CA optimizes the model's feature expression by assigning different importance to features of different channels, enabling the network to focus more on features that are more useful for the current task.
Fourier Block: As mentioned at the beginning of Section III-D, in order to better recover image details and structures, we have combined the FFT and carefully designed the FB, as shown in Fig. 5(b). Before starting the processing, we assume that the dimensions of the input feature map Iin are (b, c, h, w), where b represents the batch size, c represents the number of channels, h represents the image height, and w represents the image width. First, since FFT is a symmetric transformation, it is only necessary to store half of the frequency components to reconstruct the complete spectrum. Therefore, to improve computational efficiency, we first reduce the channels of the input feature map by half through a 1 × 1 convolution. Next, we use the FFT to convert the feature map from the spatial domain to the frequency domain. For a 2-D feature map Iin (x, y), its Fourier transform and inverse Fourier transform are mathematically expressed as
\begin{equation*}
F_{\text{FFT}}(u,v) = \sum _{x=0}^{M-1}\sum _{y=0}^{N-1}I_{\text{in}}(x,y) \cdot e^{-j2\pi \left(\frac{ux}{M} + \frac{vy}{N}\right)} \tag{4}
\end{equation*}
During the FFT process, due to the symmetry of the FFT, we only need to retain the positive frequency part and the dc component. Therefore, the width w will become w/2 + 1. In addition, an extra dimension of size 2 is added to save the real and imaginary parts of each frequency component. Thus, after a 2-D image undergoes the FFT operation, the tensor dimensions change from (b, c/2, h, w) to (b, c/2, h, w/2 + 1, 2). Next, the feature map transformed to the frequency domain will further extract features after passing through a conv-act and then separately through 1 × 1 pointwise convolution and 3 × 3 depthwise convolution. The pointwise convolution is used for information mixing and feature transformation between channels. Although the convolution kernel is small, it can learn more complex feature representations by integrating information from different channels. The depthwise convolution is mainly used to apply the convolution kernel separately within each input channel. Unlike traditional convolutional layers, it does not mix information between channels but processes the spatial features of each channel independently. Passing the feature map through both depthwise and pointwise convolutions and then fusing them allows the model to optimize internal features and cross-channel features simultaneously, enhancing the expressiveness of features and the learning efficiency of the network. Moreover, it enables the network to achieve complex feature extraction and fusion without adding too much computational overhead.
After fusing the feature maps that have been processed by depthwise convolution and pointwise convolution, we then apply max pooling and average pooling to them, respectively. Max pooling extracts the strongest activation signals of regional features, emphasizing the most salient parts of the features, which helps to maintain the sharpness and prominence of image features. Average pooling calculates the average value of feature regions, aiding in extracting smoother features and taking more account of the statistical information of the entire region. Next, to enable the network to learn how to extract and integrate information from different pooling strategies for more accurate judgment or prediction, we concatenate the results of max pooling with those of average pooling. After concatenation, we continue with depthwise convolution processing, followed by Act-Conv for channel dimension adjustment. After passing through a sigmoid activation, we perform element-wise multiplication with the feature IFB4 before the pooling operation, and then undergo inverse Fourier transformation (Inv FFT) to convert the modified feature map from the frequency domain back to the spatial domain. Finally, we add this to the feature before the Fourier transformation and adjust the channels with a 1 × 1 convolution
\begin{align*}
\left\lbrace\! \begin{array}{l}
I_{\text{FFT}} = F_{\text{FFT}}(\text{Act}(\text{Conv}(I_{3,6}))) \\
I_{\text{PConv}} = \text{PConv}(\text{Act}(\text{Conv}(I_{\text{FFT}}))) \\
I_{\text{DConv}} = \text{DConv}(\text{Act}(\text{Conv}(I_{\text{FFT}}))) \\
I_{\text{FB2}} = \text{Conv}(\text{Cat}(I_{\text{PConv}}, I_{\text{DConv}})) \\
I_{\text{FB3}} =\text{Cat} (F_{\text{AVG}}(I_{\text{FB2}}), F_{\text{MAX}}(I_{\text{FB2}})) \\
I_{\text{FB4}} = \text{Conv}(\text{Act}(\text{DConv}(\text{FB3}))) \\
I_{\text{FB}}\! = \!\text{Conv}(\text{F}_{\text{invFFT}}(I_{\text{FB2}} \times \text{Sig}(I_{\text{FB4}}))) \!\!\!+\! \text{Act}(\text{Conv}(I_{3,6})). \end{array}\right. \tag{5}
\end{align*}
In the description, Conv(
\begin{equation*}
F_{\text{invFFT}}(x,y) = \frac{1}{MN} \sum _{u=0}^{M-1} \sum _{v=0}^{N-1} I_{\text{freq}}(u,v) \cdot e^{j2\pi \left(\frac{ux}{M} + \frac{vy}{N}\right)} \tag{6}
\end{equation*}
E. Global Attention Block (GAB)
For remote sensing images, processing global information is crucial for correctly interpreting and reconstructing large-scale geographical environments. Global attention mechanisms can help models identify and leverage contextual information across vast spatial extents within images, which aids in maintaining the continuity and overall consistency of geospatial features during the SR process, thereby enhancing the practicality of spatial resolution. Therefore, to enhance the model's ability to capture information on a global scale, we propose a GAB, placing it at the beginning and end of deep feature extraction (as shown in Fig. 3), where the initial GAB is used for richer feature extraction of LR, and the final GAB, together with the initial GAB, captures longer-range and global information more effectively.
The GAB we propose is specifically shown in Fig. 6. It should be noted that, in order to enable the model to better focus on key features, we propose applying a “maximize-expand-subtract” strategy within the GAB, and we also employ jump-joint between the feature maps before and after the GAB to promote information flow in deep networks and prevent gradient vanishing during training. Initially, the input feature map undergoes Conv-Act-Conv to further extract features, then undergoes three reshapes to obtain three feature maps IGA1, IGA2, IGA3. IGA1 is transposed and multiplied with IGA2 to get I'GA12, followed by max pooling and expansion processing, and then subtracted from I'GA12 and passed through a sigmoid operation to normalize and highlight important features. The normalized result is then multiplied with IGA3, added to the initial transposed feature, and finally fused through feature fusion to obtain the final output of the GAB, IGAB. We take the GAB at the head of the deep feature extraction as an example
\begin{align*}
\left\lbrace \begin{array}{l}
I_{\text{GA2}} = I_{\text{GA3}} = F_\text{Reshape}(\text{Conv}(I_{\text{LR}})) \\
I_{\text{GA1}} = F_\text{Transpose}(F_\text{Reshape}(\text{Conv}(I_{\text{LR}}))) \\
I^{\prime }_{\text{GA12}} = I_{\text{GA1}} \times I_{\text{GA2}} \\
I_{\text{GA12}} = \text{Sig}(\text{MAX}(I^{\prime }_{\text{GA12}}) - I^{\prime }_{\text{GA12}}) \\
I^{\prime }_{\text{GA}} = I_{\text{GA12}} \times I_{\text{GA3}} + \text{Conv}(I_{\text{LR}}) \\
I_{\text{GA}} = \text{Conv}(\text{Cat}(I^{\prime }_{\text{GA}}, \text{Conv}(I_{\text{LR}}))) \end{array} \right. \tag{7}
\end{align*}
Experiment
A. Datasets and Implementation Details
We have conducted extensive experiments on UC Merced, CLRS, and RSSCN7 datasets to demonstrate the effectiveness of CFFormer.
1) UC Merced Dataset and Implementation Details
UC Merced contains 21 categories of remote sensing scenes, including agricultural, airplane, Baseball diamond, and so on, each category has 100 images with the size of 256 × 256 pixels, the spatial resolution of these images is 0.3 m/pixel. We randomly partition these images in the ratio of 6:2:2 as the corresponding training set, validation set and test set. We use bicubic interpolation to generate 64 × 64 images for SR on the × 4 scale versus 128 × 128 images for × 2, respectively, and augment the training data with horizontal and random flip strategies. We optimized our network using Adam with a batch size of 2. The initial learning rate was then set to 1e-4 for training at × 4 scale and 2e-4 for × 2 scale, both halving the learning rate at 125 000, 180 000 iterations. We trained our network for 220 000 iterations, choosing CharbonnierLoss loss as the network optimization function.
2) CLRS Dataset and Implementation Details
in order to further confirm the validity of our method for the case of richer categories, we continue our test analysis using the CLRS dataset. CLRS contains a total of 15 000 images in 25 categories, each of which has a size of 256 × 256 and a spatial resolution ranging from 0.26 to 8.85 m/pixel, which is the same as that of UC Merced adopts the same strategy. However, CLRS has higher spatial resolution and more categories than UC Merced, and higher spatial resolution means more detailed information, which challenges whether our method can successfully utilize the richer detailed information. Similar to UC Merced, we also use bicubic interpolation to generate images at × 4 and × 2 scales, with the difference that due to the higher spatial resolution and information contained in CLRS, we perform 300 000 iterations during training, with the initial learning rate set to 2e-4, and the learning rate is halved at 125 000, 200 000, 250 000, and 280 000 iterations.
3) RSSCN7 Dataset and Implementation Details
the RSSCN7 contains 2800 images in seven different categories, which poses a significant challenge due to the different range of captured scenes, which vary according to seasons, weather conditions and angles. We first cropped the images to 256 × 256 pixel size and proceeded to generate their corresponding 64 × 64 sized images, which were then randomly split in the same 6:2:2 ratio for training, validation, and testing. We continued to optimize our network using Adam, setting the batch size to 2 and choosing to halve the learning rate at 30 000 iterations and 50 000 iterations, respectively, for a total of 100 000 trainings. The learning rate was set to 1e-4 and the network optimization function was set to CharbonnierLoss.
We set the window size of our model to 12 and mlp
During the training process, the variation curves of L
B. Evaluation Indicators
We evaluated our model comprehensively using five indicators including PSNR, SSIM, SAM, QI, SCC, and all results were obtained on the Y channel.
1) Peak Signal-to-Noise Ratio and Structural Similarity Index
PSNR and SSIM are the most commonly used metrics to measure the quality of image reconstruction. PSNR evaluates the quality of an image by comparing the pixel differences between the original image and the processed image, and is expressed in decibels (dB), with the higher values indicating the lower the distortion of the image and the better the image quality. The higher the value, the lower the image distortion and the better the image quality. SSIM, on the other hand, measures the similarity between two images in terms of brightness, contrast and structure. Unlike PSNR, SSIM takes into account the characteristics of the human visual system and pays more attention to the structural information of the image content.The calculation formulas of PSNR and SSIM are as follows:
\begin{align*}
\text{PSNR}(x, y) &= 10 \log _{10} \left(\frac{1}{\text{MSE}(x, y)}\right)\tag{8}\\
\text{SSIM}(x, y) &= \frac{(2 \mu _{x} \mu _{y} + c_{1})(2 \sigma _{xy} + c_{2})}{(\mu _{x}^{2} + \mu _{y}^{2} + c_{1})(\sigma _{x}^{2} + \sigma _{y}^{2} + c_{2})} \tag{9}
\end{align*}
2) Spectral Angle Mapper (SAM)
The SAM metric is used to measure the angular similarity between two pixels irrespective of their absolute intensities.The lower the value the smaller the angular difference between the spectral vectors of the two pixels. It is widely used as an effective measure of spectral similarity between different pixels.The formula for SAM is given below
\begin{equation*}
\text{SAM}(a, b) = \cos ^{-1} \left(\frac{\sum _{i=1}^{n} a_{i} b_{i}}{\sqrt{\sum _{i=1}^{n} a_{i}^{2}} \sqrt{\sum _{i=1}^{n} b_{i}^{2}}} \right). \tag{10}
\end{equation*}
3) Universal Quality Index (UQI)
UQI is used to assess the similarity of two images, it takes into account the mean, variance and covariance between them, which ranges from
4) Spatial Correlation Coefficient (SCC)
SCC is a measure of spatial correlation between two images to assess their spatial structural similarity. A high spatial correlation coefficient indicates a high degree of similarity in spatial distribution and structural features between the images. The formula for SCC is given below
\begin{equation*}
\text{SCC}(X, Y) = \frac{\sum _{i=1}^{N} (X_{i} - \overline{X})(Y_{i} - \overline{Y})}{\sqrt{\sum _{i=1}^{N} (X_{i} - \overline{X})^{2}} \sqrt{\sum _{i=1}^{N} (Y_{i} - \overline{Y})^{2}}} \tag{11}
\end{equation*}
C. Comparison to State-of-the-Arts Methods
In this section, we conduct comprehensive performance comparison experiments by evaluating our proposed model against several state-of-the-art models. Specifically, we compare it with VDSR [31], EDSR [34], RDN [45], RCAN [8], SAN [35], SwinIR [9], HAT [40], CFAT [46], HSPAN [47], HAUNet_RSSR [48], TSFNet_RSSR [49], and SwinFIR [41]. The comparative results for the UC Merced datasets at 4 × and 2 × magnifications are presented in Tables I and II, respectively. In addition, results for the CLRS datasets at 4 × and 2 × magnifications are detailed in Table III, and the outcomes for the RSSCN7 dataset at 4 × magnification are shown in Table IV. Utilizing multiple metrics for a comprehensive evaluation, our model achieves optimal values across all comparisons.
In particular, we note that our method improves about 0.18 dB in CLRS × 2 compared to the recent SwinFIR, which suggests that our method can extract richer texture features for images with high spatial resolution and much embedded information.
In addition, although the number of parameters in our method is not the smallest, it has reached a relatively balanced level compared to other methods. We also compared the PSNR and SSIM values of different classes on the UC Merced × 4 dataset, as shown in Table V, where classes 1–21 represents classes 1–21 in the UC Merced dataset, such as buildings, overpass, etc., with each class containing 100 images. It can be seen that in the vast majority of categories, our method has achieved the best results, and has shown particularly prominent effects in the first category, namely the agricultural class.
To illustrate the advantages of our method in image processing more intuitively, we provide qualitative analyses of results from UC Merced × 4 and CLRS × 4 datasets, shown in Figs. 8 and 10, respectively. For the image “agricultural90,” methods such as RDN, RCAN, and SAN fail to achieve satisfactory SR, resulting in significant loss of image details. In addition, SwinIR, HAT, and SwinFIR methods exhibit noticeable distortions in processing field paths. In contrast, our CFFormer method successfully achieves desired SR with clear image details and natural textures.
Visualization results of different SR methods on some examples of UC Merced × 4, HR represents high-resolution, best results in bold.
Only remove the visual effects of CA, FB, CFB, GAB, and jump joint in agricultural 90.
Visualization results of different SR methods on some examples of CLRS × 4, HR represents high-resolution, best results in bold.
In the case of “Parking_228” from CLRS × 4, VDSR fails to reconstruct the image correctly, while CNN-based methods like RDN, EDSR, and SAN produce serious blurring and artifacts. SwinIR and HAT also struggle to recover clear parking lines and vehicles. Only our CFFormer demonstrates the most favorable result in achieving accurate SR.
D. Ablation Study
1) Perform Ablation Experiments on Each Module
To verify the effectiveness of CA, FB in CFB, and the addition of connections to RSTB to form RFSTB and GAB, we conducted ablation studies on the UC Merced dataset at a scale of × 4, as shown in Table VI. It should be noted that the “ × ” in RFSTB indicates retaining the original RSTB state, meaning that no fusion processing is performed between (S)STLs, rather than removing them entirely.
We can observe that when all modules are present, the SR performance reaches its peak; when fusion processing between (S)STLs is not performed or GAB is not added, the performance decreases. When neither CFB, GAB, nor RFSTB are included, there is an improvement in performance whether only CA is added, only FB is added, or both are included to form CFB, which is sufficient to demonstrate the effectiveness of the modules we propose. These comparative results fully demonstrate the independent and complementary roles of CFB, GAB, and RFSTB in enhancing model performance.
To further demonstrate the effectiveness of the modules we proposed, we conducted a visual effect analysis taking “agricultural90” as an example, as shown in Fig. 8. It can be observed that when CA is removed, the resulting image exhibits more artifacts compared to the original, specifically manifested by deeper shadows between each terrace than in the original image. When FB and CFB are removed, the terraces undergo varying degrees of deformation, which is because FB effectively processes image information in the frequency domain, aiding in capturing fine details and reducing noise amplification; removing it would be detrimental to better restoration of image edges and textural details. After removing GAB, we found that although the terraces did not undergo severe deformation (bending), a significant amount of blurriness was introduced. When the jump-joint mechanism is removed, it can be seen that although there are no severe deformations or artifacts overall compared to the complete CFFormer result and HR, the white area in the lower right corner did not clearly recover the edges.
2) Ablation Experiments on “Maximize-Expand-Subtract” in GAB
To further demonstrate the effectiveness of the GAB we proposed, as well as the “maximize-expand-subtract”and fusion strategy within it, we conducted a visual effect analysis taking “buildings69” and “overpass64” as examples, comparing the feature maps obtained by removing the GAB, only removing the expansion part of the GAB, and the complete GAB, as shown in Fig. 11.
Completeness of the Global Attention Block (GAB) was investigated on both buildings64 and overpass69. (a) shows the feature map without the first GAB, (b) shows the feature map after the first GAB with the “maximize, expand, subtract” and fusion processes removed, (c) shows the feature map with the complete first GAB, (d) shows the feature map without the second GAB, (e) shows the feature map after the second GAB with the “maximize, expand, subtract” and fusion processes removed, (f) shows the feature map with the complete second GAB.
It can be observed that Fig. 11(a) and 11(d) display relatively raw feature maps with distinct local details, but overall they lack a certain structure and integration of global information, making the feature maps appear scattered and lacking directionality. In Fig. 11(b) and 11(e), after removing some global processing steps, the feature maps show a different balance between local and global aspects. The features in the figures are more concentrated, showing the preliminary impact of global attention, but due to the absence of maximize, expand, subtract and fusion operations, this impact is not as significant as that of the complete GAB. Finally, in Fig. 11(c) and 11(f), it can be seen that the feature maps are more focused and structured, with global information well integrated and highlighted, which helps the model better understand image content and context in subsequent steps.
E. Discussion
In order to further validate the feature fusion we designed in RFSTB and the effectiveness of different numbers of STLs and SSTLs on the CFFormer, we explored the effect of the number of (S)STLs on the performance and parameters of the model,,as shown in Table VII. We found that the performance of the CFFormer continues to improve as the number of layers increases, however, it results in more parameters, and if the number of layers continues to increase after the number of layers reaches 16, the performance improvement will be very limited.
In addition, we also explored the impact of window size on model performance. The window size refers to the dimensions of the nonoverlapping patches or windows into which the input is divided during self-attention calculations. When selecting the window size, there is a tradeoff between computational efficiency and the size of the receptive field—the larger the window, the larger the receptive field of each attention head, which can capture more extensive information, but the computational cost is also higher, as shown in Table VIII. We found that performance reached a relatively balanced effect only when the window size was 12, so we chose a medium window size of 12 as the window size in our method.
We also investigated the impact of different mlp
In addition to the effectiveness of the modules we proposed, we also assessed whether our CFFormer would result in significant memory consumption and increased execution time. We compared our model with the current state-of-the-art Transformer-based SR models SwinIR, HAT, and SwinFIR in terms of FPS, parameter count, Memory Allocated (MA), Max Memory Allocated (MMA), and execution time. FPS refers to the frame rate calculated based on execution time, indicating the number of input samples the model can process per second. MA refers to the total amount of GPU memory allocated by PyTorch at a specific point in time, including the GPU memory currently occupied by all tensors and caches allocated by PyTorch. MMA refers to the highest GPU memory usage among all completed operations before the program runs to the current position. This includes not only the currently active memory but also the peak memory usage during previous operations. Execution time refers to the time required for a single forward propagation of the model. The comparison results are shown in Table X. It can be seen that our CFFormer, due to its more complex structure, has more execution time and lower FPS compared to SwinIR and SwinFIR, but these increases and decreases are limited and still within a tolerable range. Moreover, compared to HAT, our method has less MA and higher performance.
Conclusion
In this article, we introduce CFFormer, a novel method leveraging the SwinTransformer architecture, which substantially enhances SR performance in remote sensing images. This improvement is achieved through the design of RFSTBs and the implementation of innovative components such as CFBs and GABs. The architecture comprises multiple RFSTBs, each containing several (S)STLs and advanced feature fusion techniques. CFBs are strategically placed at the end of each RFSTB, where they transform the feature mapping from the spatial to the frequency domain, effectively addressing challenges like shape distortion and blurring. Compared to existing methods, CFFormer excels in producing more realistic and detailed textures, achieving state-of-the-art (SOTA) performance metrics including PSNR, SSIM, SAM, QI, and SCC. This demonstrates its superior capability in enhancing the quality and clarity of high-resolution images from remote sensing data.
However, our research has primarily focused on LR images generated by bicubic interpolation. While yielding relatively good results, our method encounters challenges with LR images under different degradation models. Therefore, our future work aims to explore the applicability of our method to more complex real-world degradation scenarios. In addition, we plan to investigate the effectiveness of our approach in other domains such as rain removal and denoising, aiming to enhance the model's generalization capabilities.