Remote Sensing Image Super-Resolution With Residual Split Attention Mechanism

Recently, deep-learning-based methods have become the current mainstream of remote sensing image super-resolution (SR) due to their powerful fitting ability. However, they are still unsatisfactory in large-scale factor SR scenarios. The more complicated information distribution of images further increases the difficulty of reconstruction. In this article, we propose a novel residual split attention group (RSAG) to maintain the overall structural and the local details simultaneously. Specifically, an upscale module that makes the network jointly consider hierarchical priors, which assists in the prediction of high-frequency information, and a residual split attention module to adaptively explore and exploit the global structure information in low-level feature space. In addition, an artifact removal strategy is proposed to reduce excessive artifacts and further boost the performance. By progressively connecting the above modules and incrementally fusing the multilevel intermediate feature maps, the fidelity of high-frequency detail information is improved. Finally, we propose a residual split attention network by stacking several RSAGs for reconstructing high-resolution remote sensing images. Extensive experiment results demonstrate that the proposed approach achieves better quantitative metrics and visual quality than the state-of-the-art approaches.

of the acquired images. Therefore, the captured satellite images may not be accurate enough for advanced remote sensing applications such as object detection [1], image segmentation [2], etc. Image super-resolution (SR) is an algorithmic technique for producing a potentially high-resolution (HR) image from a given low-resolution (LR) image.
With the rapid development of satellite photogrammetry, it is very urgent to develop efficient and high-precision satellite image SR methods. Tsai [3] pioneered the use of fusing complementary information for satellite image SR tasks and utilized the complementary information between different frameworks to reconstruct HR remote sensing images. Recently, SR algorithms have been effectively utilized to improve image resolution and quality, which are widely used in preprocessing techniques for remote sensing image analysis [4]. The current SR methods can be categorized as interpolation-, reconstruction-, and learning-based methods. The interpolation-based methods [5], [6] are a kind of noniterative framework, whose core idea is to align the LR with the HR remote sensing image and apply nonuniform interpolation to obtain the value of each pixel corresponding to the HR remote sensing image grid. Reconstructionbased methods [7], [8] typically entail converting HR images to LR images by using downsampling, establishing correspondence by studying the performance of HR detail information under LR conditions, and ultimately expressing this relationship through modeling. One of the classical remote sensing SR reconstruction algorithms is the hidden Markov chain model proposed by Li et al. [9]. Because this model relies on accurate subpixel accuracy estimates, reconstructed remote sensing images may be severely lacking in high-frequency detail information and can only boost small magnifications [10].
Influenced by the speed development of machine learning, the deep learning-based satellite image SR approaches, gradually become a mainstream research direction. The deep learningbased algorithms show strong feature representation ability, which can be used to learn the nonlinear function by convolutional neural networks (CNNs) and achieve satisfactory results. As a result, more and more CNN-based remote sensing SR methods are being proposed by scholars. But most studies [11], [12], [13], [14], [15] on the SR of remote sensing images have focused on small magnification factors, and increased the resolution by adding an upsampling layer. Few studies have attempted to solve the reconstruction problem with a large magnification factor. Such as, Pan et al. [16] utilized the backprojection strategy to handle of the dependency between LR and HR more completely. Dong et al. [17] proposed a dense-sampling framework that reused an upscaler to upsample low-dimensional features via a dense-sampling mechanism for investigating large-scale factor SR reconstruction. Although the existing methods, RDBPN [16] and DSSR [17], have attempted to densely connect the feature by upsampling modules to address the reconstruction challenge at high magnification levels (e.g., ×8), which used the same weight size to learn remote sensing feature maps of different regions and lacked discriminative ability across feature channels. During the application process, the connection between the hierarchical features is lost, resulting in the loss of intermediate information.
In addition, due to the limited a priori factors available in LR space under severe upscaling circumstances, reliable prediction of local features becomes extremely challenging. The additional prior information means more computational overhead and increases the training difficulty, thus negatively affecting the subsequent reconstruction process.
To meet the aforementioned challenges, we propose an efficient residual split attention network (RSAN). First, we propose a multipath residual split attention (RSA) mechanism to promote the internal correlation of features by splitting and fusing different channel dimensions, which ensures the method pays more attention to detail-rich regions and focuses less on parts that are not well-informed. Then, we design an upscale module to learn the hierarchical prior information in an HR potential subspace to help the prediction of high-frequency information and a residual split attention module (RSAM) with the proposed RSA and a downscale operation to explore and exploit global information in LR potential subspace. In addition, an artifact removal strategy is proposed in the upscale module to reduce excessive artifacts for better large-scale factor SR. The upscale module and RSAM are combined to form the residual split attention group (RSAG), which is introduced to simultaneously enhance the consistency of global structure and the fidelity of local detail restoration by fusing multilevel and multipath features. In each RSAG module, we adopt a dense connection to conduct the residual features fusion (RFF), which across different levels, reducing feature redundancy and promoting information exchange within and between modules. Finally, we propose a RSAN to cascade several RSAGs for reconstructing HR remote sensing images. The main contributions are as follows.
1) We propose a residual split attention network for the single remote sensing image SR, which can better balance the model size and achieve superior results on two publicly available datasets even under the large-scale factor. 2) We present the RSAM to assist the network in focusing training on detail-rich regions while paying less attention to parts that are not well-informed by splitting and fusing the intermediate residual feature maps from different channel dimensions and strengthening the representation capacity of the network. 3) To ensure both the global structural consistency and the local detail restoration fidelity are fully maintained, the RSAG is proposed to use an upscale module to jointly consider the hierarchical prior information and connect with an RSAM for adaptive weighted fusion of multipath information, which enables the network to be more accurate for reconstruction by exploiting different dimensional information.
The rest of this article is organized as follows. Section II describes the related work. Section III illustrates detailed description of the proposed method. Section IV provides experimental results. Finally, Section V concludes this article.

A. CNN-Based Image Super-Resolution
Recently, CNN-based methods demonstrate excellent performance in various computer vision tasks [18], [19], [20] because of their robust feature representation capabilities. In 2015, Dong et al. [21] first proposed three-layers CNN framework. Kim et al. [22] introduced the concept of residual networks [23] can effectively build deeper networks and converge faster. Lim et al. [24] improved a enhanced residual SR network (EDSR) based on ResNet [25] blocks, which saved space by eliminating unnecessary modules from the traditional residual network and further expanded the size of the model to enhance the network expression ability. Lai et al. [26] introduced the Laplacian pyramid framework, which can predict residuals from coarse to fine. Haris et al. [27] introduced the deep back-projection network (DBPN) that can fully exploit the interdependence of HR and LR pairings by cascading several up-and down-sampling blocks to better learn high-resolution features and achieve good performance, particularly on large-scale factor. Considering the correlation between the channels, Zhang et al. [28] presented a very deep residual channel attention network (RCAN), which can have targeted extraction of the high-frequency component, by rescaling the channelwise features to focus on the image edge texture. To improve feature expression ability, Dai et al. [29] proposed a second-order attention network (SAN) to generate discriminative features and information. Lu et al. [30] designed a multiscale information polymerization network, which addressed the problem of limited representation ability of reconstructed networks caused by the lack of consideration of the potential relationship between multiscale features in existing CNN-based SISR methods. By studying image sparsity to accelerate the inference efficiency of the network, Wang et al. [31] designed a sparse mask framework to identify different regions by using spatial and channel mask learning to mark unimportant regions that can reduce redundant computations while maintaining good performance.
The SR method described above is aimed at general images. Due to the wide range of satellite images, the spatial distribution of remote sensing images is complicated. Thus, the targets to be recovered often cover only a few pixels in the image, and the pixel differences between different types of targets are small. Therefore, deep learning-based methods designed for general images cannot effectively process satellite images due to their inability to retrieve the potential high-frequency information contained in satellite images, especially with large-scale sampling.

B. Satellite Images Super-Resolution
In remote sensing image applications, recovering HR images with clear texture details is indispensable for many tasks, because satisfactory application results cannot be obtained with only a small amount of feature information provided by LR images. Liebel and Körner [11] were the first to apply the SRCNN [21] to satellite images SR. Considering the satellite image SR method cannot directly train by the natural images, so the authors produced a remote sensing dataset using SENTINEL-2 images to relearn the mapping relationship. Lei et al. [32] designed a multifork structured framework to learn the multiscale representation ability, which combined shallow and deep feature mappings to complete the interaction of network information to better guide the reconstruction. Qin et al. [33] introduced a multiscale network based on GoogLeNet [34] that extracted image features with multiscale kernels and obtained more comprehensive depth features after concatenating each channel feature to improve the SR effect.
Inspired by the successful application of knowledge distillation [35], [36], [37] in computer vision tasks, Jiang et al. [38] constructed a distillation framework to distill and compensate feature maps at various stages for high-frequency information enhancement. Ma et al. [39] devised a approach to simplify the training stage by the wavelet transform, which combines global with local residual learning to alleviate the problem of gradient disappearance. Gu et al. [40] developed a deep residual attention strategy, which used a residual attention block to adjust the weight of feature maps and improve the representation ability. Huan et al. [41] proposed a pyramidal multiscale residual framework to enhance the power that detect contextual information. Lu et al. [42] proposed a novel structure-texture parallel embedding (SPE) method, which utilized both global structural information and local texture information in the upscaling process to guide the reconstruction results. Wang et al. [43] designed a novel satellite SR framework to transform HR images into LR, artifact, high-frequency information and introduced a selfadaption difference convolution module to better recover remote sensing images.

C. Neural Attention Mechanism
The neural attention mechanism can focus on important region with limited resources and become a popular research topic. It originated from the exploration of the human visual mechanism. Human vision tends to focus on the salient areas while ignoring the information-poor parts, and a neural attention mechanism can help neural networks focus on important feature information while suppressing useless feature representations and improving information processing efficiency. Haut et al. [44] introduced the attention mechanism into the SR tasks to learn the mapping function between texture components, enhance the high-frequency information of the image, and suppress the lowfrequency information. Dong et al. [45] designed a multiperception learning framework to perform multilevel information adaptive weighted fusion for reconstruction. Further, Zhang et al. [46] proposed the mixed high-order attention mechanism (MHAN), which applied weights to different levels of convolution in the feature extraction stage to retain more important information, and added frequency-aware connection in the feature refinement stage to fuse and refine the features of different depths through the high-order attention module. Li et al. [47] introduced an adaptive weighted attention network that integrates an adaptive weighted channel attention module and a patch-level secondorder nonlocal module to capture interdependencies among intermediate features and enhance feature representations. To address the challenge of satellite images with large difference in scene and image size, Zhang et al. [48] proposed a multiscale attention network for features extracting that used the channel attention mechanism to fuse multiscale features and assigned models for the satellite images reconstruction. This method obtains good results, but the number of models and parameters increases significantly. Although the above attention mechanisms can enhance the network's learning of important features, they lack the ability to discriminatively learn different spatial regions of the same feature. Lei and Liu [49] utilized the inception module [34] to extract scale-invariant features and combined the channel and spatial attention mechanisms to distinguish important features, which allocated attention to different regions of each feature map and made the network perform more comprehensive discriminative learning of remote sensing features. To overcome the bottleneck of low accuracy in the existing unsupervised SR methods, Li et al. [50] proposed an unsupervised super-resolution architecture that included the masked transformer to extract latent hyperspectral characteristics for realistic restoration of hyperspectral images, with strong constraints incorporated into the framework. They also introduced a dual spectralwise multihead self-attention mechanism to address the limitations of traditional CNN-based models and enhance the robustness of the model.

III. PROPOSED METHOD
The structure of RSAN is described first in this section. Then, we elaborate the proposed residual split attention group, which is composed of the upscale and residual split attention modules, respectively. Finally, we introduce the loss function. In RSAN, we let I LR ∈ R h×w×c and I HR ∈ R rh×rw×c be the LR and ground truth images, respectively, where h, w, and c denotes the height, the width, and the channel number of the LR image. r represents the scale factor. In addition, let Conv(n k , n f , n c ), P wConv(n k , n f , n c ), DwConv(n k , n f , n c ) and DeConv(n k , n f , n c ) indicate the standard convolutional, pointwise convolutional, depthwise convolutional, and deconvolutional layers, where n k , n f , and n c denote the filter size, the number of input channels, and the number of output channels, respectively. Fig. 1 shows the structure of RSAN. The proposed method consists of four components: coarse feature extraction part, residual split attention group, multilevel features fusion module, and reconstruction module. The coarse feature F C is extracted in the RSAN initial part from the input LR remote sensing image I LR , as

A. Network Architecture
where H C (·) is the coarse feature extracting operation with one inverted residual block and one Conv(3, 64, 64) layer. In the middle part F C is used for residual split attention group deep feature extraction, the mth group RSAG-m extracts the deep feature map F m + as follows: where H RSAG,m (·) represents the mth RSAG, the specific structure will be detailed later and the number of m settings will be described in the ablation study section. Then, the global residual learning is introduced into the last RSAG output, so the input of the last deconvolutional layer is defined as follows: where H Sum refers to elementwise sum operation. After the last RSAG, we feedF M + into the deconvolutional [51] layer to obtain high resolution feature mapŨ M + and aggregate the previous multilevel HR feature maps (U 1 , U 2 + , . . . , U M + ) in the multilevel feature fusion module to estimate the reconstructed HR image. The SR can be described as where H Rec (·) use Conv(3, 64 * (m + 1), 3) as reconstruction and concat(·) represents the concatenation function.

B. Residual Split Attention Groups
Previous attention-based methods [45], [46], [48] only used multilevel residual blocks for refinement to generate richer hierarchical features. However, the lack of information interaction between different spaces and channels in the single-sized receptive field residual blocks. Inspired by ResNeSt [52], we propose the RSAG to excavate the internal relevance of features by multichannel splitting and classification of feature channel dimensions, and focus on detail-rich regions and pay less attention to parts that are not well-informed. Compared to ResNeSt, our proposed RSAG first learns the hierarchical features in a high-resolution potential subspace to improve the network's prediction of high-frequency information. This procedure has the additional advantage of an artifact removal operation that effectively reduces excessive artifacts, thereby achieving better large-scale factor SR. Then, we use a depthwise convolutional operation along with the residual split attention mechanism to explore and exploit global information in the LR potential subspace. Differing from ResNeSt, RSAN continuously projects the feature space across different dimensions to simulate the degradation level of remote sensing images at different stages and better learn high-resolution components. More importantly, RSAN is specifically designed to enhance the representation of hierarchical features and is an efficient remote sensing image SR network that progressively restores details from coarse to fine.
As shown in Fig. 2, the RSAG is made up of the upscale and the residual split attention module. The upscale module can upsample the coarse feature F C , then we set the DeConv(n k , n f , n c ) and feature extraction processes as the upscale module for RSAN. First, the upscale module maps the coarse feature F C to an intermediate HR map U 1,0 via one deconvolutional layer DeConv (8,64,64) with an upsampling factor of r = 4. When the r = 8, k is set to 12. Then, U 1,0 is mapped back to obtain the LR feature map L 1 through one pointwise convolutional layer P wConv (1, 64, 64), one depthwise convolutional layer DwConv(3, 64, 64) and one pointwise convolutional layer P wConv (1, 64, 64). Subsequently, we introduce an artifact removal operation to utilize the structure prior in LR potential subspace and estimate the artifact residual feature map a. The artifact residual feature map a between the input LR F C and the learned L is computed by a deconvolutional layer DeConv (8,64,64) to get the HR map U 1,1 . The upscale module output U 1 is computed by summing the intermediate HR map U 1,0 and U 1,1 . Then, U 1 is fed into the proposed RSAM. The local residual learning feature U 1 is passed through a depthwise convolutional layer DwConv(3, 64  *  m, 64) to obtain Split. Subsequently, Split is split into n separate splits through each channel, where the number of output channels for each split is defined as c, resulting in a value of c n output channels for each split. Then, an elementwise sum operation is performed on each split where Split n represents the nth split by the last division operation. The F Splits is passed through an adaptive average pooling layer and two pointwise convolutional layers, and then input to a n-softmax function based on the feature detail richness of each previous split, resulting in the latest Split . Therefore, Split can be formulated as follows: Split = H S (P wConv(P wConv(H Avg (F Splits )))) (7) where H Avg denotes the adaptive average pooling operation and H S represents the softmax function. We split Split into n splits (Split 1 , Split 2 , . . . , Split n ) again. Then, each latest Split n is multiplied by the previous corresponding n split Split n using the product operation of the elements, respectively. By the operation of an elementwise product, the latest n splits is multiplied by the corresponding previous n splits. Thus, the internal correlation of features is improved by using multipath channel information, and the network can focus on the restoration of global structure with RSA. After the second feature split, we use the elementwise sum operation to merge each path split as the output of the RSAM, which is denotes as F RSA . The 1st RSAM ouput F RSA,1 is defined as follows: where H RSA,1 denotes the output of 1th RSAM, the H Ep represents an elementwise product operation. The upscale module and RSAM are connected with each other to constitute the RSAG, the Fig. 1 depicts the entire RSAGs structure. In the lower right part of Fig. 1, purple and green cubes represent the upscale module and RASM, respectively.
The yellow cube indicates the operation of concatenating feature maps along the channel dimension. The overall construction of mth RSAG is described in detail below. The coarse feature F C is first processed by the upscale module where H Up,1 denotes the operation of the first upscale module. Then an RASM and upscale module generate initial level features F 1 and U 2 as follows: the dense connected structure [53] is used to fully utilize the different hierarchical features, in which each upscale module output aggregates feature maps from all previous upscale modules. When the group number m ≥ 2, the concatenate module is placed after the every upscale module and RSAM, the input to the mth RSAM can be represented as follows: the input to the mth upscale module can be denotes as follows:

C. Loss Function
This section mainly introduced the hybrid loss function (HLF), which includes the pixel loss, the perceptual loss, and the binary crossentropy (BEC) loss function. Our network is trained through supervised learning with the goal of minimizing the loss function, which can be expressed as Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where θ denotes the parameter of the network, N represents the number of input samples, and L H is the loss function of the RSAN. Given that realistic degradation processes are difficult to simulate, the network should be subjected to more effective constraints during the training process. The hybrid loss is expressed as where the weight coefficients λ 1 , λ 2 , and λ 3 are used to balance the loss. To calculate the pixelwise difference between the ground-truth and the generated SR image, we use the L 1 function defined as follows: Since the VGG [54] network focuses on the deep semantic information, it contributes to the enhancement of network output image clarity, resulting in better visualization. Therefore, in order to fully utilize the feature-level information, we extract features by using a pretrained VGG network, which is used to measure the perceptual loss. In this study, we use Conv5-4 layer in VGG for extracting features. The perceptual loss is calculated as shown as follows: where f VGG54 is the VGG feature extraction function. We calculate the BEC loss in the framework of the binary crossentropy loss function, which are shown in (19) L bec = − I HR log Ĥ RSAN (I LR ) where L bec represents the BEC loss function as the discriminator.Ĥ RSAN (I LR ) denotes the model output after sigmoid activation function, which indicates the probability that the prediction belongs to a ground truth sample. The continuous adjustment of network training is achieved by iterative computation of the above hybrid loss function.

A. Experiments Details
We compared the RSAN to seven recent SR methods based on deep learning, which include EDSR [24], DBPN [27], the lightweight residual dense network (RDN) [55], RCAN [28], SAN [29], MHAN [46], SPE [42], and deep unfolding method (LDUM) [56]. We use two publicly available remote sensing datasets for our SR experiments, including remote sensing scene classification (RSCNN7) [57] and UCAS-high resolution aerial object detection dataset (UCAS-AOD). Four prevalent image quality evaluation metrics (i.e., PSNR, SSIM, VIF [58], and ERGAS [59]) are chosen to objectively assess the performance of the model. Furthermore, we conduct experiments on realscene remote sensing image SR using the Jilin-1 video satellite dataset and introduced the no-reference image quality evaluation metrics image entropy [60] and average gradient [61] to evaluate the performance of the proposed model.

B. Experiments on RSCNN7 Dataset
Wuhan University released the RSSCN7 dataset, which contains a total of 2800 remotely sensed images from seven typical scene categories, with each category containing 400 images. The original image of the RSSCN7 dataset has the pixel size of 400 × 400. To generate LR images, we apply a downsampling process to the original HR images using a Bicubic interpolation operation with a scale factor and no blur kernel in the MATLAB environment. For the experiment, we select 2100 and 700 images as the training set and test set of RASN, respectively.  improved compared to the conventional interpolation method (Bicubic). Although the EDSR subjective visual results are fine, the edge information of the generated remote sensing images is significantly insufficient. RCAN [28], SAN [29], and SPE [42] introduce attention mechanisms into the network and obtained better repair results. As the network grows deeper, the number of extracted deep residual features increases, RDN recovers more texture detail on remote sensing images than other single image SR methods.
MHAN also subjoins the attention mechanism, which mixes high-order attention mechanisms with the ability to fully exploit hierarchical features. LDUM utilizes a combination of LR and high-frequency residual images to model HR images, achieving a balance between computational cost and performance. In contrast, our RSAN obtains clearer and better results in saliency regions through a multichannel attention mechanism with multilevel residual feature fusion, which is more faithful to the ground truth. In large-scale factor condition, the RSAN recovers more salient and informative components from LR images and produces more competitive results than other algorithms.

C. Experiments on UCAS-AOD Dataset
The UCAS-AOD dataset is a public satellite image dataset that includes two kinds of targets, automobile and aircraft, and negative background samples. We randomly select 900 of these images with a resolution of 1280 × 689 as the training set. We randomly select 100 HR images and intercept a 200 × 200 pixel portion of them as the test images. To generate LR images, we utilize Bicubic interpolation with a scale factor and no blur kernel in the MATLAB environment to downsample the original HR images.  in bold. According to the experimental results, the RSAN has a higher PNSR value of 0.09 dB than the most competitive general SR method RCAN [28] with a scale factor of ×4. Compared with the latest remote sensing image SR method MHAN [46], SPE [42], and LDUM [56], RSAN achieves a higher PSNR of is 0.22/0.28, 0.36/0.25, and 0.34/0.23 dB with the upscaling factor of ×4 and ×8, respectively.
Figs. 5 and 6 show the subjective results on the UCAS-AOD dataset with scale factors of ×4 and ×8, respectively. The last row of images indicate the error map between the estimated SR image and the ground truth. RCAN [28] and SAN [29] can achieve satisfactory outcomes. Nevertheless, their VIF/ERGAS values are lower than the proposed RSAN, particularly at the scale factor of ×8. Since the UCAS-AOD dataset contains only two simple scene categories compared to the RSCNN7 dataset, the proposed RSAN performs significantly better on the UCAS-AOD dataset, especially at large scale factor, compared to the remote sensing image super-resolution methods MHAN, SPE, and LDUM. From the magnified details of the reconstructed images of these methods shown in the images, we observe that the RSAN is capable of obtaining pleasant results, which is reflected in the MSE error maps.

D. Experiments on Jilin-1 Video Satellite Dataset
In real-world scenarios, the captured satellite images may not meet the precision requirements of many applications due to limitations caused by undersampling and imaging blur of imaging sensors. Under such circumstances, it is essential to utilize SR methods to improve the quality of the LR remote sensing images. To demonstrate the robustness of the proposed RSAN in real-world scenarios, we randomly crop seven remote sensing images of different scenes with a size of 256 × 256 from the Jilin-1 satellite video imageries, and compare with remote sensing SR algorithms through subjective evaluation. As shown  in Fig. 7, RSAN has the best reconstruction performance than the other SR method. Specifically, RSAN is able to recover building edges more accurately, while MHAN [46] and SPE [42] exhibit distorted image lines in their results. In the local zoom area, the compared methods produce visible ringing artifacts and blurred outlines, while the proposed RSAN generates sharper edges with fewer jagged lines and artifacts. Based on the above observations, the RSAN can produce visually satisfying high-resolution images with sharp edges and clear boundaries compare to other algorithms.
To further evaluate the SR performance of various methods in practical remote sensing applications, we adopt two no-reference image quality assessment metrics, image entropy (IE) [60] and average gradient (AG) [61]. In SR tasks, IE can be used to measure the complexity of information and texture diversity in an image. Generally, a higher image entropy indicates a larger amount of information and richer texture in the image. The average gradient refers to the rate of change in pixel values within an image. Edge and texture details in an image often accompany abrupt changes in pixel values. Therefore, the AG can reflect the level of detail in the edges and textures of the image. The larger the AG and IE, the clearer the image. As shown in Table III, obviously, the proposed RSAN shows a advantage in getting the highest score. In summary, the subjective visual performance and the no-reference image evaluation metrics demonstrate the effectiveness and practicality of the RSAN algorithm in SR remote sensing images.

E. Ablation Analysis
In this section, we first investigate the impact of different numbers of RSAG on the overall network performance, and conduct a series of comparative experiments on the RSSCN7 dataset with a scale factor of 4, as shown in Fig. 8. Subsequently, we introduce ablation studies to verify the effectiveness of the proposed RSAM, RFF, and HLF on the RSSCN7 dataset as shown in Table IV. Furthermore, we visualize the feature maps of the RSAM module to demonstrate that it can help the network focus on regions with rich details. Finally, we compared our proposed method with ResNeSt [52].
As shown in the Fig. 8, The PSNR of RSAN clearly reaches its maximum value when m = 10. The value of PSNR increases as m increases until m = 10. As m gradually rises to 14, the PSNR of the network gradually decreases by 0.021 dB from its peak value, while the total number of network parameters sharply increases. After carefully considering the trade-offs between network parameters and reconstruction performance, we choose RSAN when the number of RSAGs is 10 as the final network model. In each ablation experiment, we further verify the effectiveness of the proposed final network model RSAN with RSAM, RFF, and HLF. Specifically, the base network is constructed by removing the RFF module from RSAN, replacing RSAM with a depthwise convolutional layer, and using L 1 loss.
Validation on RSAM: We replace the RSAM with the conventional convolutional layers, the SR result showed a decrease about 0.1 dB (see Table IV). The model based on the residual split attention mechanism tends to focus more on the regions rich in detailed information with prominent scenes. In contrast, the conventional convolutional layer treats all feature information in the same way, and direct prediction of high-frequency information tends to produce missing detail information, thus degrading the estimated SR results.
In order to better assess whether the RSAM assists the network in focusing on detail-rich regions, we select images containing samples from two categories, buildings and airplanes, from the UCAS-AOD dataset. For each feature, we visualize and compare the output feature maps of the 5-th RSAG, to demonstrate the effects with and without the RSAM. First, we transfer the channel attention feature maps generated from the fifth RSAM to the CPU for further visualization. Then, we compute the mean of the channel attention feature maps to obtain the mapping results for a single channel. Finally, we utilize the matshow function in the matplotlib library to visualize the channel attention maps as a heatmap. As shown in Fig. 9, the brighter the corresponding region (e.g., building edges and airplane contours) in the visualized feature map, the higher the corresponding value, indicating that the network is more sensitive to these detail information and can better capture the key features of edges and textures in the input data. Therefore, the RSAM can accurately focus on the regions with rich details, thereby improving the network performance.
Validation on RFF: In these ablation studies, we keep the RSAM while removing the both global and local residual feature fusion to verify the proposed residual split attention strategy. It is clear that RSAN performance decreases by more than 0.12 dB when global and local residual feature fusion are eliminated. The optimization of the residual split attention network is guided by aggregating the multilevel global and local residual feature maps of the satellite images, which makes the reconstructed images more accurate. Therefore, without RFF, the reconstructed image will be smooth due to the lack of detail information. This proves that the global and local feature fusion can jointly and adaptively learn hierarchical features in a aggregative way. The artifact removal operation can enhance the edge part of the reconstruction result.
Validation on HLF: To verify the validity of the hybrid loss function, we removed the HLF and set the loss function to L 1 loss. We observed that the SSIM result of the RSAN decreased significantly when the HLF was removed, indicating that the HLF helps to reconstruct the texture and edges of an image in the pixel and perceptual domains, which can improve the estimated SR image to be more approximate to the ground truth image.
Moreover, in order to compare RSAN with the ResNeSt, RSAG is replaced by using the ResNeSt modules and experiments are performed based on the UCAS-AOD dataset with a scale factor of 4. Table V shows the comparison of model size and running time among these methods. In comparison to the ResNeSt, RSAN achieves optimality in PSNR and SSIM. The difference in parameter count between the RSAN and ResNeSt models is attributed to the use of deconvolutional layers in RSAN to learn hierarchical features in a high-resolution potential subspace, which increases the number of parameters. Overall, while RSAN may not outperform other models in terms of running time and number of parameters, it has shown the ability to achieve superior quantitative results.

F. Model Anaysis
In a real remote sensing image SR application scenario, especially in embedded or mobile devices with low computing power, model size and operational efficiency is a key issue. Therefore, we illustrate the comparison of RSAN and other SR networks in terms of the testing time at the scale factor ×4 on Fig. 10.
As shown in Fig. 8, when we set the number of RSAGs to 3 for the simple network, the number of parameters is close to  that of the lightweight RDN [55] network, while the algorithm performance is better than that of the noncompact DBPN [27] and EDSR [24] networks. The number of parameters of the complex RSAN (when the number of RSAG is 10) is less than that of the RCAN [28] (15.59 M) and MHAN [46] (11.35 M), which are also based on attention mechanisms, and the image quality assessment results are better. Compared to SPE [42], and LDUM [56], although RSAN does not have an advantage in terms of the number of parameters, it achieves better quantitative results while achieving good inference efficiency, which can provide a suitable network for applications in different scenarios.

G. Performance in Downstream Task
To further validate the effectiveness of the estimated SR images in this article for subsequent image segmentation tasks, we perform unsupervised spatial-spectral kernels [62] as a satellite image semantic segmentation method, and all SR methods use the same parameter settings for image segmentation on the the RSSCN7 dataset.
As shown in Fig. 11, the regions where the proposed RSAN achieves superiority over other SR methods are highlighted in red and green boxes. In the segmentation results obtained by SAN [29], SPE [42], and our proposed RSAN, the buildings along the riverbank (see red box) can be accurately delineated, while other algorithms show varying degrees of misclassification. For the main road, only the method proposed in this article can reconstruct it correctly, which indicates that it outperforms the other compared algorithms and compares favorably with other CNN-based methods. In addition, we used the average MSE value of the three channels of the reconstructed RGB image to measure the direct difference between the SR and ground-truth HR image, quantitatively evaluating the segmentation results. It is clear that the RSAN achieved the best quantitative results.

V. CONCLUSION
In this article, we propose a novel remote sensing SR method that learns the hierarchical features independently by exploiting the multipath channel feature extraction through the fused multilevel residual features. The proposed method includes four components, i.e., a coarse feature extraction part, the residual split attention groups, a multilevel feature fusion module, and a reconstruction module. We employ the residual split attention group to extract very deep abstract features with long and short skip connection. Meanwhile, the upscale module can remove some of the low-frequency information by performing multiple artifact removal operations, allowing the main network to focus on learning texture and edge information. In addition, to improve the reconstruction capability of RASN, we propose the residual split attention mechanism, which promotes the flow of information in information-rich regions and allows adaptive adjustment of feature weights while maintaining global structural information. Numerous experiments and ablation studies demonstrate the effectiveness of our proposed method, which can achieve superiority over state-of-the-art methods.