Gradient Prior Dilated Convolution Network for Remote Sensing Image Super-Resolution

Super-resolution (SR) aims to recover a high-resolution image from a single or multiple low-resolution images, compensating for the limitations of satellite sensor imaging. Deep convolutional neural networks have made great achievement in remote sensing image SR. In this article, we propose a novel gradient prior dilated convolutional network (GPDCN) for remote sensing image SR, which obtains contextual spatial connections and alleviates structural distortions. The GPDCN comprises a multiscale feature extraction network and a feature reconstruction network. The former employs a double-path dilated residual block with dilation convolution to increase a receptive field, a global self-attention module to detect long-range reliance among image patches, and a gradient propagation network to extract high-level gradient information. The latter uses the mixed high-order attention module to reconstruct the feature by collecting the high-order characteristics of multiple frequency bands. Experiments with the Massachusetts_Roads and 3K VEHICLE_SR datasets demonstrate that the GPDCN outperforms recent techniques concerning both quantitative and qualitative measures.


I. INTRODUCTION
I MAGE super-resolution (SR) is a popular topic in remote sensing as it is intended to regenerate high-resolution (HR) images from corresponding low-resolution (LR) equivalents. Obtaining HR remote sensing images has become increasingly meaningful in a variety of real-world applications, including resource management, environmental monitoring [1], construction planning [2], [3], and military investigation [4]. Indeed, HR remote sensing images can contain copious amounts of vital information critical to such applications. However, owing to the limitation of imaging technology, the spatial resolution of a remote sensing image usually fails to meet the accuracy requirements [5], [6]. In addition, a variety of factors can decrease the quality of remote sensing imagery, such as transmission noise and motion blur [7]. To reduce the impact from the imaging process, the most direct method is to equip more precise remote sensors; however, this increases the hardware cost. As a result, there is demand for a practical and effective strategy that tackles the limitations of remote sensors across different practical remote sensing applications.
Unlike multiframe images that achieve a better resolution output by establishing a relationship between a targeted HR image and numerous LR images of the same scene under various circumstances. Single-image super-resolution (SISR) must rely entirely on only one input image without any additional available information; this typically results in an ill-posed issue that contains multiple image reconstruction solutions due to a loss of information. Despite these challenges, numerous SISR approaches have been proposed to date. They can be divided into three broad categories [8]: interpolation, reconstruction, and learning-based algorithms.
Interpolation-based algorithms are the most fundamental approaches to image reconstruction. Classical interpolation approaches, such as nearest-neighbor and bicubic interpolation [9], are widely used today. Nearest-neighbor interpolation chooses the closest pixel value for each location being interpolated, and while the method has a quick execution time, it struggles to provide high-quality outputs. Bicubic interpolation conducts cubic interpolation on two axes, and while it produces smoother results with fewer artifacts than other interpolation-based methods, it is slow. Thus, the interpolation approaches have yielded good results by directly using prior knowledge of natural images, but real-world satellite images with intricate details cause difficulties during the reconstruction process. Simple interpolationbased methods can result in overly smoothed edges when an image size increases.
The vast majority of satellite SR approaches use reconstruction-based methods to rebuild matching HR images by extracting the valuable information in the LR image, and combining some prior knowledge, the reconstruction process is constrained. However, these methods primarily depend on the prior knowledge of the HR images, such as gradient-profile [10], edge [11], and smoothness priors [12]. Therefore, the reconstruction-based methods are usually constrained to hand-crafted features that need manual parameter adjustments. Therefore, it is difficult to use them to handle complicated and changing scenes.
Learning-based algorithms have been proposed as a way to avoid the above problems by establishing end-to-end training between LR and HR image pairs [13], [14], [15]. Deep learning has some remarkable progress in image SR, for which deep convolution neural networks (CNNs) can extract powerful feature representation capabilities. Dong et al. [16] have proposed a deep learning method for directly learning an end-to-end mapping between LR and HR images. Furthermore, Kim et al. [17] have presented SR methods that uses a deeply recursive convolutional network to improve imaging performance without using additional parameters. Nevertheless, using CNN-based SR models for remote sensing images results in several challenges. First, most CNN-based methods boost performance by setting a very deep model; however, such deep models usually cause high computation and memory costs, limiting their practical applications. Second, the use of dilated convolution (DC) is not used effectively in enlarging perceptive fields and multiscale features. Third, most CNN-based SR models insufficiently and inefficiently use the prior knowledge of images.
Thus, this article proposes using a gradient prior dilated convolutional network (GPDCN) for remote sensing image SR to solve the above issues. The model comprises an end-to-end pixelwise network that combines low-level detail information, highlevel semantic information, gradient information, and global contextual information. Furthermore, the framework consists of two phases: feature extraction and feature restoration. The feature extraction section is a deep convolutional network architecture that captures more powerful edge characteristics than other methods using gradient prior information. A double-path dilated residual block (DPDRB) extracts multiscale feature maps from LR and gradient images during the process. In addition, a global self-attention (GSA) module appended to each of the first five DPDRBs considers global context relevance. Then, the frequency-based combination of feature maps with varying frequencies is used to reconstruct the final high-frequency details via the mixed high-order attention module (MHOA) in the restoration section, acted as the feature restoration section. In summary, this article makes the following contributions.
1) A novel network, GPDCN, is proposed to address the structural distortions and enhance the precision of remote sensing imagery SR. 2) A unique DPDRB module is present for extracting multiscale feature maps without increasing the parameters via varied dilation rates. By this module, the entire pixel data can be covered and displayed. 3) A gradient propagation network (GPN) is designed to recover gradient maps from LR images to HR images, while additional supporting information is also provided for SR. 4) A GSA block is also developed to capture global contextual information and, simultaneously, to fully use selfsimilarity among nonadjacent pixels and improve the completeness of boundaries. The rest of this article is organized as follows. Section II provides a synopsis of the relevant works. Section III will detail the proposed approach. Section IV discusses the experimental procedures and outcomes to validate the proposed method. Finally, Section V concludes this article.

A. Edge Prior
The use of edge prior has been validated in previous work [11], which attempted to construct the gradient transfer mapping from LR to HR images using mathematical equations. Since an image's sharp edges correlate to well-defined gradients along the border, Fattal [11] tried using edge statistics to infer the prior reliance of various resolutions. Sun et al. [18] used a gradient field transformation to constrain the gradient field of the HR image and the reconstructed image. Furthermore, Tai et al. [19] extended edge-directed SR by using user-supplied example texture and restoring the fine details. Kondo and Fujiwara [20] combined reconstruction-and example-based SR to maintain natural edge structures. Yan et al. [21] enhanced the gradient profile sharpness, an edge sharpness metric, with a triangle model and a Gaussian mixture model.
Undoubtedly, it is unreliable to model the mapping relationships with a few parameters, especially with complicated land covers and intricate and fragmented image details. Therefore, the reconstructed images usually exhibit spurious artifacts or jagged edges. Since deep learning networks are good at end-to-end pixel transformations, several deep learning strategies have leveraged the advantage of prior's powerful qualities in SR assignments. Yang et al. [22] applied an off-the-shelf edge detector in a recurrent residual network to reconstruct fine features guided by edges. Ma et al. [23] introduced image gradient into the GANbased SR network to provide additional structure information, improving the edge details' reconstruction ability.

B. Dilated Convolution
DC was first proposed to solve resolution reduction and information loss problems, which often appear in pixel-level tasks, such as image semantic segmentation. Early segmentation methods primarily conducted the pooling operation after the convolution layers to reduce the model's computation and increase the receptive field of the convolution layer. However, as the output of image segmentation is the pixel level, the output and input sizes should be the same. Thus, the deconvolution operation is commonly conducted in the latter part of the network, resulting in more missing information. Then, DC is introduced into semantic segmentation to solve this problem [24]. Subsequently, a few SR technologies have paid attention to the DC, as it is the same pixel-level task as semantic segmentation. Lin et al. [25] proposed a seven-layer dilated convolutional neural network with skip connections for reconstructing the HR image from an LR image. Mirchandani and Chordiya [26] presented a dilation patch super-resolution generative adversarial network by applying dilated operation in the generator architecture to obtain high-quality features. However, these methods have not focused on the massive potential of the dilation convolution to extract multiscale features. Different from these methods, this article exploits the different dilation rates to obtain multiscale features with a double-path extraction architecture.

C. Self-Attention Mechanisms
As a means of reallocating available resources to the most informative segments of inputs and modeling long-range dependencies, the attention mechanism was first applied in [27] to obtain global dependencies in the machine learning task. In addition, attention modules have been frequently used in CNNs for a variety of tasks, such as visual question answering [28] and image and video classification tasks [29], [30]. Attention modules are also becoming more widespread in the image SR field. Zhang et al. [31] applied SENet in CNNs to enhance SR performance. Dai et al. [32] constructed a second-order attention network (SAN) by considering the second-order characteristics of features. Fu et al. [33] presented dual attention as a technique for adaptively integrating local features with their global interdependence in both spatial and channel dimensions. Even though the attention mechanism has performed admirably in computer vision, when it comes to extracting explicit material from small and sparse elements, a lack of awareness of the relationship and correlation of each location is a big problem for image SR. Consequently, we have designed a GSA module to build the long-range correlation by integrating a query-specific global context for each query position.

D. SISR Algorithms of Sensing Images
SISR algorithms have gained popularity for remote sensing images in recent years [34]. Besides applying cutting-edge imaging technology, the SR technique is a low-cost and effective way of enhancing image quality. With the popularity of neural networks, numerous attempts have been made to design various architectures to obtain high-quality HR remote sensing images through learning a mapping function between LR and HR matches. Lei et al. [35] presented a local-global combination network (LGCNet) for image SR that was inspired by the success of CNN in natural image SR. To learn the connections between the characteristics from each recursion, Chang and Luo [36] devised a bidirectional convolutional long short-term memory layer. Jiang et al. [37] developed an edge-enhanced GAN model EEGAN for promoting satellite image SR reconstruction using an adversarial learning technique that can restore the sharp edge effectively. Dong et al. [38] proposed RRSGAN, which aligns the Ref features to the LR features, and the texture information in the Ref features can be transferred to the reconstructed HR images. To fuse multiscale high-/low-dimensional features, TransENet [39] introduced the transformer structure into the conventional SR framework and achieved superior performance in the remote sensing field. Zhang et al. [40] proposed a mixed high-order attention network (MHAN) to reconstruct the details by extracting high-order statistics. For hyperspectral imagery, Hang et al. [41] combined a decomposition subnetwork and a self-supervised subnetwork to construct an end-to-end SR network. Zhou et al. [42] proposed a pyramid fully convolutional network consisting of an encoder subnetwork and a pyramid fusion subnetwork to enhance the spatial resolution of low-spatial-resolution hyperspectral image.

III. METHODOLOGY
This section provides an overview of the framework. Unlike previous methods using sophisticated structures to form the deep architecture, the proposed model (see Fig. 1) is divided into two parts: a multiscale feature extraction network (the left part of Fig. 1) and a feature reconstruction network (the right part of Fig. 1). The feature extraction network is, in turn, divided into two parts: a structure-maintaining network (SN) and a GPN. The SN includes nine DPDRBs and five GSA modules. The GPN comprises three DPDRBs to obtain multiscale hierarchical features using different dilation rates. The feature reconstruction network is the MHOA, consisting of several different R-order (R = 1, 2, 3, 4) attention modules that reconstruct complicated details. The feature fusion and propagation from the extraction to the reconstruction are achieved through the frequency-based feature combination method.

A. Multiscale Feature Extraction Network
The multiscale feature extraction network includes an SN and a GPN and uses the three-band images as input. The SN begins with a convolution layer with a kernel size of 3 × 3 on the input image. Then, nine repeated DPDRBs follow in the later part of the network. After each of the first five layers, the feature maps are fed into the GSA block, gathering and distributing long-range image features. The DPDRB is inspired by multiscale residual block [43] and DC [24]. We incorporate the feature maps from the first, fifth, and ninth DPDRB to the GPN.
1) Double-Path Dilated Residual Block: Like image semantic segmentation, a pixel-level task, SR aims to predict an image's unknown pixels. Conducting multiple pooling operations would lose some vital feature information, resulting in unsatisfied reconstruction results. In addition, the standard 3 × 3 kernel focuses on a small region, ignoring nonadjacent pixels' relevance. Based on these considerations, we incorporate DC into the DPDRB (see Fig. 2), and the DC can be considered "convolution with a dilated filter," which is comparable to adding zero elements between two neighboring elements of the convolutional kernel. The dilation rate d means adding d − 1 zero elements between the nearby elements of the kernel, and the receptive field by different dilation rates is presented in Fig. 3. By using variable dilation rates, the different receptive fields of each element can be achieved with the same convolutional kernel. In the DPDRB (see Fig. 2), we adopt a continually increasing dilation rate (d = 1, d = 2, and d = 3) to accommodate exponentially expanding receptive fields.
The DPDRB can be illustrated as (1) and Fig. 2, which includes two parts: multiscale feature extraction and weighted residual connection. Given a feature F n−1 as the input of the DPDRB, S 1 and P 1 are the features through the first convolution layer, S 2 and P 2 are the features through the second convolution layer, and the final output feature F n can be obtained as where w and b denote the weights and bias, respectively, and the superscripts of w and the subscripts of b represent the layer index. The subscript 3 × 3 represents the size of the convolutional kernel. If the DC operation is used in the convolutional kernel, the subscript d k×k − r represents the k × k convolutional kernel with a dilation rate of r. [S 1 , P 1 ], [P 1 , S 1 ], and [S 2 , P 2 ] represent the concatenation operations, and λ 1 and λ 2 are the learnable parameters for weighting the input and output of the nonlinear mapping mode, respectively. In the multiscale feature extraction part, a double-path DC network is adopted to extract local multiscale features by conducting different dilation rates. Based on this flexible DC, not only does the residual structure allow for bypassing abundant low-frequency information via several extraction layers, but also λ 1 and λ 2 act as learnable parameters during training to reallocate available resources toward the most informative components of the two parts, which encourage networks to focus on learning high-frequency information. Then, the embedding information can be shared in the block and, thus, transmit and fuse the image features of different scales.
2) GSA Module: The present convolutional network concentrates on the local region, suffering the limitation of capturing the global relationship among the whole spatial area. Considering the self-similarity in the long-range area in remote sensing images, the attention mechanism is widely used in recognition and detection tasks to aggregate the self-similar patches. Inspired by [30], [44], and [45], the GSA block is proposed for aggregating long-range feature information. The GSA block can be illustrated as Fig. 4, and the input feature maps can be defined as X ∈ R B×C×W ×H , where B and C denote the batch size and channel number, respectively, and W and H represent the width and height of the input feature, respectively.
The input vectors are first fed into three 1 × 1 convolutions, and selective rewriting operations are performed to obtain three embedding feature maps K, Q, and V in different feature space, where K ∈ R B×(W ×H)×(C/2) , V ∈ R B×(W ×H)×(C/2) , and Q ∈ R B×(C/2)×(W ×H) . To decrease the computation cost, we set the output channel as C/2 for dimension reduction. The attention weight matrix A ∈ R B×(W ×H)×(W ×H) can be obtained by calculating the self-similarity of K and Q as follows: where ⊗ denotes matrix multiplications. Then, we rescale the value vector by conducting matrix multiplications on A and V , which can be illustrated as follows: Then, the convolution and elementwise addition operations yield the final output Z where C is the 1 × 1 convolution operation to adjust the channel number to the same as X.
In addition, (5) shows the GSA's matrix operations on a 3-D input array, ignoring the batch size B. In the first step, for the matrix K, K i represents the values of all channels at the ith pixel position in space K, and Q j denotes the values of all channels at the jth pixel location in space Q. ⊗ is the matrix multiplication operation. The element A ij in attention matrix A can be viewed as the influence of the jth element of the input feature map on the ith element, thus realizing the dependence between any two elements of the global context. In the second step, we conduct the matrix multiplications on attention map A and value matrix V , where V n represents the nth channel values for all positions. After conducting a softmax operation on matrix V , the matrix V S can be achieved. The multiplication processes can be described as aggregating the nonlocal self-similarity to achieve modeling of global long-range dependencies GSA blocks enhance the description of pixel features by modeling long-range relationships. Optimized for the first attention mechanism Nonlocal block [30], the GSA performs two softmax operations to smooth the image and neutralizes the effect of the sharp gradient map.
3) Gradient Propagation Network: The gradient of the image denotes the change rate of the image pixels' gray value along the x-axis and y-axis. The contours of HR images are sharper than those of LR images, as shown in Fig. 5. In other words, a more excellent gradient value indicates a clearer margin. As an image can be viewed as a 2-D discrete function, the difference of the neighboring pixels can be approximately calculated as the gradient where I(x, y) represents the pixel value at the (x, y) position; G x (·) and G y (·) represent the gradient value along the x and y axes, respectively, and a convolutional kernel can achieve it; I x (·) stands for the pixel value of the according coordinates; and Grad(·) is the gradient map at coordinate (x, y). As shown in Fig. 5, the gray value changes sharply on edge, so the gradient map can illustrate the sharpness feature of the image. In addition, this valuable feature can be utilized to facilitate the sharpness of the SR images. Considering this, the GPN is proposed to propagate the gradient in the network. The GPN takes the gradient map Grad(G) obtained from the gradient calculation as the input and starts with an initial convolution layer with a kernel size of 3 × 3. In the latter of the GPN, there are three repeated gradient blocks (GBs), each of which is the same as the DPDRB, concatenating the intermediate-level SR feature and gradient feature as the input where F sr and F gradient represent the intermediate-level SR feature and gradient features, respectively. Since a well-designed SR branch can convey abundant edge structural information, the proposed strategy performs outstandingly for recovering gradient maps. Simultaneously, the feature maps are also vital and are employed prior to enhancing the GPN's performance. With three propagating GBs, the integrated feature maps are extracted and fed into the last convolution layer with a 3 × 3 kernel. The final output of the GPN is conducted as part of the input for the reconstruction network.

B. Feature Reconstruction Network 1) Mixed High-Order Attention Module:
Attention adjusts the weight of the convolution parameters to highlight the essential parts and reduce the influence of useless information. The traditional channel-based [31] and spatial-based [46] attention commonly used in the CNN concentrates on the first-order attention, which only obtains the coarse feature statistics because they only focus on a specific region, failing to accurately predict the impact of different regions operating in concert on the final result. To model the complicated and high-order feature statistics for complex details in remote sensing images and model the attention of different parts of the image or feature map acting in concert mechanism, we introduce the HOA module [40], [47] into the reconstruction network, in which R is defined as the order of the HOA; more specifically, when R > 1, the R-order attention concentrates on R regions and models the interaction of R attentions, which can have a synergistic effect on the final output.
For an HOA module with R order, we use In addition, we apply the nonlinearity variation to improve the representation capacity of this high-order feature map, and the variation is formulated by where the ReLU function denotes the nonlinear activation function, and the sigmoid function is applied to restrict the element of the A(X) in [0,1]. Finally, similar to the general attention mechanism, A(X) is used to reweight the input X, that is, More specifically, as shown in Fig. 6, when R = 1, we get a descriptor Z, and A(X) = sigmoid(Z), then Y = A(X) X, as shown Fig. 6(a); when R = 2, we first use two 1 × 1 convolutions to get two embedding features, {Z 2 1 , Z 2 2 }, at level 2. We combine them to achieve the second-order component, that is, Then, at level 1, one convolution is used to obtain an embedding feature, Z 1 . Subsequently, the weighted matrix is calculated as A(X) = sigmoid(ReLU(Z 1 ) + ReLU(Z 2 )), and finally, the output Y can be obtained by the operation Y = A(X) X, as shown in Fig. 6(b).
2) Frequency-Based Feature Combination: In the present SR methods, many residual blocks are usually accumulated to model a very deep network, and the feature map from the last residual block is then commonly used to reconstruct the HR images. Though these "deep" networks have achieved some satisfying performance, there is a limitation in considering the feature distribution of hierarchical features, failing to utilize the feature map of different frequencies. Qiu et al. [48] pointed out that the feature maps from different frequency bands often vary greatly. For shallow layers, the parameters concentrate on low-frequency information, such as the basic textures. For deeper layers, the parameters emphasize high-frequency components of the whole image, including the region filled with edges, complicated corners, and other characteristics. As a result, low-frequency feature maps from shallow layers should be routed to a higher order HOA module, which is more elaborate and has a greater capacity for detail restoration. In addition, high-frequency components are necessary for reconstruction, and the information collected at higher frequencies from deep layers should be augmented further using a higher order HOA module. Thus, it is not the best way to concatenate all the feature maps obtained directly from different layers. Therefore, the feature maps from shallow layers and deep layers are combined based on various frequencies in our reconstruction network, and the process can be formulated as Assume that the reconstruction network contains N HOA modules and M = 2N + 1. F E and F R denote the feature maps from the extraction network and the reconstruction network, respectively, [·] is the concatenation conduction, HOA r denotes the HOA module of order r, and F R r and F R r−1 , respectively, represent the output feature from the r-order HOA module and the (r − 1)-order HOA module in the reconstruction part.

C. Loss Function
The proposed method, GPDCN, aims to minimize the difference between the reconstructed and ground-truth images. Considering that the majority of image evaluation indicators are significantly connected to pixel-by-pixel differences, pixel loss is still highly desirable. Based on the pixel loss between the HR image I HR and the SR image I SR , we add an additional gradient loss to consider the gradient information. The loss function in our experiment consists of two parts: image pixel loss and gradient loss where h, w, and c represent the height, width, and channel number of the picture, respectively, and λ is the weighted parameter on the gradient loss. In this article, we set λ = 0.001.

IV. EXPERIMENTS AND ANALYSES
A. Datasets and Implementation Details 1) Datasets: The proposed approach is evaluated using a new SR dataset named 3K VEHICLE_SR, which is obtained from the 3K VEHICLE dataset, a standard vehicle identification dataset [49] consisting of 20 pictures with a spatial resolution of 0.13 m of 5616 × 3744 pixels. The 3K VEHICLE dataset includes vehicles in a variety of real-world environments, including ports, hills, lakes, metropolitan zones, and rivers. The 20 photographs are cropped into 1170 subimages of 512 × 512 pixels and are used in the experiment as 3K VEHICLE_SR (the sizes of the LR images are 128×128 for ×4 scale, 171×171 for ×3 scale, and 256×256 for ×2 scale). Eighty percent of the subimages are used for training, 10% for validation, and the remaining 10% for testing. In addition, the Massachusetts_Roads dataset is used to assess the generalizability and robustness of the technique. We also use the WHU-RS19 dataset to test the performances on different scenes. In the experiments, all the LR images are degraded from HR images by bicubic interpolation, and the corresponding HR ones are reviewed as ground truth.
2) Implementation Details: In the experiments, we focus on scale factors of ×2, ×3, and ×4 and evaluate SR results using the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) on the Y channel of the converted YCbCr space. The model inputs and outputs three-channel RGB pictures. Since L1 loss accelerates convergence and stabilizes the convergence process, the L1 loss is adopted as the training loss function. ADAM [50] is used to optimize the training process with β 1 = 0.9 and β 2 = 0.999 to update the model parameters. The initial learning rate is set at 10 −4 and then drops by a factor of 10 every 100 epochs. All the experiments are conducted with PyTorch on NVIDIA TITAN RTX GPU.

B. Evaluation Metrics
The PSNR and SSIM, which are frequently used as assessment metrics for SR tasks, are employed in this experiment to quantify the suggested method. Given an HR image H and an SR image S, N denotes the image's total pixel number. The where t represents the tth location, H(t) and S(t) are the corresponding pixel value of the HR and SR images, respectively, and MAXI equals 255 for the 8-bit image where u is the mean pixel value, σ represents the pixel variance value, and σ HS represents the covariance between the HR image and the SR image.

C. Ablation Study
Since our GPDCN contains a trick DC and three major components-GSA module, GPN, and MHOA network-to demonstrate the validity of different parts, five tests are designed and conducted. The first column represents baseline, and the second to fifth columns represent the results of removing a module by the model, in order to verify the effectiveness of the modules. The results in Table I demonstrate that the GSA module contributes more improvement than the GPN and MHOA do. The GSA module, GPN, and MHOA can increase the network's performance for the results of ×4 SR by 0.0982, 0.0844, and 0.0893 dB, respectively. Obviously, when all the components are integrated, further promotion is achieved, demonstrating the importance of each design in obtaining the highest SR results in the proposed network.

D. Comparisons With State-of-the-Art Methods
To demonstrate the feasibility of our method, we compare it with other representative interpolation-, CNN-, and ResNetbased SR approaches: ESPCN [51], Bicubic [9], VDSR [52], DRN [53], IGNN [54], RCAN [31], SAN [32], and RDN [55]. We analyze various comparison approaches using open-source code, and all the methods are trained and tested in the same environment to guarantee the results are credible.
1) Quantitative Evaluations: Table II details the PSNR and SSIM values. Table II shows that the GPDCN outperforms the current competitive approaches on two metrics (PSNR and SSIM) for almost all factors except for the ×2, and the reason may be that the GSA's advantages of catching self-similar properties are not particularly obvious for the images with relatively  that gradient priors and global context-aware information are powerful tools for a more accurate restoration.

2) Qualitative Evaluations:
We also examine the visual effects of various procedures, as depicted in Fig. 7. We highlight the regions of interest that clearly illustrate the distinctions among the different methods. The results reveal that the fundamental bicubic interpolation technique cannot increase the amount of information. Deep-learning-based methods, such as VDSR, are capable of inferring some texture information but produce blurry image contours as a result of their global optimization method and wasteful feature usage. The results generated by the proposed approach are very competitive and much more realistic for images containing repeating high-frequency characteristics, such as corners, lines, and squares. For instance, in the figure of a car [see Fig. 7(e)], only the proposed method can reestablish an accurate and evident pattern with fewer artifacts, whereas others suffer from different degrees of blurring. Fig. 8 illustrates the gradient maps. As we can see, other methods' gradient maps have low values or include structure degradation, whereas ours are bold and realistic. The qualitative comparison demonstrates that our proposed GPDCN method is capable of extracting additional structure information from the gradient space to generate clear and realistic SR images by maintaining geometric structures.
3) Model Size Analysis: Since the model size is an essential consideration in practical applications, particularly on limited computer devices, we provide the model size and performance for currently competitive SR approaches in Table III. According to Table III, compared with DRN [53] and SAN [32], although the proposed method processes more parameters, the PSNR value is still much enhanced. Furthermore, compared with IGNN [54] and RDN [55], our algorithm produces more competitive results with fewer parameters. At the same time, the FLOPs and GPU costs of our method are comparable. As illustrated in Table III     remote sensing images SR are discussed in detail: EEGAN [37], LGCNet [35], RDBPN [56], TransEnet [39], and MHAN [40]. Except for PSNR and SSIM, Natural Image Quality Evaluator (NIQE) [57], Perceptual Index (PI) [58], and learned perceptual image patch similarity (LPIPS) [59] are introduced into the evaluation as PIs. Note that the lower the NIQE, PI, and LPIPS, the higher the quality of the images. The conclusions are shown in Table IV, and it is clear that our approach performs the best on the 3K VEHICLE_SR dataset with a scale factor of ×4. In addition, we present a visual comparison in Fig. 9. By zooming the regions of interest, it is evident that the EEGAN and LGCNet cannot produce rich details. RDBPN and TransENet can generate sharp edges but with some blurred contents. As for the MHAN, it can slightly improve the reconstruction performance by using the high-order attention mechanism while still having some noises and artifacts. And our GPDCN reconstructs the most realistic image features with the slightest ambiguity, resulting in more visually attractive results. It unequivocally establishes that the proposed GPDCN is a viable solution to the remote sensing imagery SR problem.

E. Comparison on Image Scene Classification Dataset
We compare our GPDCN with several SISR methods on WHU-RS19, a scene classification dataset for remote sensing image.  in ten scene categories and the best SSIM results in 17 scene categories. Compared with other algorithms, our method is more effective in some scenes, which include abundant edges and contours, such as "Airport," "Bridge," "Commercial," "Park," etc. Meanwhile, the proposed method achieve best PSNR and SSIM for the overall evaluation. It also shows that the PSNR results are very different among different scenes, in which the PSNR for "Beach" images is 40.198 dB, but the PSNR for "Residential" images is only 22.418 dB. The reason for this is that the image includes a wide range of remote sensing scenes, i.e., the image of "Residential" owns more high-frequency information than the "Beach." The very smoothing scenes, where minimal high-frequency information should be super-resolved, tend to have higher PSNR results.

F. Visualization of Image Gradient
To demonstrate the efficacy of the GPN, the gradient maps as outputs are shown in Fig. 10. Given sharp-edged HR photos, the generated corresponding gradient maps often contain finegrained and unambiguous contours for the items in the images [see Fig. 10(c)]. As Fig. 10 shows, after the bicubic operation, the gradient maps generated from the LR equivalents frequently exhibit coarse-grained shapes [see Fig. 10(b)]. Our GPN inputs LR gradient maps and outputs SR gradient maps to supply the SR branch with precise structural information [see Fig. 10(d)].
The result shows that our GPN successfully recovers sharp and structure-pleasing gradient maps.

V. CONCLUSION
This article presented a GPDCN for SR in remote sensing images. To be more precise, the DPDRB was proposed to enlarge the receptive field and extract multiscale feature maps. The GSA structure enabled the GPDCN to acquire long-distance interactions and structural information to integrate nonlocal operations into the network. Meanwhile, we constructed a GPN using double-path DC blocks to recover HR gradient maps from the LR ones and provide explicit structural guidance to the SR branch via gradient information. In addition, an MHOA module was adopted to reconstruct the image using hierarchical characteristics with various frequency bands. Extensive experimental results demonstrated that the proposed GPDCN can surpass existing SR algorithms and balance performance and efficacy.