G2Grad-CAMRL: An Object Detection and Interpretation Model Based on Gradient-Weighted Class Activation Mapping and Reinforcement Learning in Remote Sensing Images

Remote sensing images (RSIs) contain important information, such as airports, ports, and ships. By extracting RSI features and learning the mapping relationship between image features and text semantic features, the interpretation and description of RSI content can be realized, which has a wide range of application value in military and civil fields, such as national defense security, land monitoring, urban planning, and disaster mitigation. Aiming at the complex background of RSIs and the lack of interpretability of existing target detection models, and the problems in feature extraction between different network structures, different layers, and the accuracy of target classification, we propose an object detection and interpretation model based on gradient-weighted class activation mapping and reinforcement learning. First, ResNet is used as the main backbone network to extract the features of RSIs and generate feature graphs. Then, we add the global average pooling layer to obtain the corresponding feature weight vector of the feature graph. The weighted vectors are superimposed to output class activation maps. The reinforcement learning method is used to optimize the generated region generation network. At the same time, we improve the reward function of reinforcement learning to improve the effectiveness of the region generation network. Finally, network dissecting analysis is used to obtain the interpretable semantic concept in the model. Through experiments, the average accuracy is more than 85%. Experimental results in the public RSI description dataset show that the proposed method has high detection accuracy and good description performance for RSIs in complex environments.


I. INTRODUCTION
R EMOTE sensing image (RSI) interpretation is the core and key link of RSI application. Efficient and accurate interpretation technology is helpful to improve the application level and expand the application field of remote sensing [1], [2]. Currently, the remote sensing survey and update of surveying and mapping, land, forestry, and other industries in China still mainly adopt manual visual interpretation, which is a time-consuming, laborious, costly, and long cycle. It cannot meet the urgent needs of rapid extraction and update of natural resource information in the current rapid economic and social development [3], [4], [5].
In recent years, the pullulating of remote sensing technology makes it no longer difficult to obtain remote sensing data. Under the condition of sufficient data, the object detection methods of natural images cannot be applied to RSIs because of the problems, such as single prediction scale, poor effect of horizontal frame fitting to target, and lack of enhancement of target features [6]. The problems in the field of remote sensing object detection can be concluded as follows.
1) Scale change problem: RSI has large scene information.
The image reaches a resolution of millions of pixels. Therefore, the target scale is small relative to the image, which leads to the failure to obtain the fine features of the target. In addition, the RSI target scale variation range is wide, which is not conducive to single-scale multiclass target detection. Chen et al. [7] proposed a multiscale object detection framework based on a context feature pyramid, which improved the performance of multiscale object detection by enhancing the connection between scene and object. Zhao et al. [8] proposed a rotation-invariant CNN model for learning rotation invariant, which introduced and learned a new rotation invariant layer to increase the detection effect. Obeso et al. [9] proposed a multiscale object detection algorithm based on an attention mechanism, which introduced the attention mechanism to redis-This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ tribute the weights of feature maps in different channels.
Although some methods solve the scale problem, time efficiency cannot be guaranteed. 2) Goal orientation problem: The objects in RSIs are oriented and densely arranged, and the direction of the objects is irregular. Therefore, the detection model needs to have rotation invariance and a better-quality detection box. Aiming at the problem that the object detection approaches are difficult to distinguish the mixed pixels and the threshold is difficult to select, the adversarial growth algorithm was proposed in [10]. Feng et al. [11] proposed a single-stage target detection algorithm with a dynamic receptive field. Bottom-up short connection pathway and global context up-sampling module were added to the RetinaNet structure to enhance the structural and semantic features of the detection layer. The cascade R-CNN algorithm in [12] continued the two-stage idea based on candidate regions and adopted the cascade detection head structure, which could improve the detection performance step by step and had a good effect on small targets. However, there is still the problem of incomplete feature extraction.
3) The background is complex and chaotic: The RSI background is complex and diverse, including a large amount of redundant background information, such as mountains, rivers, and so on. This leads to the blurred boundary between background and target, which is not conducive to the extraction of target features by the model. Avola et al. [13] proposed to enhance features by capturing the correlation between global scenes and local features. Sun et al. [14] proposed nonmaximum suppression constrained by aspect ratio to improve the quality of candidate regions, and used a deformable convolutional neural network to model geometric changes of objects, which effectively improved object detection. Cheng et al. [15] proposed a pixel attention mechanism to suppress image noise and highlight target features, and introduced Intersection over Union (IoU) constant factor into SmoothL1 loss to solve the rotation box boundary problem, so as to make rotation box prediction more accurate. Chen et al. [16] replaced the traditional bounding box with a rotatable border embedded in SSD, so that the algorithm could predict the direction Angle of the target and had rotation invariance. These algorithms are improved based on traditional CNN for RSIs, which improves the performance of RSIs target detection to a certain extent. However, there are still some problems in RSIs, such as target detection angle offset, more missed detection, and low recall rate. 4) Lack of interpretability: Deep learning is a "black box" model, which lacks explanatory information about the predicted behavior of the model. As remote sensing technology is related to national security issues, it is essential to execute interpretable analysis on the model to a certain extent to enhance the confidence of the prediction results. For example, Li et al. [17] added a semantic graph module to the pretrained CNN to obtain semantic information of classification and enhance interpretability. Yan et al. [18] proposed the method of gradient attribution, which used the gradient of each pixel in the input model to understand the association between the input and the prediction results. In addition, there are also some visualization methods [19], [20], such as visualizing the regions with large activation values of the convolution kernel and analyzing the information obtained by the model in the image. These interpretable methods generally use human subjective judgment and lack in-depth analysis. Object detection is to detect objects with different scales and categories in images and give the predicted positions of objects of different categories. In object detection methods, manual selection is usually used in the feature extraction stage, such as scale-invariant feature transformation and orientation gradient histogram [21], [22]. The performance of feature extraction methods largely depends on feature design, which requires a lot of prior knowledge. Therefore, this kind of method has a high design cost, poor feature robustness, and weak generalization ability. Compared with the method of manually designed features, object detection based on deep learning uses CNN to extract image features [23], [24], which has automatic and powerful feature extraction ability, better robustness, and higher detection accuracy. Therefore, the traditional object detection method has been gradually replaced by deep learning-based methods.
Deep learning-based object detection algorithms can be roughly divided into anchor-based algorithms and nonanchorfree algorithms. The difference is whether to use anchor points to extract candidate boxes.
Anchor-based algorithms include two-stage detection models, such as region CNN (R-CNN) series, and one-stage detection models, such as YOLOv2 (You Only Look Once2), SSD (Single Shot MultiBox Detector), etc. [25]. R-CNN first generates candidate boxes for feature extraction and then puts classifiers in these regions to correct and extract targets. Faster R-CNN uses region proposal network (RPN) to deepen the detection task. For a given image, SSD outputs the borders and categories of the target using regression.
Nonanchor-based algorithms discard anchors and obtain box descriptions through other methods, such as YOLOv1 (You Only Look Once Version1), CornerNet [26], ExtrmeNet [27], fully convolutional one stage (FCOS), etc. YOLOv1 performs the regression of target position and category for each pixel of the feature map. CornerNet and ExtremeNet use the key point regression detection box. CornerNet transforms the regression frame positioning problem into a detection and matching problem for the upper left and lower right points. ExtremeNet defines key points as extreme points and groups key points according to the geometric structure. FCOS uses dense prediction to predict detection boxes, and the detector directly takes pixels as training samples, so it does not need anchor points to restrict the selection of features.
Most existing studies recognize and detect targets in RSIs based on deep learning, and achieve high detection accuracy. However, object detection methods cannot generate text descriptions related to RSI content, and there is a semantic gap between low-and high-level semantic features. It cannot realize sensing and understanding of RSIs, and has certain limitations [28]. Wu et al. [29] proposed a novel global context-weaving network (GCWNet) for object detection in RSIs to solve dense instance stacking, large-scale variations, and complex background issues. Wang et al. [30] proposed an end-to-end feature-reflowing pyramid network (FRPNet), which had two advantages that contributed to improving object detection accuracy. Wu et al. [31] proposed a context-driven detection network (CDD-Net) to improve the accuracy of multiclass object detection in RSIs. For capturing the local neighboring objects and features, a local context feature network was proposed to learn the local context of the region of interest. Unlike object detection, image description methods combine computer vision and natural language processing. Image description can extract the target area in remote sensing images. The extracted features include spatial feature, environmental feature and scenarios. It studies the connection between the image features, text semantic features and the mapping relationship.
Currently, most of the research on image description focuses on natural scenes, and there are few studies on image description for remote sensing scenes. Zhou et al. [32] proposed a description generation model based on multiscale and attentional feature enhancement, which realized the description of RSIs. Sun et al. [33] proposed an RSI description model based on deep learning and CNN. Xue and Tong [34] proposed a deep multimodal neural network model, which could be used for the text description of high-resolution RSIs. Lu et al. [35] constructed a public RSI description dataset and used a multimodal method and attention method to generate description of the content of RSIs.
Although the above-mentioned researchers have realized the description of RSIs, it is easy to be affected by the complex background of RSIs, more noise information, and a small proportion of targets, resulting in low accuracy of the generated RSI description results, which cannot meet the requirements of RSI description in complex environments. For example, if the background color is similar to the remote sensing target color, it will be difficult to distinguish the remote sensing target, and clouds, atmospheric particles, and fog will bring great difficulties to the extraction of RSI features.
Our main contributions are as follows. This article presents an object detection and interpretation model based on gradientweighted class activation mapping and reinforcement learning. The backbone network based on ResNet is used to extract features from RSIs. Then, the global average pooling (GAP)layer is added to obtain the corresponding feature weight vector of the feature graph. The weighted vectors are superimposed to output class activation maps. The reinforcement learning method is used to optimize the generated region generation network. Meanwhile, we improve the reward function of reinforcement learning to improve the effectiveness of the region generation network.
This article is organized as follows. Related works are reviewed in Section II, including deep learning interpretability approaches and Grad-CAM. Section III proposes image object detection and interpretation. Several experiments are conducted in Section IV to show the superiority of the presented method. Finally, Section V concludes this article.

II. RELATED WORKS
With the rapid development of remote sensing technology, high-resolution RSIs contain increasingly rich information, which greatly promotes the applied research in the field of remote sensing. RSIs contain important information such as airports, ports, and ships. By extracting RSI features and learning the mapping between image features and text semantic features, RSI content can be interpreted and described. It has a wide range of application value in military and civil fields such as national defense security, land monitoring, urban planning, and disaster mitigation [36], [37]. For example, in national defense security, by extracting and capturing important information such as airports and ships in RSIs, text descriptions related to the content of RSIs with smooth semantics can be generated, which can provide military information for military security managers, assist them to make decisions quickly and deploy tasks. In the civil field, the generated RSI text description can accurately provide important information about disaster assessment, farmland utilization, vegetation cover, and urban change, and provide decision support for relevant managers. Therefore, it is of great significance to describe RSIs.

A. Deep Learning Interpretability Approaches
Currently, the interpretability of deep learning is divided into several branches, among which the visualization method is one of the important research directions. Zhang et al. [38] proposed sensitivity analysis to quantify the sensitivity of the model to input variables and visualize regions with high sensitivity, indicating that this region mainly affected model decision-making. Other visualization methods sample the image blocks with the largest convolution kernel activation value [39], and then visualize these activated image blocks to analyze how the networks obtain information. Ke et al. [40] used two visualization techniques (occlusion and guided backpropagation) to find relatively important areas in the image.
As shown in Fig. 1, the interpretable visualization algorithm described above visualizes network feature maps or activation maps without further analysis of these visual features. These methods use human visual observation analysis to obtain the interpretability of network models, which is prone to human subjective judgment errors.

B. Gradient-Weighted Class Activation Mapping (Grad-CAM)
Grad-CAM belongs to the method based on class activation mapping in local interpretation [41]. According to the prediction results of a single image, the heat map highlighting the important region is obtained by combining its feature maps as the interpretation result image. Grad-CAM can also be used for weakly supervised localization problems, that is, only the label information of the image is given, and then the object referred to by the label in the image is located.
The idea of Grad-CAM is to calculate the gradient of the feature map of the last convolutional layer, which is used as the weight to obtain the thermal map for a specific category. Since the thermal map is coarse-grained, the method can also be combined with the visual interpretation method based on backpropagation to get the interpretation map with clear semantics, that is, the high-resolution, pixel-level saliency map. This method is simple and intuitive and can be flexibly applied to models of different tasks, such as image classification, image understanding, and visual question answering, as shown in Fig. 2.
The shallow feature maps of deep neural networks usually encode basic concepts such as color and texture. Deep feature maps encode more advanced concepts of semantic and spatial information. The fully connected layer discards most of the concept of spatial information. Therefore, Grad-CAM selects the feature map output by the last convolutional layer as the original information to provide interpretation. Taking the model performing the classification task as an example, to obtain the thermal map L c Grad−CAM about class c, the gradient of the output y c of the fully connected layer concerning the kth feature map A k of the convolution layer, namely ∂y c ∂A k , is first calculated. Then, GAP is performed to obtain the importance score α c k of the feature map for category c, namely Finally, all the feature maps of this convolutional layer are summed by α c k weighting and ReLU activation is performed to obtain the saliency map L c Grad−CAM about category c, namely ReLU activation is performed to screen out regions that have a positive impact on category c, that is, these regions can increase the output y c of the fully connected layer on category c. However, regions with negative influence may be related to other categories, and displaying both positive and negative regions at the same time may lead to relatively chaotic positioning results. The final interpretation result image is obtained by up-sampling and normalization of the thermal map L c Grad−CAM . Guided feature inversion [42] is a visual interpretation method based on class activation mapping in local interpretation. In other words, based on the prediction results of a single image, the thermal map of prominent important areas can be obtained by combining its feature maps as the interpretation result image.
First, the original image I a is fed into the model to obtain the feature map output by each layer. Also, based on the deep feature map, high-level semantic and spatial information are encoded, and the feature map output by the last convolutional layer is selected as the original information to provide interpretation. Then, a weight vector ω is initialized with a constant. An intermediate thermal map m is obtained by weighting the feature map, namely where f l 1 i (I a ) represents the ith feature map of the l 1 th layer of the model. m is upsampled and normalized, and a perturbed image Φ is generated using m guidance, i.e., where p is a noisy background image. It can be a gray image, a white Gaussian noise image, or the original image after the Gaussian blur image. The last method is used in the document to minimize artifacts from sharp edges. For such unnatural images, it is impossible to judge how much the model is altering its predictions because of artificial traces. The perturbed image Φ will retain the region highlighted by the intermediate thermal map m.
In this way, a generation can be selected to optimize the weight vector ω so that the distance between the original image and the feature map output by the perturbed image in the last convolution layer of the model is as small as possible, namely The second term is the L1 constraint, which is to keep the number of importance scores greater than 0 in the weight vector ω as small as possible. Because the model does not need to use all the feature maps to identify an object, and even only needs the corresponding feature map of a part of the object to make a correct prediction, that is, the feature map used for a prediction is sparse. The significance of using the original image is to ensure that the noise of the optimized intermediate thermal map is as small as possible on the one hand, and to reduce the number of parameters to be optimized on the other hand.
At this time, after the first step of optimization, the obtained intermediate heat map does not have class discrimination. It is just a linear superposition of the feature maps, so it highlights all the foreground objects. To make the interpretation result image class-discriminative, an objective function should be added to fine-tune the weight vector ω. The aim is to make the probability of the model predicting the perturbed image into the specified category as high as possible and the probability of its complementary image as low as possible. The complementary image is defined as follows: So, the objective function for the second stage is is the prediction probability of model output. Thus, the first term improves the prediction probability of the specified category for the prominent region of the intermediate thermal map, whereas the second term reduces the prediction probability of the complementary region for the specified category.
The thermal map can be obtained by the superposition of random masks. The importance of the region λ retained by mask Q is defined as the prediction probability of the perturbed image obtained by its element-level multiplication with image I. Then, the final interpretation of the importance of the prominent area in the resulting image is the expectation obtained by all masks, namely After the multiplication of mask and image elements, if the prediction probability of model f is greater, the area retained by this mask is more important.
It expands the above equation according to the expected definition and rewrites it using conditional probability The second term is (10) So On substituting it into (9), we get Since the mask m is distributed According to (12), the thermal map can be obtained by weighting the mask obtained by random sampling. The weight is the prediction probability of the disturbed image. When uniformly sampled, P [Q = q] = 1/N , i.e., Because the pixel-level mask may have a great impact on the model, a small part of pixels may be occluded, which may cause a great change in the prediction of the model. In addition, sampling a pixelwise mask computationally takes an exponential amount of space. Therefore, when generating masks to ensure smoothness, smaller masks are first generated, and then they are upsampled back to the image size.

III. PROPOSED G2GRAD-CAMRL
The proposed method is shown in Fig. 3. It includes three main learning stages: mask proposal network (MPN), reinforcement learning, and network dissecting analysis (NDA).

A. Mask Proposal Network Based on Grad-CAM
We propose an object MPN combined with Grad-CAM to achieve the purpose of adjusting the proportional relationship between target and background information as shown in Fig. 4.
In this article, GAP is chosen instead of global max pooling (GMP), because the algorithm requires the MPN to obtain the maximum possible feature region to distinguish target categories. GMP can only output the area with the highest identification of the target and completely abandon the feature area with low identification.
We use ResNet to build the MPN combined with Grad-CAM. First, it adjusts the image size to 224 × 224 pixels and inputs it into ResNet [43]. The image is transferred to the convolutional layer in the network, and the output size of this layer is [77 512]. This output is also known as the eigenmap vector. Let f k (w, h) represent the activation response of kernel unit k at any position (w, h) in the eigenvector graph, where k represents the kth [77] feature map in the vector. Then it inputs f k (w, h) into the GAP layer and obtains the output where p = (w, h). (w 0 , h 0 ) represents the upper-left coordinate of the image, (w 0 + w l , h 0 + h l ) represents the coordinate at the lower right corner of the image, w l is the width of the image, and h l is the height of the image. For the image with category c label, Grad-CAM can be calculated by the following formula: Substituting (15) into (16), the following equation can be obtained: When the image is predicted to be class c, the Grad-CAM value of any coordinate position in the image can be calculated by the following formula: Combining (17) and (18), it can be seen that Grad-CAM is used to calculate the value of P c at all pixel positions in the image, which is the basis for ResNet to determine the target category.
According to the following formula, an output value of the MPN is calculated as Taking S c as the input of formula (21), the quality score S mask of the MPN is obtained as According to the following formula, the original image, the target initial positioning mask, and the mask score are used to generate the target initial positioning map I out :

B. Attention Region Deformation in Grad-CAM
To further fully and comprehensively learn the subtle features of the key regions, the deformation sampling method is introduced to generate extended data. Traditional deformation-based data enhancement methods usually distort images randomly [44]. But its effects are not guaranteed. The deformed image in this article can highlight the attention part and suppress the remaining part, so as to help the model continue to learn the differences of subtle features.
First, the deformed image D should be sampled by the input image I. They have the same size. It can be formalized as D(x, y) = I(f (x, y), g(x, y)), where x and y represent the position coordinates of the deformed image, that is, the pixel value of the deformed image D at (x, y) is equal to the pixel value of the original image I at a certain position. The horizontal and vertical coordinates of this position are determined by the mapping relations f and g, respectively. The goal of f and g is to adaptively sample the original image according to the size of each pixel value in the Grad-CAM image, that is, the pixel position in the attendance area of the Grad-CAM image is oversampled, and the pixel position in other noncritical areas is reduced or not sampled. According to [14], f and g should be satisfied where x and y represent the horizontal and vertical coordinates of the deformed image. f (x, y) and g(x, y) represent the horizontal and vertical coordinates of the original graph to be sampled.
x, y, f (x, y) and g(x, y) are the normalized coordinate values.
Assuming that the Grad-CAM graph does not reflect the key attention area, that is, the Grad-CAM graph conforms to a uniform distribution with a pixel value equal to 1, then (23) can be satisfied by only setting f (x, y) = x and g(x, y) = y, which is equivalent to the original image (no attention area needs to be deformed). However, if the Grad-CAM graph can reflect a key attention area, that is, the pixel value conforms to the nonuniform distribution, then we want to find f and g, it is equivalent to solving for a change that transforms the Grad-CAM graph from an uneven distribution to a uniform distribution. However, in this case, the left side of (23) needs to calculate the integral of the discrete function, which cannot be solved in the higher data category [45], [46], [47]. Therefore, it is very difficult and costly to calculate these two mapping functions accurately. So, we need to find another approximate solution. When the Grad-CAM graph is not uniform, the goal of the solution formula can be visually understood as the original image pixel I(x, y) is spreading to other pixels with F (x, y) force during sampling. Therefore, f and g can be approximated as ⎧ ⎨ ⎩ f (x, y) , x ,y A(x ,y )k((x,y),(x ,y ))x x ,y A(x ,y )k((x,y),(x ,y )) g (x, y) , x ,y A(x ,y )k((x,y),(x ,y ))y x ,y A(x ,y )k((x,y),(x ,y )) (24) where k ((x, y), (x , y )) represents the distance measurement between two points. At this time, the sampling results are related to two factors: 1) the pixel value of each point in the Grad-CAM graph; and 2) the distance between the points to be sampled and each point in the Grad-CAM graph. If the value of a pixel in the Grad-CAM graph is larger, and the distance between the point to be sampled and the point is closer, and the possibility to select the point position in the original graph for sampling is greater. Therefore, this method can finally get a deformation effect similar to the expansion of the attention region, and the existence of distance measurement k also prevents selecting the point corresponding to the maximum position of Grad-CAM in the original image for each sampling. Finally, both the numerator and denominator in (24) can be realized by a convolution operation. In this case, k corresponds to one convolution operation (input and output channels are 1). If the input image is I ∈ R C×H×W , then the mapping functions f and g correspond to a flow field grid G ∈ R H×W ×2 .

C. Reinforcement Learning Strategy
In the training stage, the traditional image description method adopts the backpropagation algorithm to maximize the probability of the next real pixel given the previous real pixel. In the test phase, the probability of the next pixel is predicted based on the pixels previously generated by the model. This method will cause a mismatch between the training phase and the test phase, and lead to the phenomenon of exposure deviation, which causes an easy error and continuous accumulation in the test phase, and reduces the quality of the generated description image. In addition, the cross-entropy loss function optimizes the model in the training stage. In the test phase, discrete and nondifferentiable indicators can assess the quality of the generated images. This method will have the defect of inconsistent optimization direction, which leads to the inability of the network to directly use BLUE and other evaluation indicators for optimization training. When the cross-entropy loss function is minimum, the best evaluation result may not be produced.
To eliminate the defects of exposure bias and inconsistent optimization direction, this method introduces a reinforcement learning strategy [48]. The gradient algorithm in reinforcement learning strategy can train the nondifferentiable discrete variables end-to-end, and directly optimize the model according to BLUE and other indicators to improve the training effect of the model. The reinforcement learning strategy treats ResNet as an agent that interacts with the image and the external environment and defines the learning strategy p to guide the model to predict the next pixel. After generating the image description, the reinforcement learning strategy uses BLUE and other indicators to measure the fit and similarity between the image description and manually annotated reference statements, assigns ResNet an expected reward, and takes minimizing the negative expected reward as the goal to optimize the model, which can be expressed as where θ is the model parameter, w s is the sequence of each pixel, r(·) is the reward function, and E(·) is the expectation function. In practice application, L(θ) is generally obtained by single sampling with strategy p θ , and can be expressed as Reinforcement learning adopts the policy gradient algorithm to calculate the gradient of L(θ), which can be expressed as In practice, to facilitate the solution, Monte Carlo single sampling is used for approximate estimation, which can be expressed as Due to the randomness of sampling and the lack of context normalization, the reinforcement learning strategy is used to calculate the gradient resulting in large variance and instability of the training process. To reduce the variance, a benchmark factor b is introduced to constrain and correct the expected reward function, which can be expressed as To maintain an unbiased estimate of the gradient, the benchmark factor b can be any function that does not depend on w s . When Monte Carlo single sampling is used for approximate estimation, the gradient ∇ θ L(θ) can be expressed as Using the chain derivative rule, the final gradient expression is obtained as where s t is the input of the Softmax function. When Monte Carlo single sampling is used for approximate estimation, ∂L(θ) ∂s t in (31) can be expressed as where l is the one-hot vector representation of pixels. w t and h t are the pixel and internal vector representation at time t, respectively. The reward function is improved according to the features of RSIs to obtain more accurate regional proposals. At each time step, the agent of MPN of reinforcement learning will calculate whether to terminate the search according to the policy. The strategy is determined by the probability of fixate action and done action. The agent represents the reinforcement learning model designed in this article. Fixate action means that after a large number of interest regions are extracted from features, these regions are screened. If a certain area is selected to calculate the reward, it is to focus on that area. As long as the search is not over, a fixate action is issued to visit the new location. Region of interest (ROI) observations are updated in the domain centered around this new location. To indicate that this area of interest has been selected, it sets all entries in this domain to 1. All ROIs are sent to the pooling layer for class-specific bounding box offset prediction. Nonmaximum suppression [49] is applied to the classified ROI to obtain the most significant information. Since the remaining regions of interest have the final bounding box prediction, they are mapped to some spatial location of the observed history for a particular class. A class-specific probability vector is inserted into the history quantity merged with the base state quantity S t . With the new state, it takes a new action at t + 1 and repeats the process until the action is complete. Then it collects all the selected predictions in the entire trajectory. The RPN pseudocode of reinforcement learning is shown in Algorithm 1.
The agent of reinforcement learning should first balance two RoI selection criteria. 1) High object instance overlap should be generated; 2) The RoI number should be as small as possible to reduce the number of false positives and maintain a manageable processing time. On this basis, two action rewards are set to evaluate the actions issued by the agent: fixate action reward and done action reward.
Considering that RSIs have the characteristics of large image size and small target instances, the original reinforcement learning reward function has simple content and less data volume, which does not perform well on some datasets. Three datasets are explored in the MPN of reinforcement learning. According to the fixation reward and done reward, it is found that the fixation reward obtained by searching an image on the NWPUVHR-10 dataset is relatively dense, and the done reward is generally between −20 and −1. However, the fixation reward obtained by searching an image on DOTA and VisDrone2018 datasets is very sparse, and the done reward ranges from −50 to −20. In the DOTA and VisDrone2018 datasets, the output detection boxes of the instances are few and the target is small, so they are easy to be discarded in the training, which is unable to obtain more fixation rewards and done rewards in the image. It is difficult to converge.
For each object instance, the fixation reward first gives a small negative reward for each fixation action, but the agent also gains a positive reward for increasing IoU with any truth instance of the current image. At each time step t, the difference between the IoU of the instance and the true value and the maximum IoU value (IoU i t ) of that instance over the entire time step are computed. Trajectory data are collected for all the regions where IoU is calculated within this time step. At the same time, when the IoU threshold is appropriately reduced, the positive reward of fixate action can be increased to encourage the agent to continue searching and obtain the prediction box that may be missed because the target instance is small. It obtains the adjusted fixate reward at time t given as where i indicates the ith object instance. The done action reward is calculated based on the IoU for each instance and truth value. The larger covered area denotes the reward closer to zero, otherwise, it becomes more and more negative. Upon termination, the agent receives a done action reward that reflects the quality of the search trajectory The pseudocode of the reward function is shown in Algorithm 2.

D. Loss Function
The weakly supervised network mainly uses the method of weak semantic segmentation to generate the attention weight. It uses the weak semantic mask to guide the learning of the attention weight. The loss function of the weak semantic attention network is the cross-entropy loss, and the specific form is shown in the following equation: where H and W represent the length and width of the weak semantic mask. u ij and u ij represent the weight value of the output point (i, j) of the attention network and the pixel value of the point (i, j) on the weak semantic mask. The regression classification network contains two branches, so it is necessary to calculate the loss of the classification network and the loss of the regression network, respectively. Focal loss [50] is used for classification loss, as shown in the following equation: where N indicates the total number of prediction boxes, p n represents the probability distribution of multiple categories, and t n represents the category label of the target. In focal loss, α and γ are hyperparameters, which are set to 0.2 and 1, respectively. In addition to the classification loss, smoothL1 loss is also used as the loss function for regression tasks in the classification regression network, as shown in (37) N indicates the total number of prediction boxes, t n indicates confidence ( t n = 1 indicates the foreground, and t n = 0 indicates the background), v nj represents the predicted coordinate vector, and v nj presents the true label coordinate vector.
Therefore, the multitask loss in the model training process in this article is shown in the following equation: where σ 1 , σ 2 , and σ 3 are the balance parameters of multitask loss, L 1 is the regression loss, L 2 is attention loss, and L 3 is the classification loss.

E. Network Dissecting Analysis
Neural networks achieve superior performance at the cost of low interpretability of their black-box representation. However, in fields related to human or social security, such as medical treatment, driving, and remote sensing, deep learning models not only need excellent effects but also need to provide a certain basis for decision-making. In recent years, some interpretable visualization algorithms visualize network feature maps or activation maps and then perform interpretable analysis on model decisions. These methods make use of the subjective analysis of human vision and are prone to errors in judgment. In this article, the method of NDA [51] is improved. The basic principle of NDA is to explore the distribution of activation value of the feature map by using the feature map left by image forward propagation in the convolutional network according to a set of predefined human interpretable semantics and a dataset containing these interpretable semantic annotations. Then, the interpretable semantic information of the convolution kernel in the network is obtained by calculating the similarity between the distribution and the interpretable semantic annotation in the dataset.
First, the human-interpretable semantic concepts of the scene, object, component, material, texture, and color defined in the traditional method are divided in a way that conforms to human understanding. Scene, object, and component are considered high-level semantic concepts, whereas material, texture, and color are considered low-level semantic concepts. Second, the scoring value of the semantic concept is calculated by NDA shown in Fig. 5. Finally, the interpretability is quantified and used to encode the convolution kernel. Taking the second-layer   Table IV. convolution C5_conw2 of the fifth stage in the backbone network ResNet as an example, assuming that the input image is I(x), the feature map F (x) output from 512 convolution kernels in Conv_2 (second-layer convolution) is saved after a forward propagation and used for subsequent interpretability calculation. F (x) contains 512 feature maps, and each feature map corresponds to the semantic distribution of a convolution kernel. For F k (x) (k is the convolution kernel index), the segmentation threshold g is used to filter the weak semantic information, and the strong semantic information is retained as the semantic feature of the convolution kernel. In this article, the method of calculating threshold using probability distribution in traditional network analysis is improved, as shown in the following equation: where H and W represent the height and width values of F k (x) and p i represents the value of the ith pixel. The average value of the activation value is calculated as the threshold g because a value higher than the average value can better represent the semantics of its convolution kernel. The strong semantic feature graph after filtering is T k (x), and the filtering method is shown in the following equation: where T i k (x) represents the pixel value of the ith point on the feature map T k (x). T k (x) of each convolution kernel is compared with the marked semantic mask. First, a binarization preprocessing is carried out on T k (x), and the retained feature activation value is differentiated from the filtered weak semantic information to obtain a binary semantic map M k (x). Then, it is upsampled to facilitate the calculation of the semantic graph and the semantic mask. Using the IoU calculation method in [52], for mask L c (x) with different semantic C, the obtained IoU value is the interpretability score of convolution kernel k and semantic c. Finally, the interpretability scores of all convolution kernels in this layer for different semantic concepts are obtained. In this article, the overall average level is used as the scoring threshold f , and the specific calculation method is shown as follows: where K represents the total number of convolution kernels in this layer. The threshold f can be used to obtain the semantic concept whose score of each convolution kernel is greater than the threshold.

IV. EXPERIMENTS AND ANALYSIS
To realize RSI object detection and interpretation, the public RSI description dataset DOTA [53] is used to train and learn the method. It is also compared with other current methods with good image description performance to verify the effectiveness of this method. The experimental process is shown in Fig. 6.

A. Datasets
To verify the effectiveness of the proposed method, a comparative experiment is conducted on the DOTA V1.0 dataset. DOTA dataset is a large public dataset annotated by a rotating box, which is mainly used for RSI object detection tasks. The dataset consists of 2806 RSIs from different sensors and platforms, ranging in size from 800 × 800 to 4000 × 4000 pixels, which contains 188282 target instances of different scales, orientations, and shapes. It mainly includes 15 common categories: Plane

B. Evaluative Criteria
Average precision (AP) and mean average precision (mAP) are used to evaluate the detection accuracy of the model. Frames per second (FPS) is used to evaluate the detection speed of the model.
The AP can be calculated as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   (44) where TP represents the true positive sample, FP represents a false positive sample, and FN represents the false negative sample.
MAP can be calculated as follows: The FPS can be calculated as where S test is the number of samples in the test set, and T is the time consumed for the testing set.

C. Experiment Process
The experiment is carried out on the PyTorch framework, and the specific parameters are shown in Table I.

D. Ablation Experiments
To verify the effectiveness of the G2Grad-CAMRL algorithm, three ablation experiments are designed. ResNet network is used as the benchmark method, and the proposed three modules in this article are used for the comparison experiment. The ablation experiment is performed on the DOTA dataset, and the experimental results are shown in Table II. Ý indicates that the model contains this module. RLS is a reinforcement learning strategy. Table II shows the effectiveness of each module proposed in this article on the object detection task in the RSI dataset. Since the background occupies a large part in RSIs, Grad-CAM can solve this problem which has a better effect on accuracy improvement. It can be seen from Table II that the G2Grad-CAMRL in this article improves the IoU threshold due to the introduction of reinforcement learning, making the object detection effect better.
To verify the universality of the proposed algorithm on different backbone networks, a set of comparison experiments are designed. The G2Grad-CAMRL is compared on different backbone networks, and the experimental results are shown in Table III. Fig. 7 is the formal bar chart of Table III to give the reader a more objective understanding.  In the comparison experiment, ResNet50, ResNet101, and ResNet152 are used in the backbone network. Table III shows that simply increasing the network depth has a limited effect on improving the object detection effect of RSIs. Because small objects in RSIs have a large proportion, and the features of small targets mostly exist in shallow semantic information, so blindly deepening the network depth is of limited help to dealing with RSI object detection tasks.  Table IV. Its data graph is shown in Fig. 8. The combined loss function achieves better results, it obtains 93.58% mAP. Even if the effects of these combinations are not too different, the proposed method still has the least FPS. Therefore, the loss function method in this article is competent for the task of object detection.

E. Comparison Experiments With Other Methods
The G2Grad-CAMRL in this article is compared with four classical object detection algorithms, including Faster R-CNN, R-FCN, YOLOv2, and SSD. Faster R-CNN is the benchmark model of the original DOTA dataset. The backbone network used in YOLO is DarkNet19. The backbone networks of other comparison algorithms are ResNet50 as in this article. Then, we select three other state-of-the-art algorithms for comparison including CWDL [54], GAN [55], and SDGH-Net [56]. The experimental results are shown in Tables V and VI, respectively. Table V shows that the two-stage algorithm Faster regionbased convolutional neural network (RCNN) has the worst effect. For the object selected in this article, the highest recognition rate of PL is only 82.05% with Faster RCNN, and the recognition rate of SV is 52.96%. The recognition rate of PL based on R-FCN is 89.63%, which is 7.58% higher than that based on Faster RCNN. The recognition rate of BR is 53.37%. The highest recognition rate of the first-stage algorithm YOLOv2 and SSD does not exceed 90% because the interference of background features is ignored. In G2Grad-CAMRL, among the three objects with the highest recognition rate, SH, TC, and PL achieve 95.67%, 95.78%, and 94.58%, respectively, which are improved by 6.5%, 3.33%, and 5.67% than that by SSD method, respectively.
From Table VI, all methods have good identification results. For example, the recognition rate of SBF is 70.28% based on CWDL, the recognition rate of SP is 83.01% based on GAN, and the recognition rate of SBF is 74.15% based on SDGH-NET. Based on G2Grad-CamRL, the recognition rate of SBF and SP are 77.39% and 85.88%, respectively. It has a certain improvement over the other three methods.
As can be seen from the comparison in the above tables, the G2Grad-CAMRL remote sensing object detection method is superior to other methods. Good detection results have been achieved on aircraft, small vehicles, large vehicles, ships, etc., indicating that the proposed method has more advantages for the detection of such scenes. Figs. 9 and 10 are the PR and ROC curves for the comparison of GAN, SDGH-Net, and G2Grad-CAMRL. Fig. 9 is the PR curve trend chart. We only selected three effective methods, including GAN, SDGH-Net, and G2Grad-CAMRL. The area under curve of G2GRAD-CAMRL is 88.64%, which is improved by 1.47% and 0.49% higher than that of SDGH-NET (87.17%) and GAN (88.15%). In terms of the ROC curve, G2GRAD-CAMRL also shows a certain improvement compared with the other two methods. Figs. 11 and 12 are partial enlargements of Figs. 9 and 10, respectively, so that the reader can see the curve trend more clearly. Fig. 13 shows some detection results.

F. Visual Interpretation Effect
To verify the effectiveness of the attention mechanism, Fig. 14 shows the visual interpretation effect of Grad-CAM in the process of generating RSI description text. It can be found   that GRAD-CAM, by screening image features, focuses on the highly salient features of the target region, rejects other redundant features and noise information, enhances the perception and understanding of the content of RSIs by the model, and improves the accuracy of description results.
We perform the histogram processing on the first and third rows of Fig. 14 to obtain the results shown below (see Figs. 15 -18). From the point of view of the pixel distribution of the histogram, the pixel distribution of the histogram is denser than that of the original image after processing by the proposed method. This indicates that the sensitive areas of the image can be focused on.

V. CONCLUSION
To realize the description of RSIs, an RSI description method is proposed by using ResNet to construct the basic network architecture, introducing Grad-CAM, and adopting a reinforcement learning strategy. To verify the effectiveness of the proposed method, the publicly available RSI description dataset is used for training and verification. The experimental results show that the proposed method achieves high accuracy and has good image description performance for RSIs under complex environmental backgrounds, and can realize the interpretation and description of RSIs. In the next step, the model will be improved and optimized to further improve the description performance of RSIs. By specific engineering practice, it will be applied to the aerospace direction.
Conflicts of interest: The authors declare that they have no conflict of interest with respect to the research, authorship and/or publication of this article.
Data availability: The data used to support the findings of this study are available from the corresponding author upon request.
Author contribution: All the authors made contributions to the article in different areas. Shoulin Yin, Liguo Wang, and Muhammad Shafiq conceptualized the study; Liguo Wang and Lin Teng were responsible for investigation; Asif Ali and Lin Teng were responsible simulation; Shoulin Yin and Muhammad Shafiq wrote the original draft; Liguo Wang and Muhammad Shafiq reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.