Collaborative Learning-Based Network for Weakly Supervised Remote Sensing Object Detection

Existing object detection algorithms rely excessively on instance-level labels, which are both time-consuming and expensive. In particular, for remote sensing images (RSIs) with small and dense objects, the labeling cost is much higher than that of general images. Moreover, the propagation process of the labels over the noisy channel results in blurred and noisy information. To address the problem of obtaining RSI instance-level labels, we aim to propose a collaborative learning-based network for weakly supervised remote sensing object detection (CLN-RSOD). Compared with the state-of-the-art, the proposed model combines the advantages of two object detection sub-networks by jointly training. This improves the model's capability and thereby enhances the detection effect for multiobject in RSI. Moreover, we employ a mask-based proposal refinement algorithm for remote sensing images (MPR-RS) to optimize the candidate boxes. In addition, according to the data distribution characteristics of RSI, we introduce a new joint pooling module in CLN-RSOD to enhance the backbone network's characterization of RSI. Finally, the experimental results on two public remote sensing datasets illustrate that the proposed weakly supervised learning method is superior to other weakly supervised methods and demonstrates the effectiveness of the proposed CLN-RSOD.


I. INTRODUCTION
W ITH the improvement of computing power [1] and the development of deep neural network [2], [3] theory, deep learning [4] based methods have made great progress in object detection tasks, and their powerful feature extraction capability for images has greatly improved the accuracy of object detection algorithms. Most of the general object detection algorithms rely on the instance-level annotation of images to complete the training of object detectors. However, the instance-level labels of images are costly to obtain and even more difficult to label manually in some fields. In particular, remote sensing images (RSIs), with their large resolution, small and dense objects, and high complexity, lead to much more difficult labeling than natural images and are even more likely to introduce noise. Fig. 1(a) shows an example of the remote sensing dataset DIOR [5], where it can be seen that the objects in the images are small and dense, making it difficult to accomplish scalable instance-level labeling by hand.
To avoid the algorithm's dependence on instance-level labels, more and more researchers try to detect objects in images under weakly supervised [6] learning scenarios [7], [8]; that is, only image-level labels are used to complete the training of weakly supervised object detection (WSOD) models, which are called weakly supervised learning (WSL) and WSOD. And RSI-based WSOD becomes an important task in WSL. Fig. 1(b) and (c) shows the difference between the two annotating methods on the DIOR dataset.
In recent years, many WSOD works have started from a class activation mapping (CAM) [9] based method. The CAM-based method first performs feature extraction through a backbone network [10], [11], [12], [13], [14] to generate a feature map, where the weights of the feature map are inferred from the correspondence between the feature map and a certain object class. Then the weights are combined with the feature map to form a class activation map to highlight the object's location, and then the class activation map is processed using a threshold segmentation technique to obtain the prediction box of the object. To overcome the modification of the global average pooling [15] layer in CAM for the network structure, researchers have also proposed Grad-cam [16] and Grad-cam++ [17], which use a gradient method to calculate the weights of each feature map.
However, since there are only image-level labels, bounding box regression cannot be performed on the initially generated candidate boxes according to the real labels, so the CAM-based method only focuses on the most discriminative region of the object. The object scale in RSI is small and dense, and there are often multiple objects in a single image. The method of CAM class generates a large prediction box that encompasses a large amount of background in the image, making it difficult to achieve multiobject detection in RSI. In addition, the difference between RSI and natural images is significant. Compared with natural images, the aerial angle of RSI determines that it contains rich roof information and sparse profile information, which leads to the limitation of the general backbone network's characterization ability on RSI. At the same time, in a WSL scenario, how to propagate image-level labels to the instance-level training data in the learning process is a critical issue. Since each training image can produce many candidate boxes with different accuracies, this label propagation process introduces much ambiguous and noisy information.
To solve the above problems, we propose a collaborative learning-based network for weakly supervised remote sensing object detection (CLN-RSOD) with the starting point of improving the quality of candidate boxes used to achieve accurate object detection in RSI under weak supervision. The model is trained with a weakly supervised subnetwork (WS-SubNet) and a strongly supervised subnetwork (SS-SubNet) working collaboratively. First, we propose a new joint pooling module (JPM) to enhance the characterization of RSI by the backbone network in the RSI feature extraction phase. Furthermore, considering that the candidate boxes generated by CAM class of method contain a large amount of background information, we design a two-stage candidate box optimization algorithm in WS-SubNet. In the first stage, we incorporate a class activation map in the initial candidate boxes production process to avoid producing invalid candidate boxes; in the second stage, we design a mask-based proposal refinement algorithm for remote sensing images (MPR-RS), which quantifies the similarity between each candidate box pair by introducing information in the model, and then uses hierarchical clustering to generate the final pseudo-labels by merging the highly similar candidate box pairs in the candidate box pool. Next, SS-SubNet receives and uses these pseudo-labels for training. In addition, we complete the parallel optimization of WS-SubNet and SS-SubNet by designing a multitask joint loss function to train the two subnetworks together. Overall, the whole network completes the object detection of RSI in the dataset with only image-level labels with the collaborative learning of the two subnetworks.
In summary, our main contributions are as follows. 1) We propose a CLN-RSOD, which consists of two subnetworks working collaboratively to accomplish multiobject detection in RSI with only image-level labels effectively.
2) We propose the new JPM, which enhances the characterization of RSI through six different pooling units. 3) We designed the MPR-RS to further improve the quality of candidate boxes. The algorithm improves the performance of WS-SubNet by the designed similarity indexes and candidate box merging strategy, which enables SS-SubNet to obtain higher quality pseudo-labels and thus promotes the performance of the overall model. 4) The effectiveness of the proposed CLN-RSOD was fully validated by experimental evaluation on two RSI datasets, TGRS-HRRSD [57], and DIOR [5]. The rest of this article is organized as follows. Section II discusses the related work, and Section III presents the methodology. Section IV describes the remote sensing datasets used for the experiments, detailed experimental settings, and relevant experimental results. Finally, we present conclusions and future work and discussions in Section V.

A. Strongly Supervised Object Detection
Over the years, strongly supervised object detection algorithms have been widely studied, and several classical algorithms have emerged. Among them, the R-CNN [18], [19], [20] series are two-stage object detection algorithms based on candidate boxes; Yolov1-v4 [21], [22], [23], [24] are all single-stage object detection algorithms based on regression; a novel backbone network is proposed in Swin Transformer [25], [26] for feature extraction, instead of the general object detection algorithms of convolutional neural networks (CNNs).
When AlexNet [10] won the classification task in ImageNet [27], Girshick et al. applied CNN to the object detection task and proposed R-CNN. Owing to the powerful characterization ability of CNN, the detection performance of R-CNN on PascalVOC [28] dataset compared with the traditional object detection algorithms is improved by more than 30%. Based on R-CNN, Fast R-CNN and Faster R-CNN were further proposed by Girshick et al. In Fast R-CNN, the regions of interest pooling layer is proposed to reduce the repetitive computation for each candidate region and to achieve end-to-end training of the network, and the Softmax layer is introduced to replace the SVM classifier in the R-CNN, which improves the detection accuracy and reduces the detection time to one-twentieth of the original one. In Faster R-CNN, the authors proposed region proposal network (RPN), which is able to obtain higher quality candidate boxes at a faster rate.
In response to the slow detection speed of the two-stage object detection algorithm, the YOLO series proposed by Redmon et al. requires only one forward propagation to complete the detection of the object, and the detection speed reaches hundreds of frames per second. Specifically, the input image is divided into several grids, and the objects in each grid are detected, abandoning the previous detection process of coarse detection followed by fine detection. Although the detection speed has been greatly improved, its detection accuracy has also been reduced. With the great success of Transformer [29], [30], [31] in natural language processing, Liu et al. also successfully applied it in computer vision, and its powerful characterization capability has led to a certain magnitude of improvement in detection accuracy.

B. Weakly Supervised Object Detection
WSOD algorithms need to address not only the challenges caused by the varying scales of objects in images [32], [33], mutual occlusions [34], [35], and inconsistencies in artificial labeling with real labels [36] encountered in strongly supervised learning scenarios, but more importantly, to reduce the dependence of deep learning-based object detection training on high-quality annotated data.
At present, in addition to CAM class methods, multiple instance learning (MIL) methods represented by WSDDN [37] are another mainstream class of methods. MIL-based methods are usually divided into three parts: candidate box generators, backbone networks, and detectors. Many candidate boxes are generated by candidate box generation algorithms such as selective search (SS) [38] algorithm or edge boxes [39] algorithm, and the final detection results are obtained by using the backbone network for feature extraction and feeding into the detector to find the candidate box with the largest contribution to the classification.
MIL-based methods suffer from the problem that the optimization function is nonconvex and thus tends to converge to local minima. In response, among MIL-based methods, ICMWSD [40] introduces spatial diversity constraints associated with instances to alleviate the nonconvexity problem; OICR [41] fine-tunes the initial prediction results to alleviate this problem, and SDCN [42] performs alternate optimization enhancement by introducing a segmentation-detection synergy mechanism. In addition, many methods combine WSOD with strongly supervised object detection to improve detection accuracy by using pseudo-labels output from weakly supervised networks to train strongly supervised networks [43], [44], [45], [46], [47].
Since the annotation cost of RSI is much higher than that of natural images, how to complete the training of RSI object detectors with only image-level labels has been a hot research topic within WSOD [48], [49], [50], [51], [52]. Han et al. [53] introduced Bayesian inference in the model, which was combined with high-level features to fully demonstrate the WSL scenario for RSI for object detection. Zhou et al. [54] used the model after pretraining on a large-scale labeled dataset and then fine-tuned it on a remote sensing dataset while introducing the negative bootstrap algorithm into the training process of the detector to find the most easily identifiable samples to make the detector converge faster and more stable.
In summary, it remains to be investigated how to avoid the reliance on instance-level annotation in RSI and how to effectively improve the multiobject detection of RSI in the WSL scenario. Therefore, we will discuss this issue further.

A. Network Overview
The overall architecture of CLN-RSOD proposed in this article is shown in Fig. 2, which consists of three main parts: feature extractor, WS-SubNet, and SS-SubNet. Two subnetworks share a feature extractor, and we introduce a new JPM at the back end of the extractor. This module improves the characterization capability of RSI by joint pooling with different pooling units. For the input image I, the feature extractor extracts its features, which are pooled by JPM and then input to WS-SubNet and SS-SubNet, respectively. In WS-SubNet, we design a two-stage candidate box optimization algorithm, which first combines the class activation map to generate the initial candidate boxes in the first stage and relies on the MPR-RS in the second stage, which is used to merge the highly similar initial candidate boxes to generate high-quality pseudo-labels. SS-SubNet receives the incoming pseudo-labels from WS-SubNet and trains them, and its candidate box extraction method is based on the RPN in Faster R-CNN [19]. The two subnetworks share the weights of the same backbone network and part of the fully connected layer as a way to speed up the convergence of the network. The designed multitask joint loss function is trained jointly for the two subnetworks to complete the parallel optimization of WS-SubNet and SS-SubNet.
We will detail our JPM, WS-SubNet, and SS-SubNet in order in Sections III-B, III-C, and III-D.

B. Joint Pooling Module
As is known to all, RSIs are usually collected by sensors on spacecraft and satellites. The special shooting angle determines that RSI contains rich roof information and scarce profile information, which makes it difficult to characterize them effectively and thus leads to the extraction of poor features. Therefore, we introduce a new JPM at the back end of the feature extractor, which can adaptively fuse multiscale feature maps, and reduce the reliance of the model detector on the object profile features. The JPM structure is shown in Fig. 3.
Joint pooling is used in JPM to reduce the dimensionality of the feature map F stage5 ∈ R H 1 ×W 1 ×C 1 output by six pooling units of different sizes, where H, W, and C are the length, width, and number of channels of the feature map, respectively. Specifically, we first obtain contextual information individually using six different pooling units: Since objects account for only a small part of the RSI and the image contains a lot of useless background information, only P 1 uses Ave pool to retain the global semantic information of the image, and the other five use Max pool to retain the profile information of the objects. The special P 5 and P 6 enhance the characterization of the roof information in vertical and horizontal directions, respectively. Then, the channel number dimension of P i is compressed to 1/8 by 1 × 1 convolution, which facilitates the aggregation operation in the next stage, and the transition features are obtained in the current stage: C i , i ∈ {1, 2, . . . , 6}. The new aggregated features are obtained by 3 × 3 up-sampling on the basis of C i , and stitching one by one in the channel dimension: Finally, the stage 2 feature map F stage2 ∈ R H 2 ×W 2 ×C 2 output by the feature extractor is downsampled to make its size consistent with F 7 in the previous stage, and the final output of the backbone network is obtained after convolution calculation: F out ∈ R H 3 ×W 3 ×C 3 , the size of the convolution kernel is 3 × 3, 3 × 3, and 1 × 1, respectively. The strategy for each phase of JPM can be expressed as follows: where p avg ( * ) and p max ( * ) denote Ave pool and Max pool, respectively; f conv ( * ) and f dconv ( * ) denote convolution and deconvolution operations, respectively; ⊕ denotes stitching of the feature map in the channel dimension. The subsequent experimental results show that by capturing the correlation between features, JPM enables the feature extractor to obtain rich fusion features based on one-sided roof information, which enhances the ability of the model to characterize RSI.

C. Weakly Supervised Subnetwork
In this subnetwork, the candidate box optimization algorithm consists of two stages: In the first stage, the trained classification neural network is visualized and analyzed by Grad-cam++ [17] to generate the class activation map of a specific class in the image. On this basis, the SS [38] algorithm is used to generate several candidate boxes for each class activation map centered on the high-response region in the image, avoiding the generation of candidate boxes containing only background regions. The candidate boxes generated in this way preliminarily establish the correspondence between image-level annotation and instance-level candidate boxes. By observing the generated candidate boxes, it is found that many of the candidate boxes generated by the SS algorithm do not closely wrap the object in the image, and some candidate boxes contain only the key areas of the object and a large amount of background. To improve the quality of the candidate boxes and produce more accurate and efficient candidate boxes, we propose the MPR-RS in the second stage for screening and further refinement and merging of the initially generated candidate boxes. The core idea of the MPR-RS algorithm is to mask different regions of an image with the retained candidate boxes. When the masked region belongs to the object to be located, it causes a significant decrease in the recognition ability of the classification network for that image. The candidate boxes are refined and merged according to the relative decrease in classification score, similarity of features and size, and spatial proximity of the masked candidate boxes. Regarding the calculation of the classification score, the SPP [58] layer is used to replace the pooling layer in the last convolutional layer, which removes the limitation of the fully connected layer on the input image size and also improves the efficiency of the subsequent calculation of the decreasing function of the classification score for the postmasked image.
In the first stage, regarding the Grad-cam++ algorithm, the Grad-cam++ adds ReLU [55] and the weight gradient α kc ij to the mapping weight corresponding to the specific class, which improves the accuracy of localization for a specific class. The significant map for a specific class c is calculated as follows: where w c k is the weight of the kth feature map for a specific class c; A k ij ∈ R i×j is the kth channel feature map of the output feature map, and i and j denote the spatial location coordinates in the feature map, respectively. To prevent the problem of smaller feature responses disappearing in the final salient map, Grad-cam++ mitigates this problem by adding a ReLU activation function to the process of solving for w c k and a pixel gradient weighting factor α kc ij for A k , which is used to assign the same weight to all objects that appear in the feature map, α kc ij and w c k are calculated as follows: where a and b, i and j are the spatial position coordinates on the same feature map A k , denoting different calculations to prevent confusion; Y c is the score of a certain class c, which is backpropagated by exponential function transfer in the calculation, which is guaranteed to be infinitely differentiable.
In the second stage, regarding the MPR-RS algorithm, we define the change in the classification score of the input image I after being masked by the candidate box with the mask vector as σ drop , which is calculated as follows: where h g (I, b) denotes the input image I in each rectangular region b after masking with the mask vector g, this function directly measures the effect of the original input image and the input image masked with a partial region on the classification score. If the classification score for a certain class of objects decreases significantly after the original input image is masked, it proves that the region has a greater contribution to the model's recognition of that class of objects. Therefore, we can approximate that the rectangular region b contains the object of that class. For a specific class, define the class score decrease function for that class where I c ∈ N c is a one-hot vector, only the vector position corresponding to the decreasing category in the class score is 1, and all other positions in the vector are 0. Thus, we can use the function to evaluate the candidate boxes retained in the first stage, where each candidate box corresponds to a rectangular region b. We use three indexes to define the similarity between each pair of candidate boxes, which are for classification influence similarity: s drop (I, b i , b j ), color similarity: s color (I, b i , b j ), and internal overlap similarity: s inter (I, b i , b j ), the values of these three indexes are normalized to the interval [0, 1], and they are calculated as follows: where b i,j denotes the minimum bounding box containing b i and b j ; ζ(·) denotes the RGB histogram of the corresponding feature Merge candidate box pair b r = b i,j 5: Delete the candidate box pair: S = S\s (I, b i , b), S = S\s(I, b, b j ) 6: Delete the i and j candidate boxes: end while Output: B map of the region within the candidate box; size(·) denotes the area within the region. s drop (I, b i , b j ) is used to measure the influence of two candidate boxes on the classification score, and its similarity is 1 when the regions in the two candidate boxes have the same influence on the classification score; s color (I, b i , b j ) is used to count the similarity after normalizing the histograms of the corresponding feature maps; s inter (I, b i , b j ) is used to measure the overlap of the two candidate boxes.
Combining the above three indexes, the similarity of the two candidate boxes is calculated as follows: The corresponding candidate box merging strategy is: among all candidate boxes, merge the candidate box pairs with similarity greater than the similarity threshold τ 1 , discard the original candidate boxes and add the new generated candidate boxes for recalculation. The algorithm is iterated several times until there are no candidate pairs with similarity greater than τ 1 in the candidate box pool, as shown in Algorithm 1.
Then different priorities are assigned according to the influence on the classification score and the size of the area, and a smaller threshold is set: the area share threshold τ 2 , and only the candidate boxes with smaller areas are retained. In the MPR-RS algorithm, all candidate boxes are not encouraged to be merged, but only some highly similar candidate boxes are merged, and a more stringent candidate box similarity evaluation index is adopted, which is determined by the characteristics of data distribution in RSI.

D. Strongly Supervised Subnetwork
First, WS-SubNet is trained. When the loss is below the threshold, the output pseudo-labels are considered reliable enough, and these pseudo-labels are fed into SS-SubNet for training the network. In order to reduce the number of parameters and speed up the convergence of the network, SS-SubNet and WS-SubNet share the same backbone network and some of the weights of the fully connected layers.
To explain, Faster R-CNN [20] includes two parts: RPN and Fast R-CNN, in which RPN is the core part of Faster RCNN. Similarly, SS-SubNet uses the SPP layer to replace the pooling layer in the last convolutional layer in RPN. This, in turn, removes the limitation of the fully connected layer on the input image size.
However, the direct use of RPN for RSI object detection has the following drawbacks: RPN has the problems of slow convergence and inaccurate regression in the process of bounding box regression, which becomes especially prominent in the process of calculating multiscale objects in RSI. We introduce DIoU [56] in the loss function to alleviate the above problem, and design a distance regression-based multitask joint loss function alternative to the original loss function in RPN, which is calculated as follows: where p ic denotes the probability that the ith anchor box is predicted to be class c; p * ic denotes the real label of the ith anchor box, 1 when the candidate box is a positive sample, 0 when it is a negative sample;t ic and t * ic denote the predicted bounding box regression parameters and the true bounding box regression parameters for the ith anchor box, respectively; λ is the balance parameter used for classification loss and bounding box regression loss. The classification loss L cls and the bounding box regression loss L DIoU are calculated as follows: where d ρ denotes the diagonal length of the rectangular box covering the predicted anchor box and the real bounding box; d c denotes the centroid distance between the predicted anchor box and the real bounding box, and then the bounding box retained in the nonmaximal suppression is used for the training of SS-SubNet.
Since the two subnetworks have the same output results, for this reason, the consistency prediction loss function is used to train the two subnetworks. The consistency prediction loss function includes the consistency loss between SS-SubNet and WS-SubNet and the loss within SS-SubNet, which is calculated as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
· [βL cls_inter (p ic , p jc ) where L cls_inter denotes the consistency of the category prediction of the two subnetworks using multiclass cross-entropy loss; L reg denotes the Smooth L1 loss function; β ∈ [0, 1] is a hyperparameter used to balance the prediction consistency of the two sub-networks, when β is larger, the more SS-SubNet tends to use the output pseudo-labels of WS-SubNet as the true label, and prefers its own prediction results as the true label anyway; I ij denotes the cross-merge ratio of the candidate boxes generated by SS-SubNet and WS-SubNet, which is calculated as follows:

E. Overall Loss Function
Based on the multitask joint loss of SS-Subnet, the pseudolabel loss function of WS-SubNet is introduced to obtain the overall loss function of CLN-RSOD. The loss function of the pseudo-label in WS-SubNet can be expressed as follows: where C denotes the total class of labels; y =[y 1 , y 2 , …, y C ]∈ {0, 1} C is the image-level label of the input image I;ŷ i denotes the probability of predicting label i by WS-SubNet. Therefore, the overall loss function of CLN-RSOD can be expressed as the sum of three components The overall loss function guides the updating of weights in the network, the two subnetworks are jointly trained, and the weakly supervised output results are used as pseudo-labels of the strongly supervised network to fully utilize the advantages of the strongly supervised network in the object detection task and achieve the joint optimization of region classification and bounding box regression.

A. Datasets
We perform an experimental evaluation of CLN-RSOD on two large publicly available RSI datasets, DIOR [5] and TGRS-HRRSD [57]. The TGRS-HRRSD dataset contains 21 761 images from Google Earth and Baidu Maps. The spatial resolution of these images ranges from 0.15 to 1.2 m, and the resolution varies widely and is densely distributed. The entire dataset contains 55740 target object instances in 13 categories. The dataset is divided into three subsets: training set, validation set, and test set, which contain 5401, 5417, and 10 943 images, respectively. All the categories and the number of objects in each category in the two datasets are shown in Fig. 4.
The DIOR dataset contains 23 463 RSIs in 20 categories and 192 472 target object instances. The dataset is the largest in terms of the number of images and the number of object classes, and has the following characteristics: the scale of images varies greatly, the spatial resolution is different, and the number of objects in each image varies; the acquired images have rich diversity by fully considering the weather, season, imaging conditions and image quality, etc.; it has small inter-class variability (i.e., the difference between different classes is small, e.g., the dataset of the highway service area and the above-mentioned features all impose high requirements on the performance of the object detector. The data set is divided into three subsets, the training set contains 5862 images, and the validation set contains 5863 images, but the total number of target object instances in the training set and the validation set is the same for each category, the test set contains 11 738 images, and the sum of the number of images in the training set and the validation set is close to the number of images in the test set.

B. Implementation Details
The proposed CLN-RSOD model is built based on the Tensorflow [94] deep learning framework, and the network model is trained on a single NVIDIA Geforce 3080 GPU, using an SGD optimizer with momentum, the initial learning rate was set to 0.001 and decreased to 0.0005 after 12 epochs, the momentum and weight decay were 0.9 and 0.0005, respectively, and the training was stopped after 20 epochs, the candidate boxes generated by the SS algorithm were set to 1000, and the batch size was set to 4.
For the backbone network, we use ResNet-50 [41]. In the MPR-RS algorithm, the candidate box in the image similarity threshold τ 1 is set to 1.2, and the area share threshold τ 2 is set to 0.6. the hyperparameter λ in (8) in SS-SubNet is set to 10, which is consistent with the Faster R-CNN; the hyperparameter β in (10) is set to 0.8.

C. Ablation Studies
In order to fully validate the effectiveness of CLN-RSOD, we conducted ablation experiments on the TGRS-HRRSD and DIOR datasets. The experiments were conducted with ResNet-50 [13] as the backbone network, SS [38], and Grad-cam++  [17] as the candidate box initial generation algorithms, and continuously integrated each design module and algorithm on the basis of the above described to compare the four weakly supervised strategies, and the results are shown in Table I.
In the ablation experiments, we gradually add the MPR-RS algorithm and JPM to the network model to evaluate the effectiveness of the candidate box optimization algorithm and pooling strategy; we use the model excluding SS-SubNet to compare with the whole model to evaluate the effectiveness of the collaborative learning model. In Table I, the first three columns are for the case of using only a weakly supervised single model, constantly adjusting the candidate box algorithm, and adding JPM. The first three columns are all integrated, which is the WS-SubNet of this article, and then the SS-SubNet is integrated on the WS-SubNet, which is the CLN-RSOD of this article.
Specifically, starting from NO.2, the second stage of the candidate box optimization algorithm MPR-RS is added. The results in both datasets show that the optimization of the MPR-RS algorithm improves the quality of the labels, thus driving the overall performance of the model. NO.4 compared with NO.2, the introduction of JPM brings 14.8% and 8.1% gain amplitude of mAP metrics to the model on the two datasets, respectively. The difference in gain comes from the difference in complexity between the datasets, and the inter-class similarity and intra-class diversity of the DIOR dataset are higher than that of TGRS-HRRSD, which leads to a decrease in the contribution of JPM on the DIOR dataset.
Comparing NO.2 and NO.3, NO.4, and NO.5, collaborative learning was introduced at different baselines, and the mAP metrics were improved in both datasets, which proved the effectiveness and necessity of collaborative learning. From NO.3 to NO.4, we withdrew collaborative learning and made fine-tuning on the backbone network, namely adding JPM. The results show that in the TGRS-HRRSD dataset, the mAP metrics increase from 34.7 to 36.1, indicating that the contribution of JPM to the model is greater than that of collaborative learning. But in the DIOR dataset, the mAP metrics decrease from 15.1 to 14.8, indicating that the contribution of collaborative learning to the model is greater than that of JPM, which suggests that collaborative learning is more effective than fine-tuning on the baseline in datasets with higher complexity.

D. Comparisons With State-of-the-Arts
To further test the effectiveness of the CLN-RSOD proposed in this article, in addition to the comparison with existing mainstream strongly supervised models and some weakly supervised models on the TGRS-HRRSD dataset, we also conducted relevant experiments on the DIOR dataset with more data samples and higher complexity. The experimental results of the two datasets are shown in Tables II and III. In Tables II and III, Faster R-CNN has better detection results than RICNN and YOLOv3 in the strongly supervised learning scenario. In comparison, the results of two-stage model Faster R-CNN on both datasets are significantly better than one-stage model YOLOv3, which also proves that the candidate box optimization can improve the effectiveness of the model performance. From the TGRS-HRRSD dataset to DIOR dataset, with respect to mAP metrics, YOLOv3 decreases from 67.0 to 57.1, Faster R-CNN decreases from 81.5 to 63.1, and mAP metrics of both models decrease. This is because the complexity of the DIOR dataset is much higher than that of the TGRS-HRRSD dataset. The last four columns of the two tables show the detection results for the dataset under the WSL scenario, where the third column is our proposed WS-SubNet single model, and in comparison, our proposed CLN-RSOD achieves the best detection results.
In Table II, CLN-RSOD achieved the best detection results among the eight categories. CLN-RSOD improved the mAP metric by 37.0%, 31.3%, and 15.0%, compared with the previous three WSOD models. In Table III, CLN-RSOD achieved the best detection results among the ten categories. In addition, CLN-RSOD improved the mAP metric by 37.6%, 10.9%, and 23.6%, compared with the previous three WSOD models. From the TGRS-HRRSD dataset to the DIOR dataset, based on WS-SubNet, CLN-RSOD increased the gain amplitude of mAP metrics from 15.0% to 23.6%, which shows that with the increase of dataset complexity, collaborative learning brings more obvious gains. An object instance mining framework is proposed in OICR using WSDDN as the baseline model. By fine-tuning the detection results of WSDDN, we can effectively improve the problem that WSDDN only focuses on part of the object region. Compared with the collaborative learning network proposed in CLN-RSOD, the gain of OICR based on WSDDN is lower than that of CLN-RSOD based on WS-SubNet in both datasets, which fully verifies that the collaborative learning network proposed in this article is more effective than fine-tuning the results of the baseline model.
Studying the detection of individual classes by WS-SubNet and CLN-RSOD, the results in Table II show that only in the detection of Basketball court class objects for the TGRS-HRRSD dataset, the single model WS-SubNet achieves the best results instead. We analyze that the reason for this may be that the overall distribution of basketball courts is too dense, and unlike baseball courts where there are obvious boundaries in the middle, the generation of candidate boxes in SS-SubNet using RPN is too dense, resulting in poor detection for this type of object. The results in Table III show that in the detection of Dam and Train station in DIOR dataset, The AP metric of CLN-RSOD is lower than that of WS-SubNet. The analysis shows that the detection accuracy of WS-SubNet for this class is too low, which leads to the poor quality of pseudo-labels input to SS-Subnet and negatively impacts the model. Partial detection results of CLN-RSOD on the TGRS-HRRSD dataset are shown in Fig. 5. The detection results are good for objects with large intervals and obvious differences from the surrounding environment, but the detection of multiple similar objects in the image is confusing and difficult for very small objects, e.g., two tennis courts are identified as one tennis court in the bottom left image. Partial detection results of CLN-RSOD on the DIOR dataset are shown in Fig. 6, and it is easy to give correct detection results for some obvious objects. For example, in the top right image, CLN-RSOD can detect most of the white airplanes, but for the small airplanes with less frequent color painting in the dataset, there is a problem of missing detection; in the middle image of the second row, the two small ships behind the big ship are identified as part of the big ship; the obscured objects are also difficult to identify, and the shadows of some objects can be easily mistaken as part of the object. This is partly because the aerial images are obtained from an overhead perspective, and the unique camera angle causes the limited information that can be extracted by the backbone network.

E. Visualization Analysis
To facilitate an intuitive understanding of the contribution of the proposed MPR-RS algorithm in the pseudo-labels generation process, we performed visualization analysis using images from the DIOR dataset and selected the top twenty priority candidate boxes in the SS algorithm to demonstrate how the MPR-RS algorithm adjusts the candidate boxes. Fig. 7(b) shows that the Baseball field class is represented in four regions in the image, and the corresponding high-response regions of the class activation map also show the same distribution. The candidate boxes obtained by the first-stage SS algorithm contain a large amount of noisy information, as observed in Fig. 7(c), a large number of candidate boxes contain the background of the grass and streets. Although, some of the candidate   boxes still need to contain the object entirely, indicating that poor candidate boxes will directly affect the model's performance in the WSOD work. The results are unsatisfactory if the originally generated candidate boxes are used to give prediction results directly or trained as pseudo-labels. In the second stage, the candidate boxes are filtered and further refined and merged using the MPR-RS algorithm. The candidate boxes are merged by the similarity threshold τ 1 , and only the candidate boxes with smaller area are retained at the end by the area share threshold τ 2 . Fig. 7(d) calls up one iteration of the process, and it can be seen that the newly generated candidate boxes more completely surround the overall object, and so on, until the candidate box pool does not have any candidate box pairs with a threshold value greater than τ 1 .

V. CONCLUSION
Strongly supervised learning object detection algorithms have made great progress and are widely used in security and surveillance, medical and health care, and industrial production. However, in complex real-world scenarios, it is costly in terms of time and manpower to complete instance-level labeling in the face of massive data. Therefore, it is of great practical importance to carry out research on object detection algorithms for weakly supervised scenarios. Based on this, this article explores how to use image-level labeling for RSI object detection, analyzes the existing WSOD algorithms, and makes improvements.
To address the problem that instance-level labels of RSIs are more difficult to obtain, this article proposes a CLN-RSOD, which can effectively improve the detection effect of multiobject in RSIs. By using the output of the WS-SubNet as pseudo-labels for the training process of the SS-SubNet, the powerful detection performance of the SS-SubNet is fully utilized. In addition, a new JPM is proposed to enhance the characterization ability of RSI in the backbone network. The MPR-RS algorithm is designed in the candidate box optimization problem to improve the quality of WS-SubNet output pseudo-labels. The results show that each step of the proposed method can effectively improve the detection accuracy of objects in RSIs under the WSL scenario and effectively improve multiobject detection in RSIs.
In summary, the proposed method effectively alleviates the reliance on instance-level annotation for RSI object detection, and the proposed collaborative detection can be applied to other similar WSOD tasks.
Hangjiang Wang received the B.S. degree in Internet of Things engineering from Wuxi University, Nanjing, China, in 2020. He is currently working toward the M.E. degree in the electronic information at Nanjing University of Information Science and Technology.
His research interests include remote sensing image processing, computer vision, deep learning, and generative model. He is a university-appointed Professor with the School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China. He is a regular Reviewer/Lead Guest Editor for many prestigious journals and conferences, serves as the TPC member/chair for various conferences. His research interests center around the Internet of Things, mobile edge computing, Tactile Internet, and ultra-reliable low-latency communications.

Mithun
Xin Xu received the B.S. degree in electronic information engineering from Hefei University, Hefei, China, in 2019, and the M.E. degree in electronics and communication engineering from Nanjing University of Information Science and Technology, Nanjing, China, in 2022.
He is currently working at Mu Xi Technology Corporation, working on IC algorithms. His research interests include deep learning, radar algorithms, and weather forecasting.