Multiscale Image Splitting Based Feature Enhancement and Instance Difficulty Aware Training for Weakly Supervised Object Detection in Remote Sensing Images

Weakly supervised object detection (WSOD) has a great practical value in remote sensing image (RSI) interpretation because the instance-level annotations are not required. The multiple instance learning based methods are mainstream, and two problems should be addressed. First of all, the majority of methods usually detect discriminative parts rather than the whole object. Secondly, the quantity of easy instances is much greater than that of hard instances, which restricts the improvement of WSOD methods. To address the first problem, a multiscale image splitting based feature enhancement (MSFE) module is proposed. The MSFE module splits the input RSI in multiple scales, afterwards, the spatial attention maps (SAMs) are generated from the feature maps of each proposal corresponding to different splitting scales, and are used to calculate the maximum spatial attention map (MSAM). Each SAM is required to approach MSAM, which enforces the MSFE module to learn the feature maps which can highlight the whole object for each positive proposal. To address the second problem, an instance difficulty aware training (IDAT) strategy is proposed. The difficulty of each instance can be quantitatively measured, and is used as the weight of each instance in the training loss. Consequently, the hard instances will be focused in the training process. The ablation study demonstrates the validity of MSFE module and IDAT strategy. The comparisons with nine advanced methods on two RSI benchmarks further validate the overall effectiveness of our method.


I. INTRODUCTION
O BJECT detection in remote sensing images (RSIs) has many applications, e.g., military strategy, landscape Manuscript  analysis [1], [2], geographic information system construction [3], urban planning [4], [5], etc. As the rapid progress of deep learning [6], [7], [8], [9], [10], [11], [12], [13], [14], fully supervised object detection (FSOD) methods [15] in RSIs have obtained promising capability to some extent, however, these methods need to spend considerable human and time costs for instance-level annotations. In order to reduce the annotation cost, more researchers focus on weakly supervised object detection (WSOD) methods [6], [16] because they only require imagelevel annotations. The majority of methods implement WSOD through multiple instance learning (MIL) [7], [17], [18], [19], and a milestone work is weakly supervised deep detection networks (WS-DDN) [7]. First, lots of proposals are generated through selective search (SS) algorithm [20], and are imported to the backbone to obtain their feature vectors. Next, two feature matrices are attained by feeding the feature vectors into two fully-connected (FC) branches, and are normalized in both the category and proposal dimensions through softmax operation. Finally, the category prediction score (CPS) matrix of all proposals is generated through the element-wise multiplication of two normalization matrices, and is accumulated along the proposal dimension to attain the image-level CPSs.
The WSOD methods based on MIL have developed rapidly in recent years, but there are still two problems. First of all, the existing WSOD methods usually pay less attention to the nondiscriminative region of the object, resulting in the detection results being more tending to locate discrimination regions of the object instead of the whole object, particularly in RSIs with sophisticated backgrounds. The second problem is that the quantity of easy instances is much greater than that of hard instances. The accumulation loss of a large number of easy instances can force the WSOD model focus on the easy instances, which limits the improvement of WSOD methods.
To address the first problem, a novel multiscale image splitting based feature enhancement (MSFE) module is proposed. First, the image patches with different size are obtained by splitting the input image in different scales, and are fed into shared convolutional neural network to obtain their feature maps, and then the feature maps of each image patch are concatenated according This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to the their spatial relationship. During the training process, a series of spatial attention maps (SAMs) of each positive instance (see Section III-A) are inferred from feature maps corresponding to different splitting scales. The SAMs corresponding to large (small) splitting scales tend to highlight the global (local) discriminative regions of object. Then, the maximum spatial attention map (MSAM) of each positive instance is obtained by performing the element-wise maximum operation on multiple SAMs of each positive instance. The feature maps which can highlight the whole object region, can be learned through all SAMs approaching the MSAM in the training process.
To address the second problem, an instance difficulty aware training (IDAT) strategy is proposed. The difficulty degree of each instance denotes the degree that the instance can be detected correctly, and the difficulty degree of hard negative instances and other instances is measured through different calculation scheme. The difficulty degree is used as the weight of each instance in the training loss, and the hard instances will be given more attention in this way.
The contributions of our article are given below as follows. 1) A MSFE module is proposed for addressing the problem that existing methods tend to highlight the local discriminative regions instead of the whole object. 2) This article proposes a IDAT strategy to handle the imbalance between the quantity of hard and easy instances.

A. Online Instance Classifier Refinement
The WSDDN [7] is adopted as the basic WSOD network in this article, the related description can be seen in the second paragraph of introduction section. Online instance classifier refinement (OICR) [17] is another milestone WSOD work, which adds multiple instance classifier refinement (ICR) streams on WSDDN to further improve its performance. Each ICR branch consists of a FC layer and a softmax classifier, and the CPS of each proposal is obtained when the feature vector of each proposal passes through the ICR stream. The pseudolabel of each ICR stream is generated from the CPS outputted from previous ICR stream. In particular, the pseudolabel of first ICR stream is generated from the CPS of WSDDN. Specifically, the instances with the highest CPS are considered as seed positive samples, and their neighbor instances which have a high overlap with them are also considered as positive samples, and the remaining instances are considered as negative samples. Finally, the cross entropy loss is used to train each ICR stream.

B. Related Works of Handling the Imbalance of Easy and Hard Instances
The imbalance between hard and easy instances is a common problem in object detection, and some related works were proposed to handle this problem, which could be divided into two types. The first kind of methods alleviated the above problem by giving larger weight to hard instances. For example, focal loss [21] encouraged the detection model to focus on hard instances by giving relatively small weights to easy instances in the classification training loss. In another example, the number of iterations and the difference between CPSs and pseudolabels were jointly to define the weights of each instance [22], consequently, the hard instances would be given more attention as the training going on. The second kind of methods only gave larger weights to hard negative instances. For example, the difference between pseudolabel and the CPS on background class is used to define the weights of hard negative instances in the training loss of each ICR streams [23]. In our approach, a special instances difficulty evaluation strategy is designed for hard negative instances, furthermore, the instance difficulty score (IDS) of other hard instances are also calculated. On the whole, our instances difficulty evaluation strategy is more comprehensive.

C. Other Important WSOD Methods
Many WSOD methods are applied in natural scene images and RSIs, some important methods are introduced as follows. On the basis of OICR, Tang et al. [18] mined more seed positive samples to train the ICR streams through a proposal cluster learning strategy. Ren et al. [24] designed a drop block module to alleviate the part domination problem, and a robust pseudolabel mining strategy was proposed, in which the instances which have high CPS and low intersection over union (IoU) with other instances were selected as seed positive samples. Compared with natural scene images, the WSOD in RSIs brings more challenges because of bird's eye view imaging and cluttered background, consequently, the more powerful approaches were proposed. For example, the instances which covered the whole object were assigned high CPSs, consequently, the high quality pseudopositive samples would be mined according to the CPSs [25]. Feng et al. [26] used the semantic discrepancy between internal and external contexts to mine pseudopositive samples which could cover the whole object as much as possible.

III. PROPOSED METHOD
In this article, our method uses OICR [17] as the baseline, and the MSFE module and IDAT strategy are added to the baseline. The MSFE module is used to obtain the comprehensive object features of the positive instances, and the IDAT strategy can give higher weights to the hard instances during the training process.

A. Multiscale Image Splitting Based Feature Enhancement
As shown in Fig. 1, the MSFE module first splits the input image into z × z image patches, z ∈ [1, Z], where Z denotes the number of splitting scales, the input image indicates 1st scale. Note that N -1 scales are randomly selected from the 2nd to the Zth scale. The image patches are imported into the shared CNN to get their feature maps, then the feature maps of nth scale are concatenated according to their spatial relationship to obtain F n , n ∈ [1, N], where F n denotes the feature map of input image corresponding to the nth splitting scale.
Subsequently, M proposals are generated by using SS algorithm [20], and the rth proposal P r is projected onto F n , then a fixed-size feature G n r ∈ R H×W ×K , r ∈ [1, M] is obtained through the region of interest pooling operation, where K, H, and W denote the quantity of channels, the height and width of G n r , respectively. The SAM of P r at the nth scale, denoted as A n r ∈ R H×W , is obtained by the following equation: where G n r,k denotes the kth channel of G n r . The MSAM of P r , denoted as MA r , is obtained as follows: where max(·) denotes the element-wise maximum operation. The loss function of MSFE module, denoted as L M , is obtained as follows: where T represents the assemble of all positive instances in all of ICR branches, |T | represents the number of instances contained in T , A n q denotes the SAM of qth positive instance at the nth splitting scale, and MA q denotes the MSAM of qth positive instance. The identification of positive instances is referred to Section III-C.

B. Basic Weakly Supervised Object Detection Model
As illustrated in Fig. 1, G n r is convolved by two FC layers to get the feature vector of P r , denoted as f n r , and the feature vectors of M proposals {f n r } M r=1 are imported into two parallel FC layers to obtain two matrices x n c , x n d ∈ R C×M , where C denotes the quantity of object categories. So far, the CPS matrix of all proposals at the nth splitting scale, denoted as X n , is obtained as follows: where σ c (·) and σ d (·) denote the softmax operation along the category and proposal dimensions, respectively, and denotes the Hadamard product. The image-level CPS corresponding to the cth class, denoted as φ n c , is obtained as follows: where x n c,r ∈ X n represents the CPS that P r belongs to category c at the nth splitting scale. So far, the loss function of basic WSOD model at the nth splitting scale, denoted as L n B , is obtained as follows: where y c = 1 or 0 denotes whether or not at least one object of class c is contained in input image. The overall training loss of basic WSOD model, denoted as L B , is obtained as follows:

C. Instance Difficulty Aware Training of Instance Classifier Refinement Branches
As illustrated in Fig. 1, the f n r is fed into the lth ICR stream to acquire the CPS of P r , denoted as S n,l r ∈ R C+1 , l ∈ [1, L], where the (C+1)th dimension of S n,l r indicates background category, and L denotes the quantity of ICR streams. The supervisory information of the lth ICR stream is derived from the CPS outputted from (l-1)th ICR stream. The details as follows: In the (l-1)th ICR stream, the mean CPS of P r , denoted as S n,l−1 r , is obtained as follows: According to the pseudolabel mining strategy proposed by MIST [24], the pseudolabel y l c,r ∈ {1, 0} can be obtained based on {S n,l−1 r } M r=1 , where y l c,r indicates whether or not the P r belongs to cth class in the lth ICR stream. If y l c,r = 1, then P r ∈ T . The IDS of P r corresponding to nth splitting scale in the lth ICR branch, denoted as IDS l,n r , is obtained as follows: where R hn denotes the assemble of hard negative instances, the definition of hard negative instance is that one instance has the highest CPS (greater than 0.1) on a category which does not exist in the image, R o denotes the assemble of instances except hard negative instances, |R hn | and |R o | denote the quantity of instances contained in R hn and R o , respectively, j denotes the category corresponding to the highest CPS if P r ∈ R hn , otherwise, j denotes the actual category of P r , and s l,n j,r denotes the jth element of S n,l r . So far, the IDAT loss of L ICR streams, denoted as L I , is obtained as follows: where w l r represents the loss weight of P r , referring to OICR [17] for details. The total loss of our model is as follows:

IV. EXPERIMENTS
A. Experimental Setup 1) Dataset and Metrics: Two widely used datasets, DIOR [27] and NWPU VHR-10.v2 [28], [29], are utilized to evaluate our method. DIOR dataset consists of 23463 images with 800×800 pixels, which contains 192 472 instances in 20 categories. NWPU VHR-10.v2 dataset consists of 1172 images with 400×400 pixels, which contains 2775 instances in 10 categories. The training and validation subsets are employed for training, and the rest is for testing. The mean average precision (mAP) and correct localization (CorLoc) [30] are employed to measure the accuracy of object detection and object localization, respectively.
2) Implementation Details: a) Algorithm details: OICR [17] is used as the baseline of our method, where the backbone uses the VGG16 [31] network. About 2000 proposals are produced by SS algorithm [20]. The quantity of ICR streams is 3, i.e., L = 3. The threshold of the IoU of NMS [32] is fixed to 0.3. The quantity of splitting scales is set to 6, i.e., Z = 6. b) Training details: Our model is trained by stochastic gradient descent algorithm. The weight decay, momentum and batchsize are 0.005, 0.9, and 2, respectively. The quantity of iterations is set to 30 K and 60 K for the NWPU VHR-10.v2 and DIOR datasets, respectively. The initial learning rate is 0.0025, and is reduced to 10% of last stage at 20Kth (50Kth) and 26Kth

B. Parameter Analysis
The value of N which is the key parameter of MSFE module is quantitatively analyzed on the DIOR dataset. As illustrated in Fig. 2, out model (without IDAT strategy) achieves the highest mAP when N = 3.

C. Ablation Study
To validate the effectiveness of MSFE module and IDAT strategy, the mAP and CorLoc of baseline (OICR), baseline + MSFE, baseline + IDAT and baseline + MSFE + IDAT (our method) are compared with each other on the DIOR dataset.
2) Subjective Ablation Study: The subjective ablation study of MSFE module and IDAT strategy is conducted to intuitively demonstrate their effectiveness. As shown in Fig. 3(a), some detection results of baseline are subjectively compared with that of baseline + MSFE. Obviously, the baseline + MSFE can cover the objects more completely, which addresses the first problem proposed in introduction section. As shown in Fig. 3(b), some detection results of baseline are subjectively compared with that of baseline + IDAT. Apparently, the baseline + IDAT gives higher identification accuracy, which is induced by assigning larger weights to hard instances, i.e., the second problem proposed in introduction section is also addressed.
As shown in Tables II and III,    Figs. 4 and 5 intuitively present our detection results on both datasets, where the colors of rectangular boxes are different for each object category on the NWPU VHR-10.V2 dataset, and the color of rectangular boxes is green for every object category on the DIOR dataset. On the whole, most of objects can be identified correctly and completely enclosed by rectangular boxes, which intuitively demonstrates the validity of our method.
In addition, the capability of our method is closer to that of FSOD methods, especially in some categories, e.g., baseball, airplane, ground track field, etc.

E. Analysis of Failure Detection Results
As shown in Tables II-V, our method gives poor results in certain classes, therefore, the worst six classes are analyzed in this section, which include Dam, Bridge, Vehicle, Storage tank, Ship, and Windmill. As shown in Fig. 6(a) and (b), the reservoirs and rivers are falsely identified as foreground objects. As a matter of fact, the reservoirs and rivers can be considered as the coexisting contexts of Dam and Bridge, and the foreground objects and their coexisting contexts jointly determine the classes of RSIs, therefore, the coexisting contexts are easily identified as foreground objects. As shown in Fig. 6(c), multiple vehicles are falsely identified as single vehicle, and the similar errors can also be seen in Fig. 6(d) and (e). Actually, the characteristics of multiple objects is more salient than single object when the foreground objects are densely distributed, consequently, multiple objects are usually identified as single object. As shown in Fig. 6(f), the shadows of windmills are falsely identified as  Windmill. As a matter of fact, the shadows of foreground objects are sometimes clearer than the objects themselves because of bird's eye view imaging, therefore, the shadows are usually identified as objects.

V. CONCLUSION
This article proposes a novel MSFE module to address the problem that most methods always tend to locate the discriminative regions instead of the entire object. The MSFE module splits the input image into many patches in multiple scales, and the feature maps of input image corresponding to different splitting scales can be obtained by concatenating the feature maps of image patches. Each positive proposal's SAMs corresponding to different splitting scales are generated from their feature maps, and are used to infer their MSAM. The feature maps which can highlight the entire object can be learned by enforcing all SAMs to approach the MSAM. Moreover, we propose a IDAT strategy to address the problem that the quantity of easy instances is much greater than hard instance. The IDAT strategy assigns a weight to each instance according to the degree that the instance can be detected correctly, and the hard instances can be focused during the training process. Ablation studies verify the validity of MSFE module, IDAT strategy, and their fusion. The comparison with nine other WSOD methods on two RSI benchmarks further demonstrates the excellent capability of our method.