OBBStacking: An Ensemble Method for Remote Sensing Object Detection

Ensemble methods are a reliable way to combine several models to achieve superior performance. However, research on the application of ensemble methods in the remote sensing object detection scenario is mostly overlooked. Two problems arise. First, one unique characteristic of remote sensing object detection is the oriented bounding boxes (OBB) of the objects and the fusion of multiple OBBs requires further research attention. Second, the widely used deep learning object detectors provide a score for each detected object as an indicator of confidence, but how to use these indicators effectively in an ensemble method remains a problem. Trying to address these problems, this article proposes OBBStacking, an ensemble method that is compatible with OBBs and combines the detection results in a learned fashion. This ensemble method helps take first place in the Challenge Track Fine-Grained Object Recognition in High-Resolution Optical Images, which was featured in 2021 Gaofen Challenge on Automated High-Resolution Earth Observation Image Interpretation. The experiments on DOTA dataset and FAIR1M dataset demonstrate the improved performance of OBBStacking and the features of OBBStacking are analyzed.

Weighted Boxes Fusion (WBF) [2] aims to alleviate the weakness of NMS, by taking into account all the confidence scores of the to-be-fused bounding boxes and assigning an average confidence score to the resulting bounding boxes.This method, however, leaves two problems unaddressed.First, WBF treats the confidence scores from different models equally and takes the non-weighted mean value as the fused confidence score, disregarding three facts: 1.Some models may perform better than other models and their scores should have more weight.2. Some models may share a similar neural network structure and produce similar results, so the ensembled result may bias towards a group of similar-structured models.3. Deep learning models are poorly calibrated and different models will be overconfident to different extents, so a simple ensemble method may favor the more overconfident models.
Second, WBF is only compatible with horizontal bounding boxes.
When deep learning was first introduced into the remote sensing object detection problem, the position of a detected object was initially encoded in the same format as those in the other scenarios, i.e. a non-oriented rectangular bounding box with its sides always horizontal to either one of the side of the image coordinate grids.This format soon posed a problem.Due to the high altitude viewpoint and the steep viewing angle of the remote sensing images, the presented objects can have arbitrary orientations.Some types of objects, such as large ships, buses, buildings, and airport runways, have a large length-to-width ratio and are poorly represented by horizontal bounding boxes, especially when the objects are at a roughly ±45 • angle to the image axes.
Oriented Bounding Box (OBB) was proposed to address this problem.OBB keeps the rectangular form but obtains orientation as a new degree of freedom (DoF), the other existing DoFs being the position of its center, length, and width.OBB introduces finer labels to the objects in the remote sensing images and a better data format for the detection accuracy criteria.However, the existing ensemble methods are not compatible with OBB.
In this paper, to address the first problem, a stacking ensemble method is proposed.The stacking model is trained to best combine the member models, while simultaneously considering three factors, model calibration, model redundancy, and the performance gap between the models.For the second problem, a new bounding box fusion method is proposed for the oriented bounding boxes.The bounding boxes are parameterized with orientation, position, width, and height, and each parameter is fused separately.The combined method, OBBStacking, helps take 1st place in the Challenge Track Fine-grained Object Recognition in High-Resolution Optical Images, which was featured in 2021 Gaofen Challenge on Automated High-Resolution Earth Observation Image Interpretation.
This paper is structured as follows.Related work will be discussed in Section II.The proposed ensemble method is introduced in Section III.The experiment setup and the quantitative results are described in Section IV.We also provide some analysis of OBBStacking in Section V.The conclusion is given at the end.

A. Remote sensing object detection
Quite a few deep neural network detectors are proposed in recent years.Notably, Liu et al. [3] are among the earliest to utilize oriented bounding boxes (OBB) for object detection in remote sensing images.The method is built upon Faster RCNN [4] and proposes a rotated region of interest (RROI) pooling layer for accurate feature extraction; and an OBB regression model for precise object positioning.Later methods [5]- [7] adopt oriented anchors for a better formulation of the bounding box that's easier to learn for the neural networks, but at the cost of relying on a redundant number of rotated anchors.Ding et al. [8] propose ROI Transformer to alleviate the problem by formulating RROI as offset parameters relative to only non-oriented ROIs.Han et al. [9] build upon general rotation equivariant CNNs [10] and ROI Transformer to create an oriented object detection model (ReDet) with rotation equivariant features.Xie et al. [11] further simplify the OBB inference process of ROI Transformer with 1/3000 number of parameters used and propose a new model, Oriented R-CNN, which is currently state-of-the-art on Dota [12] Dataset.
ReDet and Oriented R-CNN are two of the models we select to generate the detection results for our ensemble method.This is due to their recognized performance on similar problems and their large backbone network difference, where ReDet uses rotation equivariant CNN and Oriented R-CNN uses the more traditional ResNet [13] architecture.The intrinsic difference in their backbone will help increase the model diversity and in turn, increase the effectiveness of the ensemble process.

B. Transformer
Transformer is another neural network structure we take interest in, due to its structural difference from CNN.It was first introduced by Vaswani et al. [14] for the natural language processing (NLP) problem.It is designed for sequential data and is effective at modeling long-distance dependencies, which is typical in language data.Its success motivated its adaptation to the computer vision domain, with the major hurdle being the difference in the structuring of data (one dimension vs. two/three dimensions) and the increased data length at each dimension.
ViT [15] by Dosovitskiy et al. was one of the notable Transformer models for computer vision problems.ViT divides one full image into several small patches to be treated as tokens, like the words in NLP, and proposes large-scale pre-training to compensate Transformer's lack of intrinsic properties for image data, such as translation equivariance and feature locality.
Swin Transformer [16] is one of the latest vision Transformer models.Swin Transformer proposes to boost its efficiency by utilizing the locality characteristic of the images and increasing the scale of features step-by-step through a hierarchical design.Swin Transformer will also be one of the backbones for our member neural network detectors.

C. Calibration of the neural networks
A well-calibrated model can produce the probability of correctness for each prediction.Guo et al. [17] show that while modern neural networks excel at making correct predictions, their level of calibration degrades.This hinders the attempt to effectively combine different neural networks and their application in critical scenarios.Guo et al. propose to calibrate the models in a post-processing manner and train a simple parametric model (Temperature Scaling) [18] to map the confidence scores of the models to the probabilities of correctness.Wenger et al. [19] propose a latent Gaussian process to correct the model output.Zhang et al. [20] propose an ensemble of post-processing methods that is data efficient and with high generalizability.
The above methods are post-processing calibration methods that are most related to our work.There are also calibration methods such as Bayesian neural network methods [21] and neural network regulation methods [22] that change the design philosophy or the objective functions to achieve more calibrated neural networks.

D. Bounding box post-processing methods
Object detection methods, along with other vision-related algorithms, may produce redundant activations in a close spatial neighborhood.Non-Maximum Suppression (NMS) has been used in such scenarios for over half a century [23] and to this day, is still being used in the deep neural network pipelines.Specifically, modern neural network detectors generate redundant results for a single object and NMS postprocesses the results by checking the spatial overlaps of the results and keeping the ones with the highest confidence scores.
NMS eliminates the redundant bounding boxes completely, which may lead to false negatives when there are overlaps between the ground truth bounding boxes.Soft-NMS [24] alleviates the problem by keeping all the bounding boxes and only mapping the confidence scores of the to-be-suppressed bounding boxes to a lower value.
Weighted Boxes Fusion (WBF) [2] targets specifically at post-processing the bounding boxes from different models.Instead of selecting one best bounding box (NMS) or keeping all of the bounding boxes (Soft-NMS), WBF produces a weighted average of the bounding boxes in terms of position and size, so all of the to-be-fused bounding boxes can contribute to the final bounding box and no redundant bounding boxes are introduced.

III. METHODS
OBBStacking is a stacking ensemble method that is compatible with OBBs.In a stacking method, a new model called a meta-learner, is trained to best combine the results of multiple existing models.OBBStacking has two stages (Fig. 2), training the meta-learner, and applying the meta-learner to the member models.First, we will introduce the meta-learner proposed in our method.Then, we will be discussing the key processes that constitute the two stages-namely bounding box clustering, meta-learner parameter optimization and bounding box fusion.

A. The Meta-Learner
In a stacking method, every member model makes an independent prediction based on a data sample, and the metalearner combines the predictions to form a more accurate one.In OBBStacking, we choose a simple model, logistic regression, as the meta-learner.The model takes the form where is the logistic function, and w ∈ R M and b ∈ R are the weight and the intercept parameter of the meta-learner, respectively.
Note that logit z ∈ R 2 is the non-probabilistic output of the member models and the 2 dimensions correspond to the tendency of refusing a target and accepting a target, respectively.In the context of deep learning, logits are often converted to probabilistic output through the logistic function, but here the logits are used because of their amenity to Eq. 1.
Later in Section V, we will show how this simple form of the meta-learner can simultaneously consider model calibration, model redundancy and the performance gap between the models.

B. Bounding box clustering
Under the OBB detection setting, each member model produces a set of OBBs, but the correspondence of OBBs between different sets is unknown.Therefore the first goal is to collect output z on the same object from the different member models.We assume the OBBs are relatively accurate in terms of position and shape such that OBBs generated from the same object but different models have a significant spatial overlap.Therefore, an OBB spatial clustering method is used to assign OBBs from the same object into the same cluster.
The clustering method has the following steps: 1) Aggregates all the OBBs from the member models into a list S, sorted by their bounding box scores s in descending order.2) Create an empty list C for the resulting clusters.
3) Pop the first OBB from S as a new cluster center, and push the cluster into C. 4) Iterate through S and find OBBs from other member models and have an overlap greater than iou thresh with the cluster center and move them from S to the new cluster.5) Go back to Step 3 and repeat until S is empty.Note that although both stages of OBBStacking include bounding box clustering, the method is applied to different sets of data.The whole scheme requires three sets of data, the training set, the validation set, and the test set.Training set is used to train the member models.Validation set is used to train the meta-learner (Stage 1 of OBBStacking).Testing set is used for measuring the final performance of OBBStacking (Stage 2).Member models and the meta-learner are trained on separate data sets to prevent the meta-learner from favoring the member models that overfit the training set.

C. Meta-learner Parameter Optimization
After the member models are trained on the training set and produce M sets of detection OBBs on the validation set, the bounding box clustering method is applied to acquire the clustered OBBs C val = {c i |i = 1, 2, ...n}.Each OBB in a cluster c i represents the prediction of a member model from one data sample x i .
Here, the major role of the meta-learner is to fuse the bounding box scores s in the same clusters.Note that we use the logit output z in Eq. 1.In most detectors, s and z can be acquired by keeping both outputs before and after the last logistic function.Additionally, in most clusters, one or more member models will be absent when they predict the probability is lower than a threshold.We set z for these cases to a fixed negative value to keep the optimization simple.
We use Negative Log Likelihood (NLL) as the objective function, which can be formulated as: where y i is the ground truth label of each cluster.To determine y i , we calculate IOU (Intersection over Union) between the cluster center OBB and all the ground-truth OBBs in the validation set.A cluster is marked as a true positive (y = 1) if it has an overlapped ground-truth OBB, and a false positive (y = 0) otherwise.Eq. 3 is a convex function regarding to w and b, and can be easily optimized.

D. Oriented bounding box fusion
Before this step, the trained member models produce M sets of OBBs from the test set, which are then clustered into C test with the bounding box clustering method.
This step aims to fuse the OBBs O = {o 1 , ..., o K } that belong to the same cluster into one OBB.We represent an OBB with a 7-tuple: where x, y, w, h, z represent the center coordinates on the x-y axis, width, height, and logit score, respectively.l ∈ {1, 2, ..., M } is the index of its source model.Orientation θ ∈ [0, π) represents the angle between the longest axis of the bounding box and the x-axis.The fusion process needs to derive the first 5 elements in o to acquire the final OBB, and these elements will be fused separately.With regard to the first 4 elements, the fusion process can be formulated as, where j is the index of the element in o, p is the index of the OBB in the cluster, o f is the fused OBB.s * is the calibrated score derived from OBB's logit score and the weight parameters in Eq. 1: s * acts like an improved weight for each OBB that addresses the output calibration and the redundancy in the member models.
Orientation parameter θ receives special treatment due to its cyclic property.First, the orientation of the bounding box with the largest score s * is designated as the major orientation θ MJ of the cluster.Then, the fused orientation is determined by averaging the relative orientations to θ MJ : where r is a bivariate function that calculates the relative difference of two angles while considering their cyclic property: Note that here we assume θ ∈ [0, π) since we do not discriminate between the head and the tail of an OBB.Lastly, the score of the fused bounding box is determined with Eq. 1 with the learned meta-learner.

A. Datasets
Two datasets are used to validate our method, FAIR1M dataset [25] and DOTA dataset [12].Both datasets have an evaluation server that evaluates the detection results on a test set of which the ground truth labels are not shared publicly.Both these evaluation servers adopt mean average precision (mAP) as the evaluation criteria, consistent with PASCAL VOC 2007 [26] and VOC 2012.
FAIR1M dataset: This dataset was introduced alongside 2021 Gaofen Challenge on Automated High-Resolution Earth Observation Image Interpretation.It contains 32912 images with widths ranging from 600 pixels to 10000 pixels and spatial resolutions between 0.3 and 0.8 meters.The images are collected from Gaofen satellites and Google Earth, covering over 100 civil airports, harbors and cities.The dataset contains 1.02 million objects annotated with OBBs and assigned into 5 major categories and 37 fine-grained sub-categories.The major categories include vehicles, ships, airplanes, sports fields and road structures.The training, validation, and testing sets contain 16488, 8287, and 8137 images, respectively.
DOTA dataset: This dataset was released in 2018.It contains 2806 images from satellites (GF-2 and JL-1), Google Earth, and aerial images with spatial resolution between 0.1 and 4.5 meters.It covers similar types of objects as FAIR1M does but with fewer sub-categories.It contains 15 categories and 0.2 million instances.The proportions of the training set, validation set, and testing set are 1/2, 1/6, and 1/3, respectively.

B. Member Models
As previously mentioned in Sec.II, we select 3 types of neural network detectors as the member models in the ensemble process, Oriented R-CNN, ReDet and a Swin detector.These 3 types of detectors have different design preferences so the diversity between the member models is assured.
The Swin detector in our experiment is a simple modification to the original one [16] for its compatibility with OBB detection.Both Swin backbone and the recent CNN backbone produce a feature pyramid [27], consisting of layers of image features with different spatial resolutions and semantic depths, so their outputs have a similar structure and they can share the same types of detectors.We keep the original backbone and replace the original detector head with the one from Oriented R-CNN, since its OBB detector structure is elegant and concise.
For Oriented R-CNN and ReDet, we follow the experiment setups in the original papers, except for those that can be limited by the GPU specifications.We use a similar setting in Swin Detector to the ones in Oriented R-CNN since they share the same type of detectors.We use 2 GTX 3080 Ti for training and inference.The images are cropped into 1024 × 1024 patches and the batch size is set to 2, 2, and 1 per GPU for Oriented R-CNN, ReDet and Swin Detector, respectively, due to the limit of GPU memory.Multi-scale training and testing are also used because they are often used in combination with ensemble methods to achieve the highest performance possible.

C. Quantitative Comparison
First, for a fair comparison, we augment the original NMS and WBF with OBB compatibility, and evaluate the performance of the member models and the selected ensemble methods on DOTA dataset.Since most of the experiments in the literature [9], [11] combine the training and the validation sets to train their models to achieve maximum performance, and our ensemble model needs a separate validation set to learn the parameters of the meta-learner, we do two separate experiments to verify the effectiveness of our method.(1) We follow the original scheme of our method, and train all the member models on the training set only, leaving the validation set for the parameter training of the meta-learner.
(2) We follow the training scheme of other methods and train the member models with data from both the training set and the validation set, and use the trained meta-learner from Experiment (1).
In the following tables on DOTA dataset, the names of the categories are abbreviated to conserve space.The categories, in order, are plane, baseball-diamond, bridge, groundtrack-field, small-vehicle, large-vehicle, ship, tennis-court, basketball-court, storage-tank, soccer-ball-field, roundabout, harbor, swimming-pool, and helicopter.The quantitative results of Experiment ( 1) are listed in Table I.Oriented R-CNN achieves the best performance among the member models and obtains 79.86% mAP.The ensemble methods all obtain a 1-2% mAP increase over the best member model and our method achieves the top score with 81.50% mAP, 0.61% over WBF.
For Experiment (2), we assume the performance gap, the calibration, and the redundancy of the member models do not drift too much from Experiment (1), and we could reuse the meta-learner for the ensemble.The results are shown in Table II.The results are generally similar to the previous one, with a slight overall performance increase of 1% mAP among the member models and 0.1-0.4% mAP increase among the ensemble methods.Our method, with the meta-learner from Experiment (1), still outperforms WBF by 0.24% mAP.This shows that our assumption holds when the training data expands, and even though our method requires a separate validation set, it still outperforms the existing ensemble methods.
Next, we evaluate the member models and the ensemble models on FAIR1M dataset using Experiment (1) setup and show the results in Table III.Among the member models, Oriented R-CNN still achieves the best performance with 47.77% mAP.Compared to the individual methods, the ensemble models obtain a huge performance increase by around 4% mAP, where our method achieves the best score with 52.42% mAP, a 4.65% increase over Oriented R-CNN, a 0.57% mAP increase over WBF.

V. DISCUSSION
In this section, we demonstrate how OBBStacking addresses the three problems that arise during an ensemble process on deep learning models-namely model calibration, the performance gap between the models, and model redundancy.

A. Model Calibration
Deep learning models tend to overfit the training data and are overconfident about their predictions.When the member models are overconfident to different degrees, their predictions are on different measurements and do not indicate true probability values.Therefore, the ensemble methods may not work well on these models as intended, and a model calibration process is needed.
In this section, we show that one of the calibration methods, Temperature Scaling (TS) [17], can be regarded as a special form of our meta-learner, indicating that OBBStacking includes the feature of model calibration.
TS attempts to map the non-accurate predictions to the real probability of correctness, by 'softening' the final logistic layer in the neural networks and introducing a temperature parameter T > 1.The 'softened' logistic layer is When T → ∞, all results of σ TS approach 1 2 and indicate maximum uncertainty.
The inference of parameter T also uses NLL as the objective function, since NLL is a standard measure of a probabilistic model's quality [28].Here, the objective function can be defined as: As can be seen, our meta-learner, Eq. 1 becomes Eq. 10 when the number of the member models is 1 and thus can calibrate models in the same fashion.

B. Performance Gap
In this section, we experiment to try to demonstrate how OBBStacking adjusts the weights when there is a performance gap between the models.
Our model tackles three problems simultaneously, model calibration, redundancy and performance gap.We assume these three problems can be disentangled and thus the factorization of the parameter exists, w = p r g, where the operator is the elementwise multiplication, p, r, g are the weight vectors for the model calibration, model redundancy and the performance gap, respectively.
We want to minimize the effect of the first two factors and see how OBBStacking handles the performance gap between the models.Along with the Swin detector used in our previous experiment, 3 additional Swin detectors are added to the Swin detector family.The only difference between these Swin detectors is the total number of epochs used in training, which are 12, 9, 16, and 18 epochs, respectively.At different epochs during the training with stochastic gradient descent, the neural networks may randomly lean towards more accuracy on some categories instead of others, and rely upon different features, thus creating a sequence of different models with relatively high redundancy.
We first run OBBStacking on the Swin family and acquire w for later comparison.Then, to show the factor of redundancy among the Swin family, we apply the bounding box clustering method to the detection results and calculate Pearson's correlation between the confidence scores of the models.As can be seen in Fig. 4, compared to the other models, the correlation coefficient between the Swin models are very close to each other, so we assume r is approximate to a vector of 1s.
As for the weight vector p from model calibration, it can be easily derived by applying TS on the Swin detectors individually, and we get p = [   value in g.Swin 16 has the worst mAP and also the smallest value in g.This is in accord with the basic ensemble idea of putting more weight on the better predictors.Swin 12 and Swin 18 have similar values in g and similar performance in mAP, which is a reasonable range considering the small performance gap between the two models and the error from the assumed r value.

C. Model Redundancy
In this part, we build upon the previous experiments to show how OBBStacking handles model redundancy. 2 collections of models are included.Collection 1 consists of Oriented R-CNN, ReDet and Swin 12. Collection 2 includes all the models in Collection 1 and the additional Swin 9, Swin 16 and Swin 18, adding up to 6 models in total.
The correlation coefficient between all the models is shown in Fig. 4 and the weight parameters w of the meta-learner are shown in Table V.We notice that in Collection 2, because of the redundancy among the Swin families, their weights decrease drastically, with a sum value of 0.36, in between the weights of Oriented R-CNN and ReDet.The weights of Oriented R-CNN and ReDet decrease slightly because the Swin family improves its performance with the increase of its members.

VI. CONCLUSION
We propose an ensemble method, OBBStacking, that is compatible with the oriented bounding box (OBB) which is widely used in object detection in the remote sensing field.OBBStacking consists of a meta-learner that can address the problems in the ensemble process of the deep neural network detectors, namely the model calibration, the redundancy between the models and the performance gap between the models.OBBStacking outperforms other ensemble methods in the DOTA dataset and the FAIR1M dataset and helps us win 1st place in the Challenge Track Fine-grained Object Recognition in High-Resolution Optical Images featured in 2021 Gaofen Challenge on Automated High-Resolution Earth Observation Image Interpretation.

Fig. 1 .
Fig. 1.Ilustration of the bounding box fusion results of different ensemble methods.The blue rectangles are the bounding boxes fed into the methods, the red rectangles are the fused bounding boxes.

Fig. 3 .
Fig. 3. Showcase of the ensemble results of OBBStacking on DOTA dataset and FAIR1M dataset.Only the objects with a confidence score larger than 0.2 are shown.

TABLE IV WEIGHT
VECTORS OF THE SWIN FAMILY

TABLE V WEIGHT
VECTORS OF COLLECTION 1 AND COLLECTION 2