Efficient Weakly-Supervised Object Detection With Pseudo Annotations

Weakly-supervised object detection (WSOD) has attracted lots of attention in recent years. However, there is still a big gap between WSOD and generic object detection. The main barriers to the efficiency of WSOD are the ineffective data augmentations and inaccurate bounding box predictions. Given only image-level annotations, it is hard for WSOD to effectively utilize variant data augmentations and accurately regress the bounding boxes. Although a fully-supervised object detector can be trained using annotations generated from the weakly-supervised object detector, the performance is still severely limited due to the low quality of mined pseudo annotations. This paper proposes an efficient WSOD method with pseudo annotations (EWPA) to make better use of imperfect annotations. With the assistance of pseudo annotations, EWPA can effectively regress more accurate bounding boxes while the traditional WSOD can only locate the salient parts of an object. Furthermore, pseudo annotations can help design more complex data augmentations, driving the network to learn more discriminative feature representations. Extensive experiments are conducted on PASCAL VOC 2007 and 2012 datasets and validate the effectiveness of EWPA.


I. INTRODUCTION
In recent years, the development of Convolutional Neural Networks (CNN) has significantly boosted the performance of many tasks in computer vision such as image classification [1]- [3], object detection [4]- [6] and semantic segmentation [7]- [9]. The research on fully supervised object detection is relatively mature, for large scale datasets with accurate bounding box annotations have been widely used by researchers. The training of a fully supervised detector relies heavily on such datasets with precise instance-level annotations, which is always a big cost of human labor. In this paper, to address this problem, we are devoted to Weakly Supervised Object Detection (WSOD) problem, which only needs image-level annotations for training and saves massive cost of training data labeling. However, the performance of WSOD still remains challenging because the commonly used The associate editor coordinating the review of this manuscript and approving it for publication was Chao Wang . data augmentations and bounding box regression become disable in WSOD.
In most approaches, WSOD is formulated as a Multiple Instance Learning (MIL) problem. By combining CNN with MIL, Bilen and Vendaldi [10] propose an end-to-end Weakly Supervised Deep Detection Network (WSDDN), which is always used as a basic multiple instance detection network (MIDN) in other works due to its conciseness and effectiveness. However, WSDDN inevitably tends to choose the most discriminative part of an object rather than the entire object. Based on WSDDN, methods like OICR [11] are designed which utilize online classifiers to refine the classification results so that more tight result boxes can be obtained. Although these approaches can improve the performance of WSOD by a considerable margin, the main bottleneck of disabled complex data augmentations and box regression still heavily limit the performance of WSOD.
A common procedure adopted in WSOD methods is to train a fully-supervised object detector using annotations generated from the WSOD detection results. Inspired by this procedure, some works [12]- [14] try to introduce a regression branch to the MIDN directly, where annotations are mined in various ways. In the approach proposed by Zeng et al. [12], pseudo annotations are mined based on low-level image features. Differently, in [13], instance-level annotations are extracted online from the results of MIDN. These works start to integrate the regression branch to alleviate the inaccuracy of bounding boxes. Although the regression branch has brought performance improvement, online instance mining limits the utilization of these precious position information, so more strategies related to pseudo annotations are still untapped.
In this paper, we focus on the better use of pseudo annotations. The variant data augmentations and corresponding box regression are fully explored. We generate pseudo annotations offline by weakly-supervised instance segmentation approaches, so that they can be used in the overall framework without limitations. In particular, the training process is enhanced in three aspects including annotation-based data augmentations, mixed-supervision in classification branch and the introduction of the bounding-box regressor, as is shown in Fig.1. To observe whether other augmentations can work, we train two networks: a basic MIDN (PCL [15]) and a MIDN with some complex augmentations including random mask, random rotation and some pixel transformations, whose results are shown in Fig.2. It can be concluded that complex augmentations, which are always used to train a fully supervised detector, will deteriorate the performance of a traditional MIDN due to the lack of apriori position information. In our case where rough annotations are available, these complex spatial and pixel transformations, such as random rotation and random brightness, can be added to training, boosting the final performance. In addition, mixed-supervision strategy is adopted in the classification branch. To train the classification branch effectively, an alternative option is to utilize offline-generated pseudo ground truths directly. However, it is not ensured that all objects are covered by ground truths so that foreground boxes may be wrongly labeled as background, affecting the recall. So in our network, we keep the MIDN architecture for the classification branch and some instances are still weakly supervised, following the original training strategy. However, when some instances have high IoU with ground truths, they are assigned with labels of the corresponding ground truths, i.e., they are fully supervised now. Finally, these pseudo ground truths are used to supervise the newly-added bounding-box regressor, which is quite natural.
Many experiments are elaborately conducted on the challenging PASCAL VOC dataset to demonstrate the effectiveness of the proposed method. Our method achieves 55.6% and 53.6% mAP on VOC 2007 and VOC 2012, respectively.
In summary, our main contributions are listed as follows.
• We propose a weakly-supervised object detector with a bounding-box regressor, which is aimed at the circumstance that rough pseudo annotations have already been mined.
• We further introduce more complex data augmentations and propose a mixed-supervision training strategy to make better use of imperfect position information.
• The proposed method achieves comparable performance on both PASCAL VOC 2007 and 2012 datasets.

II. RELATED WORK A. TRADITIONAL MULTIPLE INSTANCE LEARNING
Due to the absence of instance-level annotations, most of the previous methods formulate weakly-supervised object detection as a Multiple Instance Learning (MIL) problem [16]. These approaches consider each image as a bag of candidate proposals and it is labeled as a positive or negative sample of a specific class. An image is treated as positive for one class only when there exists at least one proposal belonging to this class. To get an detector, a proposal classifier is trained by using image-level labels, which makes it learn to distinguish the most discriminative representation of object proposals. Multiple instance learning leads to a non-convex optimization problem, i.e., the network may be stuck at local VOLUME 9, 2021 optima during optimization. To solve this problem, some methods are devoted to initialize parameters better [17]- [20] and some focus on improve the learning procedure [21], [22]. Deselaers et al. [18] utilize the objectness measure as a localization prior to initialize their model parameters. Jie et al. [19] propose a deep self-taught learning approach, which makes the detector learn to acquire tight positive samples continuously. Cinbis et al. [22] propose a multi-fold multiple instance learning procedure to prevent training from trapping into local optima. C-MIL [21] alleviates the non-convexity problem by optimizing a series of smoothed loss functions.
In our method, we use MIL to mine instance-level category information and enhance it with pseudo annotations.

B. WEAKLY-SUPERVISED OBJECT DETECTION
Most existing approaches deal with the WSOD problem by combining MIL and CNN into a unified network. A typical work is the weakly-supervised deep detection network (WSDDN) proposed by Bilen and Vedaldi [10], which consists of a classification stream and a detection stream and mines positive samples by aggregating scores of these two streams. Kantorov et al. [23] propose to learn context-aware guidance models to leverage the surrounding context regions. Tang et al. [11] introduce several online instance classifiers to WSDDN, whose classification results are further refined to select more accurate result boxes. PCL [15] optimizes the process of supervision generation for instance classifiers in [11] and gets a better performance. Zhang et al. [24] design two algorithms, PGA and PGE, to generate refined pseudo ground truths from the detection result of a basic weakly-supervised detector so that a fully-supervised detector can be trained. Tang et al. [25] propose a weakly-supervised region proposal network to generate high-quality proposals. Wang et al. [26] combine a WSDDN and a Faster-RCNN-like network together and trains them jointly through feature sharing and prediction consistency to improve the performance of WSDDN. Lin et al. [27] proposed object instance mining algorithm that can help detect more possible objects. [13], [14], and [28] proposed to combine the MIL branch with a single or multiple online regression branch to achieve re-localization of proposals. These methods are all based on a multiple instance detection network, so it is hard to avoid the non-convex optimization problem brought by MIL. As a result, their performances are severely limited. Some approaches [29]- [32] leverage the class activation map (CAM) [33] or weakly-supervised segmentation methods to provide position information. WCCN [29] proposes an end-to-end three-stage cascaded CNN where CAM and segmentation results are obtained to generate better proposals for detection. Wei et al. [30] use two segmentation-based properties, purity and completeness to discover more tight boxes. WS-JDS [31] joins weakly-supervised object detection and segmentation tasks with multi-task learning and let them complement each other's learning. Yan et al. [32] couple two MIDNs which work in a complementary manner with a segmentation guided proposal removal algorithm. In this paper, instead of obtaining segmentation map online, we use offline-generated segmentation results so that more utilization of apriori position information can be introduced to the training process.

C. WEAKLY-SUPERVISED INSTANCE SEGMENTATION
Compared with weakly-supervised object detection (WSOD) and weakly-supervised semantic segmentation (WSSS), a more exact kind of annotation, bounding boxes, has been widely used for the problem of weakly-supervised instance segmentation (WSIS). Provided with the location of an object, some approaches [34], [35] are devoted to estimating object segments. Khoreva et al. [34] propose a modified version of GrabCut [36] to estimate an object segment and incorporate it with object shape priors by using segment proposals. Remez et al. [35] train a mask generator and a discriminator with an adversarial learning scheme, the purpose of which is to make the image generated by cutting and pasting the mask area to the background of a random image more realistic. In addition, some methods [37], [38] start to use the same annotations as WSOD and WSSS recently, i.e., image-level class annotations as weak labels of WSIS, making it far more challenging than before. Zhou et al. [37] leverage stimulated peaks in a class response map to extract fine-detailed instance-level representation and obtain instance masks by combining it with offthe-shelf methods [39]- [41]. Ahn et al. [38] propose IRNet to provide a displacement vector and a class boundary map, which are used to generate instance masks from CAMs with label synthesis. Due to the excellent performance of [38], we utilize it to obtain pseudo annotations offline.

III. METHOD
In this section, we introduce the structure of EWPA, which mainly consists of two branches: classification branch and regression branch. We also present the training strategy of these branches in detail. The overall framework is shown in Fig.3. Given an input image, it is first augmented with both pixel and spatial transformations which are performed based on pseudo boxes, then a feature map shared by two branches is extracted through the CNN network. Region features for proposals are generated by RoI pooling and then sent to classification branch and regression branch respectively. In order to train these two branches effectively, supervision with pseudo annotation guidance is adopted in our method.

A. DATA AUGMENTATION WITH PSEUDO ANNOTATIONS
In most weakly-supervised object detection methods, multi-scale training and random horizontal flip are usually used for data augmentations. Since pseudo ground truths are needed to be generated when training the regression branch, they can be utilized in all training process. Pseudo boxes allow us to add more complex data augmentations, which proves to be effective in our experimental part. We use IRNet [38] to obtain pseudo ground truths, which are taken from the upright bounding rectangles of instance segmentation results with high confidence (larger than 0.3). Following methods are adopted as pixel transformations: random brightness, random contrast, RGB value shift, random hue, random saturation and random value. To address the effect of pseudo boxes, mask is also added to our augmentation strategy. Since the rough position of an object is available, annotation-based mask is performed in our EWPA rather than random mask. Specifically, for a pseudo box, we randomly sample a rectangle area inside the box and fill the corresponding area with mean pixel values. Besides random horizontal flip, random rotation is also chosen as an extra spatial transformation. The rotation degree is limited to degrees between -30 and 30. It should be noted that proposals and ground truths also need to be rotated as the image and take their upright bounding rectangles as the final results after rotation.

B. CLASSIFICATION BRANCH 1) MULTIPLE INSTANCE DETECTION NETWORK
Since only image-level labels, which indicate whether objects belonging to interested classes exist in an image, are available during training, it is necessary to introduce a multiple instance detection network (MIDN) to get classification results of proposals. In spite of the fact that the segmentation result obtained by weakly-supervised instance segmentation method can provide both position and class information, it can hardly cover all objects in an image. As a result, if we simply train a general classification branch with category information of pseudo annotations rather than using an MIDN, the final recall will be deteriorated. There are several choices for MIDN, such as [10], [11], [15]. We adopt the PCL [15], an upgraded version of OICR [11], as our MIDN, for it is end-to-end trained and easy to be integrated into our network. The architecture is shown in Fig.4. PCL is based on WSDDN [10] which includes two data streams: classification stream and detection stream. Classification scores of proposals can be obtained by aggregating results of these two streams.
In particular, for an image I , its corresponding label is denoted as Y = {y 1 , y 2 , . . . , y C }, where C is the number of classes and y c = 1 or 0 indicates the presence or absence of class c in I . Selective Search [39] is used to obtain object proposals R = {r 1 , r 2 , . . . , r |R| }, where |R| denotes the number of proposals. The image I and proposals R are fed into a CNN to generate feature maps and extract region features of proposals with RoI pooling. After these features are passed through two fully connected layers, they are branched into classification and detection data streams. VOLUME 9, 2021 Both streams consist of a fully connected layer and a softmax layer. Matrices x cls , x det ∈ R C×|R| are produced through fully connected layers of classification and detection streams, then passed through two different softmax layers which calculate scores along different dimensions. In the classification stream, the score matrix σ cls (x cls ) ∈ R C×|R| are calculated . The softmax operation to get the detection stream score matrix . The final scores of all object proposals are generated by element-wise product . Finally, the image score of class c can be obtained by the summation over all proposals : The network can be trained end-to-end by a multi-class cross entropy loss, as shown in 1.
However, the performance of WSDDN is unsatisfactory, so PCL [15] is finally adopted in our network to refine classification results of WSDDN online. Several new classifiers are added into WSDDN and also trained by the multi-class cross entropy loss. Differently, these classifiers are supervised by instance-level class labels, which are online generated from the previous classifier or the result of WSDDN during training. More details can be found in [15]. In our method, we also add strong supervision to new classifiers, which will be introduced in III-B2 in detail.
After further refinement of classification results, the classifier tends to choose tight boxes. Even so, it is inevitable that result boxes are still not accurate enough, especially for those categories with extremely discriminative parts. In order to handle this problem, result boxes are further regressed by the regression branch introduced in III-A.

2) MIXED-SUPERVISION STRATEGY
Our MIDN is totally following a weak supervision strategy. In spite of the fact that pseudo annotations are available (as mentioned in III-A), they cannot be directly used to assign class labels for proposals. Because it is not ensured that all objects are covered by pseudo boxes so that foreground boxes may be wrongly labeled as background, introducing more noise to the training of classifier.
To solve this problem, we adopt a mixed-supervision strategy. Now, all proposals are weakly supervised in our MIDN. For a newly-added online instance classifier, the instance-level class labels used for training are generated based on pseudo ground truths (called proposal clusters in [15]) obtained through classification scores of the previous classifier. We combine the pseudo annotations with these proposal clusters in a complementary way. Specifically, after generating several proposal clusters with the method in [15], boxes mined from pseudo annotations are directly added to the current proposal cluster set one by one. Specially, the confidence of the pseudo annotations is slightly reduced in the classification branch during training in order to avoid overfitting. After adding, some proposal clusters may be largely overlapped now. So for each new box, we compute the IoU with other clusters. It will be removed if the maximum of IoUs is larger than 0.5.

C. REGRESSION BRANCH
After obtaining the classification results, proposals are sent to the regression branch for further refinement of coordinates. The regression branch consists of several fully connected layers and outputs bounding box regression offsets t = ( t x , t y , t w , t h ). Different from Fast R-CNN [5], we use the same regression offsets for all object classes.
Since we get the pseudo boxes as introduced in III-A, it is possible to find regression targets for proposals during training. When assigning pseudo boxes to proposals, besides following the principle in [42], we adopt a new principle in order to balance the training of the classification branch and the regression branch: a ground truth is assigned to a proposal if and only if they have the same class label. The class label of a proposal is generated from its highest score of all foreground classes. It is worth noting that if the score is too low the proposal will be classified as background and won't be assigned with any pseudo boxes. Suppose that a proposal (x, y, w, h) has a target pseudo box (x , y , w , h ). The target regression offsets are shown in 2, 3, 4 and 5.
The regression loss of this proposal is computed with smooth L1 loss [5], which is shown in 6.
The overall network is trained end-to-end by optimizing the loss function in 7.
where L cls denotes the loss function of the classification branch and L reg is that of the regression branch. The α in 7 is the weight of the regression loss and used to balance two losses, which is set to 0.7 in our implementations.

IV. EXPERIMENTS
In this section we introduce the evaluation datasets and implementation details of our proposed method. Then we conduct ablation experiments in detail to validate the contribution of different strategies used in our method. Finally, we discuss the performance of our method compared with other stateof-the-art methods.  For evaluation, we use two kinds of metrics: mAP and CorLoc. Average Precision (AP) and the mean of AP (mAP) are the evaluation metrics applied on the test set. Correct Localization (CorLoc) is used on the trainval set measuring the localization accuracy [47]. Both metrics are based on the PASCAL criterion, i.e., IoU > 0.5 between ground-truth boxes and predicted boxes.

B. IMPLEMENTATION DETAILS
We use the VGG16 [2] network pre-trained on ImageNet [49] as the backbone of our method. As suggested in [11], the penultimate max-pooling layer and its subsequent convolution layers by the dilated convolution layers. The refinement time of MIL branch is set to 3. In regression branch, the IoU threshold for assigning boxes is set to 0.5. During the training stage, a mini-batch size of 4 images is adopted. The learning rate is 5 × 10 −4 for the first 35K iterations, then decreases to 5 × 10 −5 for the following 10K iterations. The momentum and weight decay are set to 0.9 and 5 × 10 −4 , respectively. The weight of regression loss α is set to 0 for the first 20K iterations to avoid insufficient training of MIDN head and set to 0.7 for the following 25K iterations.
We use Selective Search [39] to generate object proposals and instance segmentation results are generated offline by the network proposed by [38], which is trained on the same training images. Upright bounding rectangles of instance segmentation results whose classification scores are greater than 0.3 are taken as pseudo boxes. For data augmentation, we resize the shortest side of images to one of scales {480, 576, 688, 864, 1200} and cap the longest side to 2000. During training, the scale of an image is randomly selected and the training image is applied with a random horizontal flip. Besides, augmentations mentioned in III-A are also used. In the test stage, an image is augmented with those five scales To verify the effectiveness of data augmentations with pseudo annotations, we train two different models: a basic MIDN with these augmentations (denoted as MIDN+AUG) and our proposed model only with data augmentations (denoted as MIDN+REG+AUG). Since ground truth boxes are not available in basic MIDN, the mask strategy in MIDN+AUG are applied differently by filling some randomly sampled area with pixel means. The test results are shown in Table.1. The MIDN+REG in Table.1 means the test results of the proposed model trained only with the regression branch and the MIDN is our baseline, a single MIDN trained with classification loss.
We can observe that the performance decreases when applying some complex data augmentations to traditional multiple instance detection networks. However, comparing MIDN+REG with MIDN+REG+AUG, the performance gets better after adding annotation-based augmentations. It shows that these complex data augmentations work well with pseudo boxes but deteriorate the final performance when no extra position information is available during training.

2) WEAK SUPERVISION, MIXED SUPERVISION AND FULL SUPERVISION
To compare the strategies of weak supervision, mixed supervision and full supervision for the classification branch, we train three different networks: MIDN+REG, MIDN+REG+MS and FS+REG. The setting of MIDN+ REG and MIDN+REG+MS has been mentioned in IV-C1. They indicate weak supervision and mixed supervision, respectively. For FS+REG, we train a fast rcnn [5] with pseudo annotations, representing full supervision.
As shown in Table.1, the mAP changes from 50.5 to 53.5 when adopting weak supervision (MIDN+REG) instead of full choice (FS+REG), indicating that weak supervision is a better choice given imperfect pseudo annotations. After further using mixed supervision (MIDN+REG+MS), the performance increases by 0.5 mAP, which proves the effectiveness of our mixed supervision strategy.

3) THE INFLUENCE OF NMS THRESHOLD
We conduct experiments to analyze the influence of IoU threshold in NMS, which is shown in Fig.6. We can see that after adding the regression branch (MIDN+REG, MIDN+REG+MS, MIDN+REG+AUG and MIDN+REG+AUG+MS), the performance is insensitive to the setting of IoU threshold while the performance of the methods without regression (MIDN and MIDN+HREG) fluctuates drastically as the IoU threshold changes. The reason behind this trend is that most proposals are regressed to their corresponding objects so that IoUs between regressed proposals which belong to the same object become more larger. It is more likely to obtain the same result when using small IoU threshold in NMS. In another point of view, it also verifies the contribution of the regression branch. To get a better performance, we set the IoU threshold in NMS to 0.4 in other experiments.

4) REGRESSION BRANCH VS. NO REGRESSION
To prove that the regression branch indeed learns how to refine bounding boxes, we need to remove the regression branch from the network. Since MIDN branch alone can generate validate detection results, we use the output of the MIDN in model MIDN+REG to generate test results, which is denoted as MIDN+HREG. The results are shown in Table.1. We can observe that the setting MIDN+REG outperforms MIDN+HREG by 4.1 in mAP, which is a great promotion. It proves that the regression branch can be effective under the supervision of a few pseudo boxes, although they cannot cover all interested objects in images and are not accurate enough. We also show some qualitative results of these two evaluation methods in Fig.7, which validate the contribution of the regression branch in an explicit way. It is worth noting that the performance of MIDN+HREG is higher VOLUME 9, 2021   than MIDN. We argue that multitask learning can encode more position information in features, which makes classifier choose tight boxes better.

5) THE ANALYSIS OF SPECIFIC CLASSES
As is shown in Table.1, compared with our baseline, detection performance of some classes become worse in our EWPA, such as bike and tv. We can observe that the regression branch is actually effective based on the result of MIDN+REG while extra augmentations and mixed-supervision lead to a decrease (referring to MIDN+REG+AUG and MIDN+REG+MS). The reason is that inaccurate pseudo boxes can bring noise to classification. In our baseline, the classification ability for these two classes is already good enough. So these noise actually confuses the classifier in EWPA. Extra data augmentations further causes more uncertainty so that the parts in image which have similar shape or material with bike and tv may appear in our detection results as false positives. Some failure cases are shown in Fig.8, which also explains the decrease to some extent.
In addition, the detection performances of some classes, such as cat, horse, person and sheep, get a significant improvement. The detection results of objects belonging to these classes are more easy to be trapped into the discriminative part like person's face in our baseline as most traditional WSOD methods do. In EWPA, the regression branch can regress the defective boxes to the entire object. Meanwhile, the annotation-based augmentations enrich the training of regression, forcing it to learn more patterns about how to refine boxes. As a result, better detection results are obtained on these classes.

D. COMPARISON WITH STATE-OF-THE-ART
In this subsection, we compare the performance of our method with other state-of-the-art methods. The results on PASCAL VOC 2007 and PASCAL VOC 2012 are shown in Table.2, Table.3, Table.4 and Table.5. On VOC 2007, our model obtains 55.6 mAP on the test set and 72.3 CorLoc on the trainval set. Compared with state-of-the-art, the CorLoc is improved by 0.3. As shown in Table.3 and Table.5, our model achieves a new state-of-the-art on VOC 2012 (53.6 mAP versus 53.6 mAP and 73.9 CorLoc versus 73.3 CorLoc). To generate better detection results, many methods will train a supervised detector by using the results of the weakly-supervised object detector as pseudo ground-truth boxes or utilize the ensemble detection results of multiple CNN networks. Unlike them, we only use a single model and achieve competitive object detection performance. Some detection results of our method are illustrated in Fig.5. We can observe that our method can obtain more tight bounding boxes and localize objects more correctly instead of only focusing on discriminative parts as our baseline method, i.e., PCL [15] does. However, there also exists unsatisfactory cases, some of which are shown in Fig.9. The reason of these failures is that proposals belonging to object parts or multiple objects is neither eliminated nor regressed to the whole object so that they appear in the final result. Considering the improvement of our method mainly lies in CorLoc, the EWPA can generate higher quality annotations using top scoring results. So our future work is to mine these high quality annotations online and dynamically use them during training, which may cover our current failure cases.

V. CONCLUSION
In this paper, we propose a novel detector: EWPA for weakly-supervised object detection. With the help of the pseudo annotations mined from the result of the weakly-supervised instance segmentation method, a multiple instance detection network and a regression branch are trained jointly. Meanwhile, these pseudo boxes are also used to introduce annotation-based data augmentations and train the classification branch in a mixed-supervision way. Experiments are conducted to verify the effectiveness of the proposed method, which shows that our method achieves a comparable performance only with a single model.