A Robust Context-Aware Proposal Refinement Method for Weakly Supervised Object Detection

Supervised object detection models require fully annotated data for training the network. However, labeling large datasets is a very time-consuming task, therefore, weakly supervised object detection (WSOD) is a substitute approach to fully supervised learning for the object detection task. Many methods have been proposed for WSOD to date, their performance is still lower than supervised approaches since WSOD is a very challenging task. The major problem with existing WSOD methods is partial object detection and false detection in an objects cluster with the same category. The majority of the methods on WSOD follow multiple instance learning approaches, which does not guarantee the completeness of detected objects. To address these issues, we propose a three-fold refinement strategy to proposals to learn complete instances. We generate class-specific localization maps by fused class activation maps obtained from fused complementary classification networks. These localization maps are used to amend the detected proposals from the instance classification branch (detection network). Deep reinforcement learning networks are proposed to learn decisive-agent and rectifying-agent based on policy gradient algorithm to further refine the proposals. The refined bounding boxes are then fed to instance classification network. The refinement operations result in learning complete objects and greatly improve detection performance. Experimental results show better detection performance by the proposed WSOD method compared to the state-of-the-art methods on PASCAL VOC2007 and VOC2012 benchmarks.


I. INTRODUCTION
Weakly supervised object detection (WSOD) has acquired enormous attention in the literature due to its great ease of demanding only image-level annotated data for training object detector. This has been made possible by the development of convolutional neural networks (CNNs) [1] and largescale datasets [2] with at least image-level annotations. In this paper, we aim to effectively learn and infer whole object detections trained with coarse image-level labels indicating the particular categories of objects present in an image for the WSOD problem.
Earlier methods [3]- [5] follow conventional multiple instance learning (MIL) paradigm for WSOD task. In MIL framework, images are treated as a set of positive and negative The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico .
instances bags, and classifier utilizes these instances set as labels. High confidence proposals can be extracted by applying MIL, which is also a suitable solution to localize objects with image-level labels. However, MIL learns the most discriminative part of the target object instead of the complete object [6]. Moreover, MIL has certain constraints, such as positive bags contains at least one positive instance and the negative bag contains all negative instances. Another major drawback of MIL is the most likely positives are predicted using existing classifier which can result in faulty learning in case of false-positive predictions, as classifier explicitly cannot deduct true positives in the given image [7].
Notable progress has been made in WSOD with the advancement of CNNs, since many methods combine MIL and CNNs [8], [9]. Recent studies have revealed that even better results for WSOD can be achieved by training MIL network in the standard end-to-end manner [4], [10], or a variant of end-to-end training [11], [12]. We adopt the same training approach motivated by [11], [13] a variant of end-toend training. The approaches which integrate the CNN with MIL for WSOD have shown performance improvement by CNN as feature extractor compared to conventional handcrafted features [13].
Although a considerable advancement has been accomplished in WSOD literature and these methods have delivered significant results, the existing approaches [6], [13], [14] are less efficient in detecting tight boxes to cover the entire object. In this paper, we propose a framework for WSOD to tackle the main problems with existing WSOD methods, specifically, 1) partial object detection and 2) occluded object detection. This paper is an advanced version based on our previous work [7]. In particular, a three-fold refinement strategy is presented to amend the detected proposals by leveraging the localization maps. The proposed deep network for WSOD consists of three main branches, classification branch for extracting localization information, multiple instance classifier (MIC) (detection branch), and a refinement branch. We present a robust proposal refinement module (PRM) to rectify the proposals to be learned by object detector network by retraining through the instance-level supervisions generated by PRM. This study aims to detect complete and tight bounding boxes around an object instance. We utilized the class activation maps (CAMs) to obtain localization maps from the fused complementary network (FuCN) [7], which is a classification network. A CAM signifies the distinctive image areas which CNN utilizes to classify an object as an instance of a particular class. Therefore, to cope with the problem of partial object detection and loose detection, we leverage localization maps from FuCN to refine the object detection outputs as shown in Figure 1. For each image localization maps are generated corresponding to each object category. The generated localization maps have the same spatial resolution as of the input image.
In PRM, proposals are first refined based on activated regions within the bounding box with respect to the corresponding localization map. The coordinates of the bounding box are adjusted according to the activated region (pixels). This step compacts the loose proposals, and then further finetuning of bounding box is performed by the surrounding region to expand the too-tight proposals which do not cover the complete object, thus attains a complete and tight bounding box for each instance using within the bounding box and the surrounding region. Afterward, the activations for each refined proposal are erased from localization map iteratively. In the second step, missing detections are inferred if there exists any closed region (regions) of activations in the localization map after refining all proposals. These refinement operations result in learning complete objects and improves detection performance. Concisely, the proposed PRM first refines the detections and then investigates the detection completeness for all instances in the entire image by phase I and phase II, respectively.
As summary, our contributions are listed as follows: 1) We present a three-fold refinement procedure after the MIC branch to learn complete instances in the PRM branch. We leverage the localization maps as contextual information for proposal refinement. 2) In the proposed PRM, first, we perform refinement within bounding boxes by the activated pixels in the localization region, then additional fine-tuning is achieved by investigating the fixed neighborhood region of bounding boxes. Lastly, we leverage the connected localization regions to infer missing detections. 3) In phase I of PRM, we determine the precise association between bounding boxes and localization regions by position-aware and class-specific localization maps corresponding to bounding boxes, and elucidate the conflicting regions encompassed by multiple boxes. 4) We propose a cascade of efficient light-weight reinforcement learning (RL) networks based on policy gradient method to further learn complete proposals by training decisive-agent and rectifying-agent concurrently. Lastly, the object detector is retrained based on these refined proposals as instance-level supervision. VOLUME 8, 2020 5) We evaluate our method on PASCAL VOC2007 and VOC2012 datasets in terms of average precision (AP) and correct localization (CorLoc) to measure accuracy and inference time in frames per second (FPS).
The rest of the paper is organized as follows. Section II is devoted to related work. In Section III we present our proposed method. Section IV presents the experimental results and discussion. Finally, in Section V we conclude the paper.

II. RELATED WORK
WSOD problem has been investigated since last decade, yet the performance of WSOD methods is not close to fully supervised object detection methods. The majority of existing works in WSOD conform to MIL as a baseline framework [3], [11]. However, MIL is non-convex optimization problem [11] and also suffers from the erroneous learning process due to predicted false positives. Several studies have proposed to regularize optimization by reforming the MIL initialization strategy. Cinbis et al. [8] studied a multi-fold MIL strategy, this method was effective in averting the performance collapse in object localization.
Recent approaches combine MIL with CNNs [4], [10], [11], [13]. Bilen et al. [11] proposed an end-to-end framework, weakly supervised deep detection network (WSDDN) for WSOD. Their method is also an adaptation of MIL approach. The authors used the set of pre-computed proposals [15] to obtain candidate boxes that may contain objects, perform feature extraction of these proposals, and classify each proposal. The spatial regularizer is employed for improving performance. Sangineto et al. [16] proposed self-paced learning approach and trained with Fast RCNN [17]. During network training, the same network at different progression stages is used to predict object localization of positive samples. At each stage, a subset of images is selected whose pseudo-ground truth is the most reliable. Online instance classifier refinement (OICR) [4] is proposed to refine the instances in a fully supervised approach on results from WSDDN [11].
Some methods [18], [19] aimed at a proposal-free framework by utilizing deep features. Tang et al. [18] proposed a two-stage region proposal network for WSOD. The authors focused on proposals generation within end-to-end framework by exploiting the deep feature maps. The hidden object location information generated by early CNNs layers is utilized to generate the proposals and refine the proposals by region-based CNN classifier. Shen et al. [19] proposed an end-to-end WSOD network with generative adversarial learning approach, they used single shot multibox detector (SSD) [20] as baseline detector. Cheng et al. [21] merge selective search [15] and a gradient-weighted CAM based technique to produce proposals that can better enclose the whole objects. An adversarial erasing (AE) approach is proposed by Wei et al. [22] to localize integral object regions. The authors trained the accompanying classification networks with par-tially erased discriminative object regions from input images. Zhang et al. [23] employed adversarial learning approach motivated by [22], they trained two classifiers to learn distinct features. This strategy boosts object localization performance. We adapt [23] to learn integral object localization cues intended for proposal refinement. Jie et al. [24] proposed a self-taught learning approach for learning object location evidences to train detector. The detector is progressively learned to localize positive samples.
Tang et al. [4] presented multiple instance classifiers approach to learn the features of the whole object iteratively. They performed the instance refinement by multiple supervised stages. However, in their refinement procedure, high confidence proposals neighbouring (intersection-overunion (IoU) based) are considered with the highest score proposal or by using graphs of top-ranking proposals to make clusters which also contain partial object proposals. This approach can result in learning falsely objects with discriminative parts and ineffective to learn whole object detection, hence this method does not guarantee to predict the whole object. Wei et al. [6] proposed tight box mining strategy with a segmentation context for WSOD. They adapted OICR [4] as their detection branch. This method exploits the segmentation context to evaluate proposals by the aid of segmentation maps per-pixel scores to confine the boxes with object parts and obtain the high-quality boxes to learn instance classifier. Their network has an additional segmentation branch. Different from [6], in our WSOD framework we directly refine the proposals from localization maps to reduce the computations. They [6] trained the segmentation branch by pseudo supervisions of object localization acquired from the classification network. Hitherto, a single classification network learns the discriminative features of the object category and cannot provide good localization cues for supervising the segmentation network. It can be trapped in local minimums, consequently, partial object detection problem also resides in this method as well. Shen et al. [25] studied the multi-task learning for WSOD. They treated object detection together with semantic segmentation as a joint learning problem to overcome the failure patterns of segmentation and object detection which are typically encountered in other MIL based self-enforcement methods [4], [8], [26] trained with singletask learning.
Li et al. [27] studied WSOD jointly with segmentation task in a collaborative loop. They trained each task with the supervision of the accompanying task. Lately, spatial likelihood voting method is proposed by Chen et al. [28] to converge proposal localization for object detection with image-level supervision using multi-task learning. Recently, Zhang et al. [14] proposed region-searching paradigm for WSOD with reinforcement learning approach under weak supervisions. They extracted region correspondence maps to use as pseudo-target regions for training the agent. A teacher-student learning approach through multiple instance self-training is used by Ren et al. [29]. They trained the student network with the pseudo-labels from teacher network. They also used a dropblock to zero out the discriminative parts of objects, a similar idea to [7].
These methods [27]- [29] have achieved suitable performance, however, these approaches still have the problems of missing detection in case of occluded objects and wrong detections due to objects cluster. Since the training image is decomposed into thousands of proposals, and each approximately correct training instance is flooded with many incorrect training instances. Such weak supervision results in inaccurate predictions as it inevitably involves a lot of uncertainty due to noisy training instances. Therefore, to address the aforementioned limitations in WSOD and to improve the robustness of WSOD, we rectify the proposals predicted by MIC through class-specific localization maps, and retrain the object detection network by optimizing the instance-level objective function to learn the refined instances.

III. PROPOSED METHOD
We propose a proposal refinement module to the WSOD network in [7] to efficiently tackle the problem of partial object detection, loose detections, and objects cluster detection. In particular, our method not only generate a complete and tight box by revising each predicted bounding box but also infer missing instances (if any) from the corresponding class-specific localization maps. The same procedure is applied to all predicted boxes by the MIC branch. The overall architecture of the proposed approach is shown in Figure 2.
VGG [30] is used as a backbone, which further has three branches, i.e., FuCN, MIC (object detection), and PRM. We utilize the high-level features of classification networks learned in complementary manner to make WSOD network learns the whole object representations. We fused the CAMs from two complementary classifier networks to generate a localization map to cover whole object activation regions using Score-CAM [31]. These localization cues from com- plementary classification network are used in the PRM. The spatial resolution of localization maps is made identical to the input image by upsampling (interpolation). In PRM, we determine the pairs of bounding boxes and the corresponding activation regions in class-specific localization map by coordinated positions of bounding boxes. First, we perform refinement with the inner region of the bounding box with respect to activated pixels in the associated localization region. Then we again exploit the localization maps for contextual information within the fixed region surrounding the box by scaling it with a factor as in [6]. For further refinement, we propose policy networks to learn complete and incomplete object detections and discover undetected regions in case of incomplete detections. These dual refined proposals are learned by detector. Figure 3 demonstrates the generated CAMs overlayed on the detection output from FuCN, and refined bounding boxes by PRM with class activation values greater than δ a (a pre-defined normalized threshold). In subsequent sections, the three main branches of proposed WSOD method are described in detail. Table 1 shows the notations used in this paper.

A. FUSED COMPLEMENTARY NETWORK (FuCN)
FuCN aims to learn the whole object features as well as localization through discriminative but complementary feature learning approach inspired by [22] and [23]. FuCN includes two complementary classifiers trained with distinct inputs. In particular, image features are extracted by pretrained VGG [30]. Then, we add four additional convolutional layers (convs) followed by a fully connected (FC) layer in both classifiers. Feature erasing is performed by thresholding the discriminative features of network A on its heatmaps, afterward, these regions are removed from the entire feature maps of the input image. Removed regions are replaced by zeros. Network B is then encouraged to learn the complementary features for the target object from these erased feature maps. The CAMs from both classifiers are extracted and fused to obtain complete localization maps of objects for each class separately. We upsample these localization maps to the same resolution as the image by interpolation. This network is optimized while training with binary cross entropy (BCE) loss function.

B. DETECTION NETWORK (MIC)
Instead of focusing only on discriminative features which results in partial object detection, in the object detection network features are concatenated with complementary features from FuCN [7] to learn the whole object features. The object detection network has two further branches for classification and detection as in [11]. A set of about 2000 region proposals from selective search [15] is employed in both the training and test phase. Single-level spatial pyramid pooling (SPP) [32] is then applied to produce fixed-size feature maps for these different sized proposals. The discriminative feature learning approach used in [7] drives the network to learn the whole objects and results in improved detections. For imagelevel classification, we used class-specific max-pooling of regions score from classification branch only.

C. PROPOSAL REFINEMENT MODULE (PRM)
PRM is proposed as a rectification scheme to aim at learning complete and tight object detections. Although whole objects can be learned by object detector as in [7] through concatenated features from FuCN branch, yet in some cases, it can result in loose bounding boxes. Therefore, further refinement is required to obtain tight bounding boxes. We use spatial information within the input image to achieve the refined tight and complete proposals with only image-level supervision. PRM consists of two refinement phases, 1) In phase I, pixel activations from localization maps are leveraged to rectify the detected bounding boxes, 2) phase II includes two policy networks trained with reinforcement learning for proposal refinement. The overall flow of PRM with phase I and phase II rectification is demostrated in Figure 4. In subsequent subsections, each module is described in detail.

1) PHASE I: PROPOSAL INWARD AND OUTWARD CAM BASED REFINEMENT
The spatial resolution of localization maps obtained from FuCN is made identical to the input image by upsampling (interpolation). We select the pixels from class-specific localization maps with values greater than or equal to δ a as foreground objects. Inferring the precise association between bounding boxes and localization regions is very critical especially in case of multiple instances of the same objects in an image. The association between bounding boxes and the activated localization maps is based on the position of bounding box coordinates and corresponding position in the localization map. Bounding box coordinates are transformed according to localization map activations within the box and also to a fixed proportion of its surrounding context (s c ). Algorithm 1 and Figure 5 shows the proposal rectification procedure.
As a first step, we check whether the activated pixels (a c ) in localization map (L c ) each belong to a single bounding box or multiple bounding boxes. If a c exist in multiple bounding boxes, we call these pixels 'conflicting pixels'. Afterward, we separate the conflicting pixels from nonconflicting pixels. We calculate the area of bounding boxes which contains conflicting pixels. Then, we assign conflicting pixels to the smallest bounding box to ensure the compactness. When a pixel is assigned to a particular bounding box, we change its activation to zero in the localization map for other proposals during inward refinement. Consequently, there remains no incongruity while associating the activated pixels in the map to bounding boxes. Inward refinement involves the adjustment of region proposal according to the closed activated pixels in the corresponding localization map using image coordinates. Lastly, outward refinement is per-

2) PHASE II: REINFORCEMENT LEARNING NETWORKS TOWARDS COMPLETE REGIONS
We design a cascade of policy networks to learn whole objects under weakly supervised detection settings. Since there are no GT bounding boxes in WSOD, consequently, regression cannot be applied. We, therefore, train RL agents to further rectify the proposals in the weakly supervised setting. To the best of our knowledge, this is the first effort to apply RL based policy algorithms to learn the actions for complete or incomplete proposals and then further rectify proposals with contextual connections for WSOD. The overall procedure of phase II rectification is summarized in Algorithm 2. In subsequent subsections, we describe the decisive and rectifying agents for WSOD.

a: DECISIVE NETWORK
The objective of decisive-agent for WSOD is to learn the actions for input images as complete or incomplete object detections. We mask out all detected bounding box regions with zeroes correspondingly in input image feature maps. The decisive-agent is designed to discover the undetected object regions in the mask-out image to further improve detections of phase I. For an image if the decision of decisiveagent is incomplete, then the rectifying network decides about the detected region as a new bounding box (i.e., a missing detection from preceding detection procedure) or part of the existing proposals.
A policy gradient reinforcement strategy is adopted to learn policy function to maximize the expected reward. To input to decisive network, we mask out all detected bounding box regions obtained from phase I with zeroes in input image feature maps. The policy network predicts complete and incomplete action likelihood for the mask-out image (I erase ). This method corresponds as the detections are complete or there exists some missing objects or parts undetected.
The state is the input image features with the positionaware masked-out feature maps of detected regions from phase I. Action to the current state is the binary decision {d ∈ (incomplete, complete)}. Decisive-agent inspects whether the mask-out image further have any undetected object(s) regarded as incomplete detections, or detections are complete such that no remaining activated regions are found in the masked-out image (complete detections). The policy gradient method directly models and optimizes the policy (π θ ) with respect to parameters (θ), π θ (a|s) (a is action and s is state) to learn actions for the states.
The proposed decisive network consists of 3 FC layers followed by a softmax layer as shown in Figure 2. We use 256+1 (+1 is for additional input for action history), and 128 sized vectors for first FC and hidden FC layers, respectively. To avoid catastrophic forgetting, we input the agent's experience replay [(s, a, r),. . .,(s t , a t , r t )] to the last FC layer.
REWARD FUNCTION: In RL, designing an appropriate reward function for the specific decision problem is of great importance. The reward function {r ∈ (−1, +1)} for decisive-agent is defined based on the confidence score (s erase ) of mask-out image from a maximum of both classifiers (classifier A and B). The agent learns the correct deci- append b c to B c 8:  (1) and (2); We propose a context-aware approach by training the RL agent to amend the region proposals (B c from phase I) based on the contextual connections of bounding boxes with CAMs (L c ). The proposals neighboring to the CAMs (a c1 , a c2 ,. . . , a cn ) in the mask-out image are investigated and each bounding box (b c ) is modified according to the decision of rectifying-agent. The architectural design of rectifying network is same as decisive network (Figure 2). Closed activated regions (a c ) from localization map (L c ) are computed as activated values within threshold (δ a ) with 8-connected neighbors. We define the criteria to decide about the discovered activated regions (a c ) by the decisive network. This mainly includes dual considerations: 1) Is a c a missing detection or part of an already detected proposal (phase I detected proposals), 2) explore the neighboring context of each a c (a c ∈ L c ) to determine action for it.
We construct a temporary bounding box (b t ) around each a c as a closed foreground region based on activated locations as thresholded regions. Then we perform a distance measure between each b t and b c . As a first step, we calculate the IoU between b c and b ct (b ct is temporary or hypothetical bounding box generated as b c extended to b t ). If IoU between b c and b ct is greater or equal to δ IoU then further pixel-based distance measure comparison is performed.
Additional spatial information is fed to the network by computing spatial relations within an image by exploring the neighboring context of activated areas in L c . We calculate the connectivity of each a c with a c in detected proposals (B c ) as a distance measure between both activated regions. The pixel distance is computed between activated regions within the bounding boxes b t and b c by the most upper-left, upper-right, bottom-left, and bottom-right pixel location of the activated regions. The pixel distance of each outermost pixel in a c is computed to all four outermost pixels of the neighboring region a c to estimate the nearest region.
The neighboring bounding box with the least distance to b t is selected to modify with b c . However, if IoU between b c and all b ct is less than δ IoU , the particular temporary box is considered as a separate new bounding box that was a missed detection. For training the rectifying-agent, these two-level contextual relationship details between the bounding boxes b t and b c are fed to the network.
The color smoothness information between each b t and their neighboring proposals is also fed to the rectifying network. Color smoothness matching further guides the agent about the degree of similarity between two regions. To compute smoothness over the color image regions, the Laplacian operator (second derivative along pixels) is applied on each channel and then take an average for all three channels. For each region proposal, smoothness information is feed to the network.
We feed the rectifying network with the following state information, 1) convolutional features, 2) fused CAMs (L c ) from both classifiers, 3) rectified proposals (B c ) with scores, and 4) contextual details. Action history along with α (tolerance coefficient) history for the entire image including all bounding boxes rectification actions is also fed to the network. The transformations to the bounding box(es) are applied according to the action of rectifying-agent w.r.t. L c . Figure 6 presents the procedure of rectification network.
REWARD FUNCTION: we define reward function for rectifying-agent by computing the overall score (h) for amended boxes (B c and B c ) with high weightage to entropy score than category confidence score. The score (h(b c )) of the amended bounding box b c (phase II) is compared to the score (h(b c )) of the previous same bounding box b c (phase I). If a large gain in the score is calculated then the decision of rectifying-agent is considered as wrong and negative reward is given and vice versa. A little gain in h(b c ) is considered tolerable. For rectifying-agent rewards are defined in (3) and (4) as follows; α is the tolerance coefficient that is learned by the rectifying-agent. To optimize the network performance α coefficient is learned by the rectifying-agent. α value is VOLUME 8, 2020 updated according to learning rate based on the reward for the current action given a particular state, with current α value, and α history with action history. Therefore rectifying-agent is trained based on reinforcement learning as well as selflearning approach at the same time.

D. TRAINING PROPOSED WSOD FRAMEWORK
Overall training of the proposed WSOD method is performed epoch-wise. Once the network is trained for all epochs including all episodes for policy networks, the whole network is trained again with continuous rewards for each image regardless of object categories groups. After completing the network training for each epoch MIC network is retrained to learn whole instances.

1) TRAINING DETECTION NETWORK (MIC)
After feature extraction from pretrained VGG [30], the network is branched into three further networks, A, B, and C. Network A and network B extract discriminative features which are then concatenated for input to network C (MIC network).
For N multi-label images in trainval set, label-vector for i th image is y i = [y i1 , y i2 , . . . , y iC ], where y ij = 1(j = 1, . . . , C), if j th class object is present in image and y ij = 0 otherwise, and C is the total number of categories. For detection network, image-level classification scores are computed by class-specific max-pooling on all regions R, R = (r 1 , r 2 , . . . , r T ), here T is the total number of proposals. For i th image, the j th class score is calculated by maximum of all regions j th class softmax probabilities, p ij = max(p j r 1 , p j r 2 , . . . , p j r T ). Thus, the prediction vector for i th image is computed as p i = [p i1 , p i2 , . . . , p iC ].
We use BCE loss function with stochastic gradient descent with momentum 0.9 and weight decay 5×10 −4 for optimization. The overall loss function for the network is defined in (5) as; where, L is the loss function for WSOD network [7], L A and L B are the loss functions of networks A and B, respectively. L C is the loss function of MIC network. In (7), s is the individual state in the probability distribution (s ∈ S), where S is the number of discrete states.

2) TRAINING POLICY NETWORKS
Training settings for both decisive and rectifying networks are same except reward functions. We set episode length to the number of images in the entire subset of class-specific images (e = I c ) for training the network on PASCAL VOC2007 dataset. Whereas, for PASCAL VOC2012 episode length is half of each class-specific image subset (e = I c /2). For rectifying-agent, episode length is dynamic depending on the number of discovered instances in each image. For training policy networks, random images with same category are used in an episode. softmax in action preferences is used for policy parameterization such that the actions with the highest preferences in each state are given the highest probabilities of being selected [35]. Furthermore, parameterizing policies according to the softmax in action preferences with discrete action space enables optimizing the approximate policy as a deterministic policy which is imperative for WSOD scenario.
We use gradient ascend with RMSprop [36] optimizer for policy optimization with same learning rate as in detection network. The objective function J (θ) for each episode with a set of parameters (θ), current state at time t (s t ) and action (a t ) with discount factor (γ = 1.0) using Monte-Carlo policy differentiation with reinforcement method is defined in (8) as; We use 0.4 dropout for the last hidden layer during training as regularization to avoid overfitting. For each epoch, 20 separate episodes are executed corresponding to each PASCAL VOC category.

A. BENCHMARK DATA
The proposed method is evaluated on PASCAL VOC2007 and VOC2012 datasets with 20 object categories which are widely used as benchmarks for object detection. We use trainval sets with 5011 images for VOC2007 and 11540 images for VOC2012 for training. Only image-level labels are used for training. The test sets with 4952 images for VOC2007 and 10991 images for VOC2012 are used for evaluation of the proposed WSOD framework.

B. PERFORMANCE EVALUATION METRICS
We use two performance measures for detection. The first metric is AP with 0.5 IoU between detected boxes and ground-truths and mean of AP (mAP). Furthermore, we use CorLoc to evaluate the localization accuracy. Both AP and CorLoc metrics measure the quantitative performance of the object detector based on the PASCAL criteria with IoU ≥ 0.5.

C. IMPLEMENTATION DETAILS
VGG16 [30] trained on ImageNet [2] is used as a backbone network. In our experiments, FuCN is trained with two complementary networks with same implementation settings as in [7]. We employ pretrained FuCN to train the proposed network. Score-CAM [31] is used for computing class activations.  The learning rate is set to 10 −3 for the first 30 epochs and then to 10 −4 for the remaining 40 epochs. The number of episodes for training policy networks is same as the number of epochs. We initialize α with 1.3 value for optimizing rectifying-agent.
It is always crucial to select an optimal threshold. Nevertheless, depending on the objective for a specific task at a particular stage we use threshold values in our experiments which yields maximum performance for that specific task. In our experiments δ a is 0.5 for phase I refinement and 0.7 for phase II refinement. We use a more constricted activation threshold for phase II since we aim to extract highly confident but mistakenly unexploited regions. Using low activation values in phase II refinement can cause ambiguity and may result in incorrect detections. Surrounding context (s c ) is set to 1.2 [6] during the refinement of proposals in phase I. The threshold δ IoU is set to 0.75 in rectifying network. A loose δ IoU can even harm already correctly detected regions instead of improving the detections, and too tight δ IoU may not aid in rectifying a region even if it has its true part in an activated neighboring region. The overall score h is computed by 0.7 weight to entropy score and 0.3 weight to category confidence score. All experiments are conducted on NVIDIA GeForce TITAN XP 4 parallel GPUs.

D. COMPARISON WITH STATE-OF-THE-ART METHODS
We compare the proposed WSOD method with state-of-theart methods on PASCAL VOC2007 and VOC2012 datasets. We reported experimental results for both PRM phases (PI and PII) separately, and retrained MIC. Table 2 and Table 3 illustrate the comparison of the results on PASCAL VOC2007 dataset in terms of AP, mean AP for test set and CorLoc for trainval set, respectively.
It can be observed that our method with each module outperforms all compared methods for both test and trainval sets. In Table 2, from all compared methods, [29] has the highest mAP (54.9%), however, our method (MIC+PI+PII) has an enormous improvement with 6.7% mAP to [29]. From Table 3, Chen et al. [28] have the highest mean Corloc 71.0% among all other compared methods. In comparison to [28], the proposed method has a significantly high score with 6.0% gain in mean CorLoc. Table 4 and Table 5 illustrate the results by proposed and compared methods on PASCAL VOC2012 test and trainval sets in terms of mAP and CorLoc, respectively. A similar VOLUME 8, 2020   performance pattern as PASCAL VOC2007 is observed for PASCAL VOC2012 on both test and trainval sets. All modules of the proposed method have surpassed all compared methods with a substantial performance boost.
Our method has numerous advantages compared to all previous state-of-the-art methods. Unlike [28], we investigate the outside neighboring region of the detected box but on a fixed scale, which avoids our method to erroneously include activated regions of the same category other than the specific instance. SLV approach [28] wrongly detect whole objects cluster with same category as a single detection and individual instances remained undetected. Partial object detection problem also still occur in [29] and [28]. Furthermore, mostly WSOD methods have declined performance on classes with relatively smaller size objects particularly ''bottle'' class. It is also observed that ''chair'' class has lower AP by all compared methods. The results in Table 2 and Table 3 illustrate that our method has achieved significantly improved AP for these categories compared to state-of-the-art methods. Qualitative results (Section IV: F) have further verified that small object detection is significantly improved by phase II refinement which is mostly missed in many other WSOD methods [21], [28], [29].
Overall the proposed PRM and MIC retraining results in enhanced performance for WSOD. CAM based dual-step rectification and complete or incomplete detection investigation supervise the WSOD network to learn high-quality, refined, tight, and complete regions. Furthermore, RL lightweight policy networks have an imperative contribution in enhancing the detection performance by further discovering missed instances.

E. INFERENCE TIME
We report the inference time of our method increamentally for each module with VGG16 [30] backbone network in Table 6. The inference time for our method (MIC) is same as [7] with much improved detections. Since PRM involves a series of refinement stages the inference time for phase I is only 0.70 FPS including MIC branch, and inference time for phase II including MIC and phase I is 0.67 FPS for PASCAL VOC2007. Phase I with MIC has inference time 0.82 FPS, and phase II with MIC and phase I takes 0.79 FPS for PAS-CAL VOC2012 dataset.  Moreover, it is practical to use MIC during inference as it is trained with supervision from PRM and attains high accuracy with reasonable inference time (0.72 FPS on PASCAL VOC2007 and 0.84 FPS on VOC2012 dataset). Figure 7 shows the final qualitative detection and localization results of the proposed method. Our proposed method has overcome the problems of 1) partial object detection, 2) same category object cluster detection, 3) loose detections, and 4) missing detections effectively. However, there are certain failure cases for some images as shown in Figure 8. These failure cases are mainly due to images with less effective initial proposals because of high similarity in texture or color between neighboring instances, or due to occlusions.

F. QUALITATIVE RESULTS
Duplicate detections for the same object are effectively tackled by our method since phase I refinement procedure avoids the double detections by assigning a particular activated region only to a single bounding box. Very few false detections are observed as objects cluster or partial object detection. Such failure cases can be overcome by employing a more sophisticated proposal generator in the first place. Moreover, the qualitative results in Figure 4, Figure 6

V. CONCLUSION
This paper proposed an effective framework for WSOD to achieve exhaustive refinement after instance-level classification. Two main rectification stages are proposed, phase I for rectifying the detected regions to obtain close-fitted and whole object bounding boxes. Phase II goals to detect missed instances which are mostly small objects. Since the majority of WSOD methods suffer from problems of partial object detection, objects cluster detection, and loose detections. Therefore, this study has efficaciously accomplished the aim of tight and complete object detections in weakly supervised settings. Qualitative results have demonstrated high-quality detection output by the proposed method on PASCAL VOC2007 and VOC2012 benchmarks. Correspondingly, the quantitative results have shown enhanced performance by the proposed WSOD method compared to the state-of-the-art WSOD methods.