Abstract:
Weakly supervised object localization (WSOL) aims to locate objects with only image-level labels. Previous works mainly follow the framework of class activation map (CAM)...Show MoreMetadata
Abstract:
Weakly supervised object localization (WSOL) aims to locate objects with only image-level labels. Previous works mainly follow the framework of class activation map (CAM), which discovers the objects by estimating the contribution of each pixel position to the category prediction. However, most of them overlook the pixel-level spatial and semantic contextual correlation, resulting in: 1) limited activation ranges that only highlight the most discriminative parts rather than the entire object and 2) low activation values for some foreground parts, especially regions near the boundary between foreground and background. To alleviate this issue, we propose an activation diffusion network (ADNet) to progressively refine both the range and value of activations on the localization map. Specifically, a context propagation module is first developed to learn the top–down spatial dependency between adjacent feature maps, which helps back-propagate the activation from the discriminative part to its surroundings for more complete objects. Then, a diffusion probability distillation module (DPDM) is proposed, which transfers the pixel-level semantic correlation emerging in the image generation process to the localization map generation in a teacher–student learning manner. This helps boost the value of the activated foreground region and stimulates the value of neighboring inactivated foreground positions to sharpen the object boundary. Experiments on various datasets and backbones demonstrate the superiority of our ADNet over state-of-the-art (SOTA) methods in object localization and segmentation, yielding 82.2% and 62.2% Top-1 Loc on Caltech-UCSD Birds-200-2011 (CUB) and ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC) datasets and 76.6% pixel average precision (PxAP) on OpenImages dataset. Qualitative results also show that we can achieve a more complete and consistent activation covering the whole object.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Early Access )