Skip to Main Content
We propose a novel method to address object localization in a weakly supervised framework. Unlike prior work using exhaustive search methods such as sliding windows, we advocate the use of visual attention maps which are constructed by class-specific visual words. Based on dense SIFT descriptors, these visual words are selected by support vector machines and feature ranking techniques. Therefore, discriminative information is learned and embedded in these visual words. We further refine the constructed map by Gaussian smoothing and cross bilateral filtering to preserve local spatial information of the objects. Very promising localization results are reported on a subset of the Caltech-256 dataset, and our method is shown to improve the state-of-the-art recognition performance using the bag-of-feature (BOF) model.