Skip to Main Content
This papers presents a weakly supervised method to simultaneously address object localization and recognition problems. Unlike prior work using exhaustive search methods such as sliding windows, we propose to learn category and image-specific visual words in image collections by extracting discriminating feature information via two different types of support vector machines: the standard L2-regularized L1-loss SVM, and the one with L1 regularization and L2 loss. The selected visual words are used to construct visual attention maps, which provide descriptive information for each object category. To preserve local spatial information, we further refine these maps by Gaussian smoothing and cross bilateral filtering, and thus both appearance and spatial information can be utilized for visual categorization applications. Our method is not limited to any specific type of image descriptors, or any particular codebook learning and feature encoding techniques. In this paper, we conduct preliminary experiments on a subset of the Caltech-256 dataset using bag-of-feature (BOF) models with SIFT descriptors. We show that the use of our visual attention maps improves the recognition performance, while the one selected by L1-regularized L2-loss SVMs exhibits the best recognition and localization results.