Skip to Main Content
Recently, many object localization models have shown that incorporating contextual cues can greatly improve accuracy over using appearance features alone. Therefore, many of these models have explored different types of contextual sources, but only considering one level of contextual interaction at the time. Thus, what context could truly contribute to object localization, through integrating cues from all levels, simultaneously, remains an open question. Moreover, the relative importance of the different contextual levels and appearance features across different object classes remains to be explored. Here we introduce a novel framework for multiple class object localization that incorporates different levels of contextual interactions. We study contextual interactions at the pixel, region and object level based upon three different sources of context: semantic, boundary support, and contextual neighborhoods. Our framework learns a single similarity metric from multiple kernels, combining pixel and region interactions with appearance features, and then applies a conditional random field to incorporate object level interactions. To effectively integrate different types of feature descriptions, we extend the large margin nearest neighbor to a novel algorithm that supports multiple kernels. We perform experiments on three challenging image databases: Graz-02, MSRC and PASCAL VOC 2007. Experimental results show that our model outperforms current state-of-the-art contextual frameworks and reveals individual contributions for each contextual interaction level as well as appearance features, indicating their relative importance for object localization.