Skip to Main Content
Category level object recognition has improved significantly in the last few years, but machine performance remains unsatisfactory for most real-world applications. We believe this gap may be bridged using additional depth information obtained from range imaging, which was recently used to overcome similar problems in body shape interpretation. This paper presents a system which successfully fuses visual and range imaging for object category classification. We explore fusion at multiple levels: using depth as an attention mechanism, high-level fusion at the classifier level and low-level fusion of local descriptors, and show that each mechanism makes a unique contribution to performance. For low-level fusion we present a new algorithm for training of local descriptors, the Generalized Image Feature Transform (GIFT), which generalizes current representations such as SIFT and spatial pyramids and allows for the creation of new representations based on multiple channels of information. We show that our system improves state-of-the-art visual-only and depth-only methods on a diverse dataset of every-day objects.