I. Introduction
For robots to perform intelligent, autonomous behaviors, it is crucial that they understand what they see. The applications requiring visual abilities are countless: from self-driving cars to detecting and handling objects for service robots in homes, from kitting in industrial workshops, to robots filling shelves and shopping baskets in supermarkets, etc, they all imply interacting with a wide variety of objects, requiring in turn a deep understanding of what these objects look like, their visual properties and associated functionalities. Still, the best vision systems we have today are not yet up to the needs of artificial autonomous systems in the wild. There are examples of robots performing complex tasks such as loading a dishwasher [1] or flipping pancakes [2]. However, the visual knowledge about the objects involved in these tasks is manually encoded within the robots control programs or knowledge bases, limiting them to operate on the objects they have been programmed to understand. More in general, the current mainstream approach to visual recognition, based on convolutional neural networks [3], [4], makes the so called closed world assumption, i.e. it assumes that the number and type of objects a robot will encounter in its activities is fixed and known a priori. Hence, the big challenge is to make these visual algorithms robust to illumination, scale and categorical variations as well as clutter and occlusions.