Skip to Main Content
The task of understanding video content has seen great interest from computer vision community with the increase in camera based surveillance at grocery stores, airports, train stations, etc. What makes up a scene (objects) and what happens in the scene (actions) are two important dimensions of video understanding. In this work, we aim to identify both actions and objects in the video, however, we focus only on the objects with which human interacts. We use videos which may have multiple actions taking place during possibly overlapping intervals. Our system can recognize actions having high intra-class variance performed in complex environments using objects of different types, sizes and shapes. We produce structured descriptions for the videos as output. The descriptions identify the subject, the object, the verb and the interval of each activity recognized.