I. Introduction
There is an ever growing need to perform robotic tasks in unknown environments. Reasoning about observed dynamics using ubiquitous sensors such as video is therefore highly desirable for practical robotics. Traditionally, the complexity of this reasoning has been avoided by investing in fast actuators [1], [2] and using very accurate sensing [3]. In emerging field applications of robotics, the reliance on such infrastructure may need to be decreased [4], while the complexity of tasks and environment uncertainty has increased [5]. As such, there is a need for better physical scene understanding from low-cost sensors and the ability to make forward predictions of the scene, so as to enable planning and control.