1. Introduction
Progress in vision has been fueled by the emergence of datasets of ever-increasing scale. An example is the surge of Deep Learning thanks to ImageNet [26], [44]. The scaling up of datasets for Multiple Object Tracking (MOT) however has been limited due to the difficulty and cost to annotate complex video scenes with many objects. As a consequence, MOT datasets consist of only a couple dozens of sequences [18], [29], [35] or are restricted to the surveillance scenario [53]. This has hindered the development of fully learned MOT systems that can generalize to any scenario. In this paper, we tackle these issues by introducing a fast and intuitive way to annotate trajectories in videos and use it to create a large-scale MOT dataset.
This sequence is heavily crowded with similarly-looking people. Annotating such sequences is typically time-consuming and tedious. In our path supervision, the user effortlessly follows the object while watching the video, collecting path annotations. Our approach produces dense box trajectory annotations from such path annotations.