I. Introduction
Tracking and trajectory forecasting are critical components in modern 3D perception systems [1]. Historically, 3D multi-object tracking (MOT) [2]–[4] and trajectory forecasting [5]–[12] have been studied separately. As a result, perception systems often perform 3D MOT and forecasting separately in a cascaded order, where tracking is performed first to obtain trajectories in the past, followed by trajectory forecasting to predict future trajectories. However, this cascaded pipeline with separately trained modules can lead to sub-optimal performance, as information is not shared across two modules during training. Since tracking and forecasting modules are mutually dependent, it would be beneficial to optimize them jointly. For example, a better MOT module can lead to better performance of its downstream forecasting module while a more accurate motion model learned by trajectory forecasting can improve data association in MOT. Our goal is to jointly optimize MOT and forecasting modules and learn a better shared feature representation for both modules.