Abstract:
Temporal localization of actions in videos has been of increasing interest in recent years. However, most existing approaches rely on complex architectures that are eithe...Show MoreMetadata
Abstract:
Temporal localization of actions in videos has been of increasing interest in recent years. However, most existing approaches rely on complex architectures that are either expensive to train, inefficient at inference time, or require thorough and careful architecture engineering. Classical action recognition on pre-segmented clips, on the other hand, benefits from sophisticated deep architectures that paved the way for highly reliable video clip classifiers. In this paper, we propose to use transfer learning to leverage the good results from action recognition for temporal localization. We apply a network that is inspired by the classical bag-of-words model for transfer learning and show that the resulting framewise class posteriors already provide good results without explicit temporal modeling. Further, we show that combining these features with a deep but simple convolutional network achieves state of the art results on two challenging action localization datasets.
Date of Conference: 27-28 October 2019
Date Added to IEEE Xplore: 05 March 2020
ISBN Information: