Abstract:
The ability to recognize human actions in a video is challenging due to the complex nature of video data and the subtlety of human actions. Human activities often get ass...Show MoreMetadata
Abstract:
The ability to recognize human actions in a video is challenging due to the complex nature of video data and the subtlety of human actions. Human activities often get associated with surrounding objects and occur in specific scene contexts. Existing action recognition systems are incapable of separating human actions from representation biases, like co-occurring objects and underlying scene, which often dominate subtle human actions. In this paper, we address the issue of factorization of human actions into the activity performed by the actor, co-occurring objects, and underlying context to mitigate the influence of representation biases when they are irrelevant to the action in consideration. We propose a deep neural network architecture, denoted by FactorNet, for efficient action recognition in videos with long temporal duration. We design an attention mechanism that separates an actor from the associated objects and co-occurring scene followed by capturing long-range temporal context. We perform a comprehensive set of experimentation on six benchmark datasets to show the efficacy of our architecture. To train a model using recent video-based action datasets certainly capture and leverages such bias. The supervised representation may not be competent to new action classes. We therefore design a new dataset, known as FactNet, which consists of activity-object-scene related actions that occur in day-to-day applications. Dataset Link: FactNet.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 32, Issue: 3, March 2022)