Abstract:
The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applicat...Show MoreMetadata
Abstract:
The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.
Published in: IEEE Transactions on Image Processing ( Volume: 30)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Activity Prediction ,
- Pairwise Interactions ,
- Human-object Interaction ,
- Observation Period ,
- Visual Features ,
- Future Actions ,
- Ongoing Activity ,
- Transformer Model ,
- Smart Home ,
- Objects In The Scene ,
- Second-order Statistics ,
- Temporal Aggregation ,
- Holistic Representation ,
- Spatiotemporal Representation ,
- Object Affordances ,
- Sequence Of Actions ,
- Recurrent Network ,
- Recurrent Neural Network ,
- Object Features ,
- Action Recognition ,
- Object In Frame ,
- Cross-correlation Matrix ,
- Multimodal Network ,
- Action Labels ,
- Conditional Random Field ,
- Gated Recurrent Unit ,
- Motion Features ,
- Future Frames ,
- Mask R-CNN ,
- Spatiotemporal Characteristics
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Activity Prediction ,
- Pairwise Interactions ,
- Human-object Interaction ,
- Observation Period ,
- Visual Features ,
- Future Actions ,
- Ongoing Activity ,
- Transformer Model ,
- Smart Home ,
- Objects In The Scene ,
- Second-order Statistics ,
- Temporal Aggregation ,
- Holistic Representation ,
- Spatiotemporal Representation ,
- Object Affordances ,
- Sequence Of Actions ,
- Recurrent Network ,
- Recurrent Neural Network ,
- Object Features ,
- Action Recognition ,
- Object In Frame ,
- Cross-correlation Matrix ,
- Multimodal Network ,
- Action Labels ,
- Conditional Random Field ,
- Gated Recurrent Unit ,
- Motion Features ,
- Future Frames ,
- Mask R-CNN ,
- Spatiotemporal Characteristics
- Author Keywords