Processing math: 0%
In My Perspective, in My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition | IEEE Conference Publication | IEEE Xplore

In My Perspective, in My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition


Abstract:

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort....Show More

Abstract:

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance. The code is available at https://github.com/wiktormucha/effhandegonet.
Date of Conference: 27-31 May 2024
Date Added to IEEE Xplore: 11 July 2024
ISBN Information:

ISSN Information:

Conference Location: Istanbul, Turkiye

Funding Agency:


I. Introduction

The growing interest in egocentric vision research is evident from the release of large dedicated datasets like EPIC-KITCHENS [8], Ego4D [14], and H2O [17]. One of the challenges in this domain is the task of action recognition, focused on determining the action performed by the user in the video [27]. Research in egocentric action recognition is crucial due to broad potential application fields, including augmented and virtual reality, nutritional behaviour analysis, and Active Assisted Living (AAL) technologies for lifestyle analysis [27] or assistance [22]. ADLs targeted by AAL technologies (e.g., drinking, eating and food preparation) are all based on manual operations and manipulations of objects, which motivates research focused on hand-based action recognition.

Overview of our method. From the sequence of input frames representing action, 2D hands pose and bounding box of manipulated objects with its labels are extracted. Under the study, four distinct state-of-the-art hand pose methods are implemented and tested. Object information is retrieved using YOLOv7 [34]. Pose information is embedded into a vector describing each frame. The sequence of vectors is processed via the transformer-based deep learning neural network to predict the final action class.

Contact IEEE to Subscribe

References

References is not available for this document.