I. Introduction
The growing interest in egocentric vision research is evident from the release of large dedicated datasets like EPIC-KITCHENS [8], Ego4D [14], and H2O [17]. One of the challenges in this domain is the task of action recognition, focused on determining the action performed by the user in the video [27]. Research in egocentric action recognition is crucial due to broad potential application fields, including augmented and virtual reality, nutritional behaviour analysis, and Active Assisted Living (AAL) technologies for lifestyle analysis [27] or assistance [22]. ADLs targeted by AAL technologies (e.g., drinking, eating and food preparation) are all based on manual operations and manipulations of objects, which motivates research focused on hand-based action recognition.
Overview of our method. From the sequence of input frames representing action, 2D hands pose and bounding box of manipulated objects with its labels are extracted. Under the study, four distinct state-of-the-art hand pose methods are implemented and tested. Object information is retrieved using YOLOv7 [34]. Pose information is embedded into a vector describing each frame. The sequence of vectors is processed via the transformer-based deep learning neural network to predict the final action class.