Loading web-font TeX/Main/Bold
Un-Gaze: A Unified Transformer for Joint Gaze-Location and Gaze-Object Detection | IEEE Journals & Magazine | IEEE Xplore

Un-Gaze: A Unified Transformer for Joint Gaze-Location and Gaze-Object Detection


Abstract:

This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), i.e., gaze following detection. Current a...Show More

Abstract:

This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), i.e., gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations and their gaze followings simultaneously, aiming at jointly detect human gaze location and gaze object in a unified and single-stage pipeline. To this end, we propose GTR, short for Gaze following detection TRansformer, streamlining the gaze following detection pipeline by eliminating all additional components, leading to the first unified paradigm that unites GL-D and GO-D in a fully end-to-end manner. GTR enables an iterative interaction between holistic semantics and human head features through a hierarchical structure, inferring the relations of salient objects and human gaze from the global image context and resulting in an impressive accuracy. Concretely, GTR achieves a 12.1 mAP gain ( \mathbf {25.1}\% ) on GazeFollowing and a 18.2 mAP gain ( \mathbf {43.3\%} ) on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement ( \mathbf {45.2\%} ) on GOO-Real for GO-D. Meanwhile, unlike existing systems detecting gaze following sequentially due to the need for a human head as input, GTR has the flexibility to comprehend any number of people’s gaze followings simultaneously, resulting in high efficiency. Specifically, GTR introduces over a \times 9 improvement in FPS and the relative gap becomes more pronounced as the human number grows.
Page(s): 3271 - 3285
Date of Publication: 25 September 2023

ISSN Information:

Funding Agency:


I. Introduction

Gaze following is one of the extraordinary human abilities to precisely track others’ gaze directions and identify their gaze targets. In real-world scenarios, the human visual system operates with speed and precision, allowing us to perform complex vision tasks with little conscious thoughts, such as effortlessly understanding the behavioural intentions of others. Similarly, fast and accurate gaze following detection algorithms will enable machines to better interpret human behaviors, thereby presenting a significantly potential in various human-centric vision tasks, such as saliency prediction [23], human action detection [32], and human-object interaction detection [43], among others [45].

Contact IEEE to Subscribe

References

References is not available for this document.