Loading [MathJax]/extensions/MathMenu.js
Sharingan: A Transformer Architecture for Multi-Person Gaze Following | IEEE Conference Publication | IEEE Xplore

Sharingan: A Transformer Architecture for Multi-Person Gaze Following


Abstract:

Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad s...Show More

Abstract:

Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, the gaze following task in computer vision is defined as the prediction of the 2D pixel coordinates where a person in the image is looking. Previous attempts in this area have primarily centered on CNN-based architectures, but they have been constrained by the need to process one person at a time, which proves to be highly inefficient. In this paper, we introduce a novel and effective multi-person transformer-based architecture for gaze prediction. While there exist prior works using transformers for multi-person gaze prediction [38], [39], they use a fixed set of learnable embeddings to decode both the person and its gaze target, which requires a matching step afterward to link the predictions with the annotations. Thus, it is difficult to quantitatively evaluate these methods reliably with the available benchmarks, or integrate them into a larger human behavior understanding system. Instead, we are the first to propose a multi-person transformer-based architecture that maintains the original task formulation and ensures control over the people fed as input. Our main contribution lies in encoding the person-specific information into a single controlled token to be processed alongside image tokens and using its output for prediction based on a novel multiscale decoding mechanism. Our new architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget, and ChildPlay datasets and outperforms comparable multi-person architectures with a notable margin. Our code, checkpoints, and data extractions will be made publicly available soon.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

1. Introduction

Gaze is an important form of communication and was extensively studied across different domains and applications such as consumer behavior understanding [4], [19], [36], soci-

Predictions of Sharingan on naturalistic images from the internet with different people, activities, interactions, postures, and environments (indoors and outdoors). We provide more qualitative samples in the supplementary material.

Illustration of aspects relevant to gaze following.

ology by analyzing different gaze behaviors (e.g. joint attention, eye contact) [10,24-27], robotics through human-robot interactions [1], [1]7, [32] and clinical research for the study of neurodevelopmental disorders [7], [20], [34], etc.

Contact IEEE to Subscribe

References

References is not available for this document.