Vision is the primary means of human information acquisition. With the development of computer technology, computers can efficiently help humans identify and analyze the visual content for the underlying context information in the image. Although many computer vision tasks, including object detection, instance segmentation, human key point detection, and human pose estimation have achieved superior performances, they are still in the scope of system 1, tasks for fast and shallow thinking. However, developing more intelligent services for the system 2, tasks requires not only recognizing individual objects in an image but also identifying the relationship between them to figure out what happens in the image. Given that human–object interactions constitute a significant portion of human activities, detecting and identifying how humans interact with their surrounding objects is crucial for comprehensive visual understanding. This necessitates tasks such as human–object interaction (HOI) detection, which focuses on localizing humans and objects while discerning their interaction relationships. As shown in Figure 1, the HOI detection task is used to predict the specific location of the human and object and infer the action between human and object (e.g., “human blow cake,” “human read book”). In short, it is to detect the HOI triplet (human, verb, object) in the image. The HOI detection task goes beyond spotting individual objects in an image, aiming for a deeper understanding of visual scenes. It focuses on grasping how humans interact with objects and uncovering relationships between them. This technology has promising research possibilities and is useful in various fields such as intelligent transportation, human–computer interaction, and identifying unusual human actions. For example, in visual surveillance, HOI technology can be used to identify and detect abnormal behaviors to deliver timely warnings. However, the HOI detection problem is still an open and challenging research problem because the image may contain multiple individuals performing the same interaction, the same person may interact with multiple objects simultaneously, and the same object may interact with various humans simultaneously. Furthermore, the complex and diverse interaction scenarios and the long-tail data distribution in the benchmark datasets further increase the difficulties in designing effective HOI detection models.
Abstract:
This article systematically summarizes and discusses recent research on image-based human–object interaction (HOI) detection, which aims to detect human–object pairs and ...Show MoreMetadata
Abstract:
This article systematically summarizes and discusses recent research on image-based human–object interaction (HOI) detection, which aims to detect human–object pairs and recognize the interactive behaviors between humans and objects in an image. It has plenty of applications and can serve as the basis to assist higher level tasks of visual understanding. We introduce existing methods by categorizing them into two main groups based on the model structure: one-stage and two-stage approaches. We further divide one-stage methods into point-based, region-based, and query-based methods. Similarly, the two-stage methods are divided into HOI detection with multistream modeling, HOI detection with human parts and pose, HOI detection with compositional learning, HOI detection with graph-based modeling, and HOI detection with query-based modeling. According to this taxonomy, we also summarize and analyze the core ideas behind each strategy. Then, we present the details of the experimental protocols, evaluation metrics, datasets, and the evaluation results of the most recent representative methods. Finally, we discuss the main open challenges and future trends in the HOI detection task.
Published in: IEEE Consumer Electronics Magazine ( Volume: 13, Issue: 6, November 2024)