Exploring Triple Knowledge Cues for Zero-Shot Human-Object Interaction Detection | IEEE Conference Publication | IEEE Xplore

Exploring Triple Knowledge Cues for Zero-Shot Human-Object Interaction Detection


Abstract:

Current zero-shot human-object interaction detection methods often follow a two-phase pipeline, which uses a pre-trained detector to detect instances and then adopts CLIP...Show More

Abstract:

Current zero-shot human-object interaction detection methods often follow a two-phase pipeline, which uses a pre-trained detector to detect instances and then adopts CLIP to perform interaction prediction. During the second phase, they either obtain pairwise representations by directly performing RoI-Align on CLIP features or designing additional queries and decoders to fuse CLIP features. However, CLIP visual features often lack fine-grained information, thus being detrimental to capturing complex HOI interactions. Besides, extra decoders might increase computation costs. Thus, we propose a triple knowledge cues exploration model without extra decoders to explore various knowledge guidance for improving CLIP representations. First, we incorporate position distribution and semantic priors to delineate a layout from the predicted boxes and inject semantics by using the CLIP text embeddings. Next, we explore object priors by leveraging predefined class names and the text encoder to obtain saliency maps for humans and objects. Then, we design three types of holistic tokens to capture diverse attribute cues for human, object, and interaction, respectively. The above cues are finally integrated into a vanilla two-stage CLIP-based baseline. The experimental results on HICO-DET demonstrate the effectiveness of our proposed model.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.