Abstract:
Automatic image captioning has been extensively studied, however, existing methods primarily focus on a single image. Actually, the demand for captioning multiple images ...Show MoreMetadata
Abstract:
Automatic image captioning has been extensively studied, however, existing methods primarily focus on a single image. Actually, the demand for captioning multiple images and corresponding contextual information has been growing in diverse scenarios, e.g., composing news articles headlines, and electronic medical reports. In this paper, we propose a novel COntext-driven captioning approach for Multi-Image News, called COMIN, which employs a two-step attention mechanism, called adaptive dual attention, comprising global attention for grasping overall context and local attention for finer image details. It is inspired by the observation and cognitive processes of human beings where global attention and local attention are responsible for understanding the high-level features and detailing the low-level features. Experimental results on our newly contributed Star-News dataset show that our proposed model outperforms the state-of-the-art image captioning methods in multi-image captioning scenarios.
Published in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 14-19 April 2024
Date Added to IEEE Xplore: 18 March 2024
ISBN Information: