This framework integrates the Segment Anything Model (SAM) for segmentation, a multi-level encoder-decoder architecture for feature extraction and decoding, and attention...
Abstract:
Although significant progress in the task of producing fine-grained captions for portrait images has been made by the current models for generating detailed descriptions ...Show MoreMetadata
Abstract:
Although significant progress in the task of producing fine-grained captions for portrait images has been made by the current models for generating detailed descriptions in captions, they still face challenges in attention allocation and in capturing the detailed characteristics of the subjects. This results in a difficulty to accurately generate refined captions for character images. In response to this issue, a model named Attention-guided Hierarchical Parsing (AHP) is innovatively proposed by us. This model leverages the exceptional segmentation performance of the Segment Anything Model (SAM) to guide the model to prioritize key information in character images, maintaining focus on the subject even in complex scenes. Additionally, the model utilizes a multi-level image feature encoding-decoding framework, significantly enhancing its capacity to capture intricate image details through a thorough analysis of multi-scale features within images. Extensive experimental results demonstrate the superior performance of the proposed model in generating fine-grained, high-quality captions, significantly improving the quality of image caption generation and introducing new perspectives and methods to the field of fine-grained image caption generation.
This framework integrates the Segment Anything Model (SAM) for segmentation, a multi-level encoder-decoder architecture for feature extraction and decoding, and attention...
Published in: IEEE Access ( Volume: 12)