Abstract:
The Composed Image Retrieval (CIR) task aims to retrieve a target image that meets the requirements based on a given multimodal query (includes a reference image and modi...Show MoreMetadata
Abstract:
The Composed Image Retrieval (CIR) task aims to retrieve a target image that meets the requirements based on a given multimodal query (includes a reference image and modification text). Most existing works align multimodal semantics at both local and global granularity. However, they have failed to consider the mining of semantic correspondences at the intermediate-grained level, which has resulted in sub-optimal model performance. In this paper, we propose an adaptive interMEDiate-graIned Aggregation Network (MEDIAN). Compared with the conventional CIR models, MEDIAN is capable of generating intermediate-grained feature aggregation supervised signals and constructing graph attention networks to extract intermediate-grained features. Concurrently, MEDIAN also devises cross-modal semantic correspondence aligning guided by the target image, which in turn enables accurate multi-grained feature composition. The superiority of MEDIAN is demonstrated by extensive experiments on three benchmark datasets. Our code is available at https://windlikeo.github.io/MEDIAN.github.io/.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: