Abstract:
As an important field in information retrieval, fine-grained cross-modal retrieval has received great attentions from researchers. Existing fine-grained cross-modal retri...Show MoreMetadata
Abstract:
As an important field in information retrieval, fine-grained cross-modal retrieval has received great attentions from researchers. Existing fine-grained cross-modal retrieval methods made several improvements in capturing the fine-grained interplay between vision and language, failing to consider the fine-grained correspondences between the features in the image latent space and the text latent space respectively, which may lead to inaccurate inference of intra-modal relations or false alignment of cross-modal information. Considering that object detection can get the fine-grained correspondences of image region features and the corresponding semantic features, this paper proposed a novel latent space semantic supervision model based on knowledge distillation (L3S-KD), which trains classifiers supervised by the fine-grained correspondences obtained from an object detection model by using knowledge distillation for image latent space fine-grained alignment, and by the labels of objects and attributes for text latent space fine-grained alignment. Compared with existing fine-grained correspondence matching methods, L3S-KD can learn more accurate semantic similarities for local fragments in image-text pairs. Extensive experiments on MS-COCO and Flickr30K datasets demonstrate that the L3S-KD model consistently outperforms state-of-the-art methods for image-text matching.
Published in: IEEE Transactions on Image Processing ( Volume: 31)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Latent Space ,
- Cross-modal Retrieval ,
- Semantic Supervision ,
- Extensive Experiments ,
- Object Detection ,
- Image Regions ,
- Information Retrieval ,
- Object Detection Model ,
- Alignment Information ,
- MS COCO Dataset ,
- Loss Function ,
- Image Features ,
- Object Classification ,
- Number Of Objects ,
- Fully-connected Layer ,
- Text Words ,
- Word Embedding ,
- Words In Sentences ,
- Semantic Knowledge ,
- Cross-entropy Loss Function ,
- Bottom-up Attention ,
- Salient Regions ,
- Word Features ,
- Query Image ,
- Multi-task Learning ,
- Objects In The Scene ,
- Textual Features ,
- Semantic Consistency ,
- Text Query ,
- Bad Example
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Latent Space ,
- Cross-modal Retrieval ,
- Semantic Supervision ,
- Extensive Experiments ,
- Object Detection ,
- Image Regions ,
- Information Retrieval ,
- Object Detection Model ,
- Alignment Information ,
- MS COCO Dataset ,
- Loss Function ,
- Image Features ,
- Object Classification ,
- Number Of Objects ,
- Fully-connected Layer ,
- Text Words ,
- Word Embedding ,
- Words In Sentences ,
- Semantic Knowledge ,
- Cross-entropy Loss Function ,
- Bottom-up Attention ,
- Salient Regions ,
- Word Features ,
- Query Image ,
- Multi-task Learning ,
- Objects In The Scene ,
- Textual Features ,
- Semantic Consistency ,
- Text Query ,
- Bad Example
- Author Keywords