Abstract:
Multimodal named entity recognition (MNER) is an emerging field that aims to automatically detect named entities and classify their categories, utilizing input text and a...Show MoreMetadata
Abstract:
Multimodal named entity recognition (MNER) is an emerging field that aims to automatically detect named entities and classify their categories, utilizing input text and auxiliary resources such as images. While previous studies have leveraged object detectors to preprocess images and fuse textual semantics with corresponding image features, these methods often overlook the potential finer grained information within each modality and may exacerbate error propagation due to predetection. To address these issues, we propose a finer grained rank-based contrastive learning (FRCL) framework for MNER. This framework employs a global-level contrastive learning to align multimodal semantic features and a Top-K rank-based mask strategy to construct positive–negative pairs, thereby learning a finer grained multimodal interaction representation. Experimental results from three well-known social media datasets reveal that our approach surpasses existing strong baselines, and achieves up to a 1.54% improvement on the Twitter2015 dataset. Extensive discussions further confirm the effectiveness of our approach. We will release the source code on https://github.com/augusyan/FRCL.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Early Access )