Abstract:
To accurately perform crowd counting, utilizing the complementary relationship between RGB and thermal images to analyze the crowd has become the focus of current researc...Show MoreMetadata
Abstract:
To accurately perform crowd counting, utilizing the complementary relationship between RGB and thermal images to analyze the crowd has become the focus of current research. Due to different imaging principles, multi-modal images often contain different contents, which are their modality-specific information. For example, RGB images contain more texture and color details, while thermal images contain thermal radiation information. Meanwhile, they also describe the same target content, e.g., crowds, which are modality-invariant. However, existing methods only design different modules to directly fuse RGB and thermal image features, which did not fully consider the above facts. In this paper, by analyzing the similarities and differences between multi-modal images, we propose a Modality-Invariant and -Specific Fusion Network (MISF-Net) for RGB-T Crowd Counting. Specifically, we design a modality decomposition and fusion module (MDFM), which decomposes RGB and thermal image features into modality-invariant and -specific features by using the similarity and difference supervision between multi-modal features. Besides, reconstruction supervision is also used to prevent network learning from generating bias. After that, different fusion strategies are applied to the invariant and specific features, respectively. In addition, to adapt to the variations in size of different pedestrians, we design a modality-invariant fusion module (MIFM). Finally, after the fusion decoder, MISF-Net can obtain a more accurate crowd density map. Comprehensive experiments on the RGB-T crowd counting dataset show that our MISF-Net can achieve competitive performance. Our code will be released at https://github.com/QSBAOYANGMU/MISF-Net.
Published in: IEEE Transactions on Multimedia ( Early Access )