I. Introduction
Image-Text Retrieval (ITR) aims at searching for semantically relevant images from an image database for a given text query and vice versa. Nowadays, ITR has attracted extensive attention [1], [2], [3], [4], [5], [6], [7], [10] since there is an excellent benefit to extensive relevant applications, such as search engines and multimedia data management systems. The critical challenges of ITR lie in accurately obtaining visual and textual semantic representations, and learning a optimal joint embedding space from adequate sample pairs. Despite tremendous efforts have been dedicated to tackling the above challenges, such as the global-level matching methods [2], [6], [10], [11], attention-based local-level matching methods [1], [4], [12], [13], [14], external knowledge information based methods [3], [15], [16], [17], it remains challenging due to the inaccuracy of semantic representation and limited negative image-text pairs in a mini-batch.