Boosting Cross-Modal Retrieval With MVSE ++ and Reciprocal Neighbors

In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE ++ ) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to ﬁnd a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the k nearest reciprocal neighbors in MVSE ++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.


I. INTRODUCTION
The task of image-caption retrieval aims at finding corresponding sentences given an image query or retrieving images with a sentence query. Although image-caption retrieval has wide applications in semantic searching, it is still challenging. It starts from the single-direction retrieval in visual space. Similar images to the query are first retrieved from the training set based on their similarities with the query in the visual space. Then, the captions of the retrieved images are re-ranked and transferred to the query image [1]- [3]. In IM2TEXT model [1], the similarity of candidate images to the query is measured by Gist and Tiny Image descriptors. Then, the candidates are re-ranked based on evaluating the image content,such as objects, stuff, people and so on. However, Mason and Charniak [2] only take the textual information into account in re-ranking. The final retrieved description is determined by extractive summarization techniques [4]. Kuznetsova et al. [3] report the phrase-based approach to The associate editor coordinating the review of this manuscript and approving it for publication was Gianluigi Ciocca . synthesize the output. Given a query image, they identify its content elements using classifiers/detectors and then retrieve phrases referring to the content elements from the database. Finally, the retrieved phrases are combined into a description considering the word order and redundancy etc.
In bi-directional retrievals, most approaches share the core idea of building a common multimodal embedding space for the visual and textual data based on a training set of imagedescription pairs. Then, the query is retrieved in the multimodal space [5]. Therefore, the most critical issue is to build a joint embedding space, which aligns image and text very well. Recently, a mass of methods on joint visual-semantic embedding are proposed, which can be divided into two main categories. Methods in the first category are based on canonical correlation analysis (CCA) [6], including the normalized CCA [7] and kernel CCA [8]. Kernel CCA uses the kernel trick to produce nonlinear, non-parametric projections from the visual and semantic features into the cross-modal space. Recently, CCA is introduced into deep learning framework [9], [10]. However, a drawback of CCA is that it requires loading all data into memory to compute the covariance matrix.
Thus, the consumption of memory is heavy. Another kind of methods are based on ranking loss. For examples, some deep learning methods, such as restricted Boltzmann machines and autoencoders [11]- [15], have been applied to jointly embed the image and text. VSE [16] embeds the deep visual features and deep semantic features into a cross-modal space based on a bi-direction ranking loss. Based on VSE, VSE++ [17] improves the loss function by considering the influence of hard negative samples, which has achieved the state-of-theart performance. However, the visual features of VSE++ are extracted by the ImageNet based CNN, which ignores the scene context in images. Therefore, we concatenate the object and scene features to fully represent the object and scene-context in images. As a result, we propose to retrieve the candidates in another modality in a multi-feature based VSE++ (MVSE++) space. It provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we achieve a better visual-semantic alignment on the aspects of objects and scene context. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the k nearest reciprocal neighbors in MVSE++ space.
To summarize, the contributions of this paper are: • We propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. It reduces the retrival noise induced by mis-alignment, high dimentional vector and asymetric cross-modal neighboring relationship.
• We propose multiple visual features based embedding method, which provides a more comprehensive representation of the visual content of objects and scene context in images. Therefore, it provides us a better initial retrieval.
• We define the cross-modal k-reciprocal nearest neighbors to re-rank the initially retrieved results, which achieves more accurate results through symmetrizing nearest neighborhood relationships in the cross-modal space.
• Experimental results show that the proposed method achieves advanced image-caption retrieval performance on MSCOCO and Flickr30K datasets. To the best of our knowledge, we achieve the highest accuracy in caption retrieval and image retrieval at both R@1 and R@10.

II. RELATED WORK
Our framework involves the ideas of joint deep embedding and re-ranking. Below, we give a concise overview of relevant prior work on both topics.

A. DEEP VISUAL-SEMANTIC EMBEDDING
CCA is introduced to the deep learning framework for visual-semantic embedding to learn nonlinear predictions and improve its adaptability to large training sets [9], [10]. Considering that stochastic gradient descent (SGD) used in [9], [10] cannot guarantee a good solution to the generalized eigenvalue problem of CCA, Wang et al. [18] propose a network sharing a similar two-branch architecture with deep CCA models. Restricted Boltzmann machines and autoencoders [11], [12] are also incorporated into the learning of a multimodal space. Recently, the recurrent model achieves big progress in image-text embedding [16], [19]- [22]. Among the ranking loss based joint embedding, WSABIE [23] and DeVISE [24] learn linear transformations of visual and textual features into a shared space using a single-directional ranking loss, which applies a margin-based penalty to hard incorrect annotations. However, the single-directional ranking loss may lead to incorrect matches in the opposite direction. To tackle this problem, some works [16], [17], [25]- [28] adopt the bi-directional ranking loss. Attention mechanism is also applied to model the joint embedding at imagetext tasks [29]- [31], where they selectively focus on a subset of words and image regions to compute the similarity. Ba et al. [32] propose a recurrent attention model, which attends to some label-relevant image regions of an image for multiple objects recognition. In [33], a multimodal contextmodulated attention mechanism is proposed to compute the similarities between images and captions. Nam et al. [34] use an attention mechanism on both images and captions to capture fine-grained interplay between vision and language. Adversarial learning provides a novel perspective for crossmodel retrieval. GXN [35]incorporates generative processes into the cross-modal feature embedding, which is based on GAN (Generative Adversarial Networks) and reinforcement learning. However, all of these works use the object-related visual features, while ignore the scene-context information. Thus, we propose to use multiple visual features which contain the scene information besides object information, which is extracted through bi-CNNs. We use gated recurrent unit (GRU) [36] to represent text. Finally, a bi-directional ranking loss is used to optimize the embedding.

B. RE-RANKING METHOS
Due to the high dimensionality of feature vectors, it is hard for similarity measure to completely distinguish the relevant vectors from the irrelevant ones,which means that there are still some noises in the retrieved results. Thus, for retrieval tasks, even an initial ranking list is obtained, re-ranking is still necessary to guarantee the relevant images have higher ranks.
Besides, re-ranking does not need additional training samples and is applicable to any initial ranking results. Recently, re-ranking is popular in instance retrieval [37]- [40]. Many works use k-nn [41], [42] to measure the similarity of the probe to address the re-ranking problem. Arandjelovic and Zisserman [43] propose a discriminative query expansion (DQE) us a linear SVM to obtain a weight vector, which emphsize the negative samples that are far away from the query image. Chum et al. [41] propose a new query vector, which is obtained by averaging the feature vectors of the returned top-k results, and then use it to re-query the database. Shen et al. [39] take the k-nn of the initial ranking list as new queries to produce new ranking lists. The updated score of each image is calculated depending on its positions in the new ranking lists. However, in our cross-model task, the noise is not only aroused by the high dimensionality of features but also the asymmetric cross-modal neighboring relationship in the embedding space. We believe that k-reciprocal nearest neighbor is superior to k-nn on re-ranking, because it considers both issues to eliminate the influence of noises. k-reciprocal nearest neighbor is first defined by [38] for object retrieval and has been extend to person reidentification by [44]. In this manuscript, we first analysis and prove the asymmetric cross-modal neighboring relationship between the visual and semantic aspects of the embedding space, and then extend the k-reciprocal nearest neighbors to multi-modal space. As a result, the retrieved results are more related to the probe in bi-directions of semantic and visual aspects, which effectively improves the accuracy of retrieval.

III. CROSS-MODAL RETRIEVAL BOOSTED BY MVSE++ AND RECIPROCAL NEIGHBORS
In this section, we first propose a multiple feature based visual-semantic embedding method, based on which the initial cross-modal retrieval is conducted. Then, we re-rank the initial results based on MVSE++ with k-reciprocal nearest neighbor method. The whole framework is illustrated as Fig.1.

A. MVSE++
In original VSE++, the visual feature is extracted by CNN based object classifiers trained on ImageNet dataset [45]. Thus, the features mainly focus on objects in images. However, images and their corresponding captions usually involve the scene context besides objects. Thus, the embedding space of VSE++ is not proper for the visual semantic alignment between images and captions. Fortunately, Zhou et al. [46] selected 365 scene categories with more than 4,000 images in each category to create Places365-standard dataset, which offers an ecosystem of visual context to guide the progress on scene understanding. Therefore, we incorporate scene features extracted by CNN based scene classifiers trained on Places365 dataset. We use class activation mapping (CAM) [47] to visualize the extracted object-related and scene-related regions in Fig.2, which illustrates the complementarity of those features. It is obvious that the captions include the information of both the object and scene-context. Thus, we concatenate them to construct the multiple visual feature φ multi for image I as follows: (1) where φ obj is the object-related feature and φ pla is the scenerelated feature. θ φ represents the parameters in CNNs, which extract the visual features. Norm() is a normalization operation, which guarantees the features lie on a unit hypersphere in φ multi . To achieve a better alignment between visual feature and semantic feature, we map the concatenated visual feature φ multi into the joint embedding space (i.e. MVSE++ space), and get the embedded multiple visual feature f multi , as follows W f ∈ R D φ ×D is the embedding matrix for visual feature. D φ represents the dimensionality of visual feature. D represents the dimensionality of the cross-modal embedding space. Similarly, we use GRU [36] to encode caption C as vector ψ(C; θ ψ ) ∈ R D ψ , where D ψ is the dimensionality of caption feature. GRU is a variant of recurrent neural network (RNN), which is proposed by replacing RNN nodes with GRU nodes that consist of reset gate and update gate. Thus, the model learns to remember and to forget the information. The mapping of caption feature into the jointly embedding space is defined as Eq. (3).
W g ∈ R D ψ ×D is the semantic embedding matrix, where g is the embeded semantic feature. The model parameters W f , W g , θ ψ are denoted by θ, which are obtained through the training process of MVSE++ (see Fig.1). If we finetune the visual feature CNNs, θ also includes θ φ . In the training process, given the image and caption pairs (I, C), we define a similarity measure function s multi (I, C), which aims to score the positive pairs higher than the negative pairs (making s equivalent to cosine similarity).
We denote I and C as the hardest negative image and caption samples, which are the most similar negative sample to its corresponding probe. The sum of violation loss function used in [16] is actually minimizing the mean of the nonnegative terms, while, in fact, the hardest negative sample is more important to R@K. Thus, our final loss function of a single example pair is denoted as follows We use SGD to learn our parameters θ. We take the maximum loss over each mini-batch. Thus, e(θ, s) defined as Eq. (6) is the empirical loss of all samples, which is parametrized by θ.
In the above process, we complement the scene context feature to the object feature in visual representation of I. It provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we achieve better alignment to C .

B. THE CROSS-MODAL K-RECIPROCAL NEAREST NEIGHBOR BASED RE-RANKING
Captioning is to describe the visual information of images with highly concentrated natural language. Captions usually describe salient objects and their related context in images. Thus, the information maintained in the visual feature and semantic feature is asymmetric. We build an embedding space of MVSE++ to align them better than that of VSE++.
In the embedding space, the similarity s multi (I, C) is the dot product between I and C. Thus, it is symmetric, Viz. s multi (I, C) = s multi (C, I). However, the neighboring relationships are asymmetric. To explain clearly, we define the cross-modal k-nearest neighbors (i.e. the top-k list) of an image vector I as top(k, I), which includes the k-most similar caption vectors {C 1 , C 2 , · · · , C k } in the embedding space of MVSE++. Similarly, I ∈ top(k, C) means I is the top-k cross-modal neighbor of C in the embedding space. The asymmetric neighboring relationship means that I ∈ top(k, C) does not imply I ∈ top(k, I). The cross-modal retrieval is bi-directional, the nearest neighboring relationship searching from visual or semantic side in the embedding space to another modal is asymmetric, which is illustrated as Fig.3. The top-2 nearest neighbors of I 1 are C 1 and C 2 . I 1 belongs to the top-2 nearest neighbors of C 1 . However, I 1 does not belong to top-2 nearest neighbors of C 2 . The traditional k-nn retrieval only searches from one aspect, which may suffer from this asymmetric issue. Intuitively, we deal with this issue with reciprocal retrieval. We are not the first to use the reciprocal nearest neighbors to solve the asymmetric nearest neighbor relationship. Zhong et al. used re-ranking to improve the accuracy of person ReID [44]. Qin et al. use k-reciprocal nearest neighbors to retrieve object [38]. Yet in the cross-modal retrieval community, limited effort has been devoted to solve the asymmetric nearest neighbor relationship in the embedding latent space. This k-reciprocal nearest neighbors can be extended . The process of re-ranking based on cross-modal reciprocal neighbors for caption retrieval. At query time, we first retrieve the related samples in the MVSE++ space. Then, we re-rank the initial list based on cross-modal reciprocal neighbors. We set k 1 = 7 and k 2 = 5. The initial list shows the top-7 nearest neighbors obtained by the initial ranking in the MVSE++ space for a probe image. The re-ranking list gives the top-7 captions obtained by the cross-modal reciprocal nearest neighbors. Blue and green boxes correspond to the positive captions and the probe image. We use the same strategy for the re-ranking in another modality.
to most cross-modal retrieval. An underlying assumption of this k-reciprocal nearest neighbors based re-ranking is that if a returned image/caption ranks within the k-nearest neighbors of the probe, it is likely to be a true match, which can be used for the subsequent re-ranking.
We re-rank the initial ranking list based on the k-reciprocal nearest neighbors in MVSE++ space to reduce the influence of asymmetric issue. We define the cross-modal reciprocal nearest neighbors R(I, k1, k2) of the probe image I as follows R(I, k 1 , k 2 ) = {C i |( C i ∈ top(k 1 , I ))∧(I ∈ top(k 2 , C i ))} (7) where C i , i ∈ (1, · · · , k 1 ) are the k 1 -nearest captions of I in the embedding space. To acquire the cross-modal reciprocal neighbors, we restrict I belongs to the k 2 -nearest images to caption C i . The cross-modal reciprocal nearest neighbors R(C, k 1 , k 2 ) of caption C can be defined similarly. Fig.4 gives an example, for the probe image I, we search its top-k 1 neighboring captions {C 1 , C 2 , · · · , C k 1 } in the embedding space of MVSE++. Then, for each caption C j we retrieve its top-k2 neighboring images in the embedding space of MVSE++. If the retrieved images include image I, we say C j is the reciprocal neighbor of I.
The cross-modal k-reciprocal nearest neighborhood relationship k 1 , k 2 , I is a much stronger indicator of similarity than the undirectioned nearest neighborhood relationship top(k 1 , I), beacause it takes into account the local densities of vectors around I and C. Consequently, the asymmetric of nearest neighborhood relationships in the cross-modal retrieval is reduced. Therefore, the re-ranking based on crossmodal reciprocal neighbor can effectively improve the performance of retrieval. We denote our retrieval method based the reciprocal re-ranking as R-MVSE++.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASET AND EVALUATION METRIC
To verify the performance of our method, we conduct experiments on Flickr30K dataset [48] and Microsoft COCO dataset [49]. Flickr30K includes 30,000 images. Following Karpathy and Fei-Fei [50], we use 1,000 images for validation, 1,000 images for testing, and the rest images for training. There are 164,062 images in MSCOCO. We also use 1,000 images for validation and 1,000 images for testing. For fair comparison, we follow Faghri et al. [17] to split the training, validation and test sets. We use 30,504 images left in the validation set of MSCOCO together with the original images for training (113,287 training images in total) to further improve the accuracy. We refer to this set as rV. Each image in these two datasets is annotated with 5 sentences by Amazon Mechanical Turk (AMT). We evaluate the retrieval performance using Recall@K (Viz. R@K), which is the mean number of corrections within the top-k retrieved results. We also report the median rank of the closest ground truth results from the ranked list (i.e. Med r).

B. TRAINING
To compare with existing methods, we also conduct experiments on three popular visual feature encoders: VGG19 [51], ResNet152 [52] and DensNet161 [53]. The image features are extracted by first resizing the image to 256 × 256. Then, the image is cropped to 224 × 224 using a random crop. We extract the image features by pre-computing the FC7 features (the penultimate fully connected layer) of VGG19, pre-computing the Pool5 features (the last pooling layer) of ResNet152 and pre-computing the Pool5 features (the last pooling layer) of DenseNet161. The dimensionality of image  embedding for each encoder is 4,096 for VGG19, 2,048 for ResNet152 and 2,208 for DenseNet161. We use a GRU as caption encoder, and set the dimensionality of the word vector as 300. We set the dimensionality of GRU and the joint embedding space as 1,024.

C. ABLATION STUDY
We conduct two groups of experiments on ablation study. One is to compare the performance of embedding spaces of VSE++ and MVSE++. The other is to evaluate the influence of k value on R@1 and R@10.

1) VERIFICATION OF THE DIFFERENT FEATURES ON EMBEDDING
The verification of the effectiveness of our multiple features on retrieval on the validation set of MSCOCO is presented in Tab.1. As shown in Tab.1, both VSE++ and MVSE++ perform better on ResNet152 and DenseNet161 compared with those on VGG19, which proves the powerfulness of ResNet and DenseNet on visual feature extraction. The results of ResNet152 and DenseNet161 are comparable for both VSE++ and MVSE++. MVSE++ performs better than VSE++ on these three networks. For examples, MVSE++ on ResNet152 achieves 3.4% improvement in R@1 for caption retrieval and 3.4% improvement in R@1 for image retrieval compared with VSE++ on ResNet152.
To validate robustness of method on different dataset, we conduct experiments on Flickr30K. The verification of the superiority of the multiple feature of MVSE++ over the single feature of VSE++ is given in Tab.2. In Tab.2, we can observe that MVSE++ on VGG19 obtains 0.6% improvement in R@1 for caption retrieval and 2.2% improvement in R@1 for image retrieval. On ResNet152, MVSE++ obtains 6.7% improvement in R@1 for caption retrieval and 4.4% improvement in R@1 for image retrieval. On DenseNet161,  MVSE++ obtains 2.8% improvement in R@1 for caption retrieval and 2.7% improvement in R@1 for image retrieval. Thus, the results imply that our multiple features effectively improve the accuracy of the cross-modal retrieval. The performance of MVSE++ on ResNet152 and DenseNet161 are comparable, which are better than that of our method on VGG19, which is consistent with the results on MSCOCO. The results prove that the scene-related feature achieves better alignment between visual and semantic modals in the embedding space. The visualization in Fig.2 of the focus of objectrelated and scene-related visual features by classic CAM [47] also proves the complementary of them. Thus, the concatenated feature provides a comprehensive representation of image.

2) VERIFICATION OF THE NUMBER OF NEIGHBORS
We evaluate the influence of k 2 on R@1 and R@10 accuracies on VGG19, DenseNet161 and ResNet152, which is illustrated in Fig.IV-C.2. In our experiments, we find that when k 1 = 30, the retrieval achieves the best results, so we fix k 1 at 30 in the following experiments. We can see that for caption retrieval, R@1 achieves the best performance when k 2 = 1. R@1 accuracy is stable with the increase of k 2 . R@10 achieves the best performance with k 2 equals to 2 or 3, which means the probe caption usually exists in the top-3 reciprocal neighbors of the cross-modal retrieval images. For image retrieval, when k 2 = 2, R@1 achieves the best results. However, the accuracy of R@10 continues to rise until k 2 reaches to 8 in Fig.IV-C.2. Since captions are highly condensed semantic information of images, images contain more information than captions. Once caption is given, the corresponding image is easy to be determined. However, image can be described by more captions. Thus, the image description is objective. That is why dense caption [63] and caption variation [64] are popular recently. VOLUME 8, 2020  According to above results, we fix k 1 at 30 and k 2 at 2, and then compare MVSE++ and R-MVSE++ to demonstrate the effectiveness of reciprocal re-ranking. In Tab.3, we can see that our re-ranking method achieves evident improvement for retrieval. Especially for image retrieval, our method achieves improvements at 13.8% in R@1 on VGG19, 17.1% on ResNet152, 16.6% on DenseNet161 and 14.4% on finetuned DenseNet161. Tab.4 illustrates the results of re-ranking on Flickr30 with k 1 at 30 and k 2 at 2. We see that our method achieves improvement at 4.8% in R@1 on VGG19, 5.8% on ResNet152, and 5.1% on DenseNet161 for caption retrieval. The improvement on image retrieval is higher than that of caption retrieval. Scene-related feature of image helps us find the better aligned captions, which in turn helps us find the images similar to the probe on both object and scene. However, once the scene-semantic is missing in the probe caption, it is not easy to recover it through re-ranking.

D. VALIDATION ON RETRIEVAL
In previous experiments, we only update the parameters of captions θ ψ and the mapping metrics W f , W g . Considering the encoders of visual feature are pre-trained for classification task, in order to make the encoders better suit our visual-semantic embedding task, we fine-tune partial parameters. We report the fine-tuned results in Tab.5 to compare our embedding method with the methods Order [54], Embedding Network [18], sm-LSTM [33], 2WayNet [55], GMM-HGLMM [56] and SPE [14]. Our method achieves best results over those methods. We see that our method on fine-tuned DenseNet161 achieves 5.0% improvement in R@1 for caption retrieval and 14.9% improvement on image retrieval over that of VSE++ on fine-tuned ResNet152, which means DenseNet161 can be better fine-tuned to obtain the more reasonable parameters for visual feature encoding, towarding cross-modal embedding. Fig.IV-C.2 illustrates the superiority of our method over VSE++. We also compare our FIGURE 5. The impact of the parameter k 2 on re-ranking performance on the initial retrieval with the features extracted by VGG19 (a), DenseNet161 (b) and ResNet152 (c), respectively. We fix k 1 at 30. method with these methods in Tab.6 on Flickr30K dataset. Our embedding method on DenseNet161 with fine-tuning achieves 59.6% accuracy in R@1 for image retrieval and 56.1% accuracy in R@1 for caption retrieval, which outperformes these methods.
Comparising with the more recent state-of-the-art approaches: GXN [35], SCAN [58], SCO [57], CMPM(ResNet-152) [59], JGCAR [60], VSRN [61] and MTFN [62]. The results in Tab.5 and Tab.6 indicate that our method achieve comparative performance with these  state-of-the-art approaches. Besides, we obtain obvious advantages over these methods in image retrieval in terms of R@1 and R@10 measures. Since captions are highly condensed semantic information of images, images contain more information than captions. In the visual information encoding, our multiple-features provide more information for semantic description to retrieve. Once caption is given, the corresponding image is easy to be determined. Thus, we see that our method is superior to other methods in image retrieval in MSCOCO dataset and most measures of Flickr30K dataset.

V. CONCLUSION
In this paper, we investigate the image-caption retrieval, which joints the object-scene feature based visual-semantic embedding and reciprocal neighbors. We show that the multifeature, which avoids the visual-semantic miss-alignment on both objects and scene context aspects, is more representative than the previously used single feature for image-caption retrieval. We also propose to incorporate the re-ranking method based on k-reciprocal nearest neighbors to re-rank the initial rank list, which symmetrizes nearest neighborhood relationships in the cross-modal retrieval. We evaluate our method on MSCOCO and Flickr30K datasets. The results consistently prove the superiority of our method over the sateof-the-art methods. Besides retrieval, our visual-semantic embedding model can be applied to image captioning and VQA tasks, etc. Considering failure retrieval cases of the proposed method are always caused by the interaction between different objects, we will focus on how to well caputre the local alignment between visual and semantic information in the future.