Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as “vision and language retrieval”) is proposed in this paper. Vision and language retrieval takes data of one modality as a query to retrieve relevant data of another modality, and it enables flexible retrieval across different modalities. Most of the existing methods learn optimal embeddings of visual and lingual information to a single common representation space. However, we argue that the forced embedding optimization results in loss of key information for sentences and images. In this paper, we propose an effective utilization of representation spaces in a simple but robust vision and language retrieval method. The proposed method makes use of multiple individual representation spaces through text-to-image and image-to-text models. Experimental results showed that the proposed approach enhances the performance of existing methods that embed visual and lingual information to a single common representation space.


I. INTRODUCTION
Single-modal retrieval such as document retrieval from keyword queries [1] and image retrieval from an image query [2] has been traditionally conducted. However, these approaches only perform retrieval on the same modality. Web services now contain not only lingual descriptions but also images and videos [3]. Therefore, a vision and language retrieval method that retrieves relevant data of one modality by utilizing a query of another modality [4]- [10] is needed for many practical applications such as hot topic detection, personalized recommendation and multimodal retrieval [11], [12]. We can easily associate vision with language and vice versa; however, such an association is still difficult in computer vision because of the large gap between vision and language [13]- [15]. Actually, different modalities have diverse representations and distributions, and these heterogeneous characteristics make it difficult to directly measure the similarities of vision and language [16].
The associate editor coordinating the review of this manuscript and approving it for publication was Chin-Feng Lai . Popular vision and language retrieval methods are embedding approaches [4]- [15]. They aim to find a joint mapping of instances from visual and lingual representation spaces to a common representation space so that related instances from the source representation space are mapped to nearby places in the target space. Specifically, the goal of these approaches is to jointly learn two mapping functions f : L → E and g: V → E, where L and V are lingual and visual representation spaces, respectively, and E is a common representation space. For example, in a text-to-image retrieval scenario, a lingual feature from a query in space L and visual features from candidate images in space V are projected into a learned common representation space E that can compare the two different modalities. This embedding approach is currently one of the most popular approaches.
Although progress has been made in embedding approaches, there is still room for improvement, especially in the representation spaces. Most of the embedding approaches try to embed sentence and image information into a single representation space, E; however, sentence and image information might not be expressed as effectively in space E as in VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Illustration of the concept of the proposed approach. We make use of two representation spaces, space L and space V, in addition to space E.
spaces L and V due to the forced embedding optimization. For example, it can be considered that word relationship information in a sentence (e.g., verbs and prepositions) and detailed texture information in an image (e.g., object surface and scenery) can be effectively expressed in spaces L and V, respectively [17]- [19]. If we can make use of spaces L and V in addition to the learned common representation space E, more robust vision and language retrieval would become feasible.
In this paper, we evaluate the effectiveness of representation spaces L and V and show that the retrieval performance of conventional embedding approaches can be enhanced by effectively utilizing spaces L and V. As shown in Fig. 1, we further make use of spaces L and V to enhance the retrieval performance of embedding approaches that utilize only space E. How do we utilize visual representation space and lingual representation space? We simply adopt two types of generative models: an image-to-text model M V→L [20]- [24] and a text-to-image model M L→V [25]- [28]. Generative models can generate one type of multimedia content from a different type of multimedia content by simulating the observed distribution. An image-to-text model learns the mapping of space V to space L, and a text-to-image model learns the mapping of space L to space V. In the proposed approach, candidate images and a query sentence are respectively translated into synthesized sentences and a synthesized image through M V→L and M L→V . Then we calculate similarities between the query sentence and synthesized sentences in space L and similarities between the synthesized image and candidate images in space V. By effectively using spaces L and V in addition to space E, the retrieval performance of conventional embedding methods that only utilize space E can be enhanced.
The contributions of this paper are as follows. First, we verify the effectiveness of spaces L and V based on image-totext and text-to-image models. Then we enhance the retrieval performance of conventional embedding methods that only utilize space E utilizing the text-to-image and image-to-text models.
We review related works in Sec. II. Then we describe the proposed vision and language retrieval method in Sec. III. Finally, we evaluate the effectiveness of each representation space in Sec. IV and the proposed approach in Sec. V.

II. RELATED WORKS
We briefly review some of the most relevant works on vision and language retrieval and explain the differences between these methods and our approach. Also, we describe studies on generative models that are utilized in our approach.

A. VISION AND LANGUAGE RETRIEVAL
The retrieval performance of vision and language retrieval methods is greatly dependent on the lingual and visual feature representation. Traditional methods based on Canonical Correlation Analysis (CCA) [29] or Topic-regression Multi-modal LDA (Tr-mm LDA) [30] utilized handcrafted feature extraction techniques such as Scale Invariant Feature Transform (SIFT). They can retrieve desired images by considering various factors such as rotation and scaling. However, such handcrafted features cannot represent the high-level semantics of multimedia data [31]. Feature extraction techniques that can represent high-level semantics have therefore been studied [32]. Recently, it has been shown that deep neural networks can learn the joint lingual and visual semantic feature representation and achieve better performance [18], [33], [34].
To embed such lingual and visual features into a common representation space, most of the vision and language methods aim to learn a common embedding representation space E in which the similarity between samples from different modalities can be measured directly. The methods learn a proper embedding function from different modalities utilizing similar or dissimilar cross-modal data pairs. Kiros et al. [13] utilized a Gated Recurrent Unit (GRU) [35] and a Convolutional Neural Network (CNN). In their method, the distances between matched text-image pairs are smaller than the distances between mismatched pairs. Faghri et al. [36] extended the method of [13] with the idea of hard negative mining. Their method focuses on maximizing violating negative pairs, and they reported improvement of convergence rates. Also, as an extension of the method of [13], Zhang and Lu [9] proposed an embedding method that calculates a loss function with KL divergence. By utilizing KL divergence, they realized discriminative text-image embeddings. The above-described methods can improve vision and language retrieval performance by focusing on the effective utilization of the common representation space E. However, space E might not express the same sentence and image information as that expressed in spaces L and V. The proposed approach further makes use of multiple representation spaces (L and V) to enhance their retrieval performance.
On the other hand, there are methods that utilize multiple representation spaces for cross-modal retrieval. To utilize multiple representation spaces, Eisenschtat and Wolf [37] adopted bi-directional neural architectures, and Wei et al. [38] jointly optimized a correlation space (between text and images) and a linear regression space (from one modal space, i.e., text or image, to the semantic space). They improve cross-modal retrieval performance by utilizing multiple representation spaces. Different from their approach, we propose an approach that additionally introduces multiple representation spaces (L and V) from generative models to several conventional embedding methods that only utilize space E and enhances their retrieval performance.

B. TEXT-TO-IMAGE AND IMAGE-TO-TEXT MODELS
Image-to-text models [20]- [24] can generate a description or sentence representing information of an input image. Imageto-text generation tasks have been widely studied in the field of computer vision. Vinyals et al. [20] used a deep recurrent architecture to generate a sentence representing an input image. Xu et al. [21] extended this method based on an attention mechanism that can focus on relevant parts of the image and improve the performance of image-to-text generation.
Text-to-image generation has been one of the most attractive research topics in recent years. Text-to-image Generative Adversarial Networks (GANs) [25]- [28], [39]- [43] are the most basic and popular approaches. They are deep-learning methods and can generate an image representing information of an input sentence by alternately training their generators and discriminators. By conditioning on an input sentence, generators generate images, and discriminators discriminate whether input images are from a real data distribution or a generated data distribution. Read et al. proposed the first method in which a GAN was applied to a text-to-image task [26]. Although the generated images are not visually pleasant, they showed a new way of using a GAN. In recent years, a hierarchical structure has been used for various text-to-image GANs [28], [44]. Qiao et al. [44] proposed MirrorGAN that adopts an image-to-text model for the textto-image GAN to guarantee semantic consistency between the text descriptions and visual contents. On the other hand, Zhu et al. [28] proposed DM-GAN that introduces a dynamic memory module to refine fuzzy image contents.
To calculate similarity in space L and space V, we generate a sentence representing information of an image based on an image-to-text model and generate an image representing information of an input sentence by utilizing text-to-image GANs.

III. OUR VISION AND LANGUAGE RETRIEVAL APPROACH
We present a simple but effective vision and language retrieval approach that utilizes spaces L and V in addition to space E with focus on a text-to-image retrieval scenario. The goal of a text-to-image retrieval scenario is retrieval of a relevant image from only an input sentence as a query. This cross-modal retrieval setting does not have descriptions of candidate images. Our retrieval approach consists of two phases: similarity calculation and retrieval. In the first phase, we calculate three types of similarities between an input sentence and candidate images. Our approach calculates similarity s L n (n = 1, 2, . . . , N ; N being the number of candidate images) in lingual space L through an image-totext model M V→L and s V n in visual space V through a text-toimage model M L→V in addition to similarity s E n in space E. In the second phase, we simply integrate the three types of similarities as follows: where α, β and γ are parameters that balance the percentages of similarity combinations. Finally, we rank candidate images in descending order of s n . By using spaces L and V with space E, the retrieval performance of vision and language retrieval could be enhanced. The details of each similarity calculation are shown below.

A. SIMILARITY CALCULATION IN SPACE L
Theoretically, since any type of image-to-text model can be applicated as M V→L in our approach, an explanation based on a basic image-to-text model [20] is presented. A basic image-to-text model consists of a single Recurrent Neural Network (RNN) [45]. In the training phase, to generate a sentence that expresses an input image, we train the RNN to maximize the following objective function: where T train and I train is a training sample pair (T train and I train being a sentence and an image, respectively) in a training dataset and T train x (x = 0, 1, . . . , X ; X being the number of words in a sentence T train ) is the xth word of sentence T train . By continuously predicting the xth word from x − 1, x − 2 . . . 0th words and an input image I train in the training phase, they can generate a sentence that contains information on word relationships.
Based on the trained image-to-text model M V→L , we calculate image-to-text-based similarity in space L. We generate sentences T DB n that represent information on candidate images I DB n . From an input sentence T Q and generated sentences T DB n , we calculate lingual features f L Q ∈ R D L and f L DB n ∈ R D L , where D L represents the dimension of the extracted lingual features. Then we calculate the cosine similarities s L n between f L Q and f L DB n , where s L n indicate the similarities in the lingual representation space between the input sentence T Q and the candidate images I DB n .

B. SIMILARITY CALCULATION IN SPACE V
We also explain similarity calculation in space V based on the most basic text-to-image model [26]. In the training phase, to generate an image that expresses an input sentence, we train image generation networks (generators) and image VOLUME 8, 2020 identify networks (discriminators) based on the loss function as follows: where I G is a generated image from the generated data distribution p G , I R is an image from real data distribution p R , and T IN is an input sentence. In Eq. (6), the first term determines whether I G is from p R or p G , while the second term determines whether I G expresses T IN or not. Also, in Eq. (7), the first and second terms respectively determine whether I R and I G are from p R or p G , while the third and fourth terms determine whether I R and I G express T IN or not, respectively. Namely, by conditioning on T IN , generators learn these parameters to generate I G that discriminators cannot discriminate from I R , while discriminators learn these parameters to discriminate I R and I G . This adversarial training enables the model M L→V to generate a real image representing visual information (such as color and texture) of an input sentence. Based on the trained text-to-image model M L→V , we generate an image I Q that represents information of an input sentence T Q . Next, we calculate visual features f V Q ∈ R D V and f V DB n ∈ R D V from I Q and I DB n , where D V represents the dimension of the extracted visual features. Then we simply calculate the cosine similarities s V n between f V Q and f V DB n . These values indicate the similarity in a visual representation space between the input sentence T Q and the candidate images I DB n .

C. SIMILARITY CALCULATION IN SPACE E
As stated before, the proposed retrieval method corresponds to an enhancing method that improves the performance of arbitrary conventional embedding methods. Therefore, the selection of these embedding methods is not the main contribution of this paper. We can use any type of embedding method and we only explain the calculation of similarity in the embedding space E. We will verify the effectiveness of the proposed approach by using several embedding methods in Sec. V.
To calculate similarities s E n , firstly, we calculate a lingual feature f L Q from a query sentence T Q and visual features f V DB n from the candidate images I DB n . Then we embed f L Q and f V DB n into a common representation space based on the trained embedding models. We define embedded lingual features asf L Q ∈ R D E and embedded visual features asf where D E represents the dimension of the embedded features. Then we simply calculate the cosine similarities s E n between f L Q andf V DB n , where s E n indicates the similarities in the common representation space between the input sentence T Q and the candidate images I DB n .

IV. VERIFYING THE EFFECTIVENESS OF LINGUAL AND VISUAL SPACES FOR RETRIEVAL
In this section, we first show experimental results to verify the effectiveness of using each of the spaces L and V. More robust vision and language retrieval that further focuses on sentence and image information will be expected by verifying the effectiveness of using spaces L and V.

A. EXPERIMENTAL SETUP
We used the Microsoft Common Objects in Context (MSCOCO) [50] dataset that consists of daily scene images and their descriptions in the experiments following the conventional vision and language methods shown in Table 1.
The MSCOCO dataset contains 82,783 training images and 40,504 validation images, each of which is associated with five descriptions. Based on the data split provided by [13], we utilized 5,000 captions as queries, and for each caption, we retrieved a relevant image from the 5,000 test images.
As the text-to-image model and image-to-text model, we utilized the state-of-the-art methods Dynamic Memory GAN [28] and Show attend and tell method [21], respectively. Also, output of the BERT model provided by Reimers and Gurevych [18] was used as features of the query and generated sentences (f L Q and f L DB n ), and output of the DenseNet-121 model [34] was used as features of the generated and candidate images (f V Q and f V DB n ). As embedding methods, we adopted state-of-the-art methods shown in Table 1. It should be noted that all of the methods are implemented on the basis of open source codes provided by each author.

B. VERIFYING THE EFFECTIVENESS OF LINGUAL SPACE
The relationships between words provide vital information for discriminating similar images. We can describe these relationships with language well, and lingual feature extractors are trained to consider sentence structures [18]. We assume that space L focuses on representing word relationships in a sentence and thus can express this information in a sentence better than conventional embedding methods using space E can. If this assumption is satisfied, the retrieval performance becomes better. Namely, by cooperatively using space L with the conventional embedding methods using space E, the performance of conventional embedding methods can be enhanced.
At first, we removed the verbs and prepositions of the input sentences and randomly shuffled words in the sentences in which they had been removed. We defined these sentences as changed sentences. Then we inputted these TABLE 1. Methods used in preliminary experiments described in Sec. IV and in retrieval performance evaluation described in Sec. V. ''Description'' shows an overview of each method.

FIGURE 2.
Mean and median ranks obtained by methods based on normal sentences (red bars) and methods based on sentences in which verbs and prepositions had been removed and words had been shuffled (blue bars). In mean and median ranks, lower values indicate better performance.
changed sentences into various methods. Finally, we compared retrieval performances of methods utilizing normal sentences with retrieval performance of methods utilizing changed sentences. Figure 2 shows the mean and median ranks of the retrieval results for the MSCOCO dataset. Red bars show the results based on normal sentences, and blue bars show the results based on sentences in which verbs and prepositions had been removed and words had been shuffled. While the values of blue bars are similar to those of red bars in BL V and other conventional methods using space E, the value of the blue bar is smaller than that of the red bar in BL L . These results imply that BL V and other conventional methods using space E mainly focus on the objectives in sentences to retrieve desired images. On the other hand, the results indicate that verbs, prepositions and word orders are important for space BL L . The results suggest that the performance of conventional embedding methods using space E is enhanced by cooperatively using space L since the conventional embedding methods further focus on word relationship information.

C. VERIFYING THE EFFECTIVENESS OF VISUAL SPACE
Texture information is vital information for retrieving a target image. Recent papers [19] have shown that CNN-based visual feature extractors trained on ImageNet focus on texture information. We assume that space V focuses on representing texture information and thus can express texture information better than conventional embedding methods using space E can. We compared the effect when we removed texture information of the images in the conventional embedding methods based on the experiments in [19]. If space V can express texture information better than other embedding methods using space E can, the performance of conventional embedding methods using space E could be enhanced by cooperatively using space V.
At first, we converted candidate images C n to an edge-based representation using a canny edge extractor following a conventional method [19] and defined them as changed candidate images. Then we retrieved changed candidate images using the methods shown in Table 1. Finally, we compared the retrieval performance of methods using normal candidate images with the retrieval performance of methods using changed candidate images. Figure 3 shows the retrieval results of mean and median ranks for the MSCOCO dataset. The yellow bars show the results based on normal candidate images, and the green bars show the results based on candidate images converted to an edge-based representation. A comparison of the differences between the values of the yellow bars and green bars shows that the difference for BL V using space V (1, 696 in mean rank and 1, 498 in median rank) is considerably larger than the differences for other methods using space E (in other methods, average values of 828 in mean rank and 538 in median rank). Images converted to an edge-based representation have little or no texture information [19]. We therefore verified that texture information is more important for space V than other representation spaces. This implies that the performance of conventional embedding methods can be enhanced by cooperatively using space V since the conventional embedding methods further focus on texture information.

V. RETRIEVAL PERFORMANCE EVALUATION
In this section, we evaluate the effectiveness of the proposed approach by comparing it with some state-of-the-art retrieval methods.

A. QUANTITATIVE EVALUATION ON MSCOCO DATASET
Following evaluations of the conventional embedding methods, we report results using mean rank, median rank and Recall@k as follows: where r k is the number of correctly retrieved images in the top-k retrieval results, and N (= 5, 000) is the number of input sentences. Furthermore, in our framework, we simply set α, β, γ in Eq. (1) as α = β = γ = 1 3 . Note that we utilized the same experimental setup as that described in Subsec. IV-A. In addition, we used the comparative methods in Table 1  Note that the proposed approach can be applied to all of the existing embedding methods since its proposition is the use of a combination of the embedding space and the original visual and lingual spaces as shown in Table 2. In Fig. 4, we show the results obtained by the method of Ji '19 since this method shows the best results in Table 2.
As shown in Table 2, the proposed approach using spaces L and V in addition to space E outperforms the conventional embedding methods. Specifically, in each method, we can see that the proposed approach drastically enhances the mean and median ranks compared to the state-of-the-art methods. 96782 VOLUME 8, 2020 This means that the proposed approach is effective for various conventional embedding methods to improve the retrieval performance of vision and language retrieval. Furthermore, in each of the conventional methods, we can see that retrieval performance is better when both space L and space V are used than when only space L or space V is used. It can therefore be assumed that both spaces contribute to the performance enhancement. Also, from Fig. 4, we qualitatively confirmed that ''the proposed approach with the method of Ji '19'' can realize retrieval that considers objects in images and these relationships comparing with the ''the method of Ji '19''.

B. QUANTITATIVE EVALUATION ON THE FLICKR30K DATASET
As additional experiment, we calculated Recall@k, mean and median ranks on the Flickr30k dataset [51] that contains 31,783 images collected from the Flickr website. Following [52], we split the dataset into 29,783 training images, 1000 validation images and 1000 test images. We utilized 1,000 captions as queries, and for each caption, we retrieved a relevant image from the 1,000 test images.
We show Recall@k, mean and median ranks on Flickr30k dataset in Table 3. As shown in Table 3, we can see that the proposed approach using spaces L and V in addition to space E outperforms the conventional embedding methods. From the quantitative evaluations on MSCOCO and Flickr30k datasets, we confirm that the proposed approach can enhance the retrieval performance of conventional embedding methods that only utilize space E for various datasets.

C. PARAMETER SEARCH AND ANALYSIS OF RANKS IN EACH SPACE
In our proposed approach, we set parameters α, β and γ that balance the percentages of similarity combinations. VOLUME 8, 2020 We simply set each parameter as α = β = γ = 1 3 in the above experiments; however, there may be better parameter settings. We can further boost the performance of the proposed approach by adjusting each parameter. By searching for the best parameters, we can observe the importance of the similarities s L n , s E n and s V n . Thus, we show the results of a parameter search.
In this experiment, we simply conducted a grid search in the MSCOCO dataset. We changed each parameter by 0.1 and observed the mean and median ranks at each of the parameter settings. Here, we utilized the method of Ji '19 to calculate s E n since it outputs the best results as shown in Subsec. V-B. Figures 5 and 6 show the results of mean and median ranks, respectively. As shown in Fig. 5, the best mean rank is obtained around the settings α = 0.3, β = 0.4 and γ = 0.3. Also, as shown in Fig. 6, the best median rank is obtained around the settings α = 0.3, β = 0.5 and γ = 0.2. These results mean that the similarity s E n is the most important information; however, the similarities s V n and s L n are also important information, and we can enhance the performance for vision and language retrieval by using spaces L, E and V simultaneously. Also, we observed the rank of each sample utilizing only L for the MSCOCO dataset, E  or V in Fig. 7. Each axis shows the rank when using only each space. From Fig. 7, we can see that samples that can be retrieved in the lower rank are different in each space. This result implies that the proposed approach improves the retrieval performance of vision and language retrieval by 96784 VOLUME 8, 2020  complementing poor samples in each space. We can further improve retrieval performance if we can discriminate the poor sample in each space, and we will examine this in our future work.

VI. CONCLUSION
In this paper, we have proposed an approach that utilizes multiple representation spaces from generative models for vision and language retrieval. Experimental results verified that the proposed approach enhances the performance of conventional embedding methods that only utilize space E. Furthermore, by analyzing each component, the advantage of each representation space was shown. In a future work, we will attempt to consider adaptive fusion of similarities. Professor with the Faculty of Information Science and Technology, Hokkaido University. His research interests include AI, the IoT, and big data analysis for multimedia signal processing and its applications. He is a member of ACM, IEICE, and ITE. He was a Special Session Chair of the IEEE ISCE2009, a Doctoral Symposium Chair of ACM ICMR2018, an Organized Session Chair of the IEEE GCCE2017-2019, a TPC Vice Chair of the IEEE GCCE2018, a Conference Chair of the IEEE GCCE2019, and so on. He has also been an Associate Editor of ITE Transactions on Media Technology and Applications.
MIKI HASEYAMA (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronics from Hokkaido University, Japan, in 1986, 1988, and 1993, respectively. She joined the Graduate School of Information Science and Technology, Hokkaido University, as an Associate Professor, in 1994. She was a Visiting Associate Professor with Washington University, USA, from 1995 to 1996. She is currently a Professor with the Faculty of Information Science and Technology, Division of Media and Network Technologies, Hokkaido University. Her research interests include image and video processing and its development into semantic analysis. She is a member of the IEICE, the Institute of Image Information and Television Engineers (ITE), and the Acoustical Society of Japan (ASJ). She has been a Vice-President of the Institute of Image Information and Television Engineers, Japan (ITE), the Editor-in-Chief of ITE Transactions on Media Technology and Applications, a Director, International Coordination, and Publicity of The Institute of Electronics, Information and Communication Engineers (IEICE).