Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.


I. INTRODUCTION
With the massive explosion of social multimedia on the Internet, multi-modal data containing various media types (e.g., image, text, and video) has been increasingly popular in people's daily life. The joint modeling of multi-modal data gives rise to a wide variety of applications. Image captioning [1]- [6] aims to find or generate a natural sentence to describe an image. Image search [7]- [11] is to retrieve related images given a language sentence or several keywords. A solution to these applications is to learn an effective matching model such that the semantically related image-text pairs can be assigned higher matching scores than the unrelated ones.
The task of image-text matching [12]- [18] has attracted broad interests in both industry and academia. However, The associate editor coordinating the review of this manuscript and approving it for publication was Jiankang Zhang . it is still challenging due to that the multiple modalities of image and text have complicated correlation between each other. Specifically, it can be observed that there exist two types of fine-grained correlation between image and text, i.e., regions-text correlation and words-image correlation, which can be also considered as the bi-directional correlation. Figure 1 shows an example of image-text pair in the dataset of Flickr30K. On one side, it can be seen that the text has close correlation with some specific regions in the image, which denotes the regions-text correlation. For example, in Figure 1, the visual regions of the little boy and the football are the more important areas to reflect the semantic information of the text description. On the other side, the image is also closely related to some specific words in the text, which denotes the words-image correlation. In Figure 1, ''boy'', ''blue'', ''shirt'', and ''ball'' are more meaningful words to cover the content of the picture. If these two types of correlation VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. An example of image-text pair in Flickr30K. Two types of fine-grained correlation exists between image and text, i.e., the correlation between regions and the text and the correlation between words and the image.
are parsed accurately, the image and text can be modeled jointly to learn salient multi-modal features for more effective matching.
There are two main lines of researches for image-text matching. The first is Canonical Correlation Analysis (CCA) based methods [19]. CCA computes a linear mapping to maximize the correlation between the two projected views. DCCA [14], [20] extends CCA by employing the high level visual and textual features extracted with deep neural networks. Another strategy of image-text matching is to use a ranking loss to conduct metric learning [13], [15], [16], [21]. In [21], an image-text representation learning method is proposed to build a two-branch network with both ranking loss and structure-preserving loss. DAN [22] is proposed to build a dual attention reasoning model for image-text matching. However, these methods cannot effectively capture the bi-directional correlation between image and text.
Attention mechanism aims to dynamically extract the salient features from the sequential or spatial data. It has been proved to be effective in many tasks of multi-modal data, such as visual question answering [23]- [25], multimodal sentiment analysis [26]- [28] and multimodal representation learning [29]- [33]. As shown in Figure 1, there exist two types of fine-grained correlation between image and text. Inspired by the recent works on attention mechanism, the visual attention model can be employed to exploit the regions-text correlation to learn a joint embedding named RT for the image-text pair. Meanwhile, the semantic attention model is employed to exploit the words-image correlation to learn the other joint embedding named WI. Since the two types of joint embeddings usually have inconsistent distributions, combining them directly is not effective to bridge the gap for image-text matching learning. The existing attention-based image-text matching approaches typically focus on one type of correlation or treat each type of correlation independently. However, as shown in Figure 1, both of the two types of correlation reflect the matching information of the image-text pair, and they are also complementary to each other. Therefore, these approaches fail to fully preserve the underlying relation of the two types of correlation. Preserving this relation and the structure between items would require that the gap between the two embeddings of RT and WI of the same image-text pair is minimized, while the distances between dis-matched items of the same type of the joint embedding are maximized.
To address the problems existed in the previous attentionbased image-text matching methods, we propose a new approach Adversarial Attentive Multi-modal Embedding Learning (AAMEL) which is built around the concept of adversarial learning [34]. The core of this approach is the interplay between attentive embedding learning and embedding discriminating. Two attention models are used to learn the multi-modal embeddings of RT and WI by exploiting the two types of correlation. They have the objective to confuse the discriminator that acts as an adversary. On the other side, a discriminator is built to distinguish the two types of joint embeddings RT and WI learned from the two attention models, which also steers the learning of the joint embeddings. Through the adversarial process, it is expected that both the embeddings of RT and WI can well exploit the regions-text and words-image correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. In this way, the adversarial learning reinforces the two attention models mutually to make the learned multi-modal embeddings more effective for image-text matching. Then, a pairwise margin ranking loss is proposed to capture the triplet constraints for matching score learning. The main contributions are summarized as follows: • We investigate to exploit the bi-directional correlation between image and text for image-text matching learning. This process is more effective to learn the joint embeddings compared to those considering only one type of correlation or treating each type of correlation independently.
• We propose a novel approach, i.e., Adversarial Attentive Multi-modal Embedding Learning (AAMEL), for image-text matching. To the best of our knowledge, this is the first work to learn the attentive multi-modal embeddings with the adversarial principle for image-text matching.
• The proposed approach AAMEL is evaluated on two benchmark datasets. The experimental results demonstrate the superiority of AAMEL.
The rest of this paper is organized as follows. Section II summarizes the related works. Then section III describes AAMEL in details. Experimental results are presented in section IV. Finally, we conclude this work in section V. 96238 VOLUME 8, 2020

II. RELATED WORK
Existing works of image-text matching can be roughly classified into CCA-based methods and ranking-based methods.
Canonical Correlation Analysis (CCA) [19] aims to compute a linear mapping to maximize the correlation between two projected views. KCCA [19] is an extension of CCA which reproduces kernel hilbert spaces with corresponding kernels to capture the nonlinear relation between multi-view data. Klein et al. [35] shows that properly normalized CCA can achieve satisfying performance by utilizing state-of-theart visual and textual features. Due to the limitation of these shallow models, it is difficult for these methods to capture the highly non-linear cross-modal relation. With the development of deep learning technique, deep CCA [14], [20], [36] is proposed to apply CCA upon the high level visual and textual features extracted by deep neural networks. Deep CCA allows an end-to-end learning manner to propagate the gradient down in a deep learning framework. However, as described in [37], the generalized eigenvalue problem of CCA cannot be well solved by stochastic gradient descent due to the unstable covariance estimation in mini-batch.
Another strategy for image-text matching is to use a ranking loss to perform metric learning on image-text pairs [13], [21]. DeVISE [38] is proposed to learn a linear transformation to project images and labels into a joint visualsemantic embedding space. In [21], an image-text representation learning method is proposed to build a two-branch network with both ranking loss and structure-preserving loss. Ma et al. [13] propose the multi-modal convolutional neural networks (mCNNs) to match image and sentence at different semantic levels. Karpathy et al. [39] explicitly compute all pairwise distances to automatically calculate the alignments between visual regions and textual fragments. CMDN [40] is proposed to build multiple deep networks for cross-media shared representation learning. It integrates the intra-modal and inter-modal representations to learn the cross-modal correlation using hierarchical neural networks. CHAIN-VSE [16] builds a bidirectional retrieval framework which relies on character-level inception module for visualsemantic embeddings. Mithun et al. [41] build a two-stage approach for image-text retrieval by using a supervised ranking loss combined with weakly-supervised web images to learn multimodal representation. However, these methods cannot well capture the fine-grained correlation between image and text.
There are also some multi-modal matching methods utilizing attention mechanism [15], [22] or adversarial networks [42]- [47]. sm-LSTM [15] learns an attention-based selective LSTM for matching. Dual Attention Networks for image-text matching (m-DAN) [22] is proposed to learn two branches of representations for image and text separately. The attention networks in m-DAN are built in two branches for two modalities, which neglects the cross-modal correlation between image and text. Cm-gans [42] is proposed to seek a common subspace with adversarial learning to classify different uni-modalities for representation learning. SCAN [43] is proposed to build two kinds of stacked cross attention for correlation learning and then average their predicted similarity scores for matching. However, this shallow combination is not effective to find the internal connections between the two models. Compared with SCAN, our method exploits the correlation from a more macroscopic perspective. Furthermore, our two attention models are not treated separately but combined with adversarial learning to reinforce the two attention models mutually to learn the multi-modal embeddings for matching. Gu et al. [44] integrate the image and text adversarial generative models into image-text searching. DAML [45] is proposed to use a deep metric learning framework with adversarial regularization between two modalities for cross-modal retrieval. ACMR [46] is proposed for imagetext retrieval by seeking an effective common subspace with adversarial learning to discriminate between different modalities. Different from these works, our adversarial learning framework is built to guide the two attention models to learn from each other for more effective multi-modal embedding learning.

III. ADVERSARIAL ATTENTIVE EMBEDDING LEARNING A. PROBLEM STATEMENT
In this section, we give an overview of the proposed model.
. . , T n } denotes the text description set, where n is the number of samples. Our aim is to train a deep neural network to learn a matching score function, such that for each image V i , the matched image-text pair (V i , T i ) has a higher matching score than the dis-matched pair (V i , T − i ). Figure 2 illustrates the framework of the proposed model AAMEL. Specifically, a visual attention model is employed to exploit the correlation between visual regions and text to learn the regions-text (RT) embedding. Meanwhile, a semantic attention model is employed to exploit the correlation between textual words and image to learn the words-image (WI) embedding. Then, the two attention models are reinforced mutually through adversarial learning for more effective multi-modal embedding learning. That is, a discriminator is used to discriminate the embeddings of RT and WI learned from the two attention models, while the two attention models aim to confuse the discriminator. Finally, a pairwise margin ranking loss is proposed to learn the matching score. In this way, it can be ensured that the learned multi-modal embeddings of RT and WI can both capture the two types of correlation more effectively and are also discriminative among different image-text pairs.

B. LEARNING REGIONS-TEXT EMBEDDING
Usually, the text has close correlation with some specific regions in the corresponding image. Mining this type of correlation is useful to learn more effective multi-modal representation for image-text matching. In this section, a visual attention model is proposed to exploit the correlation between VOLUME 8, 2020 FIGURE 2. The framework of the approach AAMEL for image-text matching. The proposed visual attention model and textual attention model learn multi-modal embeddings of RT and WI by exploiting bidirectional fine-granularity correlation. The adversarial learning conducts an interplay between attentive embedding learning and embedding discriminating, which steers the two attention models to learn mutually for better embedding. Through adversarial learning, it can be ensured that the learned multimodal embeddings of RT and WI capture both two directional correlation and are more effective for matching.
visual regions and text to learn the multi-modal representation named RT embedding.
Given an image V i , we first extract the features of region maps with pre-trained deep CNNs as R i ∈ R d×m×m , where m × m is the number of regions and d is the dimensionality of the feature vector of each region. For the corresponding text T i , each word is embedded with pre-trained word embeddings and then fed into an LSTM, and the output of the last cell is used as the text representation H i ∈ R h .
To fuse region features R i and text features H i for correlation learning, we first project them into a c-dimensional common space as follows: where W r ∈ R c×d and W h ∈ R c×h are the trainable parameters, and b r and b h are the bias parameters. Then R where 1 ∈ R c×m×m is used to broadcast the dimensionality of H (c) i to c×m×m, which matches the spatial size of the image. The symbol denotes the element-wise multiplication. The attention map is then calculated by convolving the fused representation F (c) i with 1 × 1 kernel followed by a softmax function over m × m regions: (4) where the symbol * denotes the convolutional operation, W α ∈ R c×1×1 and b α ∈ R c are the convolutional parameters. Then, the attentive features of the image can be calculated as the weighted average of the visual features over all regions: In contrast to the original region features R i , the attentive region feature mappingR i is more effective to reflect the relevance to the corresponding text representation H i . Then, R i and H i are further fed into a multi-layer perceptron (MLP) to learn an f -dimensional RT embedding as F (rt) i . We denote the whole process of learning RT embedding from image and text pair (V i , T i ) as follows: where function rt(·) denotes the whole networks to obtain RT embedding and θ rt is the parameters.

C. LEARNING WORDS-IMAGE EMBEDDING
On the other side, the image also has close correlation with some words in the corresponding text. Therefore, the textual attention model is proposed to exploit the correlation between textual words and image to learn the joint representation named WI embedding. Given an image V i , we first extract the high-level visual features from deep CNNs as Q i ∈ R q , where q is the 96240 VOLUME 8, 2020 dimensionality of visual feature vector. For the corresponding text T i , each word is embedded and fed into a bidirectional LSTM (Bi-LSTM) to learn semantic word representations as E i ∈ R e×l , where l denotes the length of the text and e denotes the dimensionality of word representation.
To fuse word features E i and image features Q i for correlation learning, they are first projected into an s-dimensional common space as follows: where W e ∈ R s×e and W q ∈ R s×q are the parameter matrices, and b e and b q are the bias terms. Then, E where 1 ∈ R s×l is used to broadcast Q (s) i to dimensionality of s × l. The attention map is then calculated by convolving the fused representation F (s) i with kernel length of 1 followed by a softmax function over l words: where W β ∈ R s×l and b β ∈ R s are the convolutional parameters. Then, the attentive textual features can be calculated as the weighted average of the features over all the words: In contrast to the original text features E i , the attentive textual feature mappingÊ i is more effective to reflect the relevance to the corresponding image. Then,Ê i and Q i are further fed into a multi-layer perceptron (MLP) to learn a f -dimensional WI embedding as F (rt) i . We denote the whole process of learning WI embedding from image and text pair (V i , T i ) as follows: where the function wi(·) denotes the whole networks to obtain WI embedding and θ wi is the parameters.

D. ADVERSARIAL LEARNING OF MULTI-MODAL EMBEDDINGS
The visual and textual attention models have been used to learn RT and WI embeddings respectively. The two types of attention models learn the multi-modal representation based on different types of correlation. The visual attention model focuses on learning the discriminative visual feature, while the textual attention model focuses on learning the imagerelated word features. Motivated by the recent advance of generative adversarial models [34], [48]- [50], an adversarial learning framework is proposed to reinforce the two attention models mutually to learn more effective multi-modal embeddings.
The generative adversarial networks (GANs) usually contain a generator and a discriminator. In our model, the generator consists of two components for multi-modal embedding generating, i.e., visual attention model and textual attention model. To make the two attention models learn from each other, a classifier d(·; θ d ) is built as the discriminator to distinguish the embeddings of RT and WI.
In the learning process, the discriminator is trained to classify the RT and WI embeddings based on the input vectors. The input RT and WI embeddings are assigned with the label of 01. Based on Eq.(6) and Eq.(12), the loss function for the discriminator is formulated as: where θ rt and θ wi are kept unchanged during the training of the discriminator. On the other side, the two attention models (generator) aim to confuse the discriminator. Thus the input RT and WI embeddings are assigned with the label of 10. The generator can be trained to reduce the following loss: where θ d is kept unchanged when training the generator. However, as stated by [49], [51], the Jenson-Shannon divergence optimized by the original GANs results in instability problem. To ease the training difficulty of the original GAN, WGAN [49] is proposed with an efficient approximation of the Wasserstein distance (also known as the Earth Mover distance). While the original GANs are driven by a classifier separating real from fake samples, WGANs are driven by a distance measure which quantifies the similarity of two distributions. Therefore, in our experiments, WGAN is adopted to perform the adversarial learning of multi-modal embeddings. Based on Eq. (14) and Eq.(13), the loss of the generator and discriminator can be rewritten respectively as: The loss function (15) and (16) are very similar to that of the original GAN, but has better convergence properties.

E. LEARNING MULTI-MODAL EMBEDDINGS FOR MATCHING
Next, the embeddings of RT and WI are combined to learn the image-text matching score. In particular, we build a multilayer perceptron (MLP) with sigmoid activation in the last VOLUME 8, 2020 layer to exploit the two types of correlation to learn the matching score. Similar to [52], the pairwise margin ranking loss is used to rank the matching scores of the matched imagetext pair (V i , T i ) and the hardest negative pair (V i , T − i ) as follows: where function m(·; θ m ) denotes the MLP to calculate the matching score, and M is the margin parameter. We have also tested the symmetric ranking loss, but the hardest negative scheme performs much better.
To combine the adversarial learning method to learn the multi-modal representation for image-text matching, the loss functions Eq. (15) and Eq.(17) are simultaneously minimized as a overall loss: where λ is a hyper-parameter to regulate the importance of the adversarial learning. Note that here we use MLP for matching learning (Eq.17) and attention models for embedding learning (Eq.15). However, most of other matching methods such as DAN [22] and SCAN [43] employ cosine similarity to calculate matching scores. If we use cosine similarity here, the attention models need to undertake these two tasks of matching learning and embedding learning. Since the adversarial networks is also built on the embeddings from the attention models, it is more appropriate to divide these two learning tasks. Similarly, the adversarial discriminating loss Eq.(16) for training is written as follows: The training of the proposed model AAMEL can be implemented by matching learning and discriminating learning alternatively with back-propagation. For matching learning, the loss Eq.(18) is minimized to update the parameters θ m , θ rt , and θ wi . For discrimination learning, the parameter θ d is updated according to the loss Eq. (19). In summary, the detailed training procedure is illustrated in Algorithm 1.
After training, the matching score of a given image x and text y can be inferred as follows: score(x, y) = m (rt(x, y), wi(x, y)) (20) The time-complexity of the proposed model at inference time is faster than most representative previous works based on attention, like DAN [22] and SCAN [43]. for m-steps do 3: update parameters θ m , θ rt and θ wi for matching learning: 4: end for 8: for d-steps do 9: update parameters θ d for discriminating learning: 10: end for 12: until AAMEL converges or O(RT 2 D). DAN is O(KRD+KTD), where K is the number of pre-defined reasoning steps. As for methods not based on attention mechanism such as DCCA [14], the inference process of these methods is faster. However, the performance of these methods is much worse than that of attention-based methods.

IV. EXPERIMENTS
In this section, extensive experiments are conducted to analyze the effectiveness of the proposed AAMEL.

A. DATASETS
The experiments are conducted on two benchmark datasets, i.e., Flickr30K and MSCOCO, which are collected from Flickr. The details are described below: -Flickr30K [53] is an image-description dataset with 31,783 images containing people and animals. Each image is associated with five sentences written by native annotators. Following [54], 29,783, 1,000, and 1,000 images are used for training, validation, and testing respectively. -MSCOCO [55] contains 82,783 images for training and 40,504 images for validation. Each image is annotated with five sentences. Following [54], the dataset is randomly re-split with 82,783, 4,000, and 1,000 (or 5,000) images for training, validation, and testing respectively. Then we follow [52] to put the remaining images that originally belong to the validation set into the training set. We present experimental results on both the 1,000 and 5,000 testing images.

B. EXPERIMENTAL SETTINGS
For the images, we employ two types of deep CNNs, i.e., VGG19 [61] and ResNet152 [62], pre-trained on Ima-geNet dataset [63] to extract visual features. For VGG19,   The implementation is conducted using Pytorch 1 deep learning tool on four GPUs of NVIDIA GTX 1080Ti. We use Adam optimizer [65] with an initial learning rate of 0.0002 to train for 15 epochs. Then we fine-tune the model with a lowered learning rate of 0.00002 for another 15 epochs. Dropout technique [66] is employed for MLPs with a probability of 0.5 to reduce over-fitting during the training process. Batch Normalization [67] is also employed to reduce the occurrence of covariate shift within the neural networks. To evaluate our method on the testing set, we choose the snapshot of the trained model which performs best on the validation set.

D. EXPERIMENTAL RESULTS
In this section, the experimental results are presented and analyzed to evaluate the proposed approach for cross-modal retrieval on two datasets. R@K (with K = 1, 5, 10) is adopted as the evaluation metric. It represents the percentage of queries in which at least one correct match is ranked among the top K matches. The metric of median rank (Med r) of the top-ranked correct result is also employed. The performance of AAMEL and the compared methods on Flickr30K and MSCOCO are presented in Table 1 and Table 2 respectively.
From Table 1, it can be seen that both our two methods of AAMEL (VGG) and AAMEL (ResNet) consistently outperform the compared methods with corresponding pre-trained CNNs. CCA-based methods such as DCCA [14] and CCA [35] show relatively low performance because the generalized eigenvalue problem cannot be well solved by stochastic gradient descent [37]. DAN [22], SCO [59], and SCAN [43] obtain the state-of-the-art results for cross-modal search. Compared to DAN and SCAN, our method AAMEL (ResNet) improves the performance consistently on all metrics. Since both DAN and SCAN have employed the attention mechanism for matching learning, the improvement validates the effectiveness of adversarial learning for image-text matching. Figure 3 illustrates several qualitative results of descriptions retrieved by the given images on Flickr30K. We present top-5 ranked descriptions based on the matching scores calculated by our approach for each sample. Groundtruth sentences are checked and false ones are crossed. One can see that the retrieved sentences can well describe the given images.
The results on dataset MSCOCO are presented in Table 2. The several rows above are the results tested on the 1K test images and their associated sentences. It can be seen that our model AAMEL achieves improvement on various metrics over the compared baselines in most cases. SCAN [43] is the state-of-the-art method which integrates image and text adversarial generative models into image-text searching. Our method outperforms SCAN except on R@10 in image-to-text retrieval, which confirms that the proposed model is effective to learn the multi-modal embeddings for image-text matching. We also make tests on all the 5K images and the corresponding descriptions, and present the comparison results in the rows below. It can be observed that the proposed method still achieves the best performance, which again demonstrates its effectiveness. Figure 4 shows the qualitative results of retrieved images given description queries on MSCOCO. Each description corresponds to a groundtruth image. 96244 VOLUME 8, 2020 For each description, we present top-5 ranked images from left to right. We outline the correct matches in green boxes and false matches in red boxes. Note that most incorrect matches predicted by our model are also reasonable results. For each description, we present top-5 ranked images from left to right. The groundtruth and false matches are outlined in green and red boxes respectively.

E. ABLATION STUDY
To analyze the contributions of different components of AAMEL, we ablate the proposed model and demonstrate the effectiveness of each component: 1) Visual attention model to learn RT embedding (VA); 2) Textual attention model to learn WI embedding (TA); 3) Adversarial Networks to learn the multi-modal embeddings (AN). Table 3 shows the results of the ablative experiments on the dataset Flickr30k. V and T are the simple versions of VA and TA respectively by changing the two kinds of  attention mechanism with average pooling layers. From the results, one can see that VA and TA show consistently better performance than V and T. It confirms the effectiveness of the two types of attention model to learn RT and WI embeddings for image-text matching. VA+TA performs better than VA and TA with an increase of over 3% and 4% on all metrics respectively. This is because that the two attention models can complement each other for matching. With the adversarial networks, VA+TA+AN (the full model) further improves the performance with about 4% on different metrics. It validates that adversarial learning reinforces the two attention models to learn more effective embeddings.

F. VISUALIZATION OF ATTENTIONS
To further analyze the effectiveness of the visual and textual attention models in AAMEL, we visualize the attentions learned by the model. Upsample and Gaussion filter are used to visualize the attention weights of images and red stroking is used to highlight the attended words. Figure 5 presents several examples of the attended image and text pairs in Flickr30K. From the results, one can see that our model pays attention to the image regions and textual words correctly. For example, the first image-text pair shows that AAMEL draws the visual attention on the regions of the little girl with the basket. These regions are closely related to the corresponding text description. On the other side, the important semantic words ''child'', ''sweatshirt'', and ''eggs'' are paid with great attention weights. The results indicate that AAMEL is effective to learn attentive features for the multi-modal data, which results in improved performance of image-text matching.

G. PARAMETER SENSITIVITY
To evaluate how the balance parameter λ affects the performance of AAMEL, the sensitivity experiments are conducted for different values of λ. Figure 6 shows the performance (R@1) of AAMEL on Flickr30k. When λ = 0, the training is governed by the attention models, which leads to that the attention models are not effective to learn from each other without the adversarial networks. When λ becomes larger, the model concentrates more on the adversarial learning. However, too large values of λ also influence the learning of the attention models. It can be seen that AAMEL obtains the best performance at about λ = 0.2.

V. CONCLUSION
In this paper, we propose a novel deep method named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching, which engages two processes in an interplay game. To capture the correlation effectively, two attention models are proposed to learn the two types of correlation between image and text. Then the adversarial learning method is proposed to reinforce the attention models mutually to make the learned multi-modal representation more effectively for image-text matching. We employ two benchmark datasets to conduct experiments on the tasks of image-to-text and text-to-image retrieval. The experimental results demonstrate that the proposed method outperforms state-of-the-art baselines. Ablation study also verifies the effectiveness of the main components of AAMEL.
The main limitation of the proposed method is that the inference process of AAMEL is slower than traditional methods based on CCA or hash. In the future, we will explore how to integrate different levels of text and image features to learn more effective multi-modal embeddings for imagetext matching. Furthermore, we plan to generalize our method to other multimodal data such as video and music for crossmodal retrieval.