Query is GAN: Scene Retrieval With Attentional Text-to-Image Generative Adversarial Network

Scene retrieval from input descriptions has been one of the most important applications with the increasing number of videos on the Web. However, this is still a challenging task since semantic gaps between features of texts and videos exist. In this paper, we try to solve this problem by utilizing a text-to-image Generative Adversarial Network (GAN), which has become one of the most attractive research topics in recent years. The text-to-image GAN is a deep learning model that can generate images from their corresponding descriptions. We propose a new retrieval framework, “Query is GAN”, based on the text-to-image GAN that drastically improves scene retrieval performance by simple procedures. Our novel idea makes use of images generated by the text-to-image GAN as queries for the scene retrieval task. In addition, unlike many studies on text-to-image GANs that mainly focused on the generation of high-quality images, we reveal that the generated images have reasonable visual features suitable for the queries even though they are not visually pleasant. We show the effectiveness of the proposed framework through experimental evaluation in which scene retrieval is performed from real video datasets.


I. INTRODUCTION
With the increasing number of videos on the Web, methods of retrieval that provide users with scenes 1 corresponding to their descriptions have become important topics of study [1]- [5].The scene retrieval task has been studied by many researchers, and there have been many reports that proposes text-based retrieval methods [6] and content-based retrieval methods [7].With the rapid growth of deep learning technologies, studies on scene retrieval have moved to the next stage.
Realization of scene retrieval is difficult because several important challenges must be tackled simultaneously.First, videos and their corresponding descriptions are denoted as modalities that have different semantic spaces.Thus, it is necessary to match these two different modalities to The associate editor coordinating the review of this manuscript and approving it for publication was Amjad Ali . 1 In this paper, we define a unit including continuous shots of the same time, the same place and the same action as a ''scene'' and denote images included in videos as ''frames ' '. retrieve scenes that are relevant to input text descriptions.The retrieval performance heavily depends on the matching accuracy of the two different modalities [8]- [10].Moreover, descriptions of desired scenes are different for each user, and it is a difficult task to handle all of these descriptions.Therefore, it is essential for successful retrieval to learn high-level and robust feature representations of scenes and their corresponding descriptions [11], [12].The above-described challenges were tackled in pioneering studies, and text-based and content-based retrieval methods have been widely adopted.
Text-based methods [13]- [19], which perform annotations for target contents, retrieve contents by comparing input descriptions provided by users and the results of annotations given to the candidate contents.However, their retrieval performance depends on the quality of the annotations.Also, for realizing retrieval of new contents, it is necessary to prepare an enormous amount of training annotation data.Content-based methods [8]- [10], [20]- [23], which retrieve contents by computing similarities in content spaces, have recently attracted attention with the development of deep learning techniques [24], [25].Since content-based methods do not rely on annotated information but directly use content information, they tend to overcome the above-mentioned problems [26], [27].However, input query contents must be provided to perform the retrieval in the content-based methods.Therefore, these methods cannot retrieve desired contents when users cannot prepare the query contents, and this restriction is not user friendly.
In this paper, we propose a new scene retrieval framework, Query is GAN, based on a text-to-image Generative Adversarial Network (GAN) [28]- [35] .As shown in Fig. 1, the proposed framework enables scene retrieval from input descriptions, i.e., sentences, provided by users.The input descriptions are projected to the visual space through a multimodal neural network model, i.e., the text-to-image GAN.Specifically, query images are generated from the input descriptions based on AttnGAN [35].Then, by retrieving similar scenes based on their visual features, the proposed framework overcomes the remaining problems of existing methods.

FIGURE 1.
Illustration of the scenario that we try to achieve.From the input description, our framework retrieves the corresponding desired scene.
Our retrieval framework makes use of the hierarchical structure of AttnGAN.Specifically, in AttnGAN, higher-resolution images are recursively generated from their low-resolution versions.As the resolution becomes higher, attention is paid to words in descriptions.By focusing on the relationship between this hierarchical structure and the characteristic of the attention, retrieval that can gradually narrow down the retrieval candidates along the coarse-to-fine abstraction level direction can be realized.
In addition to the proposition of the above novel retrieval framework, new interesting results are also shown in this paper.It has been reported that the visual quality of images generated by GANs are still insufficient when the generation tasks become complicated [35].For example, Fig. 2 (a) shows an image generated by AttnGAN trained on a bird dataset [36], and Fig. 2 (b) shows an image generated by AttnGAN trained on the Common Objects in Context (COCO) dataset [37].Note that the COCO dataset is a large-scale captioning dataset that contains images of common scenes.Although AttnGAN trained on the COCO dataset can generate various images not limited to specific objects, its generation task becomes more complicated and difficult than that of the bird dataset.We can see that the generated image based on COCO dataset is not visually pleasant.However, in this study, we reveal that such visually non-pleasant images generated by the text-to-image GAN have reasonable visual features suitable for the queries, evidence of which was obtained in the experiments.Therefore, successful retrieval becomes feasible regardless of the visual quality of the generated images.
The contributions of this paper are summarized below.Contribution 1 We propose a novel scene retrieval framework and achieve higher retrieval performance than the performances of state-of-the-art methods.The proposed framework utilizes multiple-resolution generated images that can pay attention to sentenceto-word characteristics based on the hierarchical structure of the text-to-image GAN.Contribution 2 We demonstrate the usefulness and versatility of the generated images even if the images are not visually pleasant.By showing the effectiveness of the proposed retrieval framework through experimental evaluation using several input descriptions and their corresponding generated query images, the above characteristics are confirmed.

II. RELATED WORKS A. MULTIMODAL RETRIEVAL USING DESCRIPTION QUERIES
A multimodal retrieval task from input descriptions can be considered as a matching task between contents included in a target database and the provided descriptions.This task can be broadly classified into two streams.
In the first stream, methods that prepare text labels for images included in a target database are widely utilized [15], [16], [38], [39], and they compare input query descriptions and the annotated text labels in the target database.In recent years, deep learning-based annotation has been used to automatically add labels to images in a database.Karpathy and Li [38] utilized a Convolutional Neural Network (CNN) and a bidirectional Recurrent Neural Network (bRNN) [40] to add labels to corresponding regions of the images.This enables retrieval of images even if they contain many and various objects.Vinyals et al. [39] utilized Long Short-Term Memory (LSTM) to estimate long descriptions by predicting the next words from their visual features and the preceding words.This model can generate natural text descriptions that express how objects included in the target images are related to each other.Then they realized retrieval that can consider the relationship between these objects.As a similar approach, Johnson et al. [15] generated graph-type text labels and realized retrieval that takes into account the relationship between objects.
As the second stream, there have been many proposed methods that retrieve images from input descriptions by embedding their text features into common semantic spaces with visual features for realizing their comparison [8], [10], [41]- [47].These methods enable retrieval that does require text label annotation, which is necessary in the first stream.Kiros et al. [8] embedded visual and text features into a common semantic space on the basis of CNN and LSTM and enabled retrieval of relevant images.Vendrov et al. [10] proposed an embedding method that takes the order relationship between words into consideration.This method can retrieve images that are strongly related to structures of input descriptions based on CNN and Gated Recurrent Unit (GRU) even though they do not use text labels.Since the above-mentioned methods embed input descriptions and images into common semantic spaces, they have robustness to input descriptions, and our framework also adopts an approach similar to these methods.

B. APPLICATIONS OF GENERATIVE ADVERSARIAL NETWORKS
The reality of images generated by GANs has been drastically improved in recent years.Accordingly, practical studies using the architecture of GANs have become popular.GAN models have been applied to image translation tasks [48], [49] in many studies, and many related works such as works on image super-resolution [50] and image inpainting [51] have been carried out.Studies not only on the generation of images but also other kinds of contents such as paintings [52], music pieces [53], computer graphics [54] and graphs [55] have also been increasing.
Text-to-image synthesis is one of the most attractive fields for application of GANs [29]- [35].GAN-INT-CLS [29] is the first model in which concept of GAN was applied to a text-to-image translation task.Although this model can generate images from input texts, its resolution is limited to 64 × 64 pixels, and the generated images are not visually pleasant.AttnGAN [35] and HDGAN [32] have recently been proposed for improving the quality of generated images.By utilizing description information and its word information, AttnGAN can generate high-resolution images that can focus on details of the input description information.On the other hand, from its hierarchically-nested structure, HDGAN can generate images with higher resolution than that of images generated by AttnGAN.Even though these methods can generate visually pleasant images in simple tasks such as birds shown in Fig. 2 (a), it is still difficult to generate visually pleasant images in more complicated tasks as shown in Fig. 2 (b).
In the proposed framework, we utilize AttnGAN for a text-to-image synthesise task since its structure focusing on not only description information but also word information strongly matches the aim of our study.

III. OUR SCENE RETRIEVAL FRAMEWORK
The details of our proposed framework are presented in this section.Our goal is to retrieve scenes that contain particular semantic contents corresponding to input sentences and their words.An overview of our framework is shown in Fig. 3. Our framework consists of two phases, query image generation and estimation of relevant scenes.In the first phase, a hierarchical image generation network based on AttnGAN [35] is constructed and three different resolution query images that contain different abstraction levels are generated.In the second phase, a hierarchical retrieval architecture is constructed to find the most suitable scenes from the target video scene database by narrowing down the retrieval candidates along the coarse-to-fine abstraction level direction.

A. FIRST PHASE: QUERY IMAGE GENERATION
In the first phase, three different resolution query images are generated.By utilizing these generated images as queries, retrieval that takes into consideration the sentence structure is realized.To generate query images, a hierarchical network based on AttnGAN is constructed.AttnGAN consists of three generators, G r (r ∈ {l, m, h}; l, m and h, representing resolutions, i.e., low, middle and high, respectively), which take three hidden states s r (r ∈ {l, m, h}) as inputs calculated by three neural networks F r (r ∈ {l, m, h}) and generate three different resolution query images Q r (r ∈ {l, m, h}).
First, we define a sentence feature vector and a word feature matrix extracted from an input sentence as e sen ∈ R D sen and E word ∈ R D word ×T [35].D sen and D word denote the dimension of the extracted sentence features and that of the extracted word features, respectively, T denotes the number of words included in the input sentence.The features e sen and E word are calculated by a ''sentence feature extractor that strongly focuses on the word relationship'' and a ''word attribute and feature extractor that strongly focuses on the detailed words'', respectively.We obtain the hidden state s l from the feature vector e sen and the Gaussian noise z as follows: where F ca is a function that stabilizes the training [34].Specifically, it translates discontinuous features e sen to continuous features by sampling e sen on a normal distribution.Next, we calculate the hidden states s m from the feature matrix E word and the previously obtained hidden state s l in Eq. (1).Then s m becomes a feature vector that contains information on the sentence features and weakly contains information on the word features E word .Similarly, we obtain the hidden state s h from the feature matrix E word and the previous hidden state s m .Thus, s h becomes a feature vector that contains information on both the sentence and word features.The relationship of s l , s m and s h can be calculated as follows: where F attn r (r ∈ {m, h}) is a function that adds the word features E word to the previous hidden states s r (r ∈ {l, m}).AttnGAN can generate an image that focuses on each word of the input sentence through this function.
Finally, we generate the multiple resolution query image Q r (r ∈ {l, m, h}) from each hidden state s r as follows: In the proposed framework, we utilize these three generated query images Q l , Q m and Q h to retrieve relevant scenes in the following phase.
Here, we explain how to train the above hierarchical image generation network.To generate images that contain the content of the input sentence, the final objective function is defined as follows: where λ is a hyperparameter that balances L G and L DAMSM .
In this final objective function, L G is a loss function that approximates conditional and unconditional distributions, and L DAMSM is a fine-grained image-text matching loss at the word level.In more detail, L G in Eq. ( 5) is defined as follows: Each L G r (r ∈ l, m, h) in Eq. ( 6) is defined as follows: where Q r is from the generation model distribution p G r at scale r.In Eq. ( 7), the first term determines whether the image is real or fake, while the second term determines whether the image and the sentence match or not.Next, the second term of Eq. ( 5), L DAMSM , is calculated by a Deep Attentional Multimodal Similarity Model (DAMSM) described in [35].
For a batch of B image-sentence pairs, L DAMSM is defined as follows: Recall@k obtained in the quantitative evaluation for each movie.These results represent the proportion of the scenes relevant to the input sentence at rank of k.The horizontal axis represents the rank of frames, and the vertical axis represents Recall@k defined in Eq. ( 12).A higher value indicates a better result.(a), (b) and (c) respectively represent the results of ''Bad Santa'', ''As Good As it Gets'' and ''Harry Potter and the Prisoner of azkaban''.
where P i word 1 is a posterior probability that measures how the generated images are matched with their corresponding text descriptions at the word level, and P i word 2 is a posterior probability that measures how the sentences are matched with their corresponding generated images at the word level.P i sen 1 and P i sen 2 , which are similar to P i word 1 and P i word 2 are posterior at the sentence level.
At scale r, the generator G r has a corresponding discriminator D r .Alternately to the training of G r , each discriminator D r is trained to classify whether the input image is real or fake by minimizing the following loss: where Qr is from the true image distribution p data r at scale r.

B. SECOND PHASE: ESTIMATION OF RELEVANT SCENES
In the second phase, scene retrieval is performed by using the three generated query images Q r (r ∈ {l, m, h}).First, we define candidate frames as f n l (n l = 1, 2, . . ., N ; N being the number of frames included in all candidate scenes, i.e., retrieval targets) and calculate the visual features v l and v n l from Q l and f n l .In the proposed framework, we utilize outputs of the third pooling layer of Inception-v3 [56] pre-trained on ImageNet [57] as the visual features.We utilize Inception-v3 since the loss function L DAMSM in the generation network utilizes Inception-v3 as the image feature extractor for calculating the image-text matching loss.Then we simply calculate the following cosine similarities w n l between v l and v n l : This value indicates the similarity between the query image Q l and the retrieval candidate frames f n l (n l = 1, 2, . . ., N ).
From the obtained similarities, we can calculate the rankings of the candidate frames.As described above, the lowresolution query image Q l focuses on the whole information of the input sentence.Therefore, Q l has the role of screening of large-scale retrieval candidates.Next, we select the frames that are included in the top 100P m percent of the retrieval candidates.In the same manner as Eq. ( 11), we respectively calculate the visual features v m and v n m from Q m and f n m (n m = 1, 2, . . ., P m N ) and calculate their cosine similarities w n m to extract the top 100P h percent candidates, where f n m are the top 100P m percent selected frames according to the similarities w n l .These procedures are also performed for the highest resolution query image Q h and the further screened P m P h N candidates.Finally, we can obtain the scenes for which frames have higher similarities than those of the other candidate frames.It should be noted that since the query images Q m and Q h focus on the information of the input sentence and its words, they have roles in narrowing down the retrieval candidates with consideration of the object relationship.By introducing the mechanism that hierarchically selects candidate scenes that are similar to the generated images Q l and Q m mainly reflecting the contents of the object relationship, we can screen only the scenes that are similar to objects of the input sentence.Although our scene retrieval framework is quite simple, it can successfully retrieve relevant scenes based on the hierarchical structure of AttnGAN.

IV. EXPERIMENTAL RESULTS
In this section, we quantitatively and qualitatively evaluate our framework by comparing it with some state-of-the-art retrieval methods.We first describe the details of datasets in IV-A.Results of quantitative and qualitative evaluations are presented in IV-B and IV-C, respectively.

A. DATASETS
We used the following two datasets in the experiment.
COCO dataset [37] The COCO dataset consists of daily scene images and their description annotations.The dataset contains 82,783 training images, each of which is associated with 5 descriptions.In the proposed framework, we trained AttnGAN for text-to-image translation by using the COCO dataset.We used the COCO dataset since it contains various words and daily scene images, and it has been widely used for text-to-image translation tasks.By evaluating the retrieval performance with this common dataset, we confirmed the capability of the proposed framework without fine-tuning for objective retrieval dataset.[58] The MP-II MD dataset contains 68,000 scenes of 94 HD movies.This dataset contains a large number of scenes extracted from one movie, and each scene is associated with one description.In the experiment, we defined scenes corresponding to their descriptions, which were utilized as input descriptions for generating the query images, as the ground truth.We used this dataset for considering actual applications such as retrieval from one video.

B. QUANTITATIVE EVALUATION
From the MP-II MD dataset, we selected three movies, ''Bad Santa'', ''As Good As it Gets'' and ''Harry Potter and the Prisoner of Azkaban'' which consist of 430, 538 and 592 scenes, respectively, with each scene having an average of 100 frames and with a total of 153,320 frames.By inputting the description of one scene to our framework, we performed retrieval and iterated these procedures for all scenes included in each movie.We defined frames included in the target scene as our ground truth and utilized the following Recall@k for the FIGURE 5. Recall@k obtained for the integrated dataset including five movies, ''Bad Santa'', ''As Good as it Gets'', ''Halloween'', ''Rendezvous mit Joe Black'' and ''Harry Potter and the Prisoner of Azkaban''.
quantitative evaluation criterion: where r k is the number of correctly retrieved scenes in the topk retrieval results.In this experiment, we sorted all N frames in M candidate scenes according to their similarity ranks.Furthermore, when the frames of the target scene were included in the top-k retrieval results, we regarded the target scene to have been correctly retrieved.We utilized Recall@k at the frame level because it can evaluate the performance in more detail compared with utilization of Recall@k at the scene level.In our framework, we simply set P m and P h to 50, and the sizes of the low-, middle-and high-resolution images (Q l , Q m and Q h ) were 64 × 64 pixels, 128 × 128 pixels and 256 × 256 pixels, respectively.
We compared the performance of the proposed framework (PF) with the performances of some state-of-the-art methods.We selected the following comparative methods.

• Baseline method (BL)
This is our baseline method utilizes only high-resolution images generated by AttnGAN.By comparing with this method, we evaluated whether the use of the hierarchical structure in our framework is effective.
• Comparative method 1 (CM1) [8] This is a simple embedding method that utilizes deep learning-based techniques.It utilizes LSTM and CNN to compare visual and sentence features by embedding them into a common visual semantic space.We used this method as the baseline method using deep learning-based techniques.
• Comparative method 2 (CM2) [10] This is a method that takes the order relationship between words into consideration in addition to the mechanism of CM1.By comparing with this method, we evaluated whether the proposed framework can effectively use the sentence structure.
• Comparative method 3 (CM3) [9] This is a method that only utilizes the visual feature space.It embeds sentence features into the image feature space based on deep learning-based techniques.By comparing with this method, we evaluated whether the use of query images generated by the text-to-image GAN is effective.
• Comparative method 4 (CM4) [59] This is a method that adds a loss function that reduces the number of negative samples between a query and an objective sample in addition to the mechanism of CM1.We used this method since it was one of the most recent state-of-the-art methods.
It should be noted that all of the comparative methods are constructed on the basis of open source codes provided by each author.
Figure 4 shows the results of Recall@k obtained for each movie.As shown in Fig. 4, the proposed framework outperforms the comparative methods (CM1, CM2, CM3 and CM4).In addition, since the proposed framework outperforms BL, which only utilizes high-resolution images, it can be seen that we can obtain better results by utilizing different resolution query images for reflecting the whole input description and their detailed words.Specifically, we can improve the retrieval performance by narrowing down the candidate scenes hierarchically utilizing the low-resolution image Q l and middle-resolution image Q m .Since each generated image contains different semantic information along the coarse-to-fine abstraction levels, the above screening is effective.
Furthermore, in order to verify the robustness of the proposed framework for a larger scale dataset that contains various scenes, we evaluated the retrieval performance for five movies selected from the MP-II MD dataset.We constructed an integration dataset including five movies, ''Bad Santa'', ''As Good As it Gets'', ''Halloween'', ''Rendezvous mit Joe Black'' and ''Harry Potter and the Prisoner of Azkaban'', including 430 + 538 + 676 + 296 + 592(= 2, 532) scenes with 247, 320 frames.Figure 5 shows the results of Recall@k obtained from this integrated dataset.As shown in this figure, the proposed framework also outperforms the comparative methods.It can be seen that the proposed framework can retrieve desired scenes more successfully.

C. QUALITATIVE EVALUATION
In this experiment, 25 subjects (5 females and 20 males, 20-27 years old) watched input descriptions and their corresponding retrieved first results obtained by our framework and the comparative methods.The subjects evaluated the relevance of the retrieved results in 5 grades (''1 Not Relevant'', ''2 Not So Relevant'', '' 3 Neither Agree Nor Disagree'', ''4 A Little Relevant'' and ''5 Relevant'').We randomly selected 20 scenes from the MP-II MD dataset and gave their corresponding descriptions in this experiment.Examples of the retrieval results obtained by the proposed framework are shown in Fig. 6, and the results of qualitative evaluation are shown in Table 1.Each example in Fig. 6 respectively corresponds to the results in Table 1.In Table 1, the values of ''PF'' represent the average scores of all subjects obtained by the proposed framework.The values of ''BL, CM1, CM2, CM3 and CM4'' represent the average scores obtained by the comparative methods that are shown in the quantitative evaluation.
can be seen that the scores of our framework averagely exceed ''A Little Relevant''.Therefore, the proposed framework can retrieve scenes related to the input descriptions.Also, the scores of our framework are better than those of ''BL, CM1, CM2, CM3 and CM4''.Furthermore, the differences are statistically significant in Welch's t-test with p < 0.01 given a significance level Îś = 0.01.
In Fig. 6, we can see that the proposed framework can retrieve relevant scenes even if the generated query images are not visually pleasant.From this fact, we can verify that the deep learning-based features obtained from the generated images have semantic information even if they are not visually pleasant.As additional evidence, we also show results of image-to-text translation from the images generated by our framework in Fig. 7.In this experiment, we utilized AttnGAN for the text-to-image translation and show and tell [39] for the image-to-text translation, and these two methods were completely independent.From this figure, we can see that the generated images seem to be translated to reasonable descriptions.
Although the effectiveness of the proposed framework was confirmed by the results of evaluations described in this paper, there are scenes with low scores (in particular, Scenes 13 and 17 in the qualitative evaluation).The retrieval results of Scenes 13 and 17 are shown in Fig. 8.In Fig. 8 and Table 1, we can see that the proposed framework and comparative methods obtain not too high scores and results even though the results obtained by the proposed framework were better than the results obtained by the other methods.There is therefore room for improvement of the proposed framework.

V. CONCLUSION
In this paper, we have proposed Query is GAN, a novel scene retrieval framework that utilizes images generated by AttnGAN as query images.Experimental results have shown that the proposed framework can accurately retrieve scenes and enables users to find their desired scenes.Furthermore, by showing the effectiveness of the proposed framework, the usefulness of the generated images, which are not visually pleasant, can also be confirmed.In a future work, we will introduce temporal processing to the proposed framework for realizing ideal scene retrieval.

FIGURE 2 .
FIGURE 2. Examples of generated images by different datasets: (a) generated image by AttnGAN trained on the bird dataset [36], (b) generated image by AttnGAN trained on COCO dataset [37].

FIGURE 3 .
FIGURE 3. Overview of our scene retrieval framework.The proposed framework consists of two phases, and details of them are respectively explained in III-A and III-B respectively.

FIGURE 4 .
FIGURE 4. Recall@k obtained in the quantitative evaluation for each movie.These results represent the proportion of the scenes relevant to the input sentence at rank of k.The horizontal axis represents the rank of frames, and the vertical axis represents Recall@k defined in Eq. (12).A higher value indicates a better result.(a), (b) and (c) respectively represent the results of ''Bad Santa'', ''As Good As it Gets'' and ''Harry Potter and the Prisoner of azkaban''.

TABLE 1 .
Results of subjective evaluation.These results show the average scores obtained from 25 subjects.The score of 1 represents ''Not Relevant'', and the score of 5 represents ''Relevant''.

FIGURE 6 .
FIGURE 6. Examples of the first retrieved frames by the proposed framework.

FIGURE 7 .
FIGURE 7. Examples of the image-to-text results obtained from images generated by the text-to-image GAN: (a)-(d) results respectively corresponding to Figs. 6 (a), (d), (j) and (k).

FIGURE 8 .
FIGURE 8.The first retrieved frames of Scenes 13 and 17 that had low scores in the subjective evaluation.