An Interactive Image Editing System Using an Uncertainty-Based Confirmation Strategy

We propose an interactive image editing system that has a confirmation dialogue strategy using an entropy-based uncertainty calculation on its generated images with Deep Convolutional Generative Adversarial Networks (DCGAN). DCGAN is an image generative model that learns an image manifold of a given dataset and enables continuous change of an image. Our proposed image editing system combines DCGAN with a natural language interface that accepts image editing requests in natural language. Although such a system is helpful for human users, it often faces uncertain requests to generate acceptable images. A promising approach to solve this problem is introducing a dialogue process that shows multiple candidates and confirms the user’s intention. However, confirming every editing request creates redundant dialogues. To achieve more efficient dialogues, we propose an entropy-based dialogue strategy that decides when the system should confirm, and enables effective image editing through a dialogue that reduces redundant confirmations. We conducted image editing dialogue experiments using an avatar face illustration dataset for editing by natural language requests. Through quantitative and qualitative analysis, our results show that our entropy-based confirmation strategy achieved an effective dialogue by generating images desired by users.


I. INTRODUCTION
Timely and appropriately assisting human users is critical in intelligent systems. Image generation or editing systems help users create desired images through interaction [1], [2]. The capability of natural language interaction on such systems would be useful because a natural language interface does not require any special skills; it only requires the ability for natural language communication. For example, image editing systems that accept natural language requests have a natural language interface. It allows users to input requests via voice or chat. The system provides a new image according to the user request.
Such image editing systems often face ambiguities caused by natural language. Unlike general image-to-image translation tasks [3], such editing systems must be able to handle vague, under-specified, and ambiguous natural The associate editor coordinating the review of this manuscript and approving it for publication was Gianluigi Ciocca . language requests. For example, the following natural language request, ''make this avatar's hair short,'' lacks a specific objective image or criterion for creating the image desired by the user. It should be ''make this avatar's hair short by her ears'' in the less ambiguous case. However, such lack of specificity often occurs in a real situation. This is one challenging obstacle that must be overcome to generate images based on some given text. Asking the user about the ambiguity is one way to solve the problem. This solution is one of our motivations for introducing an interactive process in image editing. A trade-off also exists between the generated image quality and the constraints on the image generation system. For example, a masking mechanism is an efficient way to improve the quality of generated images in image-to-image translation tasks [4]- [6]. Masking denotes an element-wise multiplication of a mask, which consists of binary values, with the input image. Even in image editing with natural language, such generation systems based on masking constraints generate more accurate images than a system without them VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ because they can identify the parts of the image mentioned in the user's request and perform image editing on those parts of the image only [7]. However, such a strong constraint limits large changes to the image. For example, in interactive image editing, it is difficult for systems with a strong constraint to work on such a request as ''make the current portrait's hair longer'' because the request will greatly change the image. In such cases, using a generation system without any constraints can create more relevant images to the user's intention. Considering a problematic case where the system cannot decide which generated image is better as an editing result for users, one possible solution is direct confirmation with them. However, asking users to choose a single image for every request is completely unreasonable. Thus, the system is expected to ask them when it is unsure which is the best image to present.
In this paper, we assume two different types of interactive image editing systems: a system with a strong constraint and one without a constraint on their generative processes. We tackle this problem to find a better dialogue strategy using these two systems and introduce an uncertainty score based on the entropy of the generated masks to decide on the best system to a given image editing request. We call the system with the strong constraint based on the masking mechanism ''w/ mask'' and the system without a constraint ''w/o mask.'' The system confirms with the user when it is tentative about selecting a better image to match the user's editing intent using uncertainty scores.
Section II describes the image editing task. Section III shows the interactive image editing system and its dialogue strategy that we use in our experiments. Section IV presents the experimental setting, and Section V shows our results. Related works are mentioned in Section VI, and we conclude in Section VII.

II. INTERACTIVE IMAGE EDITING DIALOGUE
In this section, we describe the interactive image editing dialogue task. Its overview is shown in Figure 1. It has a human user and a system. The dialogue's purpose is to generate goal image X g , which is the user's desired image, through a dialogue. The user makes requests in natural language to change the current image closer to the goal. The system generates a new image based on the previous image when the user makes a request for a change.
Step 1 First source image X s 0 and goal image X g are given to the user.
Step 2 At the i-th turn interaction, the user makes a natural language request I i to edit the previous image X s i−1 . Step 3 The system generates a new image X i based on the request I i and the previous image X s i−1 . Step 4 The system resets X i as the new source image X s i , and the user chooses whether to continue the dialogue. If the user decides to continue, they go to the next turn (go to 2 with i += 1). If the user decides to stop the dialogue, the dialogue is finished, and image X s i is compared with goal image X g .
Note that since the goal image is invisible to the system, it cannot be optimized directly to generate the goal image.
If we have several image generators on Step 3, the system must choose one image as the new image X i . When the system cannot choose between images, one solution is to seek confirmation from the user about which image is better. We assume that the system has multiple image candidates X i,1 , X i,2 , . . . , X i,n in Step 3 and two choices: {confirm, not confirm}. If it selects confirm, the following sub-steps of the confirmation procedure are inserted before Step 4: 3-c1) The system shows image candidates to the user to confirm which image is relevant to the request. 3-c2) The user selects the most relevant image. The system sets the selected image as its generated image X i . Figure 1 summarizes the steps of a single turn to decide on the next source image X s i from Step 2 to Step 4. Since the confirmation steps lengthen the interaction, the system has to reduce the number of confirmations. Criteria exist upon which the system selects confirm or not confirm (see Section III-D).

III. DCGAN-BASED IMAGE EDITING MODELS AND DIALOGUE STRATEGY
In this section, we describe the internal architecture of our interactive image editing system, shown in Figure 1 (right), composed of image editing models based on Deep Convolutional Generative Adversarial Networks (DCGAN) [8]. We use two image editing models: a model without a generation constraint and a model with a generation constraint. We first describe DCGAN's general idea and then describe its extension to image editing tasks. We also describe dialogue strategies to use these models in an interactive process.

A. DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS (DCGAN)
A Deep Convolutional Generative Adversarial Network (DCGAN) [8] is a commonly used generative model for image generation. DCGAN is composed of generator G and discriminator D for adversarial learning [9]. The generator is defined:X It generates imageX from given noise z (e.g., Gaussian: z ∼ N (0, I )). The discriminator is defined: It classifies a given image into two classes: original target image X from the training data (real) or generated target imageX by generator G (fake). The discrimination result will be used to train the generator. DCGAN is optimized by the following objective: min (3) θ G and θ D are the trainable parameters of the generator and the discriminator. p data and p z denote the data and FIGURE 1. The left figure represents an overview of the interactive image editing dialogue and the right figure represents the internal architecture of the system. In the left figure, the user's utterance is blue and the system's one is green. The system decides to confirm or not confirm based on the user's editing request I and the current source image X s i −1 . The right figure shows the whole system which consists of DCGAN-based image editing models and an entropy-based confirmation mechanism. w/o mask model described in Section III-B generates an image and w/ mask model described in Section III-C generates a mask and an image. Our proposed confirmation method (blue box), action selection module described in Section III-D, can select confirm or not confirm based on the entropy calculation of the mask. noise distributions. Adversarial learning resembles a mini-max game between the generator and the discriminator. The discriminator is optimized to correctly classify generated images from the generator (fake) and training examples (real). On the other hand, the generator is optimized to trick the discriminator into predicting the generated images as training examples. This competitive training improves the image modeling performance [8]. To stabilize the training, we rewrite (3) and get the following training objectives as shown in [9]: min

B. DCGAN FOR IMAGE EDITING WITHOUT CONSTRAINT
The original DCGAN was an unconditional image generation model; however, image editing tasks require conditional generation because the system has to control generated images VOLUME 8, 2020 based on the given pair of the original image (source image) and the editing request as a generation condition. To achieve this conditional generation, we introduce an extension of the DCGAN model that has an encoder part for extracting conditional information from the given pair of the source image and the editing request [7]. The encoder part learns function φ = f (X s , I ) by estimating target image feature φ from the unified representation of source image X s and its editing request I . The encoder part consists of source image encoder E im , instruction encoder E i , and a 1-layer fully-connected layer FC.
We use 4-layer convolutional neural networks [10] for E im and 1-layer long short-term memory neural networks [11] for E i . Assuming I consists of word tokens (1) and (2): Condition φ is fed into both the generator and the discriminator. This formulation is necessary for training a conditional DCGAN by a matching aware method [12]. This formulation enables the discriminator to classify whether the input image corresponds to the input condition, and the generator to learn the mapping between the generated image and the condition. The objective function of the discriminator (defined in (4)) is extended by the following three functions: The objective function of the generator (defined in (5)) is also rewritten: The notations c r and c w are used for a condition that corresponds to a training example X and for a condition that does not correspond to a training example X , respectively. Objective (11) encourages the discriminator to classify the matched pair of the training example and the condition as real. Objective (12) encourages the discriminator to classify the mismatched pair of the training example to the condition as fake. Objective (13) encourages the discriminator to classify the matched pair of the generated image and the condition as fake. Objective (14) encourages the generator to trick the discriminator into classifying the matched pair of the generated image and the condition as real. In summary, the discriminator not only learns to correctly classify the input image itself as real or fake but also to classify between input images that correspond and do not correspond to the conditions. The model requires triplet (c r , c w , X ) in training. We have to select c r to be the target image feature and c w to be far away from the target image feature. Suppose that the training examples are composed of triplets (X s , X t , I ), where X s indicates the source image and X t represents the target image that corresponds to the given input pair of X s and editing request I . One choice of triplet could be (c r , c w , We suppose that f (X s , 0) is editing with a meaningless editing request. To ensure that f (X s , 0) results in a value far from target image feature c r , we use additional patterns of triplets (c r , c w , X ) ∈ This step encourages the model to learn an identity mapping between the source and target images if the given editing request is meaningless (I = 0). Therefore, we define the overall objectives: θ D , θ G , and θ Enc are the trainable parameters of D, G, and the encoder part, respectively. L D and L G are the objectives for training D and G. In each iteration, the model uses (15) if L D > L G , and otherwise it uses (16). Note that L fmatch represents the objective of the feature matching [13] to stabilize the training of G and D. It is achieved by the sum of the layer-wise mean squared errors between the latent features in D extracted from real image X and that from generated imagê X . λ Xr , λ Xw , λX r , λ gX fr , and λ f are the coefficients of each objective. We use 1.0 for each coefficient.

C. DCGAN FOR IMAGE EDITING WITH A CONSTRAINT
The image editing model based on DCGAN sometimes offers drastic changes to the source image, which are inappropriate for a cooperative process with users. To prevent this problem, we introduce an additional module called Source Image Masking (SIM) [7], which functions as a constraint on DCGAN for image editing. The SIM idea is to explicitly indicate the editing points on the source image with masking. SIM is composed of two parts, mask generator G m and image encoder with mask E imm . We next define the procedure for generating and forwarding a mask: m color is a channel-wise copied mask from mono-channel mask m mono . We utilized m mono for the entropy calculation, which decides on the system's dialogue strategy in Section III-D. indicates the Hadamard product. φ imm is fed into G as additional input. Rewriting (9), we get

D. SYSTEM'S CONFIRMATION OF ACTION DECISIONS
Confirmation, which shows multiple editing results to users from multiple models, is a safe action described in Section I. However, the user must pay additional cost for responding to the confirmation. When a confirmation action must be selected, basing it on some uncertainty scores of image generation will smooth the dialogue. We use the entropy scores of the generated image as the uncertainty scores and calculate the entropy: We define m ij as the value of the predicted mask at the (i,j)-th position with width W and height H. −α log 0.5 (0 ≤ α ≤ 1) is our confirmation threshold. The mixed model selects confirm if entropy ≥ −α log 0.5. We tried several α in our experiment.

IV. EXPERIMENTAL SETTINGS
We conducted experimental dialogues to investigate the effectiveness of our proposed dialogue strategy. In this section, we describe the dataset for the image editing dialogues, the training details of each model, and the user evaluation settings.

A. DATASET
For training w/ and w/o mask models and evaluation, we utilized the Avatar Image Manipulation with an Instruction dataset [7]. The task is portrait image editing based on instructions, which involve natural language editing requests. generator G m training. We used the ground truth mask in the training, whose pixels were set to zero where the pixels in the same position of the source and target images are different, or otherwise they are set to one. We also provided a mask loss function as mean squared error between the generated mask and the ground truth one to improve the SIM model. We trained the models using Adam [14] (α = 2.0 × 10 −4 , β = 0.5) until 5, 000 phases. The images were resized to 64 × 64. The following are the hidden sizes: 128 for φ i and φ, 1024 for φ im , and 512 × 4 × 4 for φ imm . The batch size is 64, and the vocabulary size is 1892.

C. EVALUATION METRIC FOR IMAGE QUALITY
We utilized Structured Similarity (SSIM) [15] to evaluate the improvement of the image quality that represents the similarity between generated image X and goal image Y . We calculated SSIM between images X and Y as follows: x i,ch and y i,ch are the i-th local patches of each RGB channel ch of image X and Y . Whole patches are derived by vertically and horizontally sliding a squared window with width L one-by-one. µ x , µ y are their mean, and σ 2 x , σ 2 y , and σ xy are their variance and co-variance. C 1 , C 2 are constant values. For the whole experiment, we adopted commonly used parameters: L = 7, C 1 = (255 · 0.01) 2 , C 2 = (255 · 0.03) 2 .

D. USER EVALUATION OF IMAGE EDITING DIALOGUE
In a pilot study, we found that the w/ mask model tends to successfully edit a small region in a single turn, such as changing eye color or adding a mustache or glasses. However, the w/ mask model often fails to edit a large region of the source image, such as changing hairstyle. Therefore, we focused on hair editing to evaluate the image editing dialogue. We evaluated our proposed confirmation strategy on two aspects. First, we evaluated the necessity of confirmation by comparing between the strategy without confirmation using the w/ mask model and strategy with confirmation using both the w/o and w/ mask models. Second, we evaluated the effectiveness of the confirmation strategy by comparing the strategy without confirmation or a random strategy with the others. We used 21 patterns (9 for male portraits and 12 for female portraits) as pairs of source and goal images, and conducted image editing dialogue experiments with human evaluators. The evaluators were 18 people whose TOEIC scores exceeded 730 and could use English for daily use. At the task's beginning, the evaluators looked at the source and goal images and talked with our interactive image editing system, which has different dialogue strategies. Each pattern was evaluated by three evaluators over the following six strategies: the system selected confirm with thresholds α = 0.0, 0.25, 0.50, 0.75, 1.0 (as described in Section III-D) VOLUME 8, 2020 FIGURE 2. Experimental results of image editing dialogue between 18 evaluators (users) and the system: #user turn denotes total number of user actions (making an editing request and selecting an image); (smaller is better). SSIM denotes current source-goal SSIM, subtracted by first source-goal SSIM (higher is better). Each plot in figures represents each dialogue sample. α indicates threshold for system to select confirmation: (a) α = 0.0, (b) α = 0.25, (c) α = 0.50, (d) α = 0.75, (e) α = 1.0, and (f) random: system randomly selects confirmation. If α becomes smaller, system tends to select confirmation with a lower uncertainty score. Note that every SSIM is calculated after the user's action. Therefore, when the system selects confirmation after the user makes an editing request, SSIM keeps the same value. Degradation as dialogues progress is caused by image editing models. and randomly selected confirm. We compared these different strategies to identify the effectiveness of our proposed method on the problem of interactive image editing. Note that α represents proactiveness for confirmation: when α = 0.0, the system selects confirm every time; and α = 1.0, it selects not confirm every time. In other words, α = 1.0 corresponds to the case where the system uses the w/ mask model every time.

1) NECESSITY OF CONFIRMATION (LIMITATION OF A SINGLE MODEL)
Confirmation is useful when the system needs to deal with multiple editing results from multiple models. It is difficult for a single editing model to accept every editing request because a trade-off exists between editing flexibility and the model constraints. We first investigated how the single w/ mask model works on an interactive image editing task. We compared models with different confirmation strategy settings for the improvement of image quality through dialogues (higher is better).

2) EFFECTIVENESS OF CONFIRMATION STRATEGY
Second, we investigated the effectiveness of our proposed confirmation strategy. If our confirmation method works with appropriate timing, it will improve performance (higher image quality with shorter dialogue length).

V. RESULTS
Next we describe and discuss our experimental results in two parts in Sections IV-D1 and IV-D2. Figure 2 indicates the relative changes of SSIM from the current image to the goal image and plots the overall dialogue on each setting as the dialogue progressed. #user turn denotes the total number of the user actions of making an editing request and selecting an image (smaller is better).

3) NECESSITY OF CONFIRMATION (LIMITATION OF A SINGLE MODEL)
SSIM denotes relative SSIM, which is subtracted from the first source-goal's SSIM. i-th turn's SSIM is defined as SSIM = SSIM (X s i , X g ) − SSIM (X s 0 , X g ) (higher is better). The result with a higher α, such as α = 0.75, indicates almost the same behavior to α = 1.0, which corresponds to just using the w/ mask model. SSIM worsened as the dialogue progressed due to the image editing models, which were trained with single turn editing triplets of {source image, target image, editing request}. In other words, the models were inadequately generalized to the degraded source images. Thus, degradation, which occurred in a turn, tended to be gradually strengthened in the next turn. On the other hand, the results with lower α, such as α = 0.0, α = 0.25, and α = 0.50, indicate some dialogue examples achieved a better SSIM than before the dialogue. This indicates that the w/o mask model is necessary to get better SSIM scores to change a larger region, such as a woman's hair.

4) EFFECTIVENESS OF CONFIRMATION STRATEGY
An effective dialogue strategy satisfies not only the improvement of the image quality but also the efficiency of image editing dialogue; a shorter dialogue is better. To evaluate the whole dialogue performance in these two aspects, we visualized the histogram of SSIM /#user turn collected from the end of the dialogues (Figure 3). We applied Mann-Whitney U test [16] Figure 4 to compare their effectiveness. We found a significance of p-value < 0.001 between (c) α = 0.50 and (a) α = 0.0, indicating that (c) α = 0.50 was a more efficient dialogue.
Although (c) α = 0.50 was not significant compared with (f) random, we found some interesting cases where the system used confirm and not confirm more properly than random. Figure 5 shows a dialogue example where the user discovered a good strategy. First, they tried to change the hair to a ponytail. The system successfully generated a ponytail image, but unintentionally changed the eyes to green. The user asked the system to change the eyes back to blue, and it successfully obeyed without any redundant confirmation on this turn. On the other hand, with the random confirmation strategy, the system occasionally confirmed with inappropriate timing. For example, Figure 6 indicates an inefficient case. The system should have used confirm for the editing request on i = 2, which indicate requests for changing to a smaller part. The user cannot fundamentally avoid such cases with the random confirmation strategy.

VI. RELATED WORKS A. VISION AND DIALOGUE
Vision and dialogue is an emerging topic of intersection field between computer vision and natural language processing. Conversational image editing system research [17], [18] attempts to understand the user utterance and identify the user's intention in an interactive image editing task using existing image editing software such as Adobe Photoshop and OpenCV. Our proposed method has the same motivation to identify the user's intention; however, our editing system is based on image generative models. Image generative models potentially enable the system to edit images more flexibly but they have difficulty to handle their generated images. Our proposed confirmation method provides a means to handle the generated images.

B. CONFIRMATION STRATEGY IN DIALOGUE
Confirmation strategy has mainly been investigated in the spoken dialogue system research field [19], [20]. Such spoken dialogue systems need to consider the mistakes of speech recognition or natural language understanding. In this situation, confirmation effectively manages the dialogue process. The confirmation method, based on confidence measures [19], calculates the confidence score of each content word in the speech recognition candidates. The system asks the user for confirmation when the confidence for the content word existence in the user utterance is uncertain. Similarly, our proposed confirmation method provides a confidence score for confirmation. However, the calculation of confidence scores is based on the entropy of the image editing model. The confirmation method for a document retrieval dialogue task is based on minimizing the Bayes risk [20]. It requires a classification model to calculate the Bayes risk. In contrast, our entropy-based method does not require any additional model or dialogue data for training the model.  i indicates turn index defined in Section II. #user turn denotes number of user actions, which represents total number of making an editing request and selecting an image. We put the source-goal SSIM next to each source image when the system decides on a generated image for each turn.

C. UNCERTAINTY DETECTION FOR GAN-BASED IMAGE GENERATION
Controlling generated images is an essential problem in GAN-based image generation because of the instability of the generated image quality. To stabilize the image quality, the truncation trick, which restricts the acceptable sample on latent space z, performs well in conditional image generation [21]. However, it does not provide any information about the uncertainty. Our entropy-based method provides uncertainty scores for the generated images.
Uncertainty detection for GAN-based models has been scrutinized in anomaly detection [22], [23]. However, it measures the distances between the generated images and the samples in a training dataset without indicating their suitability for the given condition. Our entropy-based method is based on a mask, which is made from the given condition, that can provide a confidence score that represents the suitability of the generated image for the given condition.

VII. CONCLUSION
We proposed an entropy-based confirmation method using a masking mechanism for interactive image editing. The mask mechanism is useful for dealing with such complicated conditions as natural language, but such a strong constraint limits the acceptable language requests. In an avatar image editing task with natural language editing requests, changing such vast regions as hair is restricted in the w/ mask constraint model. The system's capability to confirm an action provides a chance to select a relevant image generated from both the w/o and w/ mask models. We demonstrated that our proposed strategy led to more similar images with fewer dialogue turns during human evaluations. We also showed an interesting case where our confirmation method achieved an efficient dialogue strategy. It first changed a large part and then fine-tuned a small part. In future work for more effective dialogues, we will collect dialogue data and enable our system to learn adaptive strategies using reinforcement learning, for example. Another future direction is applying our method to more natural/photo-realistic image datasets. Masking mechanisms are effective in image-to-image translation tasks with these datasets [4]- [6]; thus, we expect our method also works well with them.
SEITARO SHINAGAWA received the B.E. and M.S. degrees in information science from Tohoku University, in 2013 and 2015, respectively, and the Ph.D. degree with the Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), in 2020. From 2020, he has been a Researcher with NAIST. His working area is visually-grounded dialogue systems, especially interactive image generation system with dialogue. He is a member of JSAI. His research interests include spoken dialogue systems, multimodal systems, response generation and retrieval, deep learning, and natural language processing. VOLUME 8, 2020 KALLIRROI GEORGILA is currently a Research Associate Professor with the Department of Computer Science, University of Southern California (USC) and the USC Institute for Creative Technologies. Before joining USC, she was a Research Scientist with the Educational Testing Service in Princeton, USA, and before that a Research Fellow with the School of Informatics, University of Edinburgh, U.K. Her research interests include all aspects of natural language dialogue processing with a focus on machine learning, particularly reinforcement learning of dialogue policies, speech recognition, and expressive conversational speech synthesis. She has served on the organizing, senior, and program committees of many conferences and workshops. She has also served as Vice President of the Special Interest Group on Discourse and Dialogue (SIGdial). She is an Associate Editor of the Dialogue and Discourse journal and on the Editorial Board of the Computational Linguistics journal.
DAVID TRAUM (Member, IEEE) received the Ph.D. degree in computer science with the University of Rochester, in 1994. He is currently the Director for Natural Language Research, Institute for Creative Technologies (ICT) and Research Professor with the Department of Computer Science, University of Southern California (USC). He leads the Natural Language Dialogue Group at ICT. His research focuses on dialogue communication between human and artificial agents. He has engaged in theoretical, implementation and empirical approaches to the problem, studying human-human natural language and multi-modal dialogue, as well as building a number of dialogue systems to communicate with human users. He has authored over 250 refereed technical articles. He is the Founding Editor of the the Dialogue and Discourse journal, has chaired and served on many conference program committees, and is a past President of SIGDIAL, the international special interest group in discourse and dialogue.