An Entropy Clustering Approach for Assessing Visual Question Difficulty

We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10\% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (\ie, the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms.


I. INTRODUCTION
Visual Question Answering (VQA) is one of the most challenging tasks in computer vision [1], [2]: given a pair of question text and image (a visual question), a system is asked to answer the question. It has been attracting a lot of attention in recent years because it has a large potential to impact many applications such as smart support for the visually impaired [3], providing instructions to autonomous robots [4], and for intelligent interaction between humans and machines [5]. Towards these goals, many methods and datasets have been proposed. However, while VQA models typically try to predict answers to visual questions, we take a different approach in this paper; i.e., we analyze the difficulty of visual questions.
The VQA task is particularly challenging due to the di-versity of annotations. Unlike common tasks, such as classification where precise ground truth labels are provided by the annotators, a visual question may have multiple different answers annotated by different crowd workers, as shown in Figure 1. In VQA v2 [6] and VizWiz [7], which are commonly used in this task, each visual question was annotated by 10 crowd workers, and almost half of the visual questions in these datasets have multiple answers [8], [9], as shown in Table 1 for VQA v2. Each of the visual questions in Figure  1 is followed by ground truth answers and corresponding entropy values. Entropy values are large when ground truth answers annotated by crowd workers are diverse, and entropy is zero when crowd workers agree to a single answer.
The disagreement of crowd workers in ground truth annotations has been an annoying issue for researchers dealing FIGURE 1. Examples of visual questions and corresponding 10 answers of VQA v2 datasets, and corresponding entropy values. "Q" shows the question text, and "A" shows the ground truth answers, where the mark "x" is used for indicating the number of crowd workers who had annotated that answer (e.g., " 'red'x9 " signifies that there were nine people who had answered 'red', " 'orange'x1 " shows that one person answered 'orange', and so on.).
with tasks which involve crowdsource annotations [10]- [12]. Recently some works on VQA have tackled this issue. Gurari et al. [8] analyzed the number of unique answers annotated by crowd workers and proposed a model that predicts when crowdsourcing answers (dis)agree by using binary classifiers. Bhattacharya et al. [9] categorized reasons why answers of crowd workers differ, and found which co-occurring reasons arise frequently.
These works have revealed why multiple answers may arise and when they disagree, however this is not enough to find out how multiple answers make the visual question difficult for VQA models. Malinowski et al. [13] reported that the disagreement harms the performance of the VQA model, therefore the diversity of answers should be an important clue. However, formulating the (dis)agreement as binary (single or multiple answers) drops the information of the extent how diverse multiple answers are. For example, suppose two different answers are given to a visual question. This may mean that "five people gave one answer and the other five gave the other answer," or, that "one gave one answer and the rest 9 gave the other." In the latter case, the answer given by the first annotator may be noisy, hence not suitable for taking into account. To remove such noisy answers, prior work [8], [9] employed a minimum number of answer agreement. If the agreement threshold is set to m = 2 (at least two annotators are needed for each answer to be valid), then the answer given by the single annotator is ignored. However setting a threshold is ad-hoc and different threshold may lead to  Other  All  1  41561  9775  18892  70228  2  33164  6701  18505  58370  3  5069  3754  15238  24061  4  621  2110  12509  15240  5  103  1528  10661  12292  6  23  1239  9186  10448  7  0  1062  7666  8728  8  0  952  6169  7121  9  0  726  4528  5254  10  0  287  2325  2612  total  80541  28134  105679  different results when other datasets annotated by more (other than 10) workers would be available.
In this paper, we propose to use the entropy values of answer predictions produced by different VQA models to evaluate the difficulty of visual questions for the models, in contrast to prior work [14] that uses the entropy of ground truth answers as a metric of diversity or (dis)agreement of annotations. In general, entropy is large when the distribution is broad, and small when it has a narrow peak. To the best of our knowledge, this is the first work to use entropy for analysing the difficulty of visual questions.
The use of the entropy of answer distribution enables us to analyse visual questions in a novel aspect. Prior works have reported overall performance as well as performances on three subsets of VQA v2 [6]; Yes/No (answers are yes or no for questions such as "Is it ..." and "Does she ..."), Numbers (answers are counts, numbers, or numeric, "How many ..."), and Others (other answers, "What is ..."). These three types have different difficulties (i.e., Yes/No type is easier, Other type is harder), and performances of each type are useful to highlight how models behave to different types of visual questions. In fact, usually the first two words carry the information of the entire question [8], and previous work [15] uses this fact to switch the internal model to adopt suitable components to each type. This categorization of question types is useful, however not enough to find which visual questions are difficult. If we can evaluate the difficulty of visual questions, this could push forward the development of better VQA models.
Our goal is to present a novel way of analysing visual questions by clustering the entropy values obtained from different models. Images and questions convey different information [16], [17], hence models that take images only or question only are often used as baselines [2], [6], [9]. Datasets often have language bias [6], [15], [18], [19] and then questions only may be enough to answer reasonably. However the use of the image information should help to answer the question correctly. Our key idea is that the entropy values of three models (that use image only (I), question only (Q), and both (Q+I)) are useful to characterize each visual question.
The contributions of this work can be summarized as follows.
1. Computing 3D entropy vectors entropy FIGURE 2. Overview of our entropy clustering approach. First, three models (I, Q, and Q+I) are used to predict answer distributions. The entropy values of the predicted answer distributions are computed to construct a 3D entropy vector. Then, clustering is performed on these 3D entropy vectors to analyse the accuracy for each cluster.
• Instead of using the entropy of ground truth annotations, we use the entropy of the predicted answer distribution for the first time to analyse how diverse predicted answers are. We show that entropy values of different models are useful to characterize visual questions. • We propose an entropy clustering approach to categorize the difficulty levels of visual questions (see Figure  2). After training three different models (I, Q, and Q+I), predicting answer distributions and computing entropy values, the visual questions are clustered. This is simple yet useful, and enables us to find which visual questions are most difficult to answer. • We discuss the performances of several state-of-the-art methods. Our key insight is that the difficulty of visual question clusters is common to all methods, and tackling the difficult clusters may lead to the development of a next generation of VQA methods.

II. RELATED WORK
The task of VQA has attracted a lot of attention in recent years. Challenges have been conducted since 2016, and many datasets have been proposed. In addition to the normal VQA task, related tasks have emerged, such as EmbodiedQA [4], TextVQA [20], and VQA requiring external knowledge [21]- [24]. Still the basic framework of VQA is active and challenging, and some tasks include VQA as an important component, such as visual question generation [25], [26], visual dialog [5], [27], and image captions [28]. VQA datasets have two types of answers. For multiplechoice [6], [29], [30], several candidate answers are shown to annotators for each question. For open-ended questions [2], [6], [7], [31], [32], annotators are asked to answer in free text, hence answers tend to differ for many reasons [9]. Currently two major datasets, VQA [2], [6] and VizWiz [7], suffer from this issue because visual questions in these datasets were answered by 10 crowd workers, while other datasets [21], [29]- [35] have one answer per visual question.
This disagreement between annotators has recently been investigated in several works. Bhattacharya et al. [9] proposed 9 reasons why and when answers differ: low-quality image (LQI), answer not present (IVE), invalid (INV), difficult (DFF), ambiguous (AMB), subjective (SBJ), synonyms (SYN), granular (GRN), and spam (SMP). The first six reasons come from both/either question and/or image, and the last three reasons are due to issues inherent to answers. They found that ambiguity occurs the most, and co-occurs with synonyms (same but different wordings) and granular (same but different concept levels). This work gives us quite an important insight about visual questions, however only for those that have multiple different answers annotated. Gurari et al. [8] investigated the number of unique answers annotated by crowd workers, but didn't consider how answers differ if disagreed. Instead they use a threshold of agreement to show how many annotators answered the same. Yang et al. [14] investigated the diversity of ground truth annotations. They trained a model to estimate the number of unique answers in the ground truth. Their motivation is collecting almost all diverse ground truth answers with a limited budget for crowdsourcing. If more answers are expected, then they continue to ask crowdworkers to provide more answers. If enough answers have been collected, they stop collecting answers to that question.
Our approach is to use the entropy of the answer distributions of the predictions of VQA models. This is a novel aspect, and complementary to the prior works. Entropy takes into account by a single number the fraction of multiple answers as well as the distribution of answers. It therefore provides another modality to analyse visual questions at a fine-grained level. Figure 3 shows how entropy values change for the same number of unique answers. The leftmost bar's value is zero because there is only a single answer (i.e. all answers agree), and the rightmost bar represents the case when all 10 answers are different. In between, entropy values are sorted inside the same number of unique answers. In the experiments we will see that the entropy of the answer distributions of VQA model predictions is consistent with the entropy of the ground truth answers, and also with the number of unique answers.
For computing the entropy, we use three different VQA models (image only, question only, and both) with the expectation that images and question texts convey different information. This has been studied in some recent works such as [19] that utilizes the difference between normal VQA (Q+I) and question-only (Q) models. Many recent works capture the difference of visual and textual information by using the attention mechanism between. Some co-attention models [36]- [38] use visual and textual attention in each modality or in an one-way manner (e.g., from question to image). Some other works (such as DCN [39], BAN [40], and MCAN [41]) investigate "dense" co-attention that use bidirectional attention between images and questions. More recent works try to capture a more complex visual-textual information [42]- [45]. Our work instead tries to keep our approach as simple as possible by using three independently trained models to obtain the entropy.
We should note that this approach is different from uncertainty of prediction. Teney et al. [46] proposed a model using soft scores because scores may indicate uncertainty in ground truth annotations, and minimizing the loss between ground truth and prediction answer distribution. This approach is useful, yet it doesn't show the nature of visual questions.
Our approach is closely related to hard example mining [47], [48] and hardness / failure prediction [49]. Hard example mining approaches determine which examples are difficult to train during training, while hardness prediction jointly trains the task classifier and an auxiliary hardness prediction network. Compared to these works, our approach differs in the following two aspects. First, the VQA task is multi-modal and assessing the difficulty of visual questions has not been considered before. Second, our approach is offline and can determine the difficulty without ground-truth, i.e., before actually trying to answer the visual questions in the test set.

III. CLUSTERING VISUAL QUESTIONS WITH ENTROPY
Here we formally define the entropy. Let A be the set of possible answers that a VQA model would predict, and P (x = a) or P (a) the probability distribution predicted by the VQA model, which satisfies a∈A P (a) = 1 and ∀a ∈ A, P (a) ≥ 0. The entropy H of the VQA model prediction is defined by (1)

A. CLUSTERING METHOD
To perform clustering, we hypothesize that "easy visual questions lead to low entropy while difficult visual questions to high entropy." A similar concept has been reported in terms of the human consensus with multiple ground truth annotations [13], but in this paper we address the relation between the difficulty and the entropy of answer distributions produced by VQA models. This is reasonable because for easy visual questions VQA systems can predict answer distributions in which the correct answer category has large probability while other categories are low. In contrast, difficult visual questions makes VQA systems generate broad answer distributions because many answer candidates may be equally plausible. Entropy can capture the diversity of predicted answer distributions, and also that of ground truth annotations in the same manner. We prepare three different models that use as input image only (I), question only (Q), and both question and image (Q+I). In this case, we expect the following three levels of difficulty of visual questions:

B. DATASETS AND SETTING
We use VQA v2 [6]: it consists of training, validation, and test sets. To train models, we use the training set (82,783 images, 443,757 questions, and 4,437,570 answers). We use the validation set (40,504 images, 214,354 questions, and 2,143,540 answers) for clustering and analysis. We choose Pythia v0.1 [50], [51] as a base model, and modify it so that it takes questions only (Q model), or images only (I model). To do so, we simply set either image features or question features to zero vectors. With no modification, it is Q+I model (i.e. Pythia v0.1). As in prior works [41], [46], [52], [53], 3129 answers in the training set that occur at least 8 times are chosen as candidates, which results in a multi-class problem predicting answer distributions of 3129 dimension. Note that other common choices for the number of answers are 3000 [36], [54] and 1000 [55]. Even when different numbers are used, our entropy clustering approach works and we expect our findings to hold.
The metric for performance evaluation is the following, which is commonly used for this dataset [2]: in other words, an answer is 100% correct if at least three annotated answers match that answer. First we show the performance of each model in Table 2. As expected, the I model performs worst because there is no clue of questions in the image. In contrast, Q model performs reasonably better, particularly for Yes/No type. Average performances of different models (excluding I and Q) are about 84%, 47%, and 58% for types of Yes/No, Number, and Other, respectively. Note that we show average and standard deviations (std) in Table 2, and the std values look relatively large. This is natural and it is due to the definition of VQA accuracy (Eq. (2)). For each prediction, accuracy is discrete: 0, 33.3, 66.6, or 100 depending on how many people provided that ground truth answer. Averaging these discrete values results in a large std. (In other words, large discretization errors lead to a large std. For example, 10 predictions with accuracy of 100 and 35 with accuracy of 0 result in 22.22 ± 41.57.) It is quite common for VQA papers to report average accuracy only without std, probably because std is large for any models and not useful for comparison. In this paper we report std of accuracy as well as std for entropy and the reasons to differ.
Next in Table 3 we show the entropy values of the predicted answer distributions by different models for each of the three types, as well as ground truth annotations. Average entropy values of models (excluding I and Q) for each type are 0.25, 1.62, 1.72, respectively. Yes/No type has smaller entropy than the others because answer distributions tend to gather around only two candidates ("Yes" and "No"). Note that the range of entropy values is different for model predictions and for the ground truth answers. Entropy ranges from 0 (single answer) to 2.303 (10 different answers) for ground truth answers, and from 0 (1 for a single entry, otherwise 0) to 8.048 (uniform values of 1/3129) for model predictions.

C. CLUSTERING RESULTS
Now we show the clustering results in Table 4. We used kmeans to cluster the 3-d vectors of 214,354 visual questions into k = 10 clusters. Note that many factors (e.g. initialization and number of clusters, chosen algorithms) affect the clustering result, but we will show that similar clustering results are obtained with different parameter settings in experiments. Here we use the simplest algorithm, and a reasonable number of clusters.
Each column of Table 4 shows the statistics for each cluster. Clusters are numbered in ascending order of the entropy for the Q+I model. The top rows with 'base model entropy' show the entropy values for the three base models.
To find three levels of visual questions, we divide the clusters by the following simple rule. For each cluster, • if 'Q entropy' < 1 then it is level 1, • else if 'Q+I entropy' > 2 then it is level 3, otherwise level 2. Column colors of Table 4 indicate levels; level 1 (clusters 0 and 1) are in gray, level 2 in yellow (2 to 6), and level 3 (7, 8, and 9) in red.
Below we describe other rows of Table 4.
• base model acc. Accuracy values of the three base models. Accuracy of Q+I model tends to decrease as Q+I entropy increases, which we will discuss later. • state-of-the-art entropy and accuracy Entropy and accuracy values of 9 state-of-the-art methods. • test set entropy Entropy values of the test set of VQA v2. We assign test visual questions to one of these clusters (we will discuss this later). • GT statistics Statistics of ground truth annotations.
Row 'entropy' shows entropy values of ground truth annotations. Row 'ave # ans' shows the average number of unique answers per visual question. These two rows show how ground truth answers differ in each cluster. Row 'total' shows total numbers of visual questions. Rows 'yes/no', 'number', and 'other' shows numbers of each type in that cluster. Rows '# agree' and '# disagree' show numbers of visual questions for which 10 answers agree (all are the same) and disagree (all are not the same), as in [9]. • reasons to differ Average values obtained by reason classifiers [9] that output values from 0 (not that reason) to 1 (it is this reason) to each reason independently. We train classifiers on the subset of the VQA v2 training set provided by [9], then apply to VQA v2 validation set.

1) Entropy suggests accuracy.
We performed the clustering by using the entropy values of the three models based on Pythia v0.1 [50], [51]. Using a different base model may lead to different clustering results, however the values of entropy and accuracy of different state-of-the-art models exhibit similar trends; entropy values increase while accuracy decreases from cluster 0 to 9, as shown in Figure 4. This suggests that clusters with large (or small) entropy values have low (high) accuracy, as shown in Figure 5, and this tells us that entropy values are an important cue for predicting accuracy.
2) Entropy is different from reasons to differ and question types.
Most frequent reasons to differ shown in [9] are AMB, SYN, and GRN, but Figure 4 shows that predicted values of those reasons are not well correlated to the order of clusters. For question types, Number and Other types looks not related to   these clusters. Therefore our approach using entropy captures different aspects of visual questions.
Level 1 (clusters 0 and 1) are dominant, and covers 44% of the entire validation set, including 99% of Yes/No type. Low entropy values and few number of unique answers (row 'ave # ans') of these cluster can be explained by the fact that typical answers are either 'Yes' or 'No'. Accuracy of Yes/No type is expected to be about 85% (Table 2), and it is close to the accuracy for these clusters. In contrast, level 3 (clusters 7, 8, and 9) looks much more difficult to answer. In particular, accuracy values of cluster 9 are about 10% compared to over 80% of level 1. This is due to the fact that visual questions with disagreed answers gather in this level; GT entropy is about 1.3, with more than five unique answers. However, values of DFF, AMB, SYN and GRN of level 3 are not so different from level 2, which may suggest that the quality of visual questions is not the main reason for difficulty.

4) Difficulty of the test set can be predicted.
This finding enables us to evaluate the difficulty of visual questions in the test set. To see this, we applied the same base models (that are already trained and used for clustering) to visual questions in the test set, and computed entropy values to assign each visual questions to one of the 9 clusters. Rows with 'test set entropy' in Table 4 show the average entropy values of those test set visual questions. Assuming that the validation and test sets are similar in nature, we now are able to evaluate and predict the difficulty of testset visual questions without computing accuracy. This is the most interesting result, and we have released a list [57] 1 that shows which visual questions in the train / val / test sets belong to which cluster. This would be extremely useful when developing a new model incorporating the difficulty of visual questions, and also when evaluating performances for different difficulty levels (not for different question types). Figure 6 shows some examples of visual questions in each level (from cluster 0, 4, 8, and 9). Entropy values of different methods tend to be larger in cluster 9, and visual questions in cluster 9 seem to be more difficult than those in cluster 0. To answer easy questions like "Is the catcher wearing safety gear?" or "What is the player's position behind the batter?" in cluster 0, images are not necessary and the Q model can correctly answer with low entropy. The question in cluster 9 at the bottom looks pretty difficult for the models to answer because of the ambiguity of the question ("What is this item?") and of the image (containing the photos of vehicles on the page of the book) even when the human annotators agree on the single ground-truth annotation.

E. DISAGREEMENT OF PREDICTIONS OF DIFFERENT MODELS
For difficult visual questions the number of unique answers is large, i.e. annotators highly disagree, while for easy questions numbers are small and they agree (5.39 for cluster 9, 1.72 for cluster 0). Now the following question arises; how much do different models (dis)agree, i.e. do they produce the same answer or different answers?
To see this, we define the overlap of model predictions. We have 9 models (BUTD, MFB, MFH, BAN-4/8, MCANsmall/large, Pythia v0.3 and v0.1 (Q+I)), and we define the "overlap" of the answers to be 9 when all models predict the same answer. For example, if we have two different answers to a certain question, each answer produced (supported) by respectively four and five models, then the answer overlaps are four and five, and we call the larger one a max overlap. Therefore, larger max overlap indicates a higher degree of agreement among the models. Figure 7 shows histograms of visual questions with different number of unique answers. The legend shows the details of max overlap. Figure 8 shows similar histograms but the max overlap is counted with correct model answers only.
For clusters 0 and 1, almost visual questions have one or two unique answers, and the models highly agree (max overlap of 9 is dominant). This is expected because most visual questions in these clusters are of Yes/No type, and models tend to agree by predicting either of two answers. Apparently clusters 2, 3, and 4 look similar; dominant max overlap is 9. This means that all of 9 models predict the same answer to almost half of visual questions even when annotators disagree 1 Clustering results are available online at https://github.com/tttamaki/vqd, in which we show lists of pairs of question IDs and clusters for both the validation and test sets of the VQA v2 dataset. to five different answers. In contrast, models predict different answers to visual questions of clusters 6 -9 even when annotators agree and there is a single ground truth answer (this is the case in the middle of cluster 8 column in Figure  6). Filling this gap may be a promising research direction for the next generation VQA models.

IV. CONCLUSIONS
We have presented a novel way of evaluating the difficulty of visual questions of the VQA v2 dataset. Our approach is surprisingly simple, using three base models (I, Q, Q+I), predicting answer distributions, and computing entropy values to perform clustering with a simple k-means. Experimental results have shown that these clusters are strongly correlated with entropy and accuracy values of many models including state-of-the-art methods.
Our work can be used in many different ways. One example is to use our work to classify the difficulty of visual questions to switch different network branches, like in [15] which has a question-type classifier to use different branches. Another example is to apply a curriculum learning framework by training a model with easy visual questions first, then gradually using more difficult ones. Another possible direction would include judging if questions generated by VQA and visual dialog models are appropriate to ask. However, using our work as a component is not the only possible way, because our work provides additional insights into visual questions. For example, cluster 9 contains many visual questions that require reading text in the image, which is recently explored as a new task by TextVQA [20]. This cluster also has visual questions of different types of difficulty, therefore the results of our work have the potential to inspire more interesting new tasks that have never been explored. By providing the correspondences between clusters and visual questions in the test set as an indicator of difficulty [57], our approach explores a novel aspect of evaluating performances of VQA models, suggesting a promising direction for future development of a next generation of VQA models.

A. FURTHER ANALYSIS
Here we give a more detailed analysis of the clustering results.

1) Robustness to clustering initialization
In section III-C, we argue that similar results are obtained while many factors including initialization affect the clustering result. Figures 9 and 10 show results corresponding to Figure 4, but repeated 5 more times with different initialization of the k-means clustering algorithm: the k-means++ initialization scheme [58] for Figure 9, and random initialization for Figure 10.
These figures show that we obtain similar results even with different initializations of the k-means algorithm. This demonstrates the robustness of our approach to the clustering initialization.  2) Robustness to the number of clusters Figure 11 shows results corresponding to Figure 4, but with different number of clusters for the k-means clustering. As stated before, the number of clusters affects the clustering result. However we can see in these figures that similar results are obtained with both fewer clusters k = 5 and more clusters k = 15, 20, 50, 100. This also demonstrate the robustness of our approach to the number of clusters.  Figure 12 shows results corresponding to Figure 4, but when using different features in the k-means clustering algorithm. We use the entropy values of answer predictions obtained from three different models: I, Q, and Q+I. Therefore we can use different subsets of the three models for clustering.

3) Effect of clustering with different features
The first three rows of Figure 12 show the clustering results obtained when using only one of the three models. The row (I) is obtained from the clustering result with the I model only, and so on. As expected, the results for the I model only show that there is little correlation between accuracy and the order of clusters, while the Q+I and Q models shows better correlations. The next three rows of Figure 12 show clustering results obtained when using two of the three models together. The row (I and Q+I) is obtained from the clustering result with the I and Q+I models, and so on. Again, the combinations with the Q model seems to affect the correlation.
The row (Q+I) exhibits the largest correlation, however this is due to the sample unbalance between clusters, as shown in Figure 13. In contrast, the combination (I, Q, and Q+I) has a better sample balance, i.e. samples are well distributed across clusters. Future work includes a more qualitative investigation of this issue.

4) Normalization of accuracy
The clustering result (Figure 4) shows that large entropy leads to low accuracy. However this might be due to the disagreement of answers by annotators. As shown in Table   0  1  2  3  4  5  6  7  8  9  cluster   0   20000   40000   60000   80000   Frequency   I  Q  IQ  I_Q  I_IQ    The accuracy is defined by Eq. (2), and it is bounded above by the number of ground truth answers. If annotators disagree completely on a certain visual question, each unique answer is provided by one annotator, then the accuracy is at most 33%. This might cause an apparent reduction of accuracy for clusters with large entropy.
To remove this effect, we define a normalized version of accuracy as follows; normalized accuracy(a) = accuracy(a) max A∈A accuracy(A) , where A is the the set of all possible answers. This becomes 100% for the case of complete disagreement if the predicted answer is one among 10 different ground truth answers. Figure 14 shows results corresponding to Figure 4, and results with the normalized accuracy. As can be seen, the result for the normalized accuracy is very similar to that obtained with the original accuracy. Therefore we can conclude that the result is due to the nature of our approach, and not due to the effect of apparent reduction.