Multimodal Encoder-Decoder Attention Networks for Visual Question Answering

Visual Question Answering (VQA) is a multimodal task involving Computer Vision (CV) and Natural Language Processing (NLP), the goal is to establish a high-efficiency VQA model. Learning a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions is the heart of VQA. In this paper, a novel Multimodal Encoder-Decoder Attention Networks (MEDAN) is proposed. The MEDAN consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in depth, and can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image. Each MEDA layer contains an Encoder module modeling the self-attention of questions, as well as a Decoder module modeling the question-guided-attention and self-attention of images. Experimental evaluation results on the benchmark VQA-v2 dataset demonstrate that MEDAN achieves state-of-the-art VQA performance. With the Adam solver, our best single model delivers 71.01% overall accuracy on the test-std set, and with the AdamW solver, we achieve an overall accuracy of 70.76% on the test-dev set. Additionally, extensive ablation studies are conducted to explore the reasons for MEDAN’s effectiveness.


I. INTRODUCTION
Visual Question Answer is a multimodal learning task that aims at automatically answering a natural language question related to the contents of a given picture. Compared to other multimodal learning tasks (e.g., image captioning [1], [6], [7], image-text matching [8], [12]), VQA is a more difficult and challenging task that requires to capture the high-level interactions between question and image, thus predicting an accurate answer. The performance of VQA has been substantially improved in recent years by designing better feature extractors, better attention mechanisms or better multimodal fusion approaches.
Despite being studied extensively, most existing VQA methods focus on learning fine-grained question features and image features to obtain richer multimodal feature representations. With the development of attention mechanisms in the field of deep learning, it has successfully been The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . applied to the VQA task. Taking the question in Fig. 1 as an example, to correctly answer the question ''How many people are playing'', the attention model should focus on specific regions of the image (i.e., the people around the Frisbee.). The method based on visual attention was first proposed by Shih et al. [26], and it has become an integral part of the VQA model that requires fine-grained visual understanding [20], [29]. Besides the visual attention methods, researchers have successfully introduced co-attention approaches that focus on both the important regions of image and keywords of question to learn visual attention and text attention [27], [30]. However, the early co-attention methods mostly learn coarse interactions, ignoring the dense interactions between each question word and each image region, which leads to cannot infer the correlation between any question word and any image region. Therefore, the early co-attention methods have a severe limitation.
To overcome the above limitation, two dense co-attention models DCN [28] and BAN [12] have been proposed to model dense interactions between each question and image region, which greatly improved VQA performance by achieving complete interaction. Both models can be stacked in depth. Yet, the experimental results show that their deep models have little improvement than shallow models. Inspired by the Transformer model [25], two deep intra-inter attention models (i.e., DFAF [3], MLIN [5]) and two modular co-attention models (i.e., MCAN [2], MUAN [4]) have been introduced to VQA task, which all achieve the best level of accuracy.
Although the current multimodal co-attention models have achieved good performance, we still found that when learning the fine-grained features of the image region, learning question-guided-attention features firstly is different from learning self-attention features firstly, which can obtain better image region representations. We guess the reason for this result is that the former is helpful to understand the image, while the latter is more like a reasoning module, which is based on the understanding of the image. Therefore, we propose a new model MEDAN that consists of MEDA layers cascaded in depth. As shown in Fig. 1, each MEDA layer includes an Encoder module and a Decoder module. The core of Encoder is a text self-attention (SA E ) unit, which is used to model fine-grained question features; Decoder mainly contains a question-guided-attention (GA EV ) unit and an image self-attention (SA V ) unit, aiming to extract fine-grained image region features. The layers of Encoder and Decoder can be stacked in depth to obtain more fine-grained feature representations. Evaluation results on the benchmark VQA-v2 [40] dataset show that our proposed MEDAN model achieves the state-of-the-art performance of VQA. With the Adam solver, our best single model delivers 71.01% overall accuracy on the test-std set, which is the second model that achieves 71% on test-std set after MUAN [4]. With the AdamW solver, we achieve an overall accuracy of 70.76% on the test-dev set. Eventually, a series of ablation studies are conducted to explore the effectiveness of the proposed MEDAN in this paper.
The main contributions of this paper are as follows: • Evaluation results on the benchmark VQA-v2 dataset demonstrate that MEDAN achieves state-of-the-art VQA performance. With the Adam solver, our best single model delivers 71.01% overall accuracy on the test-std set, and with the AdamW solver, we achieve an overall accuracy of 70.76% on the test-dev set.
• Exploring the effectiveness of the proposed MEDAN through extensive ablation studies. The remaining organizational structure of this paper is as follows: the second part introduces the research work related to VQA; the third part introduces the overall framework research and design of MEDAN; the fourth part verifies the effectiveness of MEDAN through some experiments; finally, take a summary.

II. RELATED WORK
With the development and application of deep learning, the research on VQA is more extensive, and the performance of VQA is much better. Here several research methods of VQA are introduced.

A. MULTIMODAL FEATURE REPRESENTATION FOR VQA
The basis of the VQA task is to extract visual features of the image and textual features of the question accurately and effectively. In early VQA methods, the VGG [13] network was commonly used to extract visual features. With the ResNet network proposed by Kaiming et al. [14]. The researchers gradually shift to the ResNet network that performs better than VGG in visual feature extraction. Recently, the bottom-up attention network [1] derived from Faster R-CNN [31] outperforms better than ResNet, which won the VQA challenge 2017. For the extraction of textual features, the LSTM network [15] and the GRU network [16] are commonly used. Additionally, pre-training methods on GloVe [17] or Bert [18] are also applied to obtain better features. Multimodal feature representation plays an important role in improving the VQA performance.

B. MULTIMODAL FEATURE FUSION FOR VQA
VQA task requires an understanding of image and question contents as well as the relation between them. For this reason, in the early stage of VQA methods, some simple joint representation methods [33] between the two modalities are used for multimodal features fusion. To capture the high-level interactions between image and question. A feature fusion method MCB based on bilinear pooling was applied to VQA by Fukui et al. [20]. Subsequently, Kim et al. [34] proposed the MLB method, whose performance is equal to MCB but with less weighted parameters. With better performance but less weighted parameters than MCB, The MFB method was proposed by Yu et al. [21]. Then, Yu et al. [30] proposed the MFH method that outperforms the aforementioned methods. Additionally, the method MUTAN was proposed by Ben-Younes et al. [22], indicating that both MCB and MLB are the special cases of this method. Multimodal feature fusion is key to improve VQA performance.

C. ATTENTION MECHANISM FOR VQA
The application of attention mechanism enables VQA to automatically ignore the image regions irrelevant to the given question, and choose to focus on the important image regions, thus predicting a correct answer. In the early stage, the attention mechanism was mainly used to focus on regions of the image, which is called visual attention. Yang et al. [32] proposed Stacked Attention Networks with learning the attention on image regions through multiple iterations. By using object regions instead of spatial grids to represent the image region features, the network BUTD was proposed by Anderson et al. [1] to learn the attention on object regions of the image. Later, the co-attention method, which not only requires to learn visual attention on the image but also needs to learn textual attention on the question, was introduced to VQA. Lu et al. [27] proposed a co-attention learning method that alternately learns image attention and question attention. Yu et al. [30] simplified co-attention into two steps, one is self-attention learning of the question, and the other is question-guided-attention learning of the image. The aforementioned co-attention models cannot infer the correlation between any question word and any image region because of ignoring the dense interactions between them. To overcome this problem, two dense co-attention models DCN and BAN have been proposed to model complete interactions between each question and each image region by Kim et al. [12] and Nguyen et al. [28] respectively. But, the disadvantage of DCN and BAN is that their deep models are not much better than their shallow model. Against this disadvantage, several new models based on deep co-attention were proposed by Gao et al. [3], [5] and Zhou et al. [2], [4], which achieve better performance on VQA task.

1) IMAGE REPRESENTATION
Following [1], the input image is represented as a set of image region features. These features are extracted from a Faster R-CNN [31] model pre-trained on the Visual Genome dataset [36]. Give an image I , the image features of I are represented as V ∈ R k×2048 , where k ∈ [10, 100] is the number of object regions. For the i-th object, it is represented as a feature v i ∈ R 2048 .

2) QUESTION REPRESENTATION
Firstly, give a question Q that is tokenized into words. Then, each word of the question is transformed into a vector, and pre-trained on a large-scare corpus by using the 300-dimensional Glove word embeddings [17] to get the final size of words u × 300, where u ∈ [1,14] is the number of words in question. Finally, the word embeddings are passed through a one-layer and 512-dimensional LSTM network to obtain the question features E ∈ R u×512 . For the j-th word, it is represented as a feature e j ∈ R 512 .
V and E can be given as follow: where FRCNN (·) represents extraction of image features through a Faster R-CNN model, GloVe (·) represents pre-training by GloVe word embeddings, LSTM (·) represents extraction of question features by the LSTM network.
In practice, to handle the different number of object regions k and the variable number of words u, following [2], both of V and E are filled to their maximum sizes (i.e., k = 100, u = 14) by zero-padding. Especially, we conduct a linear transformation of V to make its dimension consistent with the question features. As shown in (3): In practice, to compute the attention function of queries with the number of m synchronously, we pack them into a matrix Q ∈ R m×d model . And t key-value pairs are packed into a matrix K ∈ R t×d model and a matrix V ∈ R t×d model . The attended features F is obtained as follows: where softmax (·) represents normalization by a softmax function, W ∈ R m×t is the weights matrix, F ∈ R m×d model represents the attended features. As shown in Fig. 4 (b) is Multi-Head Attention. To improve the representation capacity of attended features, it allows the model to jointly attend to information from diverse representation subspaces at different positions. The Multi-Head Attention includes h paralleled heads, and each head is equivalent to an independent Scale Dot-Product Attention operation. We compute the matrix of output features as follow:  Fig. 3 (a), each Encoder consists of an SA E unit that is equivalent to a Multi-Head Attention function, and a Feed Forward (FF) network. We regard Encoder as a model to extract fine-grained question features by self-attention learning. The input of question features E = [e 1 , e 2 , . . . , e u ] ∈ R u×512 is further transformed into three matrixes: queries matrix Q E ∈ R u×512 , keys matrix K E ∈ R u×512 , and values matrix V E ∈ R u×512 . The calculation of them is as follows: The SA E unit is used to learn the correlation between each word pair < e i , e j > in question, and outputs the question features through self-attention. Especially, residual connection [14] followed by layer normalization [37] is applied to the outputs as follows: where F E E←E ∈ R u×512 is the output features of question, LN (·) represents layer normalization. FF network is composed of two full connection layers with ReLu [37] function, and Dropout [38] is applied to prevent overfitting. Denoting the input for the FF network as F E E←E VOLUME 8, 2020 obtained by SA E , we get the output F E E←E ∈ R u×512 . Then using the residual connection and layer normalization to optimize. The attended question features E = Encoder (E) ∈ R u×512 is given by: where max (·) represents choosing a larger number between two. The input dimension, hidden dimension and output dimension of the FF network are 512, 2048, and 512, W 1 ∈ R 512×2048 , W 2 ∈ R 2048×512 , b 1 ∈ R u×2048 , and b 2 ∈ R u×512 are four parameter matrixes.

b: DECODER
As shown in Fig. 3 (b), each Decoder consists of two Multi-Head Attention (i.e., GA EV unit, SA V unit) and a FF network. The GA EV unit is like a fine-grained features extractor of question-guided-attention to the image. Firstly, the image features V = [v 1 , v 2 , . . . , v k ] ∈ R k×512 and attended question features E' are transformed to Q V ∈ R k×512 , K E ∈ R k×512 , and V E ∈ R k×512 through three linear layers. Then, by focusing on relevant image regions through keywords in question, GA EV unit denotes Q V , K E , and V E as the inputs, and outputs the featuresṼ = F V V ←E ∈ R k×512 with optimization by residual connection and layer normalization as follows: In Decoder, the SA V unit is used to learn the correlation between each image region pair < v i , v j >. Firstly, getting QṼ ∈ R k×512 , KṼ ∈ R k×512 , and VṼ ∈ R k×512 from transformation ofṼ through three independent linear layers. Secondly, taking QṼ and VṼ as the inputs, SA V unit outputs the image features F VṼ ←Ṽ ∈ R k×512 with the aforementioned optimization manners as follows: The function of the FF network in Decoder is the same as that in Encoder. Firstly, denoting the input for FF as F VṼ ←Ṽ , its output is denoted by F VṼ ←Ṽ ∈ R k×512 . Then, with the optimization mentioned above, the attended image features V = Decoder (V ) ∈ R k×512 is obtained as follows: where max (·) represents choosing a larger number between two. The input dimension, hidden dimension and output dimension of the FF network are 512, 2048, and 512, W 3 ∈ R 512×2048 , W 4 ∈ R 2048×512 , b 3 ∈ R k×2048 , and b 4 ∈ R k×512 are four parameter matrixes. As shown in Fig. 3, N is the number of layers that Encoder and Decoder can be stacked. We define MEDA 1 (one-layer MEDA) as follows: Firstly, taking the original question features E 0 = E as the input, and output E 1 through a one-layer Encoder. Then, taking E 1 and original image features V 0 = V as the two inputs, and output V 1 through a one-layer Decoder. Finally, we get question-image features pair (E 1 , V 1 ).
MEDA n can be obtained by MEDA that cascaded in depth, where n represents the number of layers that Encoder-Decoder stacked. Denoting the input features for MEDA n as E n−1 and V n−1 respectively, their output features are denoted by E n and V n , which are further fed to the MEDA n+1 as its inputs in a recursive manner. The formula is as follow: C

. IMAGE-QUESTION FEATURE FUSION AND OUTPUT CLASSIFIER
As shown in Fig. 2 (c), after MEDA n , we get the question features E n = [e n 1 , e n 2 , . . . , e n u ] ∈ R u×512 and image features To fusion multimodal features simply, the question features E n and image features V n require compressing into R 512 . Firstly, we design an MLP layer that consists of two full connection layers to transform E n and V n . Then, we compute the attention weights α E (or α V ) of question word features e n j (or image region features v n i ) through a softmax function. Eventually, the attended question featuresē n ∈ R 512 (or image featuresv n ∈ R 512 ) can be obtained by summarizing the multiplication of both the attention weights α E (or α V ) and relevant question word features e n j (or image region features v n i ). Taking E n as an example, the attended featuresē n is obtained as follows: where MLP (·) represents a linear transformation operation, and its dimensions of the input layer, hidden layer, and output layer are 512, 512, and 1, α E = [α E 1 , α E 2 , . . . , α E u ] ∈ R u are the learned attention weights of words in question. Similarly, we can get the attended image featuresv n by V n .
After obtainingē n andv n , we design a linear multimodal fusion method to get the fusion feature f ∈ R 1024 . Then, the fusion feature f is projected into a vector s ∈ R L , where L represents the number of most frequency answers in the training set. Finally, a sigmoid function [35] is used to obtain the final classification result A ∈ R L as follows: where W T e ∈ R 512×1024 and W T v ∈ R 512×1024 are two linear projection matrixes, and each linear projection operation is equivalent to passing through a linear layer with input dimension of 512 and output dimension of 1024, LN (·) is used to optimize training. And sigmoid (·) is used for classification.
In particular, following [39], we use Binary cross-entropy (BCE) as a loss function to train the classification of L answers.
Especially, Linear (·) that appeared in the aforementioned equations represents a linear layer, which is used for linear transformation. The input and output dimensions of all Linear (·) are 512 except for that in (3) with 2048-dim input and 512-dim output, and that in (18) with 1024-dim input and L-dim output.

IV. EXPERIMENTS
In section IV, we will verify the performance of the proposed new model MEDAN through a series of experiments. In section A, the benchmark dataset VQA-v2 used in our experiment is introduced. In section B, we give a specific description of some hyper-parameters in experiments. In section C, extensive ablation studies for some optional hyper-parameters described in section B (e.g., the number of N that Encoder and Decoder can be stacked, the number of heads h in Multi-Head Attention) are conducted. In section D we explore the effectiveness and shortcomings of our proposed model MEDAN through attention visualization. In Section E, the performance of our model MEDAN is compared with some State-of-the-Art models.

A. DATASETS
In our experiments, we use the benchmark dataset VQA-v2 [40] for training and evaluation. It consists of images from the Microsoft COCO dataset [41] and human-annotated question-answer pairs for images. Compared to the dataset VQA-v1 [9], VQA-v2 has more annotations and better balance. The dataset VQA-v2 is split into train set (82783 pictures and 443757 question-answer pairs), validation set (40504 pictures and 214354 question-answer pairs) and test set (81434 pictures and 447793 question-answer pairs). Particularly, the test set is divided into a test-dev set of 25% and a test-std set of 75%. Each image contains 3 questions and each question has 10 answers. The answers with the highest frequency are regarded as an accurate answer. Additionally, all questions are divided into three types: Yes/No, Number, and Other.

B. EXPERIMENTAL DETAILS
During our experiments, we define the following hyperparameters. The input and output dimensions of the Multi-Head Attention are 512, and the number of heads h ∈ {1, 2, 4, 8}. Following the strategy in [39], we set the number of answers with the higher frequency L = 3129.
The number of layers that Encoder and Decoder stacked is N ∈ {1, 2, 4, 6, 8}. Additionally, the dropout in all full connection layers is 0.1. And the Bacthsize for training is set to 64.
To train the MEDAN model, we use two strategies. One is using the Adam solver [23] with parameters (β 1 = 0.9, β 2 = 0.98) following [2], and the other is using AdamW solver [24] with same β 1 and β 2 . Besides, following [2], we set the same strategy of learning rate. All the models in our experiments are trained up to 14 epochs. In particular, if the accuracy is evaluated on the validation set, only the train set is used for training. If testing on the test set, we use the train set, validation set, and vg set for training, where the vg set comes from the Visual Genome dataset [36]. By using the additional VQA dataset for training, the performance can be improved. The implementation of the proposed model MEDAN is based on Pytorch. The hardware and software used in our experiments are shown in Table 1.

C. ABLATION STUDIES OF MEDAN
We conduct extensive ablation studies with AdamW solver on the VQA-v2 dataset to explore the effectiveness of variable N and h in our model. The results are shown in Table 2. Additionally, we use two different solvers (i.e., AdamW and Adam) to evaluate the performance on the validation set as shown in Fig. 5.

1) DEPTH OF MEDA
Firstly, as shown in Table 2. (a), we explore the effectiveness of the layer numbers N on a basis of h = 8. In our best single model, the default number of layers N is set to 6. During ablation studies, we evaluate the performance of different N ∈ {1, 2, 4, 6, 8}, and the results demonstrate that increasing the number of layers before N = 6 can steadily improve the performance of our proposed MEDAN. The reason for the improvement may be the use of residual connections [14]. When N = 8, there is a subtle improvement, but the time of training and parameters will increase a lot. Therefore, considering the overall performance of the model, we choose N = 6 in our best model. VOLUME 8, 2020

2) NUMBER OF HEADS
Secondly, as shown in Table 2. (b), we research the effectiveness of the number of parallel heads h in Multi-Head Attention on a basis of N = 1. The default number of heads h is set to 8 in our best model. On the premise that the overall dimension of Multi-Head Attention is 512, we evaluate different h ∈ {1, 2, 4, 8} on the validation set. As the results showed, when h = 8 or h = 16, we get the best performance. Considering the time of training, we choose h = 8 in our best model.

3) TWO DIFFERENT SOLVERS
In our experiments, we use two different solvers to train the model. As shown in Fig. 5 (a), (b), (c), and (d) represent the per-type and overall accuracies respectively. Although the results on the validation set show that the effect of AdamW is slightly better than Adam for training the model, we still used two different solvers to evaluate the performance on the test set, and the final results demonstrate both of them are helpful for our proposed model MEDAN.  prediction (P) are presented in the right. The brightness of regions represents a correlation between themselves and question, and the darkness of words represents their importance in the attention weights. For the correct samples, we can find that the learned question attention is more focused on keywords, and the learned image attention has better a correlation between keywords and relevant image regions. For the incorrect samples, there is the uncertainty of attention learning, sometimes it cannot focus on the real keywords (e.g., ignoring the word 'looking up' in the third example), thus focusing on irrelevant image regions, and leading to predict incorrectly. These visualization results will help us make further improvement to the model in the future.

E. COMPARISON WITH THE STATE-OF-THE-ART
In Table 3, we compare our MEDAN model with the current state-of-the-art models on the VQA-v2 dataset, where all models are based on a single model. Among them, BUTD [1] is the winner of the VQA challenge 2017. This model proposed the Bottom-Up attention method based on Faster R-CNN for the extraction of visual features. The method is used by all subsequent models. MFH [30] is a better bilinear pooling method at present. The model BAN [12] achieves good performance by stacking 12 layers, and with additional counting module [42], the object counting performance is improved significantly. DFAF [3] and MCAN [2] are the best models based on deep co-attention. Our proposed model MEDAN is also based on deep co-attention. As shown in Table 3, with Adam solver for training, our model is 0.56 and 0.66 points higher than DFAF on test-dev and teststd. Although it is 0.03 points lower than MCAN on test-dev, it is 0.11 points higher on test-std. And with AdamW solver for training, our model is 0.54 and 0.16 points higher than DFAF and MCAN on test-dev. And on test-std, MEDAN is 0.64 and 0.08 points higher than DFAF and MCAN. By using Gated Self-Attention instead of Self-Attention, the model MUAN that based on deep co-attention achieves the highest accuracy. Although the accuracy of our model is a little lower than MUAN, the advantage of our model is still obvious. And we guess that if combing the two models, the accuracy may be further improved, which is very enlightening for my sub sequent research. In a word, these comparisons illustrate the adequate effectiveness of our proposed MEDAN.

V. CONCLUSION
In this paper, we proposed a novel model Multimodal Encoder-Decoder Attention Networks (MEDAN) for VQA. The core idea is to design a Multimodal Encoder-Decoder Attention (MEDA) layer, which consists of the Encoder module and Decoder module that can be stacked in depth. VOLUME 8, 2020 The core of Encoder is a question self-attention (SA E ) unit, which is used to model fine-grained question features. Decoder mainly contains a question-guided-attention (GA EV ) unit and an image self-attention (SA V ) unit, aiming to model the fine-grained image region features. By stacking the layers of Encoder-Decoder, our proposed model MEDAN achieves a new state-of-the-art performance of VQA.