Double Attention for Multi-Label Image Classification

Multi-label image classification is an essential task in image processing. How to improve the correlation between labels by learning multi-scale features from images is a very challenging problem. We propose a Double Attention Network (DAN) to improve the correlation between the image feature regions and the labels, as well as between labels and labels. Firstly, the dynamic learning strategy is used to extract the multi-scale features of the image to solve the problem of inconsistent scale of objects in the image. Secondly, in order to improve the correlation between the image feature regions and the labels, we use the spatial attention module to focus on the important regions of the image to learn their salient features, while we use the channel attention module to model the correlation between the channels to improve the correlation between the labels. Finally, the output features of two attention modules are fused as the multi-label image classification model. Experiments on MS-COCO 2014 dataset, Pascal VOC 2007 dataset and NUS-WIDE dataset demonstrate that our model is significantly better than the state-of-the-art models. Besides, visualization analyses show that our model has a strong ability for image salient feature learning and label correlation capturing.


I. INTRODUCTION
Multi-label image classification has always been a research hotspot in the field of computer vision, which aims to recognize different objects and attributes in the image.However, multi-label images often contain complex backgrounds, insignificant objects, and the occlusion between the objects, which makes multi-label classification task more difficult than single-label classification task.How to solve the problem of multi-label image classification, the key is to effectively learn the salient feature of each object in the image and capture the correlation between labels, which can improve the performance of model prediction.
Due to the powerful representation ability of the convolutional neural network [1], [2], [3], and the abundance of labeled datasets such as ImageNet [4], the deep neural network method has made significant progress in the multilabel image classification task.However, these methods often overlook three problems: the scale inconsistency of image objects, the correlation between image feature regions and labels, and the correlation between labels.As shown in Fig. 1, the region occupied by tennis and traffic light in the image can be almost negligible.However, the relative size betw-een a person and a tennis racket is significantly different from the relative size between a person and a bus.If the fixed scale strategy is adopted without considering the object size of each image, the features learned are sub-optimal.As the scene of the multi-label image is complicated, the semantic relation between the image feature regions and labels can be established, if the model can focus on the important region of the image and ignore the unimportant region.Besides, from the label network in Fig. 1, we can see that the labels related to the same object in different scenes are different.The labels related to the person in Fig. 1 (a) are tennis racket, tennis ball, chair, e.g., whereas the labels related to the person in Fig. 1 (b) are bus, handbag, backpack, e.g., different scene images determine the difference of the labels.Although the chair and handbag do not appear in Fig. 1(a) and Fig. 1(b), the predicted labels can be more consistent with the actual scene of the image through label correlation.
To solve the above three problems, we introduce a Double Attention Network (DAN), which is composed of the following three modules: feature extraction module, spatial attention module, and channel attention module.Experiments on MS-COCO 2014 dataset, Pascal VOC 2007 dataset and

FIGURE 1. The overview of the Double Attention Network (DAN).
NUS-WIDE dataset show our model is significantly better than state-of-the-art models.Besides, visual analysis proves that our proposed model can effectively learn the salient features of images, and capture the correlation between labels.
The contributions of our work are summarized as follows: (1) We introduce a new end-to-end trainable network for multi-label image classification task, which can solve the problem of scale inconsistency of image objects.In addition, we propose two attention modules to improve the correlation between image feature regions and labels, as well as the correlation between labels.(2) We conduct experiments on three publicly available datasets, MS-COCO 2014, Pascal VOC 2007 and NUS-WIDE dataset.The results demonstrate that our proposed model is significantly better than the stateof-the-art models.

II. RELATED WORKS A. Multi-Label Image Classification
Recently, methods based on deep neural networks have made significant progress in various image processing tasks such as semantic segmentation [5], object detection [6], and action recognition [7].At the same time, multi-label image classification methods based on deep neural network [8], [9], [10], [11], [12], [13], [14] are more and more popular.First of all, many researchers apply the CNN methods to multi-label image classification tasks because of the powerful effect of the convolutional neural network (CNN) on the tasks of single-label image classification.For example, Gong [8] achieved significant performance improvement by combining the convolution network with the top-K ranking target, proving that the deep neural network methods are better than the traditional methods.Yu [11] constructed a multi-instance and global prior deep dual flow network to take advantage of global and local information.Besides, since recurrent neural network (RNN) is successfully applied to machine translation [15], image caption [16] and visual question answering [17].Similarly, we can regard multi-label classification as a sequence generation problem.For example, Wang [13] introduced a unified CNN-RNN classification framework to model the co-occurrence dependency of labels in the joint image-label embedding space, thus improve the performance.Liu [14] used the embedding layer of semantic regularization as the interface between CNN and RNN, because regularization can partially or entirely decouple learning problems, so that each problem can be trained more effectively, and joint training is more efficient.Although these methods have achieved promising performance, they ignore the inconsistency of object scale in the image, even fail to solve the problem of the correlation between the image feature regions and labels, as well as the correlation between the labels.

B. Multi-Scale Features
Images in real life often contain multiple objects with differrent scales and postures.The low-level features (spatial features) of small objects in an image are usually more apparent, but their high-level features (semantic features) are easily ignored because of the influence of convolution and pooling operations.The application of multi-scale features in visual tasks, such as semantic segmentation [18] and object detection [19], effectively improves the performance of the model.However, most of the popular multi-scale features are designed by hand-designed and fixed scaling strategies (such as SIFT or feature pyramid), but feature scaling strategies should be based on specific instances.Futhermore, different scaling strategies should be adopted for different images.Therefore, a flexible multi-scale feature strategy based on the real-word image is needed to eliminate the effect of image object scale inconsistency in multi-label classification.

C. Attention
Attention allows the model to focus only on the regions of interest, and suppresses unnecessary features.It has been widely used in image processing tasks.For example, SEnet [20] considers the relationship between channels, adds attention mechanism to feature channels, automatically obtains the importance of each feature channel through learning, focuses on salient features, and suppresses unimportant features.On this basis, CBAM [21] effectively improves the prediction accuracy of the model compared with SEnet by combining the attention mechanism of feature channel and feature space.Besides, GSoP-Net [22] uses the second-order statistics of the global image of the deep convolutional neural network to capture more discriminative representations, and the performance of their model is better than SEnet.In the field of semantic segmentation, the attention mechanism is also widely used.For example, in order to fully realize the large-scale information circulation in the feature map, PSANet [42] adopts a two-way signal flow mechanism and proposes a point-by-point spatial attention model to significantly improve the scene analysis performance of the benchmark model.CCNET [43] obtains the context information of each pixel through criss-cross path, and each pixel captures the long dependency of all pixels through recurrent operation.OCNet [44] employs the self-attention method to learn the object context map by recording the similarities between all the pixels and the associated pixel p.Recently, attention is also applied to the task of multi-label image classification.Lyu [23] introduced a dynamic attention mechanism, the attention region generated by the model can guide LSTM to predict the sequence of labels, but the model needs to consider the input order of labels.RSN [24] used the saliency prediction model trained by the human gaze to learn the distinguishing features, but the model needs to be trained in multiple stages.SRN [25] specified the spatial relationship between labels by learning the attention heat map, but fails to consider the correlation between the attention region and each label.Therefore, we propose two attention mechanisms to solve the above problems: spatial attention learns the correlation between image regions and labels, and channel attention learns the correlation between labels, so that the prediction accuracy of the model can be greatly improved.

III. Network Structure
Our DAN is composed of the feature extraction module, the spatial attention module, and the channel attention module.Fig. 2 shows the structural framework of the DAN model.Next, we firstly introduce the three modules.Secondly, the strategies of aggregation of the two attention modules are introduced

A. Feature Extraction Module
Real-world images usually contain rich semantic information, which can be extracted by deep neural networks.However, the fixed scaling strategy is usually used to extract multiscale features, which is not specific to actual images, and the extracted semantic features are often sub-optimal.For example, the feature extraction process of a certain layer in convolutional neural network can be expressed as: where x is the input of the layer, n is the number of branches to be aggregated, f is the convolution operation, BN is the batch normalization operation,  is the activation operation, and  is the nonlinear function.In addition, several ()  Fx are usually stacked to process the feature information in one spatial resolution.Stages with decreasing spatial resolutions are stacked to integrate a pyramid scale policy in the network architecture.For example, a network of 3 stages with 2 layers in each stage can be expressed as: where ri D is the spatial resolution reduction operation, and ri is the reduction rate.In a convolutional neural network, the spatial resolution is usually reduced by increasing the stride of the convolution operation, such as stacking multiple bottleneck layers in ResNeXt, and using a convolution operation with stride 2 to reduce the spatial resolution.It can be found that the use of scaling strategy in fixed levels and stages can establish a linear relationship, which is not conducive to the extraction of actual image features.
Therefore, as shown in Fig. 3, we adopt the ELASTIC structure proposed in the literature [26] to dynamically extract multi-scale features of images, add the downsamplings and up-samplings in each layer of parallel branches of the network.The ELASTIC bottleneck layer uses half of the path to down-sample the input features, then up-samples the processed features and adds them to the original resolution features.Thus, the feature extraction process of a certain layer in the network can be expressed as: From the above formula, it can be found that the spatial resolution of the network has not changed after processing information.By adding the ELASTIC structure to the network, different branches of each layer can process different scale features, thus cross-scale information can be captured.If the ELASTIC is applied to all blocks in a network, the stacking of multiple layers of the network will produce exponential scaling possibilities due to the different scaling strategies for each layer.The interpolation between the maximum scale and the minimum scale of the feature can learn the feature information of various scales, thereby effectively improving the performance of the model.We use ResNeXt-50 network as feature extractor, and add the ELASTIC structure to conv_2, conv_3 and conv_4 blocks.Note that adding the ELASTIC structure will slightly increase the parameter amount of the model, but the prediction accuracy improvement of the model is considerable.For more detailed information about the ELASTIC structure, please refer to the literature [26].
Suppose that the training images x is used as the input of spatial attention module and channel attention module, where C is the number of channels of the feature map, H is the height of feature map, and W is the width of the feature map.The feature generated by the network is calculated as follows: Re ( ; ) In Equation ( 5), Re sNext ELASTIC f  is the first five convolution blocks of the ResNeXt-50 network with the ELASTIC structure added, and  is the parameters of the network.

B. Spatial Attention Module
Images usually contain multiple objects with different scales and postures, which are distributed in different positions, and even there may be an occlusion relationship between the objects.Under image-level supervision, how to identify objects in the image and establish contact with the labels is a challenging task.This requires the model to accurately locate the position of each object in the image, while the attention mechanisms can satisfy the above requirements.Therefore, as shown in Fig. 4, our proposed spatial attention module can make the model focus only on the important region in the image while ignoring other unimportant regions, and learn salient image features, thus improving the correlation between image feature regions and labels.
The feature obtained by the feature extraction module is

RE
x , and we make RE x equal to ) where  is the ReLU function,  is the Sigmoid function, f  is the convolution layer with a convolution kernel 11  , and 33   f  is the convolution layer with a convolution kernel 33  .Besides, we do matrix multiplication for X and A, and element-wise sum the result with A to get the final output 8) can highlight the features at the peak of the attention map while preventing the low-value area of the attention map from degenerating to 0, so that the relevant region of the feature map is focused on.As a result, the spatial attention module further improves the performance of our model, which will be proved by the ablation analysis later.

C. Channel Attention Module
The high-level features extracted by the deep neural network have rich semantic information, each channel of feature maps can be considered as a specific category of response, and different responses are interrelated.Besides, notice that the image usually has multiple labels, and there is a specific correlation between the labels.By using the interdependencies of channel maps, the correlation between labels in the image can be mined to improve the prediction accuracy of the model.Therefore, as shown in Fig. 5, we modify the channel attention module introduced in the literature [5] to satisfy the task of multi-label image classification, and then use this module to model the correlation between channels explicitly. The The final feature of each channel is the weighted sum of all channel features and original features, which can model the long-term dependence between channels, to implicitly improve the correlation between labels.

D. Feature Fusion
Feature extracted by the spatial attention module and feature extracted by the channel attention module are complementary.In order to obtain better feature representation, the two feature modules need to be fused.Next, we analyze and discuss several different feature fusion strategies.
Specifically, we pool the output features of the two attention modules separately, and then concatenate them according to the channel dimension direction.The resulting final feature map can be expressed as follows: ( ( ), ( )) where avgpool is the average pooling operation, concat is concatenate operation.Finally, we perform a fully conected operation on the generated feature map In addition, we can sum up the output features of the two attention modules, that is, perform an element-wise summation operation on the two feature maps(the channel dimension is still C ), and then the convolution layer with a convolution kernel of 11  is used to generate the final feature map xf : 11 ( ( , )) where sum is an element-wise summation operation, and 11 f  is the convolution layer with a convolution kernel 11  .
On the other hand, for the subtraction operation, we only need to replace the element-wise summation with the element-wise subtraction, which is the final feature map generated: 11 ( ( , )) where sub is an element-wise subtraction operation.Finally, the feature maps generated by the above two fusion methods are respectively pooled and fully connected.Besides, we can process the output features of the two attention modules separately, that is, average pooling operation and full connection operation on the feature map respectively, then train the two classifiers, calculate the error with the loss function respectively, and finally sum the two errors and carry out the back propagation.We abbreviate this method as _ loss sum , the calculation process can be as the following formula: where FC is the full connection operation and i loss is the loss function.Note that our feature fusion methods are simple, efficient, and our model can be trained in an end-to-end manner.In the following experiments, we will make a detailed comparison of feature fusion strategies.

IV. Experiment
In order to evaluate the effectiveness of the proposed model, by using general evaluation metrics, we compared our model with the state-of-the-art models on three public datasets MS-COCO 2014 [27], Pascal VOC 2007 [28] and NUS-WIDE dataset [29].In the following content, the relevant settings of the three datasets and the experiments are introduced firstly, then we introduce some metrics used in the experiments, and we compare our model with the state-of-the-art models.Secondly, we conduct detailed experimental comparisons of the four proposed feature fusion strategies, and then conduct ablation experiments prove the effective-ness of each module in the model.Finally, we visually analyze the correlation between the labels in the image and the attention feature map of the model.In our experiment, we use PyTorch [30] to implement, and use two NVIDIA Tesla K80 GPUs to train and test our proposed model.The experiment uses ResNeXt-50 pretrained on ImageNet.In the training stage, we perform a single random clipping on the input image to get the 224 224  size, and then perform the random horizontal flip.The RGB value of each pixel is adjusted to [0, 1], then the mean value is subtracted and divided by the annotation variance for normalization.In the test stage, we resize the image to 224 224  , no other data enhancement operations are used.The initial learning rate of our model training is set to 0.001, which decays by a factor of 10 for every 10 epochs.A total of 40 epochs are trained, SGD is used for optimization with the momentum factor 0.9 and the weight decay 0.0005.Due to the limitation of GPU memory, we set the batch size to 64, and we train the model with a binary cross-entropy (BCE) loss function, which is usually used as the baseline of a specific field and can explicitly model spatial or semantic relations.

B. Evaluation Metrics
In order to make a fair comparison with state-of-the-art models, we follow these methods [8], [9], [10], [11] to adopt the average accuracy (AP) scores of each category and mean average accuracy (mAP) scores over all categories for evaluation.In addition, we report the precision, recall and F1-measure respectively with reference to previous works [31], [32], [33], [34].If the prediction probability of the label is greater than 0.5, the label is positive; otherwise, the label is negative.Besides, when calculating the top-3 metrics, if one or more of the three prediction labels with the highest confidence is lower than 0.5, it will also be considered as a positive label for prediction.However, it is actually more useful to output a variable number of labels for each image.In other words, the labels with the confidence value lower than 0.5 in top-3 are removed, and only the labels with the confidence value greater than or equal to 0.5 are retained.Therefore, we calculate the top-3 metrics according to the specific image.Specifically, the overall precision, recall, F1measure (OP, OR, OF1) and per-class precision, recall, F1measure (CP, CR, CF1) are adopted, which are defined as follows: Where C is the number of labels, c i N is the number of images that are correctly predicted for the th i label, p i N is the number of predicted images for the th i label, and the number of ground truth images for the th i label.

1) COMPARISON ON MS-COCO2014
The comparison results are presented in Table 1.It is demonstrated that our proposed model is substantially superior to other models in most metrics.Besides, the best existing method is ResNet-SRN, which can learn the attention map between all labels, mining the potential relationship between labels by learning convolution, and improving the performance of the model by combining the classification results of regularization and ResNet-101 network.Different from this method, we first solve the problem of inconsistency of multiobjects scale in the image, and use the correlation between image feature regions and labels, as well as the correlation between labels to improve prediction accuracy of the model.The classification result is better than ResNet-SRN.Specifically, the results of our model for obtaining mAP, CF1, and OF1 are 77.5%,72.1%, and 76.0% separately, which are 0.4%, 0.9% and 0.2% higher than the best method ResNet-SRN, respectively.It is important that our model is end-to-end training, while ResNet-SRN requires many complex steps.Besides, the ResNeXt-101 network is not adopted in our model because of the limitation of GPU memory.Other-wise, the performance will be further improved.

2) COMPARISON ON PASCAL VOC2007
The experimental comparison results of Pascal VOC2007 are presented in Table 2. Our proposed model performs much better than other models.Among the recognition accuracy of the 20 labels of the dataset, the recognition accuracy of 11 labels in our model is higher than that of the previous method, and the mAP value of 91.1% is the best result.For small objects in the image, such as bird, cat, and TV, our model obtains the highest accuracy, which is improved by 4.7%, 2.4%, and 3.6% respectively compared with the RLSD model, indicating that our model can solve the problem of objects scale inconsistency.Besides, in Pascal VOC2007, the person is the most frequently occurring label.From the table, it can be found that some labels with semantic relationship have a specific improvement in recognition accuracy, such as person and {bike, boat, bus, car, horse, motor}, the probability of simultaneous occurrence between them, compared with the DELTA model, our recognition accuracy is increased by {0.8%, 0.8%, 0.5%, 0.9%, 0.9%, 1.9%}, respectively.
Besides, there are specific semantic dependencies among other labels, such as table and tv, table and plant.When only Att-Image model [37] is used, the recognition accuracy of table, tv and plant is 76.6%, 83.0%, and 67.0%respectively, while our recognition accuracy is 86.8%, 89.5%, and 78.3%, and the performance is improved by 10.2%, 6.5%, and 11.3% respectively.This shows that our model can learn the dependency relationship between labels well, and improve the prediction accuracy of the model.

3) COMPARISON ON NUS-WIDE
The quantitative results are shown in Table 3.The comparison results are similar to MS-COCO's.Our proposed DAN performs better than state-of-the-art methods on most metrics.The C-P, C-R and O-F1 of our DAN model substantially exceed the previous state-of-the-art results, while O-P and O-R are slightly behind the state-of-the-art models.It should be noted that the RLSD method uses bounding box information for multi-label image classification, while our proposed DAN still outperforms it with a large margin.
Besides, the MS-CMA [46] uses 448 448  images to train and test the model, while our DAN only uses 224 224  images to train and test the model.We know that larger images are more conducive to learning small target objects, which is beneficial to the final experimental results.However, we use smaller images to get as good mAP as the MS-CMA, which shows that our DAN can learn the multiscale features of images very well.4. For fairness, we only change the final feature fusion strategy of the model, while the other structures of the model remain unchanged.We make an experimental comparison on the PASCAL VOC2007 dataset, and select four subclasses of objects for comparison, such as bottle, cat, bike and chair, which have the characteristics of small object size, diverse shapes, or mutual occlusion.From the comparison results, we can observe that: (1) No matter which fusion strategy is used, the experimental results of our DAN model are better than the baseline model ResNeXt-50.
(2) _ loss sum fusion strategy fails to effectively learn the spatial features and channel features, but only focuses on the cross entropy loss of both, resulting in the worst performance of the model.concatenate, which indicates the summation strategy can better learn the effective image features.For example, on the indicators cat and chair, the AP value of the summation method is 0.8% and 0.7% higher than that of the concatenate method, respectively.Therefore, we choose the summation as the feature fusion strategy of our DAN model.

5) COMPARISON OF MODEL PARAMETERS
This section compares the number of model parameters, and the comparison results are presented in Table 5.It is demonstrated that our DAN achieves the best performance on MS-COCO2014 and Pascal VOC2007, but the number of model parameters is very small.Our DAN and the model DELTA achieve the best mAP on VOC2007, but the parameters of our model are reduced by 79%.
Although the number of DAN parameters increased compared to the model Deep MIML and MIML-FCN, the performance on MS-COCO2014 is improved by 17.0% and 14.0%, respectively, which shows that our DAN has excellent advantages in parameters.

6) ABLATION STUDY
To evaluate the effectiveness of each module, we decompose the model and perform experiments on the MS-COCO 2014 dataset.The ablation results are presented in Table 6.We use ResNeXt-50 as the backbone, and the performance of the model is much improved after adding each component.
As can be seen from the experimental results in Table 6, after the addition of the ELASTIC structure, the prediction accuracy of the model is significantly improved compared to ResNext-50 in various indicators, which indicates that the ELASTIC structure can effectively solve the problem of inconsistency object scales in the image, thus better identifying small objects in the image.After adding the CAM module, the model performance is further improved.
Experimental results on Pascal VOC 2007 dataset present that our model can effectively capture the correlation between labels.
With the addition of the SAM module, the model can focus on the salient features in the image, thus improving the performance of the model.After reasonable integration of the above modules, our proposed DAN achieves the best performance, and the mAP value is 77.5%, which also reflected the CAM and the SAM to learn each other complementarily.In addition, it only takes 36 hours for DAN to train with 40 epochs.

D. Visual Analysis
First of all, we use the prediction results of the DAN-CAM model to qualitatively analyze the correlation between labels, and then use the Grad-CAM [47] method to visually analyze the correlation between the important regions of the image and the labels.Next, we describe in detail.

1) LABELS CORRELATION ANALYSIS OF CAM MODULE
The experimental results in Table 2 and Table 6 illustrate the effectiveness of our CAM module in improving the correlation between labels from a quantitative perspective.In this section, we use the CAM module to predict the relevant labels of the image from a qualitative perspective to illustrate the problem.In the experiment, we use the MSCOCO 2014 dataset, which contains a total of 80 object categories and can be divided into different groups according to the scene of the image, namely label groups.Therefore, we choose four different scenes, such as sports, traffic, office and food, to explain that our CAM can learn the correlation between labels according to the scene categories of images to improve the performance of the model.In addition to selecting multilabel images, we also specially select single-label images in the dataset.In the experimental results, in addition to ground truth and labels whose model prediction probability are greater than 0.5, we also give the first three labels whose model prediction probability is less than 0.5, so as to prove whether the model can predict the corresponding label group according to the actual scene.Our DAN-CAM model only contains the ELASTIC structure and the CAM module, without the SAM module.The experimental results are shown in Fig. 6.
In Fig. (A-1) and Fig. (C-4), due to the small size of dog and bed, ResNeXt-50 model is difficult to learn effective features and fails to predict correctly.However, our model 3) are single-label images, our model DAN-CAM can not only predict the correct label, but also predict the corresponding label group based on the actual scene.That is to say, the stop sign often appears with the car, person and bicycle, but not with labels such as oven, dining table and bottle (except for extreme cases).It shows that our CAM module can learn other labels associated with the image according to the important object (the label with the highest probability) in the image, so as to improve the correlation between image scene and labels.In addition, we divide  6 is the case of our model failure.In Fig. (A-4), our model learns it incorrectly because the logo on the snowboard is one person.However, the ResNeXt-50 model predicts the light that does not conform to the scene.The same situation also appeared in Fig. (B-4), indicating that our model could enhance other relevant labels according to the label with the highest probability in the image, and would not predict labels that did not conform to the scene.
Therefore, compared with the ResNeXt-50, our DAN-CAN not only predicts more accurately and comprehensively, but also predicts labels more in line with the actual scene of the image.From another perspective, our model learn image related label group according to the salient object in the image (the label with the highest probability), and even the single-label image can learn the related label group.

2) VISUAL ANALYSIS OF DAN MODEL
In order to further verify the correlation between the image feature regions and labels, we use the Grad-CAM [47] method to visualize the learned feature map.The visual results are presented in Fig. 7.The input image, the predicted label probability distribution diagram, and the attention heat maps of the top-3 labels are shown in each row.
As can be seen from the figure, if the object corresponding to the label exists in the our SAM module can highlight the corresponding feature region very well.For example, in the fourth line, there are multiple objects in the image, such as the car, person, and clock, our model can well highlight the region where the object is located.Similar phenomena can be found in other examples.At the same time, the predicted label probability distribution diagram is observed.On the one hand, the prediction probabilities of image correlation labels are much higher than 0.5, while the prediction probabilities of irrelevant labels are much lower than 0.5, which shows that our DAN can learn image features well and has strong robustness.On the other hand, labels predicted by the model with probabilities of less than 0.5 are closely related to image scenes.For example, in the second line, the traffic light often appears in the traffic scene, which is more likely to appear at the same time with the car and bus.Moreover, the probability of a person appearing in the traffic scene and backpack is also high.For example, the image of in the first line shown in Fig. 7, containing one person and one backpack, which shows that the CAM module can effectively learn the correlation between the labels.This means that the predicted labels are more consistent with the actual scene of the image.

V. Conclusion
In this work, we introduce a new DAN model for the multilabel image classification task.DAN is composed of three essential modules: feature extraction module, spatial attention module, and channel attention module.The feature extraction module uses the dynamic learning strategy to extract the multi-scale features of the image, and solves the problem of inconsistency scales of image objects.Our spatial attention module focuses on the important region of the image and ignores the other unimportant region.It can effectively learn the salient features of the image and improve the correlation between the feature regions of the image and the labels.Our channel attention module models the correlation between channels and captures the correlation between labels.Our ablation experiments verify the effectiveness of each module of our model.Experimental results on MS-COCO 2014 datasets, Pascal VOC2007 datasets and NUS-WIDE dataset show our model is significantly superior to the state-of-the-art models.Besides, the results of the visual analysis prove that the SAM can effectively learn salient features of the image, and accurately locate the position of objects, while the CAM can capture the correlation between labels.

FIGURE 2 .
FIGURE 2. The structure of the Double Attention Network (DAN).

12 {,
N is the number of images in the training set, d C is the number of categories in the dataset.The feature generated by conv_5 of the ResNeXt-50 network is

7 TABLE 2 .
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2020.3044446,IEEE Access VOLUME XX, 2020 Comparison results of different models on Pascal VOC2007.

( 3 )
The performance of the summation fusion learning strategy is better than that of the This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2020.3044446,IEEE Access 8 VOLUME XX, 2020

FIGURE 6 .
FIGURE 6. Sample images in the MS-COCO validation set and predicted labels using different models.Note: labels in bold denote correct predictions; underlined labels denote missed labels in prediction; italic labels indicate the probability of prediction is less than 0.5.
Fig. (B-2) and Fig. (C-2) into two different scenarios, because Fig. (B-2) contains the bicycle, and Fig. (C-2) contains the laptop, although both of the images contain one person and one bench.Compared with the model ResNeXt-50, our model DAN-CAM not only predicts the correct labels, but also does not miss detection, which indicates the good predictive ability of our model.Hence, our CAM module can check and correct the labels predicted by the model ResNeXt-50 very well.The rightmost column in Fig.

FIGURE 7 .
FIGURE 7. Input images (left), predicted label probability distribution diagrams (middle) and attention heat maps(right) of top-3 labels.The ground truth labels are highlighted in red, with the dotted red line at the position indicating a probability of 0.5.

Spatial Attention Module (SAM). the
spatial attention map, firstly, we use the convolution layer with a convolution kernel of 11  , and use the BN and the ReLU function to get the feature map output feature obtained by the feature extraction module is RE x , and we make RE x equal to CM BR   firstly, where M H W  is the number of pixels.Then we do matrix multiplication by  , and then the element- wise sum is performed with B to obtain the final output