Enhanced Visual Attention-Guided Deep Neural Networks for Image Classification

A fully connected layer is essential for a CNN, i.e., convolutional neural network, which has been shown to be successful in classifying images in several related applications. A CNN begins with convolution and pooling operations for decomposing an input image into features. The result of this process is then fed into a fully connected neural network, driving the final classification decision for the input image. However, it has been found that the learned feature maps in a CNN are sometimes not good enough for being fed into the fully connected layers to get good classification results. In this article, a visual attention learning module is proposed to enhance the classification capability of the fully connected layers in a CNN. By learning better feature maps to emphasize salient regions and weaken meaningless regions, better classification performance can be obtained with integrating the proposed module into the fully connected layers. The proposed visual attention learning module can be imposed on any existed CNN-based image classification models to achieve incremental improvements with negligible overhead. Based on our experiments, the proposed method achieves the top-1 accuracies of 95.32%, 92.73%, and 66.50% on average, respectively, obtained on our collected Underwater Fish dataset, the public Animals-10 dataset, and the public Stanford Cars dataset.


I. INTRODUCTION
Image classification or visual object recognition is a fundamental problem [1]- [3] in several computer vision-based applications, such as fish species recognition for underwater exploration [4] and visual understanding-based autonomous driving [5]. Conventional image classification approaches usually apply the extraction of handcrafted features, e.g., [6], to analyze images. For example, a novel image retrieval framework was presented in [7] to retrieve digital images from huge databases based on texture analysis techniques for extracting discriminant features, including color and shape features. However, in recent years, based on the rapid development of deep learning techniques [8] with great success in The associate editor coordinating the review of this manuscript and approving it for publication was Chang-Hwan Son . numerous perceptual tasks, e.g., image classification [9]- [14] and image restoration [15]- [18], several CNN-based deep neural networks were presented for image classification. For example, a deep CNN, called AlexNet [9], was presented to perform image classification for the ImageNet dataset of 1.2 million high-resolution images into the 1000 different classes. In addition, a very deep CNN, called VGGNet [10], was proposed for large-scale image recognition. Its main contribution is to evaluate the network performance with increasing depth using an architecture with very small convolution filters. Moreover, a deep CNN architecture, called Inception or GoogleNet [11], was also presented for large scale visual recognition. The key is to allow for increasing the depth and width of the network while keeping the computational budget constant. Furthermore, a residual learning framework, call ResNet [12], was proposed to ease the training of deeper networks for image recognition. It explicitly reformulates the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. Moreover, an inverted residual network structure, called MobileNetV2 [13], was presented to improve the state of the art performance of mobile models on multiple tasks, including ImageNet classification.
On the other hand, to strengthen the representational power of a CNN, several approaches were presented recently by enhancing the quality of spatial encodings and/or recalibrating channel-wise feature responses. For example, an architectural unit, termed squeeze-and-excitation (SE) block [19], was proposed to model interdependencies between channels. The building blocks can be stacked and easily embedded into any CNN architectures, e.g., by insertion after the non-linearity operation following each convolution, for performance improvement. Furthermore, a convolutional block attention module (CBAM) [20] was presented by sequentially inferring attention maps along two separate dimensions, i.e., channel and spatial. CBAM can be also integrated into any CNN architectures for achieving better performance. For other recently developed deep attention models, a non-local neural model inspired by the classical non-local means method was presented in [21] for capturing long-range dependencies, e.g., successive video frames. In addition, a residual attention network built by stacking attention modules which can generate attention-aware features was proposed in [22]. Moreover, an efficient channel attention (ECA) module for deep CNNs was presented in [23], which captures crosschannel interaction in an efficient way.
For advanced applications of visual attention modules, a deep model was presented in [24] to consider the leaf spot attention mechanism. In addition, a deep architecture denoted by region-of-interest-aware deep CNN was proposed in [25] for making deep features more discriminative to increase classification performance.
In addition, by visualizing the process of a CNN [26], it has been shown that the final convolutional layer of a CNN usually dominates the decision resulted by the CNN [27]. That is, the last convolutional layer can produce a feature map or a coarse localization map highlighting the important spatial regions in the input image for predicting the concept. Therefore, in this article, we propose to only embed an enhanced visual attention layer after the final convolutional layer of a CNN for learning better feature maps to compromise between high-level semantics and detailed spatial information.
The main features and contributions of this article are three-fold: (i) the proposed enhanced visual attention module directly improves the last fully connected layer by enhancing the learned last feature maps of a CNN for better classification capability without needing extra convolutional layers; (ii) the proposed method uses the Huber loss function [28], [29] to guide the last learned feature map toward the corresponding ground truth, instead of using the MSE loss to avoid the effects from possible outlier samples; and (iii) the proposed module just needs to be embedded into a CNN only once and can fit any CNN architecture for image classification purpose with almost negligible extra overhead.
The rest of this article is organized as follows. Sec. II presents the proposed enhanced visual attention-guided deep neural networks for image classification. Experimental results are demonstrated in Sec. III, followed by concluding this article in Sec. IV.

II. PROPOSED ENHANCED VISUAL ATTENTION-GUIDED DEEP NEURAL NETWORKS FOR IMAGE CLASSIFICATION A. OVERVIEW OF THE PROPOSED ENHANCED VISUAL ATTENTION MODULE
To strengthen the salient region feature maps and suppress the insignificant feature maps for an input image for image classification, we propose to separate the learned feature maps into one channel or map and the rest channels in the last convolutional layer. As illustrated in Fig. 1, through the channel separation process, the learned C feature maps of a CNN are split into the feature map of the C-th channel and the rest (C -1) feature maps from the first to the (C -1)-th channels. The selected map from the C-th channel is then fed into the leaned enhanced visual attention module (described later) for refining the feature map. The output enhanced feature map from the proposed enhanced visual attention module can be viewed as a weighting coefficient map used for refining all of the rest (C -1) feature maps. The weighting coefficient map is then used to enhance each of the rest (C -1) feature maps based on the element-wise multiplication. The refined feature maps can better capture the salient region or the main object region of the input image. The feature maps are then fed into the final fully connected layer of the CNN for generating the image classification result.
More specifically, for each c-th, c = 1, 2, . . . , C, learned feature map X c ∈ R H×W from the last convolutional layer in a CNN, the weighting coefficient map ω ∈ R H×W learned by the proposed enhanced visual attention module from enhancing the last channel X c will be used to refine each X c , c = 1, 2, . . . , C − 1. The refined version F c is expressed as: where F c ∈ R H×W is the refined version of X c , and H and W denote the height and the width of the feature map, respectively. The function abs( ) is used to set each coefficient of ω to its absolute value. The operation '' '' means the element-wise multiplication. In our method, ω is used to highlight significant region in the input image and suppress insignificant information for image classification, as illustrated in Fig. 2.

B. TRAINING CONVOLUTIONAL NEURAL NETWORKS WITH THE PROPOSED ENHANCED VISUAL ATTENTION MODULE
The motivation for enhancing the last learned feature maps of a CNN in this article is mainly inspired by the fact that the final convolutional layer (immediately before the final fully connected layer) of a CNN usually captures higher sematic  The term X c , c = 1, 2, . . . , C-1, denotes the original learned feature map and ω denotes the weighting coefficient map (refined version of X c , i.e., the C-th channel) learned by the proposed enhanced visual attention module, which is used to refine each X c , c = 1, 2, . . . , C-1, where F c =X c abs(ω) and the abs() operation is omitted in this.
features for final decision output of the CNN. Moreover, based on the visualization of a CNN for classification purpose [27], the output can be semantically visualized by the weighted combination of the feature maps learned by the last convolutional layer, as illustrated in Fig. 3. Therefore, it is reasonable to refine the representational power of the feature maps learned from the last convolutional layer of a CNN for capturing richer semantic information and obtaining the better prediction result.
To realize this idea, this article proposes to embed an additional layer, called the enhanced visual attention module, into any existed CNN. This module will immediately follow the last convolutional layer of the CNN called the host CNN and enhances the feature maps generated from the last convolutional layer. To train the host CNN with the proposed enhanced visual attention module embedded, as shown in Fig. 4, we first simply split the C learned feature maps from the last convolutional layer of the host CNN into the C-th feature map and the rest (C -1) feature maps. Our main goal is to refine the C-th feature map to enrich the significant information for image classification and suppress the insignificant information. Therefore, we calculate the feature loss between the C-th feature map and the corresponding ground truth (described later). On the other hand, the rest (C -1) feature maps are also connected to the original fully connected layer called the 1st fully connected layer of the host CNN. Moreover, to guide the refined feature maps toward the correct prediction output, all of the refined feature maps are also connected to a fully connected layer called the 2nd fully connected layer, exactly the same as the 1st one.
To guide the selected feature map from the C-th channel to the corresponding ground truth map, the Huber loss [28], [29]   is used as the loss function. The Huber loss function has been shown to be more robust to outlier than the generally used MSE (mean squared error) function. The Huber loss function for each pair of the selected feature map X c and its corresponding ground truth Y c is expressed as: where X c,a,b and Y c,a,b denote the (a, b)-th element of X c and Y c , respectively. H and W are the height and the width of each feature map, respectively. The parameter δ is a threshold, empirically set to 0.7. During the training process, all of the element values of the feature maps are first normalized using the min-max normalization method [30], [31]. On the other hand, the loss functions used to guide the image classification outputs from the two fully connected layers to the corresponding ground truths are the generally used cross entropy functions. The used cross entropy loss functions for the 1st and 2nd fully connected layers are, respectively, expressed as: where i, j, N, and C denote the serial number of the i-th training image, the serial number of the j-th class, the total number of the training images, and the total number of classes for image classification. The terms, Y 1,i,j and Y 2,i,j , are two binary indicators (ground truths), respectively, for the 1st and 2nd fully connected layers. If the class label j is the correct prediction for the observation i, the indicator is 1. Otherwise, the indicator is 0. P 1,i,j and P 2,i,j are two predicted probabilities, respectively, generated by the 1st and 2nd fully connected layers for indicating the observation i belongs to the class j. The two fully connected layers used in the proposed training process are exactly the same and also with the same training data. Based on the Huber loss defined in Eq. (2) and the cross entropy losses defined in Eqs. (4) and (5), the total loss function for training the host CNN with the proposed enhanced visual attention module embedded is expressed as: where λ 1 , λ 2 , and λ 3 are the weighting coefficients to control the weight for each respective loss. Our guideline for empirically tuning the weighting coefficients are addressed as follows. The term CE 1 is used for guiding the original feature maps before refinement to the final results, and therefore its weighting coefficient λ 1 is set to be smaller. In addition, the term CE 2 is used for guiding the refined feature maps to the final results, and therefore its weighting coefficient λ 2 is set to be larger. Moreover, the term H is used to guide the selected feature map to the corresponding ground truth map for further feature refinement, and therefore, its weighting coefficient λ 3 is also set to be larger. As a result, based on the guideline, the three weighting coefficients λ 1 , λ 2 , and λ 3 are empirically set to 0.2, 0.4, and 0.4, respectively, where λ 1 + λ 2 + λ 3 = 1.0. Based on the proposed loss function defined in Eq. (6), in the training process, we aim at guiding both of the original learned feature maps and the refined feature maps toward the correct prediction output while guiding the selected feature map toward its corresponding ground truth salient map for refining the other feature maps. Therefore, the learned deep model would usually generalize well and be neither underfit nor overfit.

C. TESTING CONVOLUTIONAL NEURAL NETWORKS WITH THE PROPOSED ENHANCED VISUAL ATTENTION MODULE
In the testing process of the proposed method, different from the network structure used in the training stage, only one fully connected layer (used in the host CNN) is used. As illustrated in Fig. 5, in the testing stage, each input image for image classification is fed into the host CNN and goes through the deep network. After obtaining the feature maps generated from the last convolutional layer of the CNN, the proposed module splits the total C maps into the C-th map and the rest (C − 1) maps. The selected C-th channel is re-mapped to its refined version by our module, which is used as a weighting coefficient map. Then the weighting coefficient map is used to enhance each of the rest (C − 1) maps by the element-wise multiplication operation via Eq. (1). The enhanced feature maps are connected to the fully connected layer, which generates the final prediction output.

A. NETWORK TRAINING AND PARAMETER SETTINGS
To evaluate the performance of the proposed enhanced visual attention module, we selected four classic convolutional neural networks to form our host CNNs and embedded the proposed module into them. The four host CNNs are VGG16 [10], ResNet-50 [12], MobileNet V2 [13], and ShuffleNet V2 [14]. To train each host CNN with the proposed module embedded, we used three image datasets for  image classification. The used three datasets are our collected Underwater Fish dataset, where some images in this dataset are collected from [32] of 10 classes (Fig. 6), the public Animals-10 dataset of 10 classes (Fig. 7) [33], and the public Stanford Cars dataset of 196 classes (Fig. 8) [34]. In all our experiments presented in this article, all used images are in the RGB color space with the number of input channels set to 3. Moreover, based on the fact that it is not easy to get the ground truths of the feature maps learned by the last convolutional layer of a CNN, in our experiments, we applied the PoolNet [35] and BASNet [36] to generate the ground truths of feature maps for our training images, as examples shown in Fig. 9. Both of the two deep networks [35], [36] are mainly designed for salient object detection with the corresponding salient map generated. The numbers of training images, testing images, and ground truth salient maps for the three datasets are summarized in Table 1.
It should be noted that using the ground truths of saliency maps to guide the last learned feature map of a CNN in the training process indeed introduces richer information than only using the classification labels for network learning. However, the applied existed saliency detection models [35], [36] may generate wrong saliency maps, and therefore, we applied the Huber loss function to reduce the effects of possible outliers. On the other hand, we also introduced the other two terms based on the classification labels for guiding the two fully connected layers into our total loss function.
In addition, to train each host CNN with embedding all the evaluated attention modules, we used the RMSprop, i.e., root mean square propagation, optimizer [37] with the momentum   set to 0.9, the learning rate decay set to 0.98 per epoch, and the input image size set to 224 × 224. All the evaluated attention modules denote the four compared state-of-the-art modules [19], [20], [22], [23], descried in Sec. III.B, and the proposed module. The other parameter settings are summarized in Table 2. In Table 2, for each host CNN trained on a dataset, all the parameters are the same as those used for this host CNN with embedding each attention module.
Moreover, the selection of the datasets used for model training and testing in our experiments mainly depends on the following three principles. First, the proposed framework focuses on image classification for images with dominated objects inside, and therefore, prefers to datasets consisting of images with clear objects and suitable labels. Second, the selected datasets should be representative and popular in the image processing and computer vision community. Third, the selected datasets may be useful for our recently executed project for the applications of unmanned underwater  vehicles, i.e., UUVs. The selection for our collected Underwater Fish dataset mainly depends on the first and the third principles. In addition, one of our data sources [32] for forming our Underwater Fish dataset is also popular in recently related research works, e.g., [38], [39]. On the other hand, the selections of both the public Animals-10 dataset [33], also used in recent works, e.g., [40], [41], and the public Stanford Cars dataset [34], also used in recent works, e.g., [42], [43], are mainly based on the former two principles.

B. QUANTITATIVE RESULTS
To evaluate the image classification performance for each selected host CNN with the proposed enhanced visual attention module embedded, we reported the top-1 and top-5 accuracies obtained on the respective dataset. Moreover, we also compared the SE (squeeze-and-excitation) module [19], CBAM (convolutional block attention module) [20], RAN (residual attention network) module [22], and ECA (efficient channel attention) module [23] with the proposed module by embedding the respective attention module into the four selected host CNNs. To get significant performance improvement compared with each original host CNN, the compared attention modules might be usually embedded into the host CNN multiple times, for example, to be embedded after each convolutional layer. Different from these approaches, the proposed module just needs to be embedded once after the last convolutional layer of each host CNN. Tables 3-5, respectively, show the top-1 and top-5 accuracies (suggested by [44]) obtained by the four host CNNs, VGG16 [10], ResNet-50 [12], MobileNet V2 [13], and ShuffleNet V2 [14], with and without embedding the SE [19], CBAM [20], RAN [22], ECA [23], and the proposed modules, respectively, on our Underwater Fish, the Animals-10 [33], and the Stanford Cars [34] datasets. It can be found from Tables 3-5, embedding the proposed module into the host CNNs can significantly improve the top-1 and top-5 accuracies, compared with those obtained by the original host CNNs and those obtained by the host CNNs with embedding the compared attention modules. That is, the proposed enhanced visual attention module can be widely embedded Quantitative results in terms of the Top-1 and Top-5 accuracies, the number of parameters, denoted by params (in M or Mega), the FLOPs (in G or Giga), and the average run time per image (in milliseconds) obtained by each host CNN with/without embedding the compared attention modules and the proposed module conducted on the stanford cars dataset [34]. For each term of params and FLOPs, only the increment from the corresponding value of the corresponding host CNN is shown. into any existed CNN architectures, enhance the features learned by the last convolutional layers, and be generalized to many datasets for image classification.
On the other hand, to evaluate the image classification accuracies obtained by the proposed method in terms of different input image sizes, we reported the related results in Table 6. Table 6 shows the top-1 and top-5 accuracies obtained by embedding the proposed module into MobileNetV2 in terms of the image sizes of 224 × 224, 112 × 112, and 56 × 56, respectively. As shown in Table 6, larger input image size will lead to better classification accuracy. However, even if the input image size is relatively small, the proposed method still achieves acceptable results.

C. NETWORK COMPLEXITY ANALYSIS
The proposed method was implemented in Python programming language with Pytorch [45] on a personal computer equipped with Intel R Core TM i7-4790 CPU, 3.6 GHz, 16 GB memory, and NVIDIA GeForce RTX 2080 Ti GPU. To analyze the complexities of the evaluated host CNNs with the proposed module embedded, we reported the numbers of network parameters, the FLOPs (floating point operations) for network testing, and the average run time per image (in milliseconds). Tables 3-5 shows the numbers of network parameters, the FLOPs for network testing, and the average run time per image (in milliseconds) for the four evaluated host CNNs with and without embedding the SE [19], CBAM [20], RAN [22], ECA [23], and the proposed modules, respectively, conducted on the three datasets. Based on Tables 3-5, the additional burden of network complexity induced by embedding the proposed module is almost negligible. The main reason is that the proposed module is only required to be embedded once into each host CNN, where only one additional element-wise multiplication is required for enhancing each feature map. It can be also observed from Tables 3-5 that the run time for testing an image based on all the host CNNs with embedding the proposed module is lower than those obtained by embedding the compared state-of-theart deep attention modules [19], [20], [22], [23]. Therefore, the proposed enhanced visual attention module can be easily embedded into any CNN architectures with negligible extra burden.

IV. CONCLUSION
In this article, we have proposed an enhanced visual attention module for being embedded into any existed CNNs for image classification purpose. By enhancing the features learned from the last convolutional layer, which can capture richer semantic information for image classification while suppressing insignificant information, of a CNN, the CNN with the proposed module embedded achieves significant improvement in classification performance with negligible extra overhead. For future works, it is expected to extend our module for enhancing CNNs of different purposes, such as image regression. VOLUME