Loss-Based Attention for Interpreting Image-Level Prediction of Convolutional Neural Networks

Although deep neural networks have achieved great success on numerous large-scale tasks, poor interpretability is still a notorious obstacle for practical applications. In this paper, we propose a novel and general attention mechanism, loss-based attention, upon which we modify deep neural networks to mine significant image patches for explaining which parts determine the image decision-making. This is inspired by the fact that some patches contain significant objects or their parts for image-level decision. Unlike previous attention mechanisms that adopt different layers and parameters to learn weights and image prediction, the proposed loss-based attention mechanism mines significant patches by utilizing the same parameters to learn patch weights and logits (class vectors), and image prediction simultaneously, so as to connect the attention mechanism with the loss function for boosting the patch precision and recall. Additionally, different from previous popular networks that utilize max-pooling or stride operations in convolutional layers without considering the spatial relationship of features, the modified deep architectures first remove them to preserve the spatial relationship of image patches and greatly reduce their dependencies, and then add two convolutional or capsule layers to extract their features. With the learned patch weights, the image-level decision of the modified deep architectures is the weighted sum on patches. Extensive experiments on large-scale benchmark databases demonstrate that the proposed architectures can obtain better or competitive performance to state-of-the-art baseline networks with better interpretability.


I. INTRODUCTION
OVER the past few years, convolutional neural networks (CNNs) have exhibited powerful capability on discriminative feature extraction and achieved tremendous success on many computer vision and pattern recognition tasks [1]- [5]. However, CNNs still confront several limitations. One notorious drawback is poor interpretability, e.g. it is difficult to understand how they reach their decisions, and which objects or their parts determine the image-level prediction [6], [7].
To enhance the interpretability of CNNs, most existing studies focus on understanding the representations of pre-trained CNNs or learning CNNs with interpretable/disentangled middle-or high-layer representations [8]. These methods usually collect the evidence from feature maps or filters to discover the significant image regions or object parts for an image-level decision, instead of directly and explicitly explaining the significant parts during training. Additionally, they are often based on current popular CNNs, most of which do not maintain the spatial relationship of features in one image because of pooling. This would make the effect of any image part on a hidden activation highly depend on other parts, thereby increasing the difficulty of interpretation, e.g. which parts determine the image-level prediction. To better understand or preserve the spatial relationship of features, capsule networks [9], [10], which utilize vector-output capsules to replace the scale-output feature detectors of CNNs, employ dynamic routing to substitute one popular operator, max-pooling. Because max-pooling only extracts the most meaningful information in a local pool and potentially loses some useful information. Nevertheless, dynamic routing is an extremely expensive procedure, with consuming very high computation and memory costs, especially for multiple routing layers spending much training and inference time [11]. Additionally, dynamic routing cannot explicitly take into account the significance of patches in an image, because it directly calculates the class probability of each capsule instead of patches. However, discovering significant patches in one image is beneficial to the understanding of the image-level decision and even the improvement of image prediction accuracy, because some patches might contain the significant objects or their parts.
Attention mechanisms [12] can be utilized to discover the significant patches, because they are capable to assign large weights to significant patches and meanwhile provide small weights to trivial patches. However, current attention mechanisms [13] are widely applied to nowadays popular CNNs, such as VGG [14], GoogleNet [15] and ResNet [5], which often do not preserve the spatial relationship of patches in an image. More importantly, they usually learn patch weights and image prediction with different layers and parameters, so that the image classification accuracy significantly depends on the effectiveness of learned patch weights. Unfortunately, attention mechanisms easily assign large weights to trivial patches, thereby potentially decreasing model performance.
To better explain the image-level decision of deep neural networks (DNNs), in this paper, we propose a general attention mechanism to mine significant patches in an image for decision-making, with considering the patches' spatial relationship yet without using any additional annotations. The proposed attention mechanism can be applied to different deep architectures including convolutional or capsule networks, so that their image-level decision is a weighted sum of patches. Three major contributions of this paper are listed as follows:

•
We propose a novel loss-based attention mechanism, namely Loss-Attention, by using the same parameters to learn patch weights and logits (class vectors), and image prediction simultaneously, for connecting the attention mechanism with the loss function. Specifically, the proposed attention mechanism is to mine significant patches and the new loss function is to further boost their precision and recall.
• Based upon Loss-Attention, we propose two deep architectures by modifying current popular CNNs with preserving the spatial relationship of patches in an image for better interpretation, e.g. the image-level decision is a weighted sum of patches. One architecture exploits convolutional layers and the other one adopts capsule layers. For clarity, we present the idea of the proposed convolutional architecture in Fig. 1. The proposed capsule architecture is very similar to Fig. 1 and can be found on released codes.
• Extensive experiments on multiple large-scale benchmark databases demonstrate that the proposed deep architectures can obtain higher or competitive classification accuracy to current popular convoluational or capsule networks, with better interpretable capability. It is worth noting that our proposed capsule architecture can obtain competitive or even better performance than the popular convolutional networks on large-scale complex databases.

II. RELATED WORK
In this section, we will briefly review some related work including visual interpretability of CNNs, part-based models, capsule networks, and attention-based deep multiple instance learning (MIL).

A. Visual Interpretability of CNNs
Numerous methods have been proposed to explore visual interpretability of CNNs, including network visualization, model diagnosis, the disentanglement of CNN representations, and explainable models. References [16], [17] are popular network visualization methods, which exhibit the image appearance that maximizes the score of a given unit. Another popular network visualization technique is the up-convolutional net [18], which inverts CNN feature maps into images. Model diagnosis methods [7], [19]- [21] analyze CNN features to visual image regions that contribute the most to the decision-making of CNNs. Disentangling CNN representations is to disentangle complex feature maps in conv-layers into humaninterpretable representations. [6], [22] select units from feature maps to describe "scenes" and [23] discovers objects from feature maps of unlabeled images. Reference [24] mines object-part concepts from a pre-trained CNN by extracting certain neural units from feature maps of a filter, with using some object part annotations. Most of aforementioned methods focus on the understanding of a pre-trained CNN, but explainable models aim to learn disentangled representations of neural networks with clear semantic meanings. Reference [25] is a popular interpretable method, which automatically assigns each filter in a high conv-layer with an object part during training. Additionally, visual interpretability methods usually generate class-discriminate representations, fine-grained representations or both. Unlike previous fine-grained approaches [17], [26] learning pixel-space representations, the proposed method is similar to the class-discriminate methods [21], [22], which generate class-discriminative representations. This is because the proposed method learns patch weights by using class information of the corresponding image. Previous methods collect evidence from filters or feature maps to implicitly explain the decision-making of nowadays CNNs, which do not consider the spatial relationship of features. By contrast, the proposed method considers the patches' spatial relations to directly and explicitly utilize a weighted sum of patches for an image-level decision, and mines the significant patches, which contain objects or their parts determining the image-level prediction.

B. Part-Based Models
Object parts play a significant role in object recognition, because they are able to capture localized discriminative features of an object. Numerous detection methods are on the basis of object parts. One popular method is deformable part model (DPM) [27], which learns part constellation models with the latent discriminative support vector machine (SVM). However, these methods require ground-truth bounding box annotations. Recently, some CNN-based methods learn or select object parts without any additional part or bounding box annotations. Reference [23] learns part models by finding constellations of neural activation patterns. Reference [28] utilizes elastic non-negative matrix factorization to analyze the response of a pre-trained CNN and extract salient image regions. Reference [29] proposes a multi-attention CNN in order to reinforce part generation and feature learning. These methods are usually on the basis of pre-trained CNNs and most of them cannot directly and explicitly measure the significance of object parts on image-level decision during training. By contrast, the proposed method modifies the architectures of CNNs to preserve the spatial relationship of patches, so that the image-level decision is a weighted sum of patches. And meanwhile it can directly mine significant objects or their parts during training.

C. Capsule Networks
A capsule is constituted by a group of neurons [9] and thus it outputs an activity vector instead of a scalar to represent different properties of a specific entity, such as an object or its part. Because CNNs cannot preserve the spatial relationship of features by using the pooling layer, e.g. max-pooling, [10] proposes dynamic routing using "routing-byagreement" between capsules to substitute max-pooling. So it can obtain better performance and more benefits on image interpretation. Reference [30] adopts EM routing for matrix capsules with representing each entity by a pose matrix. Reference [31] formulates dynamic routing as an optimization problem. DeepCaps [11] proposes 3D-convolution-based routing to replace the original dynamic routing for significantly decreasing computation costs. Although capsule networks have achieved promising performance on several popular simple databases and shown strong benefits on image interpretation, their performance on complex databases is still not on a par with that of CNNs. Additionally, the routing strategy can be viewed as an attention mechanism [30], but it is different from the proposed Loss-Attention: (i) The vector outputs of capsules have distinct length in Loss-Attention, while the routing strategy usually squashes the vector outputs of capsules to equal length. This means that Loss-Attention and the routing strategy utilize different ways to calculate the significance of capsules. (ii) Loss-Attention aims to discover the significant patches in an image so that the image-level decision is a weighted sum of patches, but the routing strategy fails to explicitly explore the significance of patches for the image decision-making.

D. Attention-Based Deep MIL
MIL has been widely applied to real-world applications [32], [33], where only a general statement of the category is given for multiple instances. For example, one bag is composed of tens or hundreds of instances, and it is usually described by a single bag label and there is no label information associated with instances. Although attention mechanisms [12], [34] with DNNs have been successfully used in many tasks, such as image captioning and classification, few efforts focus on attention mechanisms for deep MIL. One popular method is attention-based deep MIL (ADMIL) [13], which proposes two attention mechanisms by using a two-layered neural network to learn instance weights. However, these two attention mechanisms might attain inferior performance to mean-pooling [35] on large-scale image classification in many cases, because they can easily assign large weights to trivial patches. To reduce the effect of trivial patches, loss-based attention mechanism [36] has been proposed to simultaneously learn instance weights and generate bag-level prediction. But its attention mechanism is on the basis of the softmax+cross-entropy function, thereby possibly being ineffective to remove the trivial patches and only suitable for the singlelabel applications. By contrast, the proposed Loss-Attention is based on the l 2, 1 -norm to encourage row-sparsity. It can be applied to both single-label and multi-label scenarios, and simultaneously learn patch weights and logits (class vectors), produce image-level prediction, and remove the trivial patches.

A. Preliminaries
We first briefly review two popular loss functions including softmax+cross-entropy and sigmoid+binary-cross-entropy, which will be utilized in the proposed objective for tackling with single-label and multi-label tasks, respectively, and an l 2, 1 -norm used in our attention mechanism. For brevity, we introduce the two loss functions using only one training sample.

1) Softmax+Cross-Entropy:
Given a single-label training sample X ∈ ℝ C 0 × H × W and its corresponding one-hot label vector y = y k k = 1 K ∈ 0, 1 K , and an L-layer deep neural network f θ ( ⋅ ) with the parameters θ l l = 1 L , where C 0 , H and W denote image channels, height and width, respectively, K is the number of classes, and θ l represents the parameters of the l th -layer in the neural network. Let z = z k k = 1 K = f θ (X) ∈ ℝ K be the output for X in the L th layer of the network, and s(z) ∈ ℝ K be the estimated class probability of X, where s( ⋅ ) denotes the softmax function and Σ k = 1 K s z k = 1. To measure the dissimilarity between the true class probability y and the estimated class probability s(z), the crossentropy loss is [37]: Because X is a single-label sample and y ∈ 0, 1 K , we have Σ k = 1 K y k = 1. Suppose that X belongs to the t th class, i.e. y t = 1 and Σ k = 1, k ≠ t K y k = 0, Eq. (1) equals: 2) Sigmoid+Binary-Cross-Entropy: When X is a multi-label training sample, because the softmax function is usually suitable for single-label classification tasks and exhibits inferior performance on multi-label applications, σ(z) ∈ [0, 1] K is often employed to handle multi-label tasks, where σ( ⋅ ) denotes the sigmoid function. Binary-cross-entropy is defined as: where 1 K ∈ ℝ K is a vector with all entries being ones.
3) l 2, 1 -Norm: For a matrix Z = z 1 , z 2 , ⋯, z N T ∈ ℝ N × K , the l 2, 1 -norm of Z is defined as: Eq. (4) can encourage the row-sparsity of Z [38], [39], because it is the minimum convex hull of the l 2, 0 -norm of Z, i.e. ∥ Z ∥ 2, 0 , which is to count the number of non-zero rows of Z.

B. Loss-Based Attention
Traditional attention mechanisms [13] learn patch weights and image prediction using different layers and parameters, and thus the image classification accuracy is significantly affected by the effectiveness of learned patch weights. To address this issue, we learn the patch weights and logits and generate image prediction simultaneously in order to connect the attention mechanism and the loss function. Specifically, the proposed attention mechanism is on the basis of the l 2, 1 -norm [40] and connects with the loss function, i.e. sharing the same parameters with a fully connected layer for image classification and calculating patch weights based on their logits. For clarity, we show the difference between traditional attention mechanisms and the proposed one in Fig. 2. The proposed loss function employs the learned weights to guarantee the selected patches to be within the same class as its image.

1) Attention Mechanisms:
Because convolutional and capsule neural networks are two different architectures, which have distinct outputs for one training sample X ∈ ℝ C 0 × H × W , in the following we present general attention mechanisms for these two different architectures based on their outputs. To avoid the abuse of symbols, we still utilize f θ ( ⋅ ) to represent the L-layer convolutional or capsule neural network.

a) Attention for convolutional neural networks:
Suppose that the image X is divided into M patches, and H = h m m = 1 M ∈ ℝ C × M is its output of the L-1 th layer, and θ L ∈ ℝ C × K denotes the parameters of the L th layer, where h m ∈ ℝ C represents the feature representation of the m th patch of the image X, and C is the number of channels. Let P = p m m = 1 M be the L th -layer output for image patches, where p m ∈ ℝ K is the logit (class vector) for the m th patch and it is calculated as p m = h m θ L . Then we present the proposed attention mechanism as follows: where α j is the attention weight of the j th patch of X, ξ ∈ [0, 1] is a threshold to remove the trivial patches, and z ∈ ℝ K is the L th -layer output for X. It is worth noting that Eq. (5a) utilizes the l 2, 1 -norm, i.e. ∑ m = 1 M ∑ k = 1 K p mk 2 , to encourage the row-sparsity of P ∈ ℝ M × K , so as to enhance the weights of significant patches and decrease the weights of trivial patches. Additionally, we empirically set the maximum of ξ as 1 during the training process, because all patch weights might be zeros during training when ξ > 1.

b) Attention for capsule neural networks: Suppose that
represents the feature representation of the m th patch for the image X, C is the number of channels and D is the capsule dimension. Let θ L ∈ ℝ D × K denote the parameters of the L th layer, and P be the L th -layer output corresponding to H, e.g. P m = p cm c = 1 C , be the L th -layer output corresponding to H m , where p cm = h cm θ L ∈ ℝ K . Afterward, we introduce the proposed attention mechanism as follows: where α rj denotes the attention weight of the j th patch of X at the r th channel, sgn( ⋅ ) is a function defined as: sgn(α m ) = 0 if α m = 0, and sgn(α m ) = 1 when α m > 0.

2) Loss Function via Attention Weights:
Based on the attention mechanism Eq. (5) or (6), we can obtain the weight of each image patch. However, when directly utilizing the loss in either Eq. (2) or Eq. (3) for model training, it might have two issues: (i) a trivial patch with a large weight, although ξ can remove some trivial patches; (ii) low significant patch recall. For better illustrating these two issues, based on the output of convolutional networks for the sample X, we present two propositions as follows. Their detailed proofs are shown in the Appendix.
Eq. (7) suggests that when L ce 0, at least one patch of the image X belongs to the t th class.
Specifically, for any patch, if it has q mk q mt 0 ( ∀k ≠ t) and α m ≫ 0, then L ce 0. However, L ce 0 cannot theoretically guarantee the patch with a large weight and more than one patch assigned to the t th class, thereby potentially assigning a large weight to a trivial patch and leading to the low significant patch recall. For Eq. (8), when L bce 0, at least one significant positive image patch and one negative patch will be assigned weights larger than zeros. Unfortunately, it still cannot guarantee more than one positive or negative significant patch to be selected, and it is also very likely to assign a large weight to a trivial patch. Similar findings can be obtained from the attention mechanism for capsule networks.
To alleviate the aforementioned two issues, based on Eqs. (2) and (3), we introduce regularization terms using the weights obtained from the proposed attention mechanism Eq. (5) and (6), and present the following loss functions to handle single-label and multi-label tasks, respectively. Specifically, given training data Ψ = X i i = 1 N , let B denote the index set of selected training samples in each mini-batch, y i be the one-hot label vector of X i and z i represent its L th -layer output in convolutional or capsule neural networks. The proposed loss function for single-label tasks is: where |B| denotes the number of selected images in the mini-batch, the regularization term is to enforce selected patches to share the same class with the image, γ(τ) is an unsupervised weighting function to balance the weight between image and patch classification, and τ is the number of current training epochs.
Based on Eq. (3), the proposed loss function for multi-label tasks is:

IV. NETWORK ARCHITECTURES
Most current CNNs do not preserve the spatial relationship of features in one image. This is because they usually adopt max-pooling or stride operations following by large convolution kernels (whose size is larger than 1), and thus the effect of any part of the input on a hidden activation depends on other parts. Additionally, the activity of one hidden unit depends on the activity of other hidden units [41]. These two causes significantly increase the difficulty to interpret CNNs. To maintain the spatial relationship of patches and reduce the complex dependency between input and hidden activations for better interpretation, e.g. the image-level decision is a weighted sum of patches, we propose two schemes (one with convolutional layers and the other using capsule layers) by modifying CNNs. In the following, we present the two schemes based on one popular network, VGG-11 [14] (The left architecture of Fig. 3). Modification on other network, such as ResNet, is similar. Due to limited space, we provide more details on released codes.

A. Convolutional Architecture
We first remove the max-pooling operations and two fully-connected layers in VGG-11, to preserve the spatial relationship of patches within an image and reduce the complex dependency between input and hidden activations. Next, we introduce one convolutional layer with 512 channels, kernel size 9 × 9 and stride 4 to determine the size and number of patches and extract their features, and another convolutional layer with 512 channels, kernel size 1 × 1 and stride 1 for the nonlinear mapping of patch features. The 1 × 1 kernel is to reduce the dependency among patch features. Then we add an attention layer using Eq. (5) to select significant patches based on the attained patch features. For clarity, we present this architecture in the middle part of Fig. 3.

B. Capsule Architecture
We first remove the max-pooling and two fully-connected layers in VGG-11. Then we add two capsule layers, including one capsule with 32 channels, kernel size 9 × 9, stride 4, and capsule dimension 16, and the second capsule with 64 channels, kernel size 1 × where H d ∈ ℝ 32 × 6 × 6 , b ∈ ℝ 32 is to remove trivial image pixels in each channel, and 1 1 × 6 × 6 ∈ ℝ 1 × 6 × 6 is a matrix with all entries being ones in order to expand b to have the same size as H d Based on Eq. (11), H ∈ ℝ 32 × 6 × 6 will be fed into the second capsule layer. Afterward, we adopt an attention layer using Eq. (6) to assign a weight to each capsule and select significant patches. For better illustration, we present this capsule architecture in the right part of Fig. 3.
The size of input images is 32 × 32 in Fig. 3. When input images have a larger size, they will consume much more computation and memory costs. In this case, we can utilize stride operations in convolutional layers of the backbone network only to reduce the image size, and then adopt the proposed two convolutional or capsule layers and the attention layer to preserve their spatial relationship and discover the significant patches. Note that we do not adopt max-pooling to reduce the image size, because it might lose some useful information. Moreover, the proposed schemes can also be applied to other CNNs, such as ResNet [5], upon which we can first remove the stride operations in convolutional layers, and then add the proposed convolutional or capsule and attention layers. In addition to VGG-11, in our experiments we apply the proposed two schemes on a popular network ResNet18.

V. EXPERIMENTAL RESULTS AND ANALYSIS
To evaluate the proposed architectures, we conduct experiments on multiple large-scale benchmark databases for image classification and patch interpretability.

A. Implementation Details
We implement the proposed architectures by using the PyTorch framework and adopt VGG11_bn [14] and ResNet18 [5] as our backbone networks mostly. We employ the optimizer, SGD, to update model parameters, and totally run the model 200 epochs with a batch size being 128. By default, we first train the model 100 epochs using the learning rate η

B. Experimental Settings
Because the proposed architectures utilize VGG11_bn and ResNet18 as their backbone networks, we compare them with the baseline methods VGG11_bn and ResNet18. Additionally, because the proposed method adds convolutional layers, which might increase model parameters, for a fair and better comparison, we report the classification results of VGG16_bn and ResNet50, which have more parameters than our convolutional architectures. Moreover, to better illustrate the strength of the proposed Loss-Attention, we present the results of Mean-pooling, Attention and Gated-Attention [13], and Dynamic Routing [10] using our modified architectures. Mean-pooling means assigning each patch to the same weight. Note that Attention and Gate-Attention utilize the same training procedure as our method, but Mean-pooling and Dynamic Routing do not exploit this procedure. Thus, we adopt a different learning procedure for Mean-pooling and Dynamic Routing as follows: we adopt the optimizer, Adam [42], with initializing momentum parameters β 1 = 0.9 and β 2 = 0.99. We also train the model 200 epochs. The learning rate ramps up to the maximum 0.003 during the first 80 epochs by using the function e − ∥ 1 − T ∥ F 2 . Then the learning rate keeps unchanged during the following 40 epochs; afterward, the learning rate decreases to 0.0003, and it becomes 0.00003 during the last 40 epochs. The Adam momentum parameter β 2 becomes 0.999 after the first 80 epochs. We run each experiment 4 times and calculate the average accuracy. Note that for the proposed method, the selection of batch size, optimizer type, learning rate and its strategy is the same as the backbone network. However, the performance of Mean-pooling and Dynamic Routing might be greatly affected by different optimizers, e.g., Adam and SGD (see Table I). The major possible reason is that Mean-pooling and Dynamic Routing cannot provide significant patches, so that trivial patches significantly affect the gradient update. Additionally, Adam can be viewed through the lens of clipping, thereby leading to better performance in heavy-tail noise settings [43].

C. Experiments for Image Classification
We run experiments to evaluate the proposed architectures on image classification by using the following popular single-label databases: CIFAR-10 [45] consists of 60K color images in 10 classes, each of which contains 6K images. These images are divided into a training set of 50K examples and a testing set of 10K ones. Each one is aligned and cropped to 32 × 32 pixels. [45] is composed of 60K color images belonging to 100 classes, with 600 images per class. These images are also divided into 50K training and 10K testing ones. Each image is with a size of 32 × 32.

1) Experimental Results:
On the three databases, Loss-Attention adopts Eq. (9) for classification. Besides the four comparative methods, Mean-pooling, Attention and Gated-Attention, and Dynamic Routing, we also present the results of several popular capsule networks [10], [11], [47], [48] to better evaluate the proposed capsule architecture.
Table II presents the classification accuracy of different deep methods. For convolutional networks, when using VGG11_bn as the backbone network, Mean-pooling, Attention, Gated-Attention and Loss-attention obtain superior performance over VGG11_bn and VGG16_bn on CIFAR-10 and CIFAR-100, and Loss-Attention achieves better classification accuracy than the other methods on all the three databases. Additionally, when using ResNet18 as the backbone network, Loss-Attention also attains better accuracy than the others on CIFAR-10 and CIFAR-100, and achieves competitive performance with the best competitors on SVHN. These results suggest that the proposed architectures, whose imagelevel decision is a weighted sum of patches, can obtain better or competitive classification performance with popular CNNs.
For capsule networks, Loss-Attention obtains superior performance over Dynamic Routing and other deep capsule methods [10], [11], [47], [48] when using VGG11_bn and ResNet18 as backbone networks. Moreover, Loss-Attention with the capsule architecture can achieve competitive and even better classification accuracy than that with the convolutional architecture on CIFAR-10 and SVHN. The capsule architecture attains slightly worse accuracy than that with the convolutional one on CIFAR-100, probably because its capsule dimension is similar to the number of classes. They suggest that capsule networks with Loss-Attention can obtain superior or similar performance to convolutional ones on complex databases. It is worth noting that when using our proposed architecture with VGG11_bn and ResNet18 as backbone networks, Dynamic Routing can attain better performance than the deep capsule methods [10], [11], [47], [48] on CIFAR-10, and it only attains slightly worse accuracy than DeepCaps on SVHN.
The proposed method can also be applied to more deeper versions of ResNet or other different architectures. Table III displays the accuracy of Loss-Attention with ResNet50 and GoogleNet [15] as backbone networks on CIFAR-10 and CIFAR-100. It suggests that Loss-Attention outperforms Baseline (ResNet50 and GoogleNet). Additionally, Loss-Attention can achieve better performance on large-scale databases. For example, when using ResNet18 as the backbone, the accuracy of Loss-Attention and ResNet18 is 56.57% and 55.40% respectively on ImageNet [2], where each image is resized to 32 × 32. Additionally, Loss-Attention using ResNet18 as the backbone takes one week to train a model for ImageNet, with 4 GPUs and a batch size being 128. When we utilize ResNet50 and GoogleNet as the backbone, the time cost for model training is respectively 4.5 and 6 times more than using ResNet18. Hence, here we do not show their results on ImageNet because of limited resources and spaces.

D. Experiments for Image Patch Interpretability
Because test images in the aforementioned databases do not contain bounding boxes, we run experiments for image patch interpretability on two popular databases with bounding boxes as follows: Tiny ImageNet [49] is a single-label database, which has 200 classes with each category consisting of 500 training, 50 validation and 50 test images. Among them, validation and test images have bounding boxes. We adopt training images as a training set and validation images for test. Each image is with a size of 64 × 64. [50] is one multi-label database, which consists of around 328,000 images belonging to 91 object types. We utilize the 2014 training and validation sets, including 82,081 training and 40,137 validation images. We adopt the training images for training and validation ones for testing. We crop and resize each image to 64 × 64 pixels.

Microsoft COCO
Note that we do not resize each image to 32 × 32 in order to illustrate that the proposed architecture can handle a larger image size (> 32 × 32).

1) Experimental Settings:
Because the size of images in the two databases is 64 × 64, we adopt stride 2 in the fourth convolutional layer of VGG11_bn and in the sixth layer of ResNet18 and remove max-pooling or stride operations in other layers. The patch sharing at least one common label as its corresponding image and more than half size locating in the bounding box is viewed as a correct one. Additionally, we show the image localization accuracy of Attention, Gated-Attention and Loss-Attention on Tiny ImageNet by using the estimated bounding box, which is the minimum square to contain selected patches. For Loss-Attention, we select the patches with weights larger than 0, and for Attention and Gated-Attention, we choose the patches with weights bigger than ξ M . The estimated bounding box is considered correct if intersection over union (IoU) is larger than 0.5. Then we show the average precision (AP) of image localization. Moreover, we present the image classification accuracy (Accuracy for Tiny ImageNet and mAP for COCO, where mAP is defined in [44]) of the aforementioned methods and the baselines VGG11_bn and ResNet18. Note that we do not report the performance of Dynamic Routing on Tiny ImageNet due to its high memory cost for a large number of classes. We also do not show the image localization accuracy AP of COCO, because many images contain multiple bounding boxes belonging to one category of objects and the attention methods cannot directly handle this case. For Loss-Attention, we utilize the aforementioned parameter settings for image classification, and we adopt Eq. (9) for Tiny ImageNet and Eq. (10) for COCO to train models.

2) Experimental Results:
Tables IV-V present the performance of different deep methods on Tiny ImageNet and COCO. Attention and Gated-Attention obtain better image classification accuracy, patch precision and recall than Mean-Pooling on Tiny ImageNet, while they achieve significantly worse performance than Mean-Pooling on COCO. This might be because they align large weights to trivial patches and obtain low patch recall, thereby decreasing the model performance. Loss-Attention obtains better image classification and localization accuracy, and F-score for patches than Mean-pooling, Attention, Gated-Attention on the two databases. The proposed attention mechanism can remove trivial patches, and the introduced regularization term in the loss function can further boost the patch precision and recall, thereby decreasing the effect of trivial patches on model performance. Loss-Attention with the modified convolutional architecture also outperforms the baseline methods on image classification. For example, when using VGG11_bn as the backbone network of convolutional architectures, Loss-Attention attains 0.83% higher image classification accuracy, 1.78% better AP and 13.27% F-score than the best competitors on Tiny ImageNet. It achieves 7.70% higher mAP and 1.70% F-score than the best competitors on COCO. Additionally, Loss-Attention achieves better image classification and patch precision than Dynamic Routing on COCO. Moreover, Loss-Attention with a convolutional architecture achieves better image classification than that with a capsule architecture on Tiny ImageNet and COCO. This might be because the capsule architecture adopts a small capsule dimension, which is less or close to the number of classes on the two databases.
To better illustrate the effectiveness of the proposed architectures, Fig. 4 displays heat maps of sample images from COCO by using Grad-CAM [21] and the convolutional architecture+Loss-Attention with ResNet18 as the backbone. It suggests that both Grad-CAM and Loss-Attention can generate class-discriminative representations, but Loss-Attention produces more accurate representations. This is because Loss-Attention selects significant patches and meanwhile removes trivial patches. For clarity, Fig. 5 presents selected patches of some images from COCO by using the convolutional architecture+Loss-Attention with ResNet18. Fig. 6 presents the estimated bounding boxes of some images from COCO with the convolutional architecture. Similar observations can be found when using VGG11_bn as the backbone network or the capsule architecture. They suggest that the proposed architectures can be viewed as a weighed sum of patches, and Loss-Attention can effectively mine the significant patches containing objects or their parts to interpret the image-level decision, i.e. which parts of the image determine the decision-making.

E. Ablation Study and Parameter Analysis
Here, we evaluate the essential parameters γ max and ξ in the proposed Loss-Attention, with the convolutional architecture using VGG11_bn and ResNet18 as backbone networks on Tiny ImageNet. Table VI presents the results of Loss-Attention on setting γ max = 0 or ξ = 0.
It displays that when γ max = 0.1, ξ = 0.1 achieves higher patch precision yet lower recall than ξ = 0; when ξ = 0.1, γ max = 0.1 attains better image classification and localization, patch precision and recall than γ max = 0. Similar findings can be observed on other databases. Fig. 7 presents the effects of γ max within [0,5] and ξ during [0,1] on Loss-Attention. Fig.   7(a)-(d) show that when ξ = 0.1, Loss-Attention attains the best image classification accuracy for γ max = 0.1 and it achieves the best AP for γ max = 0.5, after which the accuracy decreases with the increasing value of γ max . Patch recall has a similar trend to the image classification accuracy, while patch precision gradually grows with the increasing value of γ max . They suggest that γ max can increase the patch precision when γ max ∈ [0, 5], and it can boost the image classification accuracy and patch recall when γ max ∈ [0, 0.1], and improve the localization accuracy when γ max ∈ [0, 0.5]. Fig. 7(e)-(h) illustrate that when γ max = 0.1, AP and patch precision grow with the increasing value of ξ, while the image classification accuracy and patch recall decrease. Similar findings can be observed on COCO, so we do not show them for brevity.
Both Table VI and Fig. 7 infer that the regularization term can be used to boost the image classification and localization accuracy, patch precision and recall. Additionally, ξ can be used to adjust the value of image classification accuracy, AP, patch precision and recall.

F. Discussion and Analysis
Based on experimental results of image classification in Tables II-V, we can see that the proposed convolutional and capsule architectures significantly reduce the dependency of multiple parts of the input by removing max-pooling or stride operations, so that their image-level decision is a weighted sum of patches. However, they still can achieve competitive and even better performance than the popular CNNs, VGG11_bn, VGG16_bn, ResNet18, ResNet50 and GoogleNet. This is mainly attributed to the loss-based attention mechanism, which can effectively mine the significant image patches. As shown in Tables II-V, Mean-pooling, Attention and Gated-Attention mechanisms cannot always outperform the backbone when they adopt the same backbone networks, but the loss-based attention mechanism usually has superior performance over all of them. Table II presents that Dynamic Routing with the modified capsule architecture outperforms previous capsule networks on CIFAR-10 and achieves competitive performance to the best competitor on SVHN. This might be because we adopt Adam for Dynamic routing to handle trivial patches [43] and a different training procedure, i.e. gradually increasing the learning rate to smooth the training process, which is usually able to improve the model generalization performance [51]. Additionally, when we utilize the same training procedure as that of previous capsule networks, Dynamic Routing with the modified capsule architecture usually achieves much worse accuracy. They might suggest that previous capsule networks can achieve better performance by using the same procedure as ours. Moreover, the capsule networks with Loss-Attention can achieve better or competitive performance to convolutional networks on CIFAR-10, CIFAR-100 and SVHN. This infer that the performance of capsule networks can be on a par with that of CNNs on complex databases.
Experiments for image patch interpretability (Tables IV-V) suggest that a better patch precision or recall does not always result in a higher image classification or localization accuracy for the proposed convolutional architectures. This is because the image-level prediction is determined by a weighted sum of patches, i.e. each patch has different significance, while the patch precision or recall only shows how many significant patches are selected and does not consider their significance. Therefore, a single patch precision or recall is not correlated with classification and localization. However, as shown in Tables IV-V, a better F-score usually leads to better classification and localization performance for the proposed convolutional architectures on Tiny ImageNet and COCO. Ablation study demonstrates that the parameter ξ can remove trivial patches to improve the image localization accuracy and patch precision, and the introduced regularization term can further boost patch precision and recall.
The modified deep architectures consider the spatial relationship of features, and obtain competitive or even higher accuracy than baseline networks with better interpretability. However, because of removing max-pooling or stride operations, they have two major disadvantages: (i) consuming more GPU memory, (ii) increasing computational costs. These are caused by feeding inputs with a larger size into the next layer and using more parameters, e.g. the layer with 512 channels, kernel size 9 × 9 and stride 4 in Fig. 3. In practice, when the image size is larger than 32 × 32, we can add stride into several layers before the two introduced convolutional layers to reduce the image size. For clarity, Table  VII presents the classification accuracy and training time of Loss-Attention with ResNet18 on CIFAR-10 for a larger size 64×64. It illustrates that Loss-Attention using one layer with stride 2 consumes less time cost. Additionally, adding stride 2 into one layer achieves higher accuracy than that without using stride. This is because a larger image size usually generates more patches, which increase the difficulty of mining significant patches. Loss-Attention's time cost mainly depends on the number of layers with stride 2 in the backbone, because the stride can reduce the input size. Meanwhile, the accuracy is very close when the stride is used in the backbone. Moreover, the added convolutional layer using stride 2 only consumes slightly more time than that using stride 4, but with almost the same accuracy. We respectively set kernel size and stride as 9×9 and 4 in our experiments, because we follow the setting in Dynamic Routing for a fair comparison and better interpreting image-level decision. In practice, if we only want to obtain better accuracy than the backbone with low computational complexity, the stride can be used in more convolutional layers.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose a general attention mechanism and modify previous convolutional and capsule networks to mine significant patches, which contain objects or their parts determining the image-level prediction. The proposed Loss-Attention shares the parameters between attention mechanisms and loss functions to learn patch weights and logits, and image prediction simultaneously, in order to connect the attention mechanism and the loss function for boosting patch precision and recall. The modified deep architectures consider the spatial relationship of features by removing max-pooling or stride operations in convolutional layers, so that the image-level decision is a weighed sum of patches. Extensive experiments on multiple large-scale benchmark databases demonstrate the superior performance of the proposed deep architectures over comparative popular deep neural networks with better interpretation.
Although the proposed architectures can attain promising performance on single-label image localization, it still cannot locate multiple objects belonging to one category in an image. This might be because our method focuses on the patch interpretation rather than region proposal selection. However, it is promising to extend and apply our method for weakly supervised localization on universal scenarios. Additionally, our capsule architecture utilizes convolutional layers as backbone, and in the future it is promising to design different capsule networks based on the proposed two capsule layers to handle large-scale tasks. . (13) where the fourth inequality is derived from log(1 + a) ≤ a for all a > − 1.
Therefore, Proposition 2 is proved. The idea of the proposed convolutional architecture using a weighted sum of patches for the image-level decision. We remove max-pooling or stride operations in convolutional layers to preserve the spatial relationship of patches, and we only utilize stride in one convolutional layer to extract patch features for patch logit generation. A detailed convolutional architecture is displayed in the middle panel of Fig. 3. α i1 , ⋯, α iM T denote the weight of patch logits p i1 , ⋯, p iM , respectively. M is the number of patches.  Two different architectures of attention mechanisms. Left: Traditional attention mechanism. Right: The proposed attention mechanism. ℎ 1 , ℎ 2 , ⋯, ℎ M represents the feature representation of patches, θ a1 , θ a2 and θ al are the parameters of the attention mechanism for weight generation, α 1 , α 2 , ⋯, α M is the weight of patches, and θ L denotes the parameters for image prediction. Note that in the proposed attention mechanism, θ L is used for both the attention mechanism Eq. (5) or Eq. (6) and image prediction.    Heat maps of sample images from COCO by using Grad-CAM [21] and the convolutional architecture+Loss-Attention. Both of them adopt ResNet18 as the backbone network. The first and second rows show heat maps of Grad-CAM and Loss-Attention, respectively.  Selected patches of some images from COCO by using the convolutional architecture+Loss-Attention with ResNet18 as the backbone network.  Predicted bounding boxes of some images from COCO by using the convolutional architecture+Loss-Attention with ResNet18 as a backbone network. The effect of the parameters γ max and ξ in Loss-Attention with the convolutional architecture using VGG11_bn and ResNet18 as backbone networks on Tiny ImageNet.