Global Learnable Pooling With Enhancing Distinctive Feature for Image Classification

Pooling layers appear widely in deep networks for its aggregating information in a local region and fast downsampling. Due to the reason that the closer to the output layer, the more the network learns is the high-level semantic information related to classification, the global average pooling would inhibit the contribution of local high magnitudes features in the global region. Besides, the gradient of the distinctive features is considerably attenuated due to the large region size of global average pooling. In this paper, we propose a global learnable pooling operation to enhance the distinctive high-level features in the global region, which is codenamed as GLPool. Because it is located before the classification layer, our GLPool is more sensitive to network performance. Besides, GLPool is not a hand-crafted pooling operation, which has the characteristic of adapting to any size of the input. With few parameters is added, GLPool is also a plug-and-play layer. The visualization via class activation map (CAM) on GoogLeNet and ShuffleNet-v2 also shows that GLPool can learn more concentrated and high-level distinctive features than global average pooling. The experiments on several classical deep models demonstrate the significant performance improvements on ImageNet32 and CIFAR100 datasets, which is exceeding obvious for lightweight networks.


I. INTRODUCTION
Convolution and pooling operations performed a very significant role in deep learning. Since pooling operations are parameterless and consume almost no computing resources, they are widely used to reduce the size of feature maps in the network. Inherited from the design concept of LeNet [1], both AlexNet [2] and VGG [3] also apply convolution and pooling as the main components of the network. Besides, the pooling operations aggregate information within a local region to keep the useful features, it is applied to extract features in GoogleNet [4]- [6], which would considerably increase feature richness. To solve the problem of full connection in convolution networks, NIN (Network in Network) [7] proposes a global average pooling (GAP) operation, which effectively avoids over-fitting and is more in line with the calculation method of CNN. The GAP is widely used in The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . the deep convolution network, including ResNet [8], [9], DenseNet [10], MobileNet [11]- [13], ShuffleNet [14], [15] and so on. Besides, the GAP has been used in tracking [16].
The most widely used average [17], maximum [18], and stochastic [19] pooling methods face the problem of losing detailed information or interrupting gradient transfer. As has been discussed in paper [20], numerous improvement methods have been proposed, such as mixed pooling [21], S3pool [22], rank-based pooling [23], [24] and so on. These operations can not guarantee to minimize the classification error during the training stage [25]. LEAP [25] guarantees effective feature selection during the dimensionality reduction process, which also avoids the problem of the sharp increase of parameters caused by using convolution [26]. However, most of the end of the network utilizes global average pooling, which presents the network still faces the previous problems.
It is worth noting that the visualization research of the network shows that the closer to the classification layer, the more VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ high-level semantic information related to the classification task learned by the network [27]. Although GAP [7] and pyramid pooling [28], [29] can gather all the information on the feature map, the operation of averaging suppresses the contribution of some distinctive features in the local region. Besides, the gradient of the distinctive features is considerably attenuated due to its large region size of GAP. Accordingly, it can be obtained from the above analysis that the closer the pooling layer is to the classification layer, and the larger the area size, the greater the impact on the network performance. In this paper, we propose a global learnable pooling operation to enhance the learning of distinctive features in aggregated global information. Each GLPool operator operates on only one feature map to obtain global information, which is the same as GAP. The slightly increased parameters allow the GLPool to better highlight the contribution of distinctive features. Besides, the GLPool is not a hand-crafted pooling operation, and it has the characteristic of adapting to any size of the input. Because GLPool is simple to implement, it is also a plug-and-play layer, which can be ported to multiple networks instead of GAP.
The proposed GLPool has many differences from LEAP and can be seen as an extension of it. Our GLPool aims to enhance the distinctive features in the global region, while the LEAP works in the local region. Besides, the LEAP needs to set pool size manually, while GLPool is adaptive to the network. Also, LEAP is in the early and middle stages of the network, while GLPool is in front of the classification layer of the network. It is exactly because of the different locations that LEAP and GLPool are complementary and can be used together. Our GLPool is similar to the conditional layer [30], [31]. Due to its globally learnable characteristics, our GLPool is also very straightforward to implement learnable pyramid pooling methods.
As can be seen from Figure 1, the GAP only aggregates all the information of each feature map, and GLPool will enhance the features of different positions and different sizes on different feature maps based on network training results. We also draw the class activation map (CAM) [32] to illustrate that the GLPool contributes more prominent information than GAP. And the GLPool helps the network to make the feature points activated in the feature map more accurate and concentrated.
To illustrate that GLPool can effectively replace the GAP layer, we have improved multiple well-known networks, containing residual, inception, and lightweight networks. All the improved networks have achieved significant performance improvements in the ImageNet32 and CIFAR100 datasets.
In summary, the main contributions of this article are as follows: • We propose the GLPool operation, which is used instead of GAP to enhance the learning of high-level distinctive information for classification-related features. GLPool is a plug-and-play layer, which can be ported to multiple networks. • We also hold a corollary that the larger the pooling region and closer to output, the more prominent influence on the network. This also indicates that the GAP has a more noticeable influence on network performance than the pooling operation. The GLPool can be applied with LEAP to construct a full convolution network.
• We utilize the CAM to illustrate that the GLPool can learn more concentrated and significant feature regions than the GAP. With the help of GLPool method, residual, inception, and lightweight networks achieve notable performance improvement on the ImageNet32 and CIFAR100 datasets. The rest of the paper is organized as follows. Section II introduces the related works about pooling. Section III detailed description of the GLPool method and the modified networks. In Section IV, the experiments and analyses are demonstrated. Finally, we give the brief conclusion in Section V.

II. RELATED WORKS
Summary research of pooling learning is given in the article [20], which is very detailed and organized. In this paper, we review some works related to our GLPool method.

A. GAP
The global average pooling is proposed in NIN [7]. It aggregates all the information on each feature map and then takes the average, which maps each feature map into a classification-related value. GAP greatly reduces the network parameters and effectively avoids over-fitting. GAP is more in line with the calculation method of CNN. The variant of global average pooling is paper [33] which uses log mean exponential function (AlphaMEX) to extract the features. In addition, various pyramid pooling methods are based on GAP, such as spatial pyramid pooling [28], [34], [35], concentric circle pooling [29].

B. LEARNING-BASED POOLING
The Learning-based pooling is guaranteed to minimize the training error during the training stage [25]. The convolution [26] can be applied to as the pooling operation by letting the stride as 2. But the conventional convolution operations will contribute a large number of parameters. The LEAP [25] operator in each feature channel only utilizes information from the corresponding feature channel, which only a few parameters are needed. LEAP also uses the mechanism of shared pooled region parameters to implement feature selection and dimensionality reduction for the entire feature map. The convolution built based on the LEAP method is also very similar to the deep separable convolution [36]. The depthwise convolution is also used to do achieving dimensionality reduction, which also does operations on one feature map [11], [15].

III. GLPool METHOD
To better show the reason for using GLPool instead of GAP, we give a detailed calculation process. The architecture of GAP and GLPool is shown in Figure 2 (a) and (b), respectively.

A. GLPool OPERATION
The GAP is to aggregates all the information on each feature map to get a classification-related value. GAP is located at the end of the network, so the feature maps entering GAP are all high-level semantic information related to the classification task. Before the advent of global average pooling, the fully connected layer connected all high-level semantics to the classification layer. However, it is accessible to cause over-fitting due to a massive number of connections. GAP aggregates all high-level semantic information of each feature map and averages them to prevent over-fitting.
Hence, the output y k of the k-th feature map {x k ij } n×n after the GAP as Since the global average pooling layer is located before the classification layer, its input feature maps contain the most relevant information for the classification task. Therefore, the suppression of distinctive high-level information by the global averaging operation will have a significant impact on the performance of the network. Consequently, the global average pooling has a more magnificent impact on the network due to the location and size.
Hence we propose the GLPool method, which assigns weights to the pooling and adaptively adjusts parameters through network training to achieve enhancing high-level semantic information of different feature maps.
After applying the GLPool, the output y k of the k-th feature map {x k ij } n×n as where W k = (w k i,j ) n×n denotes the learnable matrix on the k-th feature map. Because this method is to obtain the global information of each feature map, we do not apply the activation function to ensure the impact of each feature map on classification.
Hence, the GLPool also has the advantage of adapting to any arbitrary size input. GLPool would enhance different local regions for different feature maps, such as the head, eyes, or ears of a panda shown in Figure 1. The output of the GAP and GLPool both are Y G = (y 1 , y 1 , · · · , y C ), where C is the number of channels.
Our GLPool method is somewhat similar in form to the convolution operation with kernel size equal to the VOLUME 8, 2020 feature map. This also illustrates the similarity in operation between convolution and pooling. However, the pooling method is more focused on the down-sampling process, and adding weights is more accessible to achieve selective downsampling. Convolution focuses on learning and extracting features through weights. Besides, there is no activation function in the GLPool method. We apply GLPool to many kinds of networks and study the universality of its methods. Also, we analyze the influence of pool location and pool area size on network performance and try to give a mathematical explanation from the angle of the gradient. To verify the validity of the hypothesis, we use CAM to visualize the results of network training. These are our second and third contribution.
Since the weights of GLPool are finally determined through network learning, we can draw lessons from the weight initialization method of convolution. We apply the Glorot normal distribution initialization method, which is determined by the size of the global pooling region.

B. THE BACKPROPAGATION PROCESS
And the GAP also has a great impact on gradient transfer due to its large pooling size, where the 7 × 7 and 8 × 8 are commonly applied.
To illustrate this, let J (Y T , Y P ) be the final loss of the network, where Y T , Y P be the truth and predicted label. And the Y P is get by the softmax function.
where W s , b s be the weight of the softmax layer, f () be the softmax function. Hence, the gradient of the softmax layer is obtained by combining Equation (3).
The size of the GAP is generally large, such as the commonly used 7×7 pooling region. This means that in the backpropagation, the losses transferred from the classification layer will be averaged by 49 to ensures that the sum of the gradients before and after the pooling remains the same. Then the loss for the global average pooling layer is where n is the size of the pooling region, which generally is 7. The larger the pooling region, the more prominent the influence on losses. And these operations seriously weaken the impact of the significant features on the network gradient.
Since the region of GAP is much larger than the pooling in the network, the performance of the network can affect more significant by improving it.
Combining the Equation (2) in forward propagation, the loss of the GLPool is And the gradient of weights as The weighted GLPool method can effectively guarantee the gradient transfer of salient features, thereby highlighting the distinctive high-level features of the current feature map, making the network have better classification results.
According to the gradient solution process of global pooling and convolution, we get the following surmise that the larger the pooling region and closer to output, the more prominent influence on the network. As can be seen from Equation (1) and (5), the loss of the pooling layer is inversely proportional to the size of the pooling region. Due to the nature of the network, the further away from the output layer, the more the gradient is affected by pooling operation. This surmise can be verified in the experiments section that the network will have more comprehensive improvement when replacing the GAP.

IV. EXPERIMENTS
In this section, we validate the effectiveness of the proposed GLPool method for several classical deep network structures. We first present studies for the image classification on ImageNet32. Then, we show the effect of GLPool via class activation map (CAM) and visualization of network intermediate processes. To further demonstrate the performance of GLPool, we also perform experiments on the CIFAR100 dataset. The results also demonstrate that GLPool has a greater impact on the network than LEAP.

A. ADD GLPool FOR DEEP NETWORKS
We add the GLPool method for the exiting deep networks. Since there are so many deep models, we choose three kinds of models: (1) ResNet [8], which represent the depth of the network; (2) GoogLeNet [4], which represent the width of the network; (3) ShuffleNet [14] and ShuffleNet-v2 [15], which are the emerging lightweight networks.
For the above deep models, there is a GAP layer in front of the classification layer. We replace this layer with a GLPool layer, which can also adopt the deep network for images of arbitrary sizes. To satisfy the dataset with the size of 32 × 32, the first two layers of the network are also replaced by a convolution operation. To illustrate that GLPool has a more magnificent impact on the network than LEAP, we also add the LEAP to ShuffleNet, which puts the pooling in the module to achieving dimensionality reduction.

B. TRAINING THE NETWORK
Theoretically, the above network structures can be trained with standard backpropagation. In this paper, we apply SGD as the optimizer, which also uses a weight decay of 10 −4 and a momentum [8], [10] of 0.9. The initial learning rate is set to 0.01 and is divided by 10 when the iteration at the {20, 30, 40}. We set the batch size as 128. Besides, all the convolution layer in the deep networks use an L2 regularization [37] operation with a coefficient of 0.002.
To illustrate the advantages of GLPool operation, we do experiment on the ImageNet32 [40], [41], and CIFAR100 datasets. The ImageNet32 is a downsampled version of ImageNet, which also contains 1.28 million training images and 50K validation images from 1000 classes. The dataset is available on the official website of ImageNet.
All networks were implemented applying the tensorflow.keras framework and trained on the Tesla V100 GPU with 32G memory. 1 ResNet34 is consists of basic residual units with the configuration of {3, 4, 6, 3}. ResNet50 is consists of the bottleneck units with the configuration of {3, 4, 6, 3}. Shuf-fleNet applies group convolution with 3 and 4 groups. ShuffleNet-v2 also has two complexity with different channel numbers. 1 The code is also available at https://github.com/HedwigZhang/GLPool.

C. THE EXPERIMENT RESULTS ON ImageNet32
We conduct several deep models aiming to show the improvement after applying GLPool. Besides, we design a series of ablation studies to show that GLPool can always improve the performance under the same hyper-parameters conditions such as coefficient of regular loss, the scheduler of learning rate. Table 1 shows the classification accuracy of the networks after applying GLPool methods. Table 1 that all the experimental networks have achieved significant performance improvements after the GLPool operation. For the ResNet34, GLPool makes the network improve by 1 percentage point on the test set and 5 percentage point on the training set. The accuracy improvement of ResNet50 is slightly more than ResNet34. For GoogLeNet, which is the representative network of inception structure, the accuracy improvement on the training and test sets is more notable than that of the residual network. This is because there are also multiple maximum pooling layers in addition to global average pooling in GoogLeNet. After adopting GLPool, the performance improvement is noticeable, exceeding ResNet34 and achieving similar accuracy to ResNet50.

It can be seen from the bolded values in
On the two lightweight network structures ShuffleNet and ShuffleNet-v2, GLPool has helped the network achieve the most significant performance improvement. As shown in Table 1, there is an improvement of 6.4 percentage points in classification accuracy and 5 percentage points in top5-classification accuracy in ShuffleNet. And the ShuffleNet-v2 network under two complexity configurations has achieved an improvement effect of more than 4 percentage points. This is because the main body of the ShuffleNet network is depthwise convolution, whose ability to learn spatial information is weaker than traditional convolution. And GLPool compensates the learning ability of ShuffleNet to achieve the most significant performance improvement, which illustrates that GLPool enhances the high-level distinctive features.

2) ADDED PARAMETERS
One of the reasons why the traditional convolution is not broadly utilized for data dimensionality reduction is because of its huge computational parameters. While the number of parameters added by the GLPool is pretty meager, which is also illustrated in Table 1. The number of parameters increased by GLPool in these networks is less than 0.1M, and the complexity is less than 8M. The table also shows that the increased complexity is directly proportional to the number of channels for the GLPool operation. Figure 3 better shows that the GLPool method will only slightly increase the parameters and significantly improve performance. There is a positive correlation between network performance and line length. And the training process is shown in Figure 4. Dark-colored lines represent the results of the network after using the GLPool operation. As the number of iterations increases, networks that applying the GLPool method have better learning capabilities.

3) REGULARIZATION COEFFICIENT
The deep learning is easily effected by the hyper-parameter of the network. Hence, we add experiments that reflect the influence of the GLPool method under different   regularization coefficient. We choose ShuffleNet-v2 as the experimental network. To further improve performance, we also add the Nesterov momentum to the SGD optimizer.
As shown in Table 2 and 3, the GLPool method has a significant improvement in network performance under different regularization coefficients. The subscripted values in the table indicate improved accuracy.

4) VISUALIZATION RESULTS VIA CAM
The class activation map (CAM) is a very useful technique which indicates the discriminative image regions used by the network to identify that category [32]. It was originally only used to show the contribution of each feature map of the GAP layer. Since our GLPool also operates on a feature map, CAM can also be used to display the active region of GLPool.
It can be seen from Figure 5 that the feature activation area after using GLPool is more concentrated and distinguishable. To illustrate that this is not a special circumstance, we also randomly select 4 pictures from each category of ImageNet for CAM visualization, which is shown in Figure 6.
In the same category, GLPool still performs more concentrated and excellent. For the banana which is shown in Figure 6 (a), the activation region of the first and last  pictures via the GAP is biased, and it is not accurately positioned on the banana. The activation region of GLPool is more focused on the target itself. For the category of the sports car, the activation region of the GAP spreads to the surrounding space. The same phenomenon also occurs in the last category. The active region of the second image applying GAP is on the back of the Panda, where all the figures applying GLPool is more focused on the most distinctive head features of the giant panda, which can be seen from Figure 6 (c). In summary, GLPool contributes more to learning salient features, so better classification accuracy can be achieved.

D. EXPERIMENT RESULTS ON CIFAR100
We also do experiments of several deep models on the CIFAR100, which is also a widely used dataset of classification. And several ablation studies also show that the effect of GLPool is much more important than LEAP in the module. As shown in Table 4, the GLPool method also helps the deep networks improve performance on the CIFAR100 dataset.
Compared with Table 1, each network has improved performance after using the GLPool method, but the improvement is lower than that on ImageNet32. The main reason for this phenomenon is the amount of data. Hence, the network is not fully trained on the CIFAR100 dataset with far fewer images than ImageNet32, which also restricts the advantages of the GLPool method.
Except that the performance improvement is not noticeable on GoogLeNet, it is very outstanding on other networks. On the training set, ResNet34 that applying the GLPool methods improved the classification accuracy by more than 5 percentage points. On the test set, the classification accuracy has also been improved by more than 2 percentage points. The GLPool method even achieved a performance improvement of more than ten percentage points on the lightweight network. And the effect on the test set is also very significant.
The training process shown in Figure 7, the networks with the GLPool method will converge faster and ultimately achieve better performance.

1) GLPool HAS MORE PROMINENT INFLUENCE ON THE NETWORK
To verify the surmise proposed earlier, we also introduced LEAP into the network. Since the ShuffleNet also applies the average pooling to do dimensionality reduction in some units, we design 4 networks for experiments. The first network is the original ShuffleNet, which uses the average pooling and GAP. The second network replaces average pooling with LEAP while retaining GAP. The third network uses our GLPool instead of GAP while retaining average pooling. The last network applies both LEAP and GLPool. To better describe, let ''P'' denote the average pooling.
It can be found from the experimental results shown in Table 5 that ShuffleNet in both configurations achieved a more significant performance improvement after replacing GAP with GLPool. The improvement effect of using Shuf-fleNet under LEAP is not obvious. The GLPool has a more considerable impact on the network. This further verified that the larger the pooled region, the greater the influence on the network. When the network uses LEAP and GLPool at the same time, the performance will be further improved, which VOLUME 8, 2020  shows that the two methods are complementary and can be used together.

2) CAM RESULTS ON CIFAR100
To verify the above assumption, we also perform CAM visualization on the ShuffleNet-v2 model trained on the CIFAR100 dataset. As can be seen from Figure 8, the CAM of ShuffleNet-v2 with GLPool is more focused on high-level distinctive feature areas. For example, the features of the bird's beak and tail, the cat's head and body, the deer's antlers, and the bow of the ship are all remarkable. GAP also learns background information, such as the water waves around the ship, which may have a certain bad influence on classification performance.

3) EXPLAINING THE DISTINCTIVE FEATURE LEARNING ABILITY OF GLPool FROM ANOTHER ANGLE
Since depthwise convolution does not learn information between channels, ShuffleNet-v2 still uses a pointwise convolution before the GAP layer to enhance spatial information learning. However, this pointwise convolution has 1024 channels and will increase the parameters by about 1M, which is very huge for ShuffleNet-v2. Therefore, if the ShuffleNet-v2 still achieves a performance improvement after applying GLPool to replace GAP and cutting this pointwise convolution layer, it means that GLPool is very helpful for learning distinctive high-level features.
On the ImageNet32 dataset, ShuffleNet-v2 decreases about 1M parameters and improves the performance by 2 percentage points after removing the pointwise convolution kernel and adding GLPool. On the CIFAR100 dataset, the parameters are reduced by 0.5M, and the performance almost reaches the best level of the network without clipping the pointwise convolution layer. Because the amount of CIFAR data is much less than ImageNet32, the improvement of network performance is more obvious after using GLPool instead of GAP. This further illustrates that GLPool has a stronger ability to learn distinctive high-level features.

E. DISCUSSION
Our experiment is mainly on the dataset with a size of 32×32. Due to the nature of convolution, our GLPool method can also be applied to the original deep networks which satisfy the input size of 224 × 224. In ResNet, there is a convolution layer with the kernel size of 7×7 and max-pooling layer with the region of 3×3 before the residual units. The max-pooling layer can directly be replaced by the LEAP. This situation is also applicable to GoogLeNet and ShuffleNet. Under the same hyper-parameters, the network with the GLPool method can have manifest improvement.

V. CONCLUSION
In this paper, we propose the global learnable pooling (GLPool) method for the existing deep networks. GLPool's adaptive adjustment mechanism through network training makes it possible to enhance the learning ability of distinctive features that are more relevant to classification tasks in global information. The application of GLPool can effectively enhance the distinctive features and will not significantly increase parameters. This phenomenon is further illustrated by CAM visualization. Experimental results on the ImageNet32 and CIFAR100 datasets show that many deep learning networks have significant performance improvements after applying the GLPool operation. The experiments also verify that the larger the pooling region and closer to output, the more prominent influence on the network. Because it is simple to implement, it is a plug-and-play module that can be generalized to other deep networks.