Rethinking CAM in Weakly-Supervised Semantic Segmentation

Weakly supervised semantic segmentation (WSSS) generally utilizes the Class Activation Map (CAM) to synthesize pseudo-labels. However, the current methods of obtaining CAM focus on salient features of a specific layer, resulting in highlighting the most discriminative regions and further leading to rough segmentation results for WSSS. In this paper, we rethink the potential of the ordinary classifier and find that if features of all the layers are applied, the classifier will obtain CAM with complete discriminative regions. Inspired by this, we propose Fully-CAM for WSSS, which can fully exploit the potential of the ordinary classifier and yield more accurate segmentation results. Precisely, Fully-CAM firstly weights feature with their corresponding gradients to yield CAMs of each layer, then fusing these layers’ CAMs could generate an ultimate CAM with complete discriminative regions. Furthermore, Fully-CAM is encapsulated into a plug-in, which can be mounted on any trained ordinary classifier with convolution layer, and it exceeds its previous performance without extra training.


I. INTRODUCTION
Fully-supervised semantic segmentation (FSSS) [1], [2], [3], [4] aims to classify each pixel on image. With the development of deep learning, FSSS, as a basic computer vision task, has reached a major milestone. Unlike other general tasks such as object detection and classification, it is a data-driven task and requires the dense pixel-level masked label to train, but the cost of obtaining labels is huge. Object detection requires bounding box as supervision, and classification only requires category label as supervision. However, FSSS requires dense pixel-level annotation, the time cost of labeling pixel-level annotation is far higher than other tasks obviously.
The associate editor coordinating the review of this manuscript and approving it for publication was Byung Cheol Song . Therefore, much work has focused on weakly-supervised semantic segmentation (WSSS) in recent years. It is to synthesize pixel-level pseudo labels with low-level labels, such as scribble [5], [6], bounding box [7], [8], [9], points [10], [11] and image-level classification label [12], [13], [14], [15]. The image-level classification label is one of the most popular supervisions because it is straightforward to obtain. Simultaneously, it is also the most challenging for WSSS. The process of image-level WSSS methods is as follows: (1) the image-level classification label is used as the supervision to train a classifier which is usually a fully convolutional network (FCN) followed by a global average pooling (GAP) layer, and the features output by the last layer of the classifier is used as coarse localization named Class Activation Map (CAM) [16]; (2) refine the CAM to synthesize more accurate pixel-level pseudo labels; (3) train a supervised semantic segmentation network with these pseudo labels and test its performance. To synthesize more accurate pixel-level pseudo labels based on CAM, DSRG [17] proposed using CAM as seed growth points and expansion. AffinityNet [13] proposed predicting the semantic similarity between adjacent coordinate pixel pairs in the image to diffuse CAM.
Generally speaking, high-quality CAM has a positive impact on the segmentation effect. as mentioned in many related works [18], [19], the quality of traditional CAM (as shown in Fig.1) is poor. It can only highlight the salient features of the object. Most previous works [18], [19], [20] on WSSS attributed the poor localization ability of CAM to the fact that the ordinary classifier can only highlight the most discriminative regions of each class. So, most of the works try to improve the performance of CAM by using complex training methods, e.g. Puzzle-CAM [19] proposed a separate and merged training method to narrow the gap between the global CAMs and local CAMs; SEAM [21] uses equivariant attention mechanism to fuse the CAMs from various transformed images and generate complete localization; CIAN [20] construct the affinity matrix of the two images by the self-attention mechanism and mine their common category location and uncommon category location; AdvErasing [22] gradually erases the most discriminator regions of CAM and guide the network to focus on other areas of the object. This not only increases the consumption of computing resources but also increases the difficulty of training. Nonetheless, we think they only see the appearance without deeper thinking, and the real reason for the poor localization ability of CAM is the insufficient utilization of information.
As far as we know, Convolutional Neural Network (CNN) based methods have different emphases on the features extracted from different layers. Generally, the object's low-level features (e.g., contour and texture features) are extracted in the shallow layer, and the high-level features (e.g., abstract features that are difficult to understand) are extracted in the deep layer. The CAMs of the ordinary classifier's layers also follow some rules. Specifically, the CAMs from the shallow layers of the classifier network have clear object contour, but with redundant noise, the CAMs from the deep layers relatively concentrate the object's discriminative regions, but the overall contour of the object has disappeared. Meanwhile, many works reuse features to improve their performance. For example, U-Net [4] integrates with some low-level and high-level features during upsampling and obtains a more accurate semantic segmentation effect. ResNet [23] reuses the previous extracted features by residual blocks to obtain higher classification accuracy. These works have shown that more features can be applied to overcome some limitations of traditional models. Therefore, more features could participate in the localizing object task in WSSS (see (i) in Fig.5), it highlights the whole area of the object by applying the features of previous layers.
In this paper, we rethink the potential of the ordinary classifier's CAM and find that the ordinary classifier already has sufficient capability to obtain CAM with more complete discriminative regions without complex training. To fully exploit the potential of the ordinary classifier, We propose a simple framework named Fully-CAM that applies the features from all convolution layers to gain CAM with complete discriminative regions for WSSS. The process of obtaining CAM can be divided into three steps: obtain the features of each convolution layer in the forward pass; obtain the gradients of each feature in the backward propagation; generate the ultimate CAM in the generation. Specially, In the backward propagation, we design the Computing Gradients Module (CGM) to obtain the gradients of all features at once. In the generation, we design the Fusing Localization Module (FLM) to generate the ultimate CAM by fusing all the features weighted by gradients. The main advantage of the proposed Fully-CAM is that it allows the classifier's all features to participate in localizing objects. As is known to all, the previous methods used a specific layer's features to determine the localization, which is regarded as the insufficient utilization of information and results in the absolute monopoly of the generated CAM over the localization task. However, our method allows all the features to participate in localizing and complementing each other's weaknesses with their strengths. It makes the ultimate CAM accurately localize objects. We also conduct extensive ablation studies and experimentally verify that the proposed Fully-CAM achieves additional performance.
Our main contributions are as follows: • We experimentally verify that the ordinary classifier without complex training has enough capability to localize the whole object region.
• To make our method widely used, Fully-CAM is designed as a plug-in that can be mounted on any trained ordinary classifier with convolution layer without retraining and exceed their previous performances. VOLUME 10,2022 • We achieve the additional performance on the previous method in the WSSS through our CAMs on the PAS-CAL VOC 2012 val/test set with only the image-level classification labels.

II. RELATED WORK
Image-level weakly supervised semantic segmentation mainly studies two aspects: improving the quality of the CAM to highlight the whole discriminative regions of objects and synthesizing more accurate pixel-level pseudo labels. They are all inseparable from obtaining CAM. We first introduce the related progress on CAM and then related work in WSSS.

A. CLASS ACTIVATION MAP
CAM plays a significant role in interpreting CNN because it can visualize the basis of the model decision. At present, there are two mechanisms to obtain CAM. One is the traditional method [16] to obtain CAM by weighting the features based on the path weight of the full connection layer, and the other is Grad-CAM [24] to obtain CAM by weighting the features based on the gradients of backward propagation. The traditional method has strict requirements for the network structure of the classifier. The classifier must be a FCN followed by the GAP layer and the full connection layer. Sometimes the full connection layer can be removed, and the output result of the GAP can be directly used as the predicted confidence of each class. This strict constraint on the network results in that the traditional method can only obtain the last convolution layer's CAM. Later, the proposal of Grad-CAM makes it possible to obtain any layer's CAM in the network, and it can visualize CNN with any structure because Grad-CAM uses the gradient of backward propagation as the weight to weight the features to obtain CAM. Grad-CAM is flexible, but it lacks the importance of pixel space, result in the CAM is unclear. Note that the two visualization methods described above are only for a specific layer.

B. WEAKLY-SUPERVISED SEMANTIC SEGMENTATION
Compared with FSSS, WSSS uses low-level labels to generate pseudo pixel-labels to guide training, e.g., scribble [5], [6], bounding box [7], [8], points [10], [11] and image-level classification label [12], [13], [14], [15]. Most advanced methods utilize image-level labels to train models, and most of the works use the CAM obtained by the classifier to synthesize pseudo labels. DSRG [17] combines deep learning and seed region growing method, which uses CAM as seed growth points instead of manually selecting seed points to expand the entire region; AdvErasing [22] uses two classifiers, one to generate the CAM, and the other to iteratively erase the most discriminative areas in the CAM, and guide the network to focus on other areas of the object to highlight the entire area of the object; NL-CCAM [25] uses a linear function to calculate the coefficients of each CAM and weight the CAMs to make the foreground more prominent; AffinityNet [13] proposed predicting the semantic similarity between adjacent coordinate pixel pairs in the image to diffuse CAM; IRNet [12] generates a transition matrix from AffinityNet and extends the method to weakly supervised instance segmentation. There are also some advanced methods to use the attention mechanism to improve CAM on WSSS, e.g., CIAN [20] construct the affinity matrix of the two images by the self-attention and mine their common category location and uncommon category location. SEAM [21] proposed consistency regularization on predicted CAMs from various transformed images for self-supervision learning.

III. APPROACH
The overall pipeline of Fully-CAM is illustrated in Fig.2.
Our framework consists of a training stage (not required) and a inferencing stage. In the training stage, we use the most common method to train a classification model to provide the basis for generating CAM in the inferencing stage.
In the inferencing stage, there are three steps: forward pass, backward propagation, and generation to obtain CAM. The forward pass is used to obtain the feature of each convolution layer's output and predicted confidence score of the classifier, the backward propagation is used to obtain the gradient of each feature, and the generation is used to obtain ultimate CAM through features and gradients. In the backward propagation, we design the Computing Gradients Module (CGM) to obtain the gradients of a specific class. In the generation, Fusing Localization Module (FLM) is designed to generate ultimate CAM through the gradients and the features. It firstly generates the CAM of a single input image obtained by fusion of feature maps weighted by gradients. Then it fuses CAMs of different transformed images to generate the ultimate CAM.

A. TRAINING OF ORDINARY CLASSIFIER
Different from other WSSS methods, the classifier is applied the most ordinary classifier in our proposed method. In other words, it is applicable to any classifier with convolution layer. We define I as input image and feature extraction as f . In previous methods in WSSS, the classification head often consists of a convolution layer Conv with the number of output channels as number of class C and a global average pooling layer GAP. The advantage of classifier heads in previous WSSS methods is that they can obtain CAM more convenient, but it must to modify the trained classifier model structure and retrain, the confidence score y pred is obtained by Nowadays, the classification head of most mature classifiers often consists of GAP layer and full connection layer FC, and the confidence score y pred is obtained by Since we classify multi-label data, the loss used is binary cross entropy loss (BCE), σ is Sigmoid, and loss is obtained by In our method, we generalize the classification head to make it universal and can be used directly without modification.

B. COMPUTING GRADIENTS MODULE
We all know that there may be multiple classes of objects on an image, and we need to distinguish the discriminative regions of different classes. This section will introduce in detail how CGM obtains the gradients of a layer's feature map of a specified class. Fig.3 shows the process of CGM. Formally, let classifier denote the image classifier and θ represent its parameters. For a given image I , when inputting I to the classifier, we can obtain the predicted score under a specific class c i defined as y pred Let A n be the output feature maps of the n-th convolution layer in the network, the shape of A n is (1, K , W , H ). A nk (k ∈ [1, K ]) is the k-th feature map within A n and its shape is (W , H ). The gradients of the prediction score y pred under a specific class c i ∈ C in the feature map A nk can be obtained by The g nk represents the gradients of the all classes C. Note that g nk is a three-dimensional matrix of shape C × W × H , since the gradient of back propagation is computed for each predicted class y pred c i , the number of channels in g nk is C. Next, the gradients of a specific class is what we need, we have to filter the gradients. Let c denote the c-th class in the ground truth y gt of image I , y c is a vector of shape 1 × C that is used as a label for the target class c by one-hot encoding. The gradients g nkc (1×W ×H ) of the target category c in the feature map A nk can be obtained by g nkc = y c · g nk (6) Due to the shape of g nkc should be the same as A nk , we need to squeeze the g nkc , andĝ nkc of shape W × H can be obtained by:ĝ nkc = squeeze(g nkc , 0) C

. FUSING LOCALIZATION MODULE
The Fusing Localization Module is designed to generate a CAM with complete discriminative regions by fusing various CAMs from different convolution layers and the different transformed input images. Before fusion, we first introduce how to generate the CAM of a certain convolution layer of a specific class. To obtain the CAM for the n-th convolutional layer in CNN, it first multiplies the activation value of each location in the feature map by a gradient as the weight to obtain the CAM of the k-th feature map of the n-th convolution layer CAM nkc and (i, j) represents the spatial location, and the result is obtained by: g nkc ij indicates the influence of target category c to A nk ij . If the gradient is negative, it is irrelevant to A nk ij . Similarly, A nk ij also may be negative, and it is regarded as information redundancy. Moreover, there will be a lot of floating-point operations, and we set all negative values to zero for ease of computation. We have obtained the CAM of the k-th feature map of the n-th convolution layer, but we cannot fuse it directly. The reason is that there is a huge numerical gap between values of CAM of each feature map of each convolution layer. If it is accumulated simply, it will make the CAM with a large value play an absolutely dominant role. In order to make each CAM reflect its characteristics, we normalize each CAM so that the value range is between [0, 1]. Then, the normalized CAM nkc are linearly combined along the channel dimension to obtain the CAM CAM nc , which is formulated as follows: We can get the CAMs of all convolution layers through the above steps. However, due to the size, stride, and padding of the convolution kernel and downsampling in the network, the obtained CAMs are different in size. As shown in the inference of Fig.2, due to the features in different layers with different sizes, We need to restore the CAMs to the image I size through linear interpolation. the restored CAM of the nth convolution layer is obtained bŷ Finally, CAMs from all convolution layers are fused to generate the ultimate CAM of a specific class of image by: where CAM c is the ultimate CAM andĈAM nc represent the CAM from n-th convolution layers. From (4) to (11), all the features from all the convolution layers produce the ultimate CAM. Different from previous approaches (such as traditional CAM [16], Grad-CAM [24]), whether a certain location of the image is highlighted and its degree of highlighting is not determined by one or several features but by all the features captured by the network. The Fully-CAM method, which uses all the features captured by the network, can achieve more accurate and fine localization than other methods. Although Fully-CAM can capture accurate and detailed location information, we used a little trick to further improve the highlight localization performance. As shown in Fig.4, we send the original and transformed images to the network to get the corresponding CAMs and integrate information from both. For example, it uses the flipped image to get the CAM of the flipped image, and it is necessary to flip the CAM of the flipped image back for matching the CAM of the original image, then fuse the CAMs of the original image and flipped image. Here we denote the scaling, flipping, and other transformations as t, the inverse transformations as t −1 , and the process of formula 1-8 as τ . Therefore, we can get the enhanced ultimate CAM by . . .

CAM t n = τ (t n (I ))
By (12), we get the CAM of the transformed image by t(I ), then inversely transform the CAM by T −1 (CAM ), and exchange information with other CAMs that are inversely transformed to obtain the enhanced ultimate CAMĈAM . In this way, the information can be utilized to the greatest extent, useless information can be filtered out, and the accuracy of localizing can be improved.

A. DATASET & IMPLEMENTATION DETAILS 1) DATASET
PASCAL VOC 2012 dataset [26] is used in our experiments which is the most representative dataset in WSSS. It includes 4369 images, 1,464 images for training, of witch 1,449 for validation and 1,456 for testing. Note that, to be consistent with the experience of previous works [13], [19], [27], [28], [21], we also introduce Semantic Boundary Dataset [29] as an augmented training set with 10,582 images. Mean Intersection-over-Union (mIoU) is used to measure the performance of different methods.

2) IMPLEMENTATION DETAILS
Our experiments are implemented based on PyTorch 1.10 with ResNet-50 and ResNet-101 as the backbone network for WSSS. We follow the previous work [19] to set the parameters of the experiment. Specifically, we use Adam optimizer with an initial learning rate of 0.1, weight decay of 0.0005, α = 4 as the maximum, and linearly ramped up α to its maximum value by half epochs. The batch size is 32 with 15 epochs on four NVIDIA 3080 GPUs for training the classifier. The batch size is 24 with three epochs on three NVIDIA 3080 GPUs for training IRNet [12]. The batch size is 24 with 50 epochs, and the initial learning rate is 0.007 on four NVIDIA 3080 GPUs for training DeepLab. For data augmentation, we first randomly resize the image to 320 × 640 and randomly flip it, and the crop size is 512. In the inference stage, we randomly flip the image and use multi-scale (the scale ratio is set to {0.5, 1, 1.5, 2}) on a single 3080 GPU.

B. ABLATION STUDIES
Our Fully-CAM has three essential aspects: (1) the CAMs of all features are weighted and fused; (2) the CAMs of different transformations are fused; (3) Fully-CAM is plug and play, which improves the performance of any trained ordinary classifier with a global average pooling layer without extra training. We perform experiments to study the effect of different aspects of our model.
Previous work only used the features of a specific layer as the basis for CAM, which is a behavior of insufficient information utilization. Thus, we first validate the influence of applying the features of different layers (see Table.2). Due to too many convolution layers in the network, if each feature of the convolution layers needs to test the performance separately, the workload will be huge. Therefore, we artificially selected several representative convolution layers in the network and finally applied all features of the network. It is easy to see that when more and more convolution layers' features are applied, the localizing effect of CAM is better and better with mIoU(%) from 44.76% to 53.88%. This further proves the importance of features for CAM.
Furthermore, for three ways to obtain CAM: traditional CAM [16], Grad-CAM [24], and Grad-CAM++ [35], they have their advantages. Traditional CAM can only obtain the localization of the last layer's feature maps, Grad-CAM and Grad-CAM++ can obtain the localization of any layer's feature maps, but the noise will affect their localizing. The common point is to obtain localization only from the feature maps of a specific layer. As we said above, this is a manifestation of insufficient use of information. As shown in Table.3, if localization is obtained only from a specific layer's feature maps, the localizing effect will be challenging to improve. In contrast, our method has better results.
We believe that the classifier pays attention to different regions for different transformed input images, and we use the data enhancement strategy of random size and random flip when training the classifier. Therefore, we have done ablation experiments on multi-scale and flipping in the inference step. Table.4 shows the effectiveness of our introduced transformations. It is easy to see that with the increase in the number of transformed images, the mIoU of CAM is from 51.28% to 53.88%. VOLUME 10, 2022  To further study the advantages of plug and play of Fully-CAM, we mount our framework on multiple trained ordinary classifiers (see Table.5). It is easy to see that Fully-CAM can significantly improve the CAM of multiple backbones. Specially, there is a 2% improvement on VGG-16, and an increase of about 6% for ResNet-50 and ResNet-101 with our framework. Given this phenomenon, we speculate that this is related to the gradient. ResNet has the residual block, which can significantly alleviate the vanishing gradient problem. Although the VGG-16 network is not very deep, it may also be affected.
Based on the above ablation studies, Fully-CAM exploits the potential of the ordinary classifier and yields the best performance in CAM. Fig.5 illustrates the qualitative results between Fully-CAM and the traditional CAM based on ResNet-50. It can be seen that our method has a more complete and accurate localizing effect.  Table.1 and Table.6 show the experimental results of our method and Existing methods. To improve the accuracy of pixel-level pseudo labels, we follow the previous works [12] to train an IRNet based on our revised CAM. The pseudo labels after IRNet and applying the dense Conditional    Random Field (dCRF) are used to train the semantic segmentation network DeepLab [36] with ResNet-101 for WSSS. As shown in Table.6, we achieve mIoU of 68.2% and 68.9%  backbone. We can see from these tables that our method achieves a better performance than the most methods in mIoU, and illustrate that we have fully explored the potential of CAM.

C. COMPARISON WITH EXISTING METHOD
Furthermore, qualitative comparison of the segmentation networks trained with pseudo pixel-level labels is shown in Fig.6. IRNet and our method use the same backbone ResNet-50 and the same training method. Obviously, we can see that the CAM obtained by our method dramatically improves semantic segmentation performance. The original segmentation effect (as shown in (c) in Fig.6) is rough, and many pixel labels are missing. rough and lacks many pixel labels. However, our work can greatly make up for these deficiencies, and our effect (as shown in (d) in Fig.6) is more complete and refined Admittedly, although we have surpassed most of the current advanced methods, there is still a gap between us and the state-of-the-art work. Nevertheless, our research is one of the few to improve CAM's performance compare with other works [37], [38], [39]. CAM has always been an indispensable part of WSSS. We have experimentally proved that ordinary classifiers can exceed their original performance without additional training throught our method.

V. CONCLUSION
In this work, we first profoundly rethink CAM. We find that the reason for the poor localization ability of CAM is not that the classifier can only highlight the most discriminative regions but the insufficient use of information. Then, to fully explore the potential of the classifier, we visualize the CAM of each convolution layer of the classifier and find that the classifier can highlight whole object regions. Next, we propose Fully-CAM, designed as a plug-in unit to take all feature maps to participate in the localizing task. Without complex training, the ultimate CAM highlights the whole area of the object. Finally, our CAM is used in the previous work, which significantly improves the performance of the previous method on the PASCAL VOC 2012 dataset. In the future, we will make efforts in weakly supervised object detection with only image-level label. The reason is that CAM is necessary for weakly-supervised tasks with image level labels. We believe that the good CAMs obtained by our method can improve the performance of weakly supervised object detection. XI WU received the Ph.D. degree from Southwest Jiaotong University. He is currently a Professor and the Deputy Dean of the Department of Computer Science, Chengdu University of Information Technology. He is also the Deputy Director of the Collaborative Innovation Center for Image and Geospatial Information of Sichuan Province, China. His main research interest includes computational intelligence cooperated with cognitive studies. He is also interested in the area of novel methods for analysis of imaging data after joined the Department of Computer Science, Chengdu University of Information Technology.