GSM-HM: Generation of Saliency Maps for Black-Box Object Detection Model Based on Hierarchical Masking

Interpretability of DNN-based object detection has been a rising concern for the research community. The first step towards this goal is a saliency map that visualizes the importance (saliency) of pixels in an image for the object detected by a specific model. Black-box based methods generate a saliency map without the need to look into the internals of a model, thus applicable to all models without the need of adaptation. In addition, they provide more reliable evaluation on the saliency of pixels than white-box methods by means of the absence of these pixels from the image. However, with current black-box methods, the absence of pixels is produced by random image masks. Despite the need of a great number of random masks for sufficient coverage, the quality of the pixel saliency is not assured to be satisfactory. In this work, we propose a more effective black-box framework with hierarchical masking. In this framework, called GSM-HM, pixel saliency is evaluated at multiple levels, with each lower level performing a refinement on the saliency information of the upper level. This hierarchical framework significantly reduces the masking efforts on less valuable pixels, thus it can produce saliency maps with higher qualities. In our experiments, the quality of a generated saliency map is evaluated with four different metrics: deletion, insertion, convergence and RAM (the ratio of average to maximum). Compared with D-RISE, a recent black-block method, GSM-HM generates more accurate saliency maps evaluated by these metrics.

pretation of the generated saliency map. 90 In this paper, we propose a black-box saliency map gen-91 eration method with hierarchical masking, which updates 92 the saliency scores of pixels hierarchically and refines 93 the saliency map smoothly. Our contributions are as 94 follows: 95 • We design a hierarchical masking framework consisting 96 of a coarse-grained phase that identify the approximate 97 saliency areas for an object, and a fine-grained phase for 98 refining the saliency scores of these areas. 99 • We propose an adaptive mask generation mechanism 100 using the l-nearest neighbors. It can adapt to objects of 101 different sizes automatically, depending on whether the 102 object is large or small.

103
• We propose a new evaluation metric to better evaluate 104 the quality of saliency maps.

105
• Compared with D-RISE, the saliency maps generated by 106 our hierarchical framework have less ''noise'' and better 107 represent the saliency areas of objects.

109
Saliency maps can be a useful tool for understanding, evalu-110 ating and optimizing object detection models. There exist two 111 basic approaches for generating saliency maps.

112
A. GRADIENT-BASED APPROACH 113 In gradient-based methods, the saliency map is calculated 114 according to the gradients in the model. Simonyan et al. 115 proposed a gradient-based method named Gradient [12]. This 116 method obtains the derivatives of each pixel in the input 117 image by backward propagation, and then rearranges the 118 derivative vector to obtain the saliency map of the input 119 image. For multi-channel images, Gradient will take the max-120 imum derivative for each pixel across all channels. CAM [13] 121 replaces the full connection layer in the classifier with the 122 global average pooling layer, and obtains the weight of each 123 feature map after global average pooling layer according to 124 softmax results. Then class activation map is generated by a 125 weighted sum of these feature maps. However, CAM has to 126 modify the architecture of model, which imposes additional 127 overhead on model training. So a general method, Grad-128 CAM [9], was proposed. Grad-CAM calculates the average 129 gradient of the last convolutional layer as the weight, and 130 sums the feature maps of the last layer according to the weight 131 to obtain the saliency map. There are other methods based on 132 CAM, such as Grad-CAM++ [14]. Based on Grad-CAM, 133 LayerCAM [15] was proposed to generate saliency map of 134 every layer of convolutional neural networks. LayerCAM can 135 obtain more fine-grained salient information of an object by 136 class activation maps of shallow layers.

137
As explained earlier, this approach has the problem that 138 class activation map is not equivalent to saliency map, and 139 the absence of part of the area with strong activation might 140 have no appreciable effect on object detection.  Black-box methods are not based on gradient information, 143 thus no need to examine the internals of the deep models. 144 Instead, the impact of an area on the object detection is 145 evaluated by masking it off the input image. This makes the 146 black-box methods applicable to all deep models without any 147 adaptation. The saliency map generated by it would also be 148 more objective as the impact for object detection is evaluated 149 from the absence of an area.       For comparison, we describe a baseline black-box framework 209 for saliency map generation, which is essentially the frame-210 work used in D-RISE [10]. Suppose we have a DNN-based 211 object detector D that can output bounding boxes, labels and 212 confidence scores of detected objects for a given image I , our 213 goal is to generate a saliency map for each of the detected 214 objects in this image.

215
First, the input image I is perturbated by a mask M that 216 covers some pixels of the image, and the masked image I 217 will be tested by the object detector D. For a specific object 218 O in image I , we obtain the original confidence score CS I and 219 the masked confidence score CS I with the masked image I 220 respectively, as shown in Fig. 1. In addition, we also obtain 221 the IoU of the bounding boxes in I and I respectively, which 222 is denoted as IoU I ,I in Fig. 1. The Saliency Score of mask 223 M , SS M , can be calculated with the following formula: Essentially, the larger the SS M , the more likely that at least 226 part of the masked pixels are salient for object O. Thus we 227 can generate a saliency map specific to mask M , where the 228 masked pixels will be assigned the saliency score SS M .

229
However, a single saliency map obtained from a specific 230 mask has two drawbacks. On one hand, it may miss some 231 other salient pixels that are not covered by the mask. On the 232 other hand, some masked pixels might actually not be salient 233 for this object. They are assigned a high salient score simply 234 because they happen to be masked together with other salient 235 pixels. Therefore, in the baseline framework, a large set of 236 diversified masks are randomly generated and applied to 237    To guide the search for salient pixels effectively and safely,  Next, the masked images are fed into the object detector to 263 obtain the saliency scores for the corresponding masks. These 264 mask saliency scores are then used to update the cell saliency 265 values. When this process terminates at the lowest level, the 266 final saliency map is generated. GSM-HM is divided into a 267 coarse-grained phase, followed by a fine-grained phase, each 268 having several levels. At coarse-grained levels, we obtain a 269 relatively accurate prior saliency knowledge of cells. At fine-270 grained levels, the saliency area of an object is continuously 271 refined.

273
The first step in GSM-HM is mask generation at a level l, and 274 its algorithm is described in this part.
Based on the fact that salient pixels for an object are natu-277 rally formed into a set of continuous areas of different sizes, 278 we propose a l-nearest neighbors mask generation mecha-279 nism. For each mask at level l, a cell is first selected as the 280 center of the masking area with a probability weighted on its 281 saliency value SV (which means cells with higher saliency 282 values will have higher chances). Next, the l-nearest neigh-283 bors of the center cell are selected to form the whole masking 284 area. Fig. 4(b) shows that the blue cell is the center cell of the 285 masking area, and the yellow cells are the neighbors.   The increasing on the number of masked cells and the 302 shrinking of masked area, together with the weighted selec-303 tion mechanism for center cell selection, will facilitate gen-304 erating high quality saliency maps for both large objects and 305 small objects due to the following reasons. First, the weighted 306 selection mechanism, which is based on saliency information 307 from previous levels, will guide the generation of masks 308 around areas with higher saliency. This means that we have 309 more chances to refine the shapes of salient areas at lower lev-310 els. Second, at higher levels, the masked areas with large sizes 311 can very quickly identify potential saliency areas, although 312 they may have a significant portion of non-salient pixels for 313 small objects. Last, at lower levels, the masked areas consist 314 of smaller cells, which will help refine the saliency areas 315 identified at upper levels. In particular, the refining of the 316 saliency areas more often results in confirmation of saliency 317 in case of large objects, and more often results in exclusion 318 of saliency in case of small objects. The process of the l-nearest neighbors mask generation algo-321 rithm is shown as the following pseudocode. Remove c from C 7: end while At each level, cells with non-zero saliency values are put 323 into C as candidates of center cells. Note that level 0 is a 324 special level because this level doesn't have the previous 325 level. So all cells at level 0 are set to the same cell saliency 326 value and put into the center cell candidate set C. According 327 to cell saliency values, the center cell is weighted randomly 328 sampled from the C. When a cell is chosen as the center cell, 329 it will be removed from C. Then, the l-nearest neighbors mask 330 is generated based on the center cell and its neighbors. 331 We design different mask generation termination condi-332 tions for the coarse-grained and fine-grained levels. Gener-333 ally, the mask generation termination condition is to set a 334 fixed number of masks. For example, the number of masks 335 set by D-RISE [10] is 5000. However, since the properties of 336 objects such as size and shape are different, a fixed number 337 of masks may not necessarily achieve ideal performance for 338 all objects. In order to improve the adaptability of masks to 339 objects, we design the mask generation termination condition 340 based on the sum of cell saliency values for the cell candidate 341 set SV C , as shown in (2).
At the coarse-grained levels, we hope to obtain relatively 344 accurate prior knowledge about the cell saliency, so the mask 345 generation termination condition of coarse-grained levels is 346 that the center cell candidate set C becomes ∅.

347
At the fine-grained levels, due to a large number of cells, 348 a large number of masks need to be generated if all cells in 349 C are selected as the center cell. To reduce the number of 350 masks at each of these levels, we set a termination condition 351 according to the following formula: where C contains the rest cells that have not been selected as 354 mask centers yet. The rationale is that if the sum of saliency 355 scores for the rest of the candidate cells in C is below a 356 certain level, the need of trying more masks becomes low. 387 The parameter α is a weight for deciding which part  termination condition of the mask generation process are dif-420 ferent for coarse-grained and fine-grained levels, as described 421 in Section III-D. At each fine-grained level, we set the mask 422 generation termination threshold δ = 0.5. We set the weight 423 for cell saliency calculation α = 0.5. D-RISE [10] is used as 424 the comparison method.

426
In order to measure the quality of saliency maps generated 427 by different techniques, this paper adopts four different eval-428 uation methods, namely Deletion [10], Insertion [10], Con-429 vergence [20], and RAM (the ratio of average to maximum). 430 These four methods can be used to evaluate the saliency map 431 from different aspects. The right part of the formula was also used for calculating 444 the Saliency Score of a mask in (1). The drop speed of 445 DDR I ,I is faster if the correlation between the saliency area 446 and the object is stronger. Different from deletion, inser-447 tion fills the pixels into the original image one by one in 448 VOLUME 10, 2022

480
In GSM-HM, a saliency map is generated at each level of 481 granularity, as shown in Fig. 5. At the initial level, we can 482 obtain some rough saliency areas of an object and exclude 483 some irrelevant areas. Compared to non-hierarchical random 484 methods like D-RISE, this will help reduce efforts spent on 485 a large portion of the image for saliency evaluation at lower 486 levels, which have increasingly larger number of cells to test. 487 Fig. 6 gives the saliency maps generated by GSM-HM 488 and D-RISE for several objects of different sizes and shapes, 489 respectively. We have three major observations. The first   Table 1 gives a comparison between GSM-HM and 513 D-RISE on the mean value for each of the four evaluation 514 metrics with the test dataset. It can be seen that GSM-HM 515 is consistently better than D-RISE. More in-depth analysis of 516 the four metrics will be described in the rest part.   Since the magnitude of saliency maps generated by different 533 methods are different, we first normalize the saliency maps.

FIGURE 7. An example of deletion and insertion metrics in GSM-HM and D-RISE.
Then the convergence metric is calculated with formula (7). 535 From the mean convergence value shown in Table 1, we can 536 see that the convergence of GSM-HM is considerably better. 537 Table 2 gives some examples of convergence results. 538 Each pair of D-RISE results have a larger Euclidean dis-539 tance, which means the saliency areas of different results are 540 significantly different. The poor convergence performance 541 of D-RISE is caused by the high saliencies of irrelevant 542 areas. The example in Table 3 further illustrates the reason 543 for the poor convergence of D-RISE. It can be seen that 544 areas far from the motorcycle object have different degrees 545 and distributions of saliencies, and this kind of difference 546 causes the poor performance on the convergence metric of 547 D-RISE. In contrast, in our GSM-HM framework, salient 548 areas are refined with the guidance of prior knowledge 549 obtained at coarse granularities, and the salient areas are 550 closely restricted to the object itself, thus it achieves a much 551 better convergence by excluding irrelevant areas from object 552 saliency. When calculating the RAM metric with formula (8), a max-555 min normalization of the saliency map is needed, such that 556 the results of different methods are in the same magnitude.

557
It can be seen from Table 1, the RAM value of D-RISE is 558 higher than GSM-HM. As described earlier, the RAM metric 559 is introduced to study the ''noise'' on saliency. As expected, 560 the ''noise'' in the saliency map generated by D-RISE leads 561 to high RAM value. Table 4 gives a concrete example for 562 the truck object in the image. With D-RISE, the area outside 563     black-box method. GSM-HM also generates saliency maps 574 with less ''noises'' than D-RISE. We also perform quanti-575 tative comparisons with D-RISE using four metrics. Experi-576 mental results demonstrate that our method is able to generate 577 more accurate object-related saliency maps.

578
In future work, how to generate accurate saliency maps 579 with significant less masks is a research topic that deserve 580 investigating. Current masking method also needs to be 581 improved, as it may introduce new artifacts when covering 582 an area with black pixels. We also have a strong interest in 583 using the saliency maps as a tool for investigating some tricky 584 issues raised in object detection.