Position-Aware Anti-Aliasing Filters for 3D Medical Image Analysis

Maximum pooling, average pooling, and strided convolution are three widely adopted down-sampling approaches in deep learning based 3D medical image analysis. However, these methods have their own pros and cons. Maximum pooling and strided convolution are advantageous in capturing the discriminative features but often lead to the aliasing problem. In comparison, average pooling anti-aliases the representations but produces less discriminative representations. To address such shortcoming, anti-aliased maximum pooling (MaxBlurPool) uses low-pass filters to mitigate the aliasing effect. However, these filters are designed to be fixed, making it difficult to adapt to various spatial positions. In this paper, we propose Position-aware Anti-aliaSing filterS (PASS) to learn spatially adaptive low-pass filters. Compared to maximum pooling, PASS integrates a one-layer local attention module, whose computational cost is minimal. Thus, PASS can be incorporated into existing network architecture with minor efforts. In comparison to previous anti-aliased counterparts, PASS brings consistent and clear performance gains on brain tumor segmentation, pulmonary nodule detection, and cerebral hemorrhage detection. Besides, PASS also greatly improves the model robustness under adversarial attack.

rich representations from the high-level layers of DCNNs, 23 which are often more generalizable than low-level features. 24 In 3D medical image analysis with CNNs, maximum pooling 25 and average pooling are two widely adopted down-sampling The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan . of convolution. The resulting down-sampling convolution 33 operation is often named as strided convolution. 34 As aforementioned, maximum and average pooling adopt 35 two different ways to implement down-sampling, and they 36 too face different problems accordingly. Maximum pool-37 ing captures the most predominant parts, which makes the 38 produced features (cf. Fig. 1b) discriminative. Nonetheless, 39 maximum pooling layers in DCNNs inevitably result in alias-40 ing results because of the preserved high-frequency signals. 41 This characteristic makes high-level semantic representations 42 sensitive to small shifts [1]. In comparison, average pooling 43 is anti-aliased, and thus helps preserve the shift invariance in 44 DCNNs. However, the outputs of average pooling are often 45 less discriminative compared to maximum pooling, as shown 46 in Fig. 1c. As a result, the performance of average pooling 47 is often inferior to that of maximum pooling in a range of 48 tasks [2]. For strided convolution, Zhang [1] pointed out that 49 it also suffers from the same issue as maximum pooling does. 50 On the other hand, applying low-pass filtering is the 51 default solution to anti-alias in traditional signal processing. 52 Inspired by such phenomena, anti-alias maximum pooling 53 integrated into existing 3D medical imaging models as 93 a plug-and-play component.    we calculate Hadamard product of the local attention matrix 159 and a pre-defined gaussian blur filter, whose result is passed 160 to the blur normalization to ensure it is a low-pass filter. 161 We then convolve the feature map produced from maximum 162 pooling (stride=1) with the obtained position-aware blur fil-163 ter. Finally, we sub-sample the convolved result to acquire the 164 down-sampled feature map. In the following, we will describe 165 each step in details.

166
Step (i): Non-strided maximum pooling (stride=1). 167 Most maximum pooling operations employ a stride of 2, 168 which can be decomposed into two procedures: maximum 169 pooling with stride=1 (i.e., non-strided maximum pooling) 170 and sub-sampling with stride=2. Suppose F l ∈ R H ×W ×D×C 171 denotes the input features to the l-th layer.
Step (i) can be 172 summarized as follows: where subscripts {2, 1} denotes the kernel size and stride 175 of maximum pooling, respectively. The goal of non-strided 176 maximum pooling is to preserve the discriminative features 177 in feature maps.

178
Step (ii): Position-aware anti-aliasing. As aforemen-179 tioned, maximum pooling inevitably produces aliasing 180 effects because of the maximization operation. Although 181 the obtained features are discriminative, they also adversely 182 VOLUME 10, 2022 tice, the learned high-level representations with rich seman-184 tics may vary a lot with a small shift in the input [1], which 185 severely deteriorates the performance and robustness. 186 We mitigate the above problem by applying position-aware 187 anti-aliasing to the result of non-strided maximum pooling. which are expressed as follows: ate gaussian distribution, which can be formalized as follows: Next, we calculate the Hadamard product of the local 207 weight matrix P l i and blur filter B:

209
where stands for the Hadamard product operator, and 210 Q l i ∈ R K ×K ×K . To ensure Q l i is a low-pass filter, we apply 211 blur normalization to Q l i : In practice, we found the softmax function works well for blur   Step (iii): Sub-sampling (stride=2). Finally, we apply 225 a sub-sampling operation with stride=2 to reduce the spa-226 tial dimension ofF l , resulting in the output feature map 229 F l+1 serves as the input to the next layer in DCNNs.

230
Training loss function. For brain tumor segmentation, 231 the training loss function is a weighted summation of 232 the cross-entropy loss and the dice loss. Specifically, the 233 cross-entropy loss is formulated as: where Z stands for the number of categories, U denotes the 236 number of voxels in the predicted segmentation mask, y z u 237 denotes the ground-truth binary label of category z at the 238 u-th voxel, and p z u is the corresponding predicted probability 239 of category z. 240 On the basis of the above notations, the dice loss is formu-241 lated as follows: The training loss function for segmentation is a weighted 244 combination of the cross-entropy loss and the dice loss. 245 The cross-entropy loss measures the pixel-level classification 246 accuracy while the dice loss alleviates the data imbalance 247 problem: We implement PASS using PyTorch [21]. For fairness, 287 we carefully tune hyper-parameters for each dataset, where 288 baselines and our PASS share the same training protocol. 289 We save the checkpoint which produces the lowest loss value 290 and use it for testing.   Experimental results using different down-sampling method-350 ologies are presented in Table 1. Comparing MaxPool with 351 AvgPool, we see that maximum pooling is more advanta-352 geous in segmenting the whole tumor (WT) while AvgPool 353 performs better on the tumor core (TC) and enhancing tumor 354 (ET). Considering WT is much larger than TC and ET, the 355 above comparisons demonstrate the aliasing adversely affects 356 the segmentation of small objects. By comparing Strided-357 Conv with MaxPool, we find that these two down-sampling 358 approaches display similar performance, both of which out-359 perform AvgPool on the segmentation of WT while showing 360 slightly worse results on TC and ET. This phenomenon veri-361 fies the conclusion provided by [1], which is StridedConv and 362 MaxPool have similar characteristics.

363
For state-of-the-art approaches, GTPool surpasses Max-364 Pool, AvgPool, and StridedConv on all three tumor categories 365 by obvious margins. This is consistent with the superiority 366 of the gate function used in GTPool, which incorporates 367 the advantages of MaxPool and AvgPool by mixing their 368 outputs. Similar to GTPool, MaxBlurPool also integrates 369 a low-pass filter with maximum pooling for anti-aliasing 370 while maintaining the discriminative features. Compared to 371 GTPool, MaxBlurPool achieves consistent improvements on 372 three tumor classes. When we replace MaxBlurPool with our 373 PASS, we observe very obvious improvements, especially in 374 TC and ET which are smaller and thus harder to segment com-375 pared to WT. These improvements reflect that the anti-aliased 376 yet discriminative representations may aid the discovery of 377 small objects. Besides, we find PASS is complementary to 378 Attention U-Net even if Attention U-Net employs multiple 379 attention modules in the network. The reason behind might 380 be that the attention modules of Attention U-Net mainly lie 381 indicating that aliasing may adversely affect the discov-409 ery of small objects as the learned high-level semantics is 410 highly influenced by aliasing. Again, StridedConv displays 411 similar performance as MaxPool does. By integrating the 412 advantages of maximum and average pooling, GTPool brings 413 a 0.5-percent improvement over AvgPool. MaxBlurPool 414 obviously surpasses GTPool by incorporating an explicit 415 low-pass filter into maximum pooling. Once again our 416 PASS surpasses MaxBlurPool by 2 percents, again verify-417 ing the advantage of learning position-aware anti-aliased 418 representations.

420
A t-test validation is conducted on all three datasets. We com-421 pute p-values between the best and the second best results. 422 Specifically, on BraTS-2018, we calculate two p-values based 423 on mean dice scores of 3D U-Net and Attention U-Net. The 424 p-values on brain tumor segmentation, pulmonary nodule 425 detection, and in-house cerebral hemorrhage detection are 426 6.25e-3 (3D U-Net)/7.32e-3 (Attention U-Net), 9.34e-3, and 427 3.74e-3, respectively. All p-values are smaller than 0.01, 428 indicating that performance improvements brought by our 429 PASS are statistically significant at the 1% significance level. 430

431
In this section, we conduct ablation studies to investigate 432 the impacts of different modules in PASS. All ablative 433 experiments were performed on brain tumor segmentation 434 (BraTS-2018).   3. In-house cerebral hemorrhage detection. ↑ means higher is better. The best results is bolded. P-value is calculated between our PASS and MaxBlurPool.

TABLE 4.
Ablation study on brain tumor segmentation. The baseline is 3D U-Net with maximum pooling (row 0). The evaluation metric is mean dice score. Conv and IN represent the convolution layer and instance normalization layer, respectively. B stands for the blur filter and K denotes the blur kernel size.
pulmonary nodule detection and cerebral hemorrhage detec-469 tion, respectively. These comparisons further validate the 470 anti-aliasing characteristic of PASS, which helps improve 471 the model robustness. Additionally, we observe that average 472 pooling performs better than maximum pooling and strided 473 convolution, again indicating that anti-aliased features do 474 help models to resist adversarial perturbations.

475
2) VISUAL ANALYSIS 476 We follow [27] to visualize the segmentation results in Fig. 3. 477 We can see that our PASS greatly reduces small-sized false-478 positive predictions. For instance, MaxBlurPool produces 479 lots of isolated noisy predictions because it cannot adap-480 tively anti-alias different contents. In comparison, our PASS 481 can greatly reduce false-positive segmentations. In addition, 482 we see that PASS performs the best on the segmentation of 483 the tumor core, which is consistent with the result reported 484 in Table 1. 486 We propose Position-aware Anti-aliasing Filters (PASS) 487 to adaptively anti-alias high-level representations with rich 488 semantics. PASS introduces a position-aware local attention 489 module to typical maximum pooling. PASS comprises only 490 one convolutional layer, making it computationally efficient 491 to replace existing down-sampling methods. Compared to 492 typical pooling strategies and previous anti-aliasing counter-493 parts, our PASS produces observable and consistent improve-494 ments on a variety of medical imaging tasks, including brain 495 tumor segmentation, pulmonary nodule detection, and cere-496 bral hemorrhage detection.