Abstract:
Recently, hybrid models integrating a CNN and a Transformer (ConvFormer), shown in Fig. 23.2.1, have achieved significant advancements in semantic segmentation tasks [1]–...Show MoreMetadata
Abstract:
Recently, hybrid models integrating a CNN and a Transformer (ConvFormer), shown in Fig. 23.2.1, have achieved significant advancements in semantic segmentation tasks [1]–[4], which are critical for autonomous driving and embodied intelligence. The CNN enhances the multi-scale feature extraction ability of the Transformer to achieve pixel-level classification, but the large token length (TL) demand of semantic segmentation (> 16K TL) incurs significant computation and memory overheads. Prior NN accelerators [5]–[12] demonstrate that sparse computing and pruning can effectively reduce computation and weight storage, but most of them focus on pure CNN or Transformer models in simpler vision or language-processing tasks (1-4K TL). Moreover, the performance bottlenecks of ConvFormers stem from their memory-intensive Backbone and compute-intensive Segmentation Head (Seg. Head), raising three challenges for hardware acceleration: 1) Conventional sparse attention [5]–[9] fails to buffer the attention feature map (Fmap) on-chip when the TL exceeds 16K, even at 90% sparsity, resulting in massive external memory access (EMA). 2) While Layer-Fusion (LF) [13]–[18] is a common technique to reduce Fmap EMA, it is infeasible to buffer key (K), value (V), and convolution weights on-chip simultaneously. Moreover, different fused attention-convolution layers may cover various vanilla attention (VA) tiles, leading to enormous redundant KV and weight EMA. 3) In the Seg. Head, the Fmap sparsity is extremely low, thereby limiting the effectiveness of conventional zero-skipping strategies [11], [12] designed to reduce computational work.
Date of Conference: 16-20 February 2025
Date Added to IEEE Xplore: 06 March 2025
ISBN Information: