Conferences >2025 IEEE International Solid...

A 28nm 0.22μJ/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recently, hybrid models integrating a CNN and a Transformer (ConvFormer), shown in Fig. 23.2.1, have achieved significant advancements in semantic segmentation tasks [1]–...Show More

Metadata

Abstract:

Recently, hybrid models integrating a CNN and a Transformer (ConvFormer), shown in Fig. 23.2.1, have achieved significant advancements in semantic segmentation tasks [1]–[4], which are critical for autonomous driving and embodied intelligence. The CNN enhances the multi-scale feature extraction ability of the Transformer to achieve pixel-level classification, but the large token length (TL) demand of semantic segmentation (> 16K TL) incurs significant computation and memory overheads. Prior NN accelerators [5]–[12] demonstrate that sparse computing and pruning can effectively reduce computation and weight storage, but most of them focus on pure CNN or Transformer models in simpler vision or language-processing tasks (1-4K TL). Moreover, the performance bottlenecks of ConvFormers stem from their memory-intensive Backbone and compute-intensive Segmentation Head (Seg. Head), raising three challenges for hardware acceleration: 1) Conventional sparse attention [5]–[9] fails to buffer the attention feature map (Fmap) on-chip when the TL exceeds 16K, even at 90% sparsity, resulting in massive external memory access (EMA). 2) While Layer-Fusion (LF) [13]–[18] is a common technique to reduce Fmap EMA, it is infeasible to buffer key (K), value (V), and convolution weights on-chip simultaneously. Moreover, different fused attention-convolution layers may cover various vanilla attention (VA) tiles, leading to enormous redundant KV and weight EMA. 3) In the Seg. Head, the Fmap sparsity is extremely low, thereby limiting the effectiveness of conventional zero-skipping strategies [11], [12] designed to reduce computational work.

Published in: 2025 IEEE International Solid-State Circuits Conference (ISSCC)

Date of Conference: 16-20 February 2025

Date Added to IEEE Xplore: 06 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ISSCC49661.2025.10904499

Conference Location: San Francisco, CA, USA

Contents

References is not available for this document.

A 28nm 0.22μJ/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A 28nm 0.22μJ/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Authors

Figures

References

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?