RTSA: A Run-Through Sparse Attention Framework for Video Transformer | IEEE Journals & Magazine | IEEE Xplore

RTSA: A Run-Through Sparse Attention Framework for Video Transformer


Abstract:

In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, the...Show More

Abstract:

In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, their deployment poses significant computational challenges for hardware. To address this, pruning has emerged as a promising approach to reduce computation and memory requirements by eliminating unimportant elements from the attention matrix. Unfortunately, existing pruning algorithms face a limitation in that they only optimize one of the two key modules on VidT's critical path: linear projection or self-attention. Regrettably, due to the variation in battery power in edge devices, the video resolution they generate will also change, which causes both linear projection and self-attention stages to potentially become bottlenecks, the existing approaches lack generality. Accordingly, we establish a Run-Through Sparse Attention (RTSA) framework that simultaneously sparsifies and accelerates two stages. On the algorithm side, unlike current methodologies conducting sparse linear projection by exploring redundancy within each frame, we extract extra redundancy naturally existing between frames. Moreover, for sparse self-attention, as existing pruning algorithms often provide either too coarse-grained or fine-grained sparsity patterns, these algorithms face limitations in simultaneously achieving high sparsity, low accuracy loss, and high speedup, resulting in either compromised accuracy or reduced efficiency. Thus, we prune the attention matrix at a medium granularity—sub-vector. The sub-vectors are generated by isolating each column of the attention matrix. On the hardware side, we observe that the use of distinct computational units for sparse linear projection and self-attention results in pipeline imbalances because of the bottleneck transformation between the two stages. To effectively eliminate pipeline stall, we design a RTSA architecture that supports sequential execution of both sparse linear pro...
Published in: IEEE Transactions on Computers ( Volume: 74, Issue: 6, June 2025)
Page(s): 1949 - 1962
Date of Publication: 03 March 2025

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.