Abstract:
In recent years, the Transformer algorithm has achieved outstanding results in many areas. However, the Transformer algorithm relies on the self-attention mechanism, whic...Show MoreMetadata
Abstract:
In recent years, the Transformer algorithm has achieved outstanding results in many areas. However, the Transformer algorithm relies on the self-attention mechanism, which requires a large amount of computation with quadratic complexity and large memory resources, hindering its popularity in edge devices. Most existing studies focus on reducing the algorithm complexity by selecting part of the elements instead of all elements to participate the calculation of the attention, but do not explicitly consider the efficiency of deploying their methods on edge devices which is suffering from the float point matrix multiplication. This paper proposes a software-hardware collaborative self-attention module, which adopts bitwise operations to replace the traditional float point matrix multiplication while retaining the ability of the attention mechanism to capture complex long-range information dependencies with the reduction of algorithm complexity and memory consumption. Meanwhile, we design a dedicated acceleration operator on Xilinx ZCU104 FPGA. The experimental results show that the proposed operator achieves a speedup of more than 1300× compared with the traditional self-attention operator with only 0.8% performance loss in CIFAR image classification task.
Published in: 2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)
Date of Conference: 10-12 May 2024
Date Added to IEEE Xplore: 19 July 2024
ISBN Information: