Abstract:
Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplic...Show MoreMetadata
Abstract:
Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplications (MM) result in substantial data movement and computation, causing high latency and energy. In recent years, computing-in-memory (CIM) has been demonstrated as an efficient MM architecture, but a Transformer's attention mechanism of raises new challenges for CIM in both memory access and computation aspects (Fig. 29.3.1): 1a) Unlike conventional static MM with pre-trained weights, the attention layers introduce dynamic MM (QKT, A'V), whose weights and inputs are both generated at runtime, leading to redundant off-chip memory access for intermediate data. 1b) A CIM pipeline architecture can mitigate the above problem, but produces a new challenge. Since the K generation direction does not match the conventional CIM write direction, the QKT-pipeline needs a large transpose buffer with extra overhead. 2) Compared with fully connected (FC) layers, attention layers dominate a Transformer's computation and require > 8b precision to maintain accuracy, so previous analog CIMs [1]–[2] with \leq 8\mathsf{b} precision support cannot be directly used. Reducing the amount of computation for attention layers is critical for efficiency improvement.
Date of Conference: 20-26 February 2022
Date Added to IEEE Xplore: 17 March 2022
ISBN Information: