Loading web-font TeX/Math/Italic
FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy | IEEE Conference Publication | IEEE Xplore

FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy


Abstract:

The weight-only quantization has emerged as a promising technique for alleviating the computational burden of large language models (LLMs) by employing low-precision inte...Show More

Abstract:

The weight-only quantization has emerged as a promising technique for alleviating the computational burden of large language models (LLMs) by employing low-precision integer (INT) weights, while retaining full-precision floating point (FP) activations to ensure inference quality. Despite the memory footprint reduction achieved through decreased bit-precision of weight parameters, the actual computing performance is often not improved significantly due to FP-INT multiply-accumulation (MAC) operations being performed on the floating point unit (FPU) after de quantizing the INT weight values to FP values, owing to the lack of dedicated FP- INT arithmetic units. In this study, we investigate the impact of introducing a dedicated FP-INT unit on overall performance and find that such specialization does not yield substantial improvements. As an alternative approach, we propose FIGNA, an accelerator based on INT units designed specifically for FP- INT MAC operations. A key feature of FIGNA is its ability to achieve the same numerical accuracy as the FPU while relying solely on the integer-unit, a departure from prior methods that relied on integer-units with numerical approximations for FP arithmetic results, albeit claiming similar inference accuracy through dedicated network training. Through comprehensive experiments on FP- INT quantized networks for LLMs, including OPT and BLOOM, we demonstrate the superior performance of FIGNA compared to conventional FPUs in terms of performance per area (TOPS/mm^{2}) and energy efficiency (TOPS/W) across various input and weight precision combinations. For instance, in the FP16-INT4 case, FIGNA shows 6.34x higher TOPS/ mm^{2} and 2.19x higher TOPS/W compared to the baseline.
Date of Conference: 02-06 March 2024
Date Added to IEEE Xplore: 02 April 2024
ISBN Information:

ISSN Information:

Conference Location: Edinburgh, United Kingdom

Contact IEEE to Subscribe

References

References is not available for this document.