I. Introduction
Public key cryptographic protocols, such as ECC and RSA, are fundamental to modern advancements in Internet security. However, as security requirements increase, extended key lengths complicate basic finite field operations, particularly modular multiplication, leading to an increase in resource overhead. To obviate the exceptionally consuming division operations implicated, Barrett modular multiplication (BMM) and Montgomery modular multiplication (MMM) algorithms are widely used [1], [2]. Despite this, the challenge of performing large bit-width multiplication remains. Among the algorithms developed for large bit-width multiplications, the FFT-based Schönhage-Strassen algorithm [3] and the more recent Fürer algorithm [4] are used for extremely large bit-width integer multiplication. The Toom-Cook multiplication (TCM) [5], with complexity , where is the bit-width and is the degree of TCM, is more appropriate. However, in TCM-2 implementations (Karatsuba multiplication) [6], when exceeds 2, the algorithm requires precise division operations [7]. This requirement poses a significant challenge to achieving performance enhancements in hardware design. Ding et al. [8] initially put forth a TCM based modular multiplier over NIST prime fields. Their design uses shift operations instead of exact divisions in the interpolation operation. Building on this, Gu and Li [9] advanced the field by integrating the Montgomery algorithm to implement a modular multiplier for arbitrary prime fields. In this brief, we improve the TCM algorithm for hardware implementation, and based on it, we propose a novel BMM algorithm. In the algorithm, we provide a new precompute method with detailed mathematical derivations to eliminate the redundant factors inherent in TCM. Additionally, these derivations facilitate the computation of parameters essential for minimizing the error range in the modular multiplication result.