Skip to Main Content
The Montgomery Multiplication is one of the cornerstones of public-key cryptography, with important applications in the RSA algorithm, in Elliptic-Curve Cryptography, and in the Digital Signature Standard. The efficient implementation of this long-word-length modular multiplication is crucial for the performance of public-key cryptography. Along with the strong momentum of shifting from single-core to multicore systems, we present a parallel-software implementation of the Montgomery multiplication for multicore systems. Our comprehensive analysis shows that the proposed scheme, pSHS, partitions the task in a balanced way so that each core has the same amount of job to do. In addition, we also comprehensively analyze the impact of intercore communication overhead on the performance of pSHS. The analysis reveals that pSHS is high performance, scalable over different number of cores, and stable when the communication latency changes. The analysis also tells us how to set different parameters to achieve the optimal performance. We implemented pSHS on a prototype multicore architecture configured in a Field Programmable Gate Array (FPGA). Compared with the sequential implementation, pSHS accelerates 2,048-bit Montgomery multiplication by 1.97, 3.68, and 6.13 times on, respectively, two-core, four-core, and eight-core architectures with communication latency equal to 100 clock cycles.