Performance Evaluation of Instruction Set Extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor
Grossschadl, J.; Tillich, S.; Szekely, A.
Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on
Volume , Issue , 29-31 Aug. 2007 Page(s):680 - 689
Digital Object Identifier 10.1109/DSD.2007.4341542
Summary:Many important algorithms for public-key cryptography rely on computation-intensive arithmetic operations like modular exponentiation on very long integers, typically in the range of 512 and 2048 bits. Modular exponentiation is generally realized through a sequence of modular multiplications and spends the majority of execution time in simple inner loops. Speeding up these performance-critical inner loop operations with custom instructions has, therefore, a significant impact on the total execution time of public-key cryptosystems. In this paper we analyze the performance of instruction set extensions for long integer arithmetic on a SPARC V8 processor. We discuss various implementation options and optimization opportunities for both modular multiplication and exponentiation. In particular, we introduce a partial loop unrolling (PLU) technique for modular multiplication which allows to achieve large performance gains at the cost of a moderate increase in code size, while maintaining the full flexibility of a "rolled-loop" implementation. In addition, we study window methods for modular exponentiation and analyze their impact on performance and memory requirements. Our experimental results, obtained with an FPGA prototype of the LEON-2 SPARC V8 core, show that a full 1024-bit modular exponentiation can be performed in about 12.5 ldr 106 clock cycles, which is a reasonable value for embedded devices like smart cards or sensor nodes.
View citation and abstract |