High-Throughput Bilinear Pairing Processor for Server-Side FPGA Applications

This study focuses on the acceleration of cryptographic pairing operations on field-programmable gate arrays (FPGAs) for server-side applications. Previous studies on FPGA pairing implementations focused on area efficiency for embedded devices, trying to achieve maximum performance with minimal circuit resources. However, their architectures are likely to be inefficient for server-side applications, where the primary interest is maximum performance when FPGA resources are finished. Their architectures are inefficient for two reasons: low utilization of the digital signal processor (DSP) and low operation frequency. In this study, we propose a high-throughput pairing processor architecture for server-side FPGAs, taking full advantage of DSPs. First, we propose a loop-unrolled modular multiplication algorithm that is suitable for a server-side FPGA. The algorithm shows the highest throughput and area efficiency compared to algorithms from previous studies. Second, we design a pairing processor architecture that embeds the proposed modular multiplier, thus, maintaining its high throughput by supporting redundant adders and interleaved executions. We evaluate BN254 and BLS12_381 pairings on the proposed processor architecture and the evaluation results show that it achieves good throughput that is approximately two and five times faster than that from previous studies, respectively.

the amount of data flowing through a network, aggregate signatures [1] that reduce the number of transactions have been proposed.In addition, for the demand to increase data confidentiality, searchable cryptography has been introduced to enable searching in the ciphertext state [2].Some of these advanced cryptographies are based on an operation known as the (elliptic curve-based) bilinear pairing.A cryptographic system based on pairing is known as pairing-based cryptography (PBC).However, the pairing is computationally intensive, typically requiring more than 10 000 modular multiplications on a finite field [3].Key-size update due to recent theoretical attack [4] also increases the computational complexity of the pairing.For PBC to be widely used worldwide, it is important to accelerate the pairing computations.In this study, we focus on PBC for server-side applications such as aggregate signature verification in Section II-B) that requires a large number of pairing computations in a batch manner.Field-programmable gate array (FPGA) implementation is a promising option for accelerating cryptographic operations for server-side applications.Server-side FPGAs such as the Xilinx Virtex Ultrascale+ XVU9P used in AWS, Alibaba, and Huawei clouds [5], [6] have significantly huge resources for a single crypto core; therefore, increasing throughput with a multicore strategy [7] is an important factor.However, existing studies of FPGA-based pairing processors are unsuitable for multicore strategy in server-side applications for the following two reasons.

A. Utilized Resource Imbalance
Several types of circuit resources are available in FPGAs, including look-up tables (LUTs), flip-flops (FFs), and, digital signal processors (DSPs).If the utilization of these resources is unbalanced (if the utilization of one of the resources is high), the number of cores in a multicore implementation will be bounded by that resource, implying that some resources are unused and do not contribute to performance (see Fig. 1).Existing studies of FPGA pairing processors [8], [9], [10], [11], [12], [13], [14] evaluate only single-core performance; thus, do not take into account the utilized resource balance for multicore performance.

B. Low Operating Frequency
Existing pairing processors on FPGAs [8], [9], [10], [11], [14], [15] run at 200-300 MHz, which is lower than the maximum operating frequency of the FPGAs, 600-800 MHz.This is mainly because the existing studies focus on low-latency computation.Whereas server-side applications require high throughput rather than low latency to respond to a large numbers of cryptographic requests.
To solve these problems, this article: 1) introduces a metric, the slice DSP ratio (SDR), that maximizes the use of circuit resources on an FPGA and 2) uses the SDR metric, proposes a methodology to maximize the performance of the crypto core for server-side applications.
To implement a cryptographic core on an FPGA, we can primarily use two resources: a logic slice, consisting of LUTs and FFs, and a DSP, a dedicated unit for sum-ofproduct operations.Balanced utilization of these two resources leads to maximize the performance under multicore strategy.We introduce the SDR as a metric for resource utilization imbalance.Xilinx Virtex Ultrascale+ XVU9P, which is a standard server-side FPGA adopted in AWS, Alibaba, and Huawei cloud, provides 295 000 slices and 6840 DSPs.The closer the SDR of a cryptographic core is to 43 1= 295 000/6840, the more "balanced" the resource utilization, resulting in high performance due to the full resource utilization as shown in Fig. 1.In the previous studies [8], [9], [11], most pairing cores had SDR > 43, which is not preferable for maximizing multicore performance in terms of not taking full advantage of FPGA resources (DSPs).
Our methodology for getting the SDR of the pairing circuit close to 43 is to construct the modular multiplier with as few slices as possible (with low SDR).Since the components of the pairing circuit other than the modular multiplier (e.g., modular adders and sequencers) are composed only of slices, the SDR of the entire pairing circuit is balanced by the low SDR modular multiplier and the high SDR other components.
In order to reduce the SDR of the modular multiplier, we propose a new modular multiplication algorithm designed for the DSP of the cloud FPGA.This algorithm is carefully designed to take full advantage of various DSP functions, asymmetric multiplier, ternary post-adder, and pipeline registers, achieving high throughput at low SDR by integrating functions previously implemented in slices into the DSP.The low SDR of the modular multiplier allows us to implement other components for the pairing computation, such as the modular adders, with a large number of logic slices to improve performance.Although adders and sequencers can typically be a performance bottleneck, our pairing processor architecture avoids the performance degradation through deep pipelining and the use of redundant adders that consume a large number of logic slices, resulting in high throughput of the entire pairing computation.
As a demonstration of our proposed method, we show that the proposed modular multiplier achieves a low SDR of less than 22, an operating frequency of over 600 MHz, and at least more than twice the throughput per area compared to previous studies.Furthermore, we evaluate the pairing processor using the proposed modular multiplier on BN254 and BLS12_381 curve pairings.The evaluation results show that our implementation have the good SDRs (46.96 and 41.24) and achieve approximately 2 and 5 times multi-core throughput, compared to those from previous studies.Our source code is available on the web. 2he remainder of this article is organized as follows.Section II discusses the preliminaries for this study, including pairing mathematical background and related works.Section III discusses the proposed method and the ways in which it achieved a high performance on server-side FPGAs.Section IV presents the evaluation results of our proposed method on XVU9P FPGA and comparisons with results from previous studies.Finally, Section VI concludes this article.

A. Pairing
The map below e e : is called (admissible) "pairing" if it satisfies the following three properties.1) Non-degeneracy: e(P, Q) ab .3) Computability: Efficiently computable in polynomial time.In the above expression, G 1 and G 2 are represented additive, and G 3 multiplicative.[a]P represents the scalar multiplication of P by an integer a.
Let p be a prime number, E be an elliptic curve defined over F p , and k is the embedding degree.Our target pairing is the optimal Ate pairing [18], where G 1 , G 2 are defined as subgroups of elliptic curve group E(F p ), and G 3 is a subgroup of F * p k .Although the unique property of the pairing-bilinearity-produces various functionalities in advanced cryptography, the security parameters for using the pairing securely have not been decided yet with solid Fig. 2. Typical operations required for optimal Ate pairing computation.Hardware architecture suitable for F p 2 operations is effective in speeding up the pairing since all operations can be decomposed into F p 2 operations.consensus in the research community.In 2018, Razvan and Sylvain [4] updated the security-bit length for the secure pairing computation.The parameters can be updated again in the future.
Because these curves have the embedding degree k = 12, the pairing computation involves performing the F p 12 arithmetic.We use the following extension: Since F p 12 is constructed on F p 2 , it is important to accelerate the F p 2 operation in the optimal Ate pairing architecture.Fig. 2 shows a top-down view of the operations required for a typical optimal Ate pairing computation.The optimal Ate pairing can be decomposed into two functions: f , called the Miller loop, and its power of p k − 1/r , called the final exponentiation, where r is the order of the elliptic curve group.Each function is decomposed into finergrained operations: elliptic curve operations, F p 12 operations, and F p 2 operations.The lowest-level operation is the prime field (F p ) operation, i.e., addition and multiplication modulo p, which can be implemented with log p-bit adders and modular multipliers.According to [22], the pairing computation for a 254-bit prime requires approximately 10 000 F p multiplications and 57 000 F p additions, indicating that high-throughput adders and modular multipliers are effective in speeding up the pairing computation.Because most operations for pairing can be decomposed into F p 2 operations, as shown in Fig. 2, we design a hardware architecture optimized for F p 2 operations by combining modular adders and multipliers.

B. BLS Signature
The BLS digital signature scheme [1], [23] is an example of PBC.The BLS digital signature has a short signature length than the conventional elliptic curve digital signature algorithm (ECDSA) and a signature aggregation function, which can aggregate multiple signatures into a single signature.According to these advantages, some decentralized applications, such as DFINITY [24] and Ethereum [25], adopt the BLS signature to compress data and storage size.
G 1 and G 2 are elliptic curve groups, which have the same prime order p; g 1 and g 2 are the generators of G 1 and G 2 , respectively, and H 0 is a hash function such that H 0 : M → G 1 .The BLS signature consists of three functions, KeyGen, Sign, and Verify.
In addition, the BLS signature allows the following Agg and AggVerify functions.

C. Importance of Server-Side FPGAs to Accelerate Pairing Computations
The main bottleneck of the BLS signature is n + 1 pairing computations involved in the AggVerify.As an example of the BLS signature application [26], [27], we take a wireless sensor network (WSN), where millions of sensor nodes sense some physical quantities and the sensed data is accumulated onto server nodes.Signing the sensed data on the sensor nodes provides the ability for authenticity verification on the server nodes.This prevents the data poisoning attack and the adversarial example attack.However, the signature length (64 B in typical ECDSA, for instance) is significantly larger than the sensed data (we estimate a few bytes in many cases).The load on the network bandwidth must be addressed when many nodes send signed data.Introducing the BLS signature onto the WSN significantly saves the network bandwidth by aggregating multiple signatures on intermediate nodes such as gateways.The server nodes must process n + 1 pairing computation to verify the aggregated signature, which can become a bottleneck.
Hardware implementation is a promising approach for accelerating pairing computation.Among hardware Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
implementations, we focus on reconfigurable FPGA because security parameters can be updated.The server side can use high-performance FPGAs, compared to IoT end devices.FPGAs have recently become available for cloud services with the Xilinx Virtex Ultrascale+ XVU9P.Our goal is, therefore, to implement a high-throughput implementation on the Xilinx Virtex Ultrascale+ XVU9P.We propose a high-throughput implementation that can handle many pairing requests, focusing on the fact that the pairing requests on the server nodes occur in a batch-like manner when an aggregated signature arrives.

D. Related Studies
Sakamoto et al. [15] proposed a pairing processor based on Yao et al.'s work [8], and their pairing processors have similar architectures.While Sakamoto et al. aimed to develop a low-latency and large-scale pairing processor with fully unrolled quotient pipelining Montgomery multiplication (QPMM), Yao et al. aimed to achieve a good area-speed efficiency with the residue number system (RNS).These studies have implemented pairing processors with 16-18 pipeline stages on FPGAs, and they demonstrated that their pairing processor can use the pipeline at high efficiencies of approximately 100%.This indicates that the pairing computation has fewer computation dependencies, and we can maintain the performance if we implement a deeper pipeline.We believe that we can improve the throughput of a pairing processor while maintaining the latency even when the pipeline stages are deepened to maximize the DSP operating frequency.
Accelerating modular multiplication is a major topic in cryptography research, as modular multiplication is the dominant operation in computing many public key cryptosystems.Barrett reduction [28] and Montgomery reduction [29] are the most well-known modular multiplication algorithms, with the latter being the mainstream method.
The key point to implementing modular multiplication in FPGAs is effectively using DSPs, the dedicated hardware elements for performing multiplications.The operating frequency of the DSP is crucial in implementing a fast modular multiplier.Some studies [16], [17] aimed to operate the DSPs at the maximum frequency written in datasheets.
Suzuki [16] implemented a modular multiplier that operates at a maximum operating frequency of 400 MHz on Virtex-4 devices by adequately assigning the QPMM [30] algorithm to 17 DSPs.Gallin and Tisserand [17] implemented a 128-bit modular multiplier operating at 349 MHz, compared to the maximum operating frequency of 390 MHz for the Spartan-6 device, by adequately assigning the coarsely integrated operand scanning (CIOS)-Montgomery multiplication to nine DSPs.Although these studies proposed methodologies for maximizing the DSP operating frequency, their results were validated only on small-scale (older generation) FPGAs.We should take advantage of many resources (thousands of DSPs) in modern large-scale FPGAs used in cloud services.
Gallin and Tisserand [17] implemented an elliptic curve processor using their proposed modular multiplier.The elliptic curve processor requires adders and memories, and not only a modular multiplier.The adders and memories must also be designed appropriately to maintain a high operating frequency.In their study, the operating frequency of the elliptic curve processor was decreased from 349 to 298 MHz.

III. PROPOSED PAIRING PROCESSOR ARCHITECTURE
A. Architecture Overview Fig. 3 shows the overall architecture of our pairing processor, which is an improved version of the one from our previous study [15].This architecture is mainly designed to accelerate the F p 2 arithmetic frequently appearing in the pairing computation.For example, we can efficiently calculate the F p 2 multiplication and squaring using a Karatsuba-like formula as the following: where z 0 , z 1 , x 0 , x 1 , y 0 , y 1 ∈ F p .Fig. 4 shows the ways in which the above equations are processed in the proposed architecture.As shown in Fig. 4, the architecture enables the calculation of F p 2 multiplication or squaring by three or two cycles on average, respectively.In Sections III-B and III-C, we explain ways in which this architecture achieves both high performance and a low SDR.
The pairing computation is performed on a finite field; therefore, all additions in the architecture require modular adders rather than simple adders.Because the modular arithmetic Fig. 4. Proposed architecture's pipeline data flow when processing F p 2 through multiplication and squaring for three and two cycles on average, respectively.The sequencer module first issues three opcodes (control signals) for the F p 2 multiplication, followed by two opcodes for F p 2 squaring.This figure shows the computation process according to the control signals.In the figure, all computations are in the same-sized boxes, but the computation time is different.This figure is not cycle-accurate, and the interleaved execution is omitted to simplify the explanation; τ data are read from the BRAMs for an issued opcode when enabling an interleaved execution (we can also say that this figure shows the situation in which all threads are processing the same data).usually involves a comparison operation, the addition results must be determined before the comparison.This implies that we cannot take advantage of a fast redundant adder such as a carry-save adder (CSA), and it is difficult to increase the operation frequency owing to the long carry chain.To solve this problem, we adopted the lazy-reduction technique into the architecture, which separates modulo operations from the modular adders and pushes all the modulo operations to the modular multiplier.This enables all adders in the architecture to be converted to redundant adders, thus increasing the operating frequency.
Although CSA is the most popular redundant adder that can eliminate a long carry chain, it has a disadvantage of doubling the number of used FFs.Consequently, for Xilinx FPGAs that have dedicated logic for carry propagation (CARRY8), a ripple-carry adder (RCA) is often more efficient.To avoid a full carry chain of RCA, we employ the partially redundant partitioned adder (PRPA) that divides an adder into κ subadders and saves a γ -bit carry for each sub-adder.For a non-redundant, R-bit integer A represented as we define a partially redundant κ-partitioned integer Â as follows: where r = R/κ (we assume κ|R).In the PRPA, each sub-addition is performed with r + γ -bit width to eliminate the original R-bit carry chain.The length of the each carry bit depends on the number of additions from the output of the modular multiplier to the next input (γ = 8 bits are sufficient for our scheduling).These saved carry bits are used by the ToUint module in multiple cycles to maintain a high operation frequency.We implemented a 256-bit twoinput RCA on an xcvu9p-2l device and found that it operates at approximately 400 MHz.Because the BLS12_381 pairing requires a 381-bit addition, we selected κ = 4 with an operating margin.

B. Operation Modules 1) Sequencer:
The sequencer issues control signals to control the operation of each module.The sequencer has a built-in ROM with 2048 entries, which contains a pairing computation program optimized for the 18-stage pipeline architecture [15].Because the pipeline of our architecture has more than 18 stages (89 stages for BN254 pairing and 121 stages for BLS12_381 pairing), this program cannot run efficiently.We make our architecture to support an interleaved execution (multithreading), where a core can concurrently execute multiple pairings.For the number of threads τ , the sequencer issues the control signal every τ cycles and issues the memory address for corresponding threads every cycle.
Let σ be the number of pipeline stages of the entire pairing processor.Our program works under the condition that σ < 18τ ; therefore, we set τ = 5 for the BN254 pairing and τ = 7 for the BLS12_381 pairing.Our pairing program is optimized by various techniques described in [3], including twisted elliptic curves, sparse multiplication, and compressed squaring techniques (see [15] for the detailed scheduling).
2) Preadder: The preadder shown in Fig. 5 is a two-input two-output module for generating the input to the subsequent modular multiplier.The data path is divided into two parts: the left and right sides produce the multiplicand and multiplier Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 5. Preadder block diagram.A preadder has three operational modes controlled by the control signal csig pre : 1) pass-through mode that directly outputs di when csig pre = 0; 2) one-step-delay mode that outputs the sum of di pre,l and di pre,r of τ -cycles-before when csig pre = 1; and 3) two-step-delay mode that outputs the sum of di pre,l and di pre,r of 2τ -cycles-before when csig pre = 2.
of the modular multiplier, respectively.The main arithmetic units of the preadder are four adders (adders 1-4 constructed as the PRPAs), where adder 3 is a subtractor taking two's complement value as the right-side input.This module has three operational modes as shown in Fig. 5. First-in first-out FIFO registers are for supporting τ interleaved execution.
3) ToUint Module: The ToUint module is a two-input twooutput module responsible for converting the PRPA form values to non-redundant form and making the values unsigned.We realize the unsigned operation by adding 512 M to the input values because the minimum input value is −512 M by carrying up the redundant γ bits, taking κ = 4 cycles to eliminate the long carry propagation.
4) Cmul: The Cmul module is a two-input two-output module responsible for multiplying by constants α ∈ {2, 3, 4, 6}, which consist of constant shifters and the PRPA.Multiplication by two is useful for accelerating F p 2 squaring which, has a term of 2x 0 x 1 [see (6)].Multiplications by 3, 4, and 6 are useful for accelerating F p 12 and elliptic curve operations.
5) Postadder: The postadder module shown in Fig. 6 is a one-input one-output module responsible for accumulating the output of the QPMM.The postadder consists of three accumulators, two of them are mainly used for producing the two terms of an F p 2 element, as shown in Fig. 4, and the remainder is used for calculating F p 12 elements.Each accumulator has eight modes of operations controlled as shown in Fig. 6.Similar to the preadder, the postadder has FIFO registers to support the interleaved execution.
6) F p Inverter: In our pairing implementation, an F p inversion occurs only once per pairing; however, the inverse operation can be a bottleneck if executed on the QPMM by Fermat's little theorem, because the computation causes many pipeline bubbles.To solve this bottleneck, our pairing processor has a dedicated module for F p inversion, which implements the Montgomery inverse algorithm [31] with τ -stage pipeline.Fig. 6.Postadder block diagram.This module has three accumulators; each consists of the PRPA, two selectors, and a register.The operational mode of an accumulator depends on the selector outputs controlled by the control signal csig post,1 .When a sub signal which is a bit of csig post,1 equals one, the right-hand selector selects one of the inverted inputs, realizing a subtraction operation using two's complement value.The postadder output is selected from three accumulators by the control signal csig post,out .

7) QPMM:
The improvement of the modular multiplier module is the main part of our proposed method.The detailed description is presented in Section III-C.So far, we discussed that our architecture consumes many slices for adders to accelerate F p 2 operations and FFs to support the interleaved execution.Therefore, to maximize the throughput in multicore implementations, the modular multiplier must be implemented with a small SDR by efficiently utilizing the functions of DSP primitives.

C. Low-SDR High-Throughput Modular Multiplication Algorithm
The basic idea of lowering the SDR is to improve performance while consuming a large amount of DSP with loop-unrolled implementation.Sections III-C1 and III-C2 first describes the features of the Xilinx DSP48E2 primitive and proposes a modular multiplication algorithm suitable for using its asymmetric multiplier and sum-of-products arithmetic functions.
1) DSP48E2: Many FPGAs contain DSP hardware macros to accelerate signal-processing applications.The specifications of the DSP vary by vendor and device.Virtex Ultrascale+ devices are the mainstream FPGAs available on many cloud services, AWS F1, Huawei, and Alibaba cloud.This study proposes a modular multiplication algorithm for DSP48E2, which is the DSP primitive on Virtex Ultrascale+ FPGAs.
A simplified block diagram of DSP48E2 is shown in Fig. 7. DSP48E2 takes 27-bit A, 18-bit B, 48-bit C, and PCIN as inputs and outputs 48-bit P and PCOUT.PCIN is a dedicated line connected to PCOUT of the adjacent DSP, and it is used to input the results of the adjacent DSP with a minimum delay.DSP48E2 is a typical multiply-accumulate (MAC) unit, with a 27 × 18-bit signed multiplier (26 × 17-bit unsigned multiplier) and a 48-bit three-input adder followed by the multiplier.DSP48E2 has optional pipeline registers at various locations, which can be used to increase the operating frequency instead of the latency.The maximum operating frequency of the DSP is listed in the datasheet [32], for example, up to 644 MHz for Virtex Ultrascale+ xcvu9p-2l devices that are available on the cloud services.
2) High-Throughput Modular Multiplication Algorithm for DSP48E2: We propose a modified QPMM algorithm [30,Algorithm 4] suitable for DSP48E2.The naive QPMM algorithm is shown in Algorithm 1. QPMM is a variant of Montgomery multiplication that eliminates computational dependency in the conventional high-radix Montgomery multiplication algorithm [30, Algorithm 1].The essential improvement of QPMM is the ability to delay the timing when q i is required by d cycles.This eliminates the computational dependency between L 1 and L 2 and allows them to be executed in parallel.A larger value of d makes the multiplication width long; therefore, d should be minimal.Moreover, because q i is the lower k bits of S i , carry bits must not be propagated to more than k bits during L 2 calculation.Thus, we can efficiently calculate most additions during the L 2 calculation using redundant representation adders such as the CSA.
To execute the QPMM algorithm on the DSP48E2, we must replace the L 2 calculation with a multiple-precision operation matched to the DSP48E2 operation width.Because the multiplier of DSP48E2 is asymmetric, we represent multiplicand A and multiplier B using the different radixes 2 ℓ and 2 k (typically ℓ = 26, k = 17), respectively, as the following: Algorithm 1 QPMM [30] Consider all values except B in the 2 ℓ -radix representation.The sum of q i−d M and b i A becomes the sum of products of the same index j as follows: and let t i, j = m j q i−d + a j b i , then the sum of ( 10) and ( 11) is DSP48E2 can calculate (12) easily; however, the length of s i, j increases by c bits as index i progresses.Let the lower 2k bits of s i, j be sl i, j = s i, j mod 2 2k and the remaining upper bits be su i, j = s i, j /2 2k .Equation ( 11) is rewritten as follows: where each s i, j is at most k + ℓ + 1 bits or less.When ℓ = 26, k = 17, k +ℓ+1 = 44.DSP48E2 can output up to 48 bits; we can efficiently calculate these sum-of-products operation within the DSP.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.We propose a QPMM variant algorithm using (14), as shown in Algorithm 2. Assigning the four-term sum-of-products at Line 6 to two DSPs, as shown in Fig. 8, we can complete most QPMM operations within the DSPs.We refer to the combination of these two DSPs as a processing element (PE).Fig. 8 shows four different latency PEs using optional registers in the DSPs.The overall image of the unrolled implementation of Algorithm 2 using these PEs is shown in Fig. 9.The demultiplexers (DeMUXs) virtually have no cost because it is a distribution of wires, and most functions except for few adders can be completed within the DSP, which is expected to contribute to a low SDR.
The algorithm parameters are k = 17, ℓ = 26, n = 17, m = 11 for BN254, and k = 17, ℓ = 26, n = 25, m = 16 for BLS12_381, selected for extending R to perform the lazy reduction (R = 2 289 for BN254 and R = 2 425 for BLS12_381).Ma et al. [33] reported that increasing 2 bits in R can extend the range of QPMM input A and B twice.These parameters extend the maximum input of Algorithm 2 from 2 M to 1024 M.

A. Evaluation Platform
We implemented the proposed method on a VCU118 evaluation board and evaluated area and timing performances after place and route (PAR) by Vivado 2020.1.We use default and performance explore synthesis strategies for the modular multiplier and the pairing core evaluation, respectively.The VCU118 board contained a 16-nm Virtex UltraScale+ xcvu9p-2l FPGA, which has 147 780 CLBs (1 182 240 sixinput LUTs and 2 364 480 FFs), 6840 DSPs, and 2160 36-kb block random access memories (BRAMs).The xcvu9p is a high-performance large-scale FPGA for use in server-side applications and is adopted in Alibaba, Huawei, and AWS clouds.Because the notations of the area differ between Ultrascale and other devices, we converted certain values in the implemented results, such that one RAMB36 = two RAMB18s and one CLB = two slices.

B. Modular Multiplier Evaluation
We implemented Algorithm 2 with a fully unrolled manner on the VCU118 board.The performance evaluation results for BN254 and BLS12_381 (254-and 381-bit multipliers) on each PE latency are shown in Table I.As the latency increased, the operating frequency improved, and when λ PE = 4, the 256-bit modular multiplier operated at 623 MHz, which is close to the maximum operating frequency of 644 MHz [32].The latency and the resource consumption worsened as λ PE was increased; however, we do not consider these to be problems as our goal is to maximize throughput.Even at λ PE = 4, the SDR remains below 43, around 20, indicating that the DSP is being used effectively.For these results, we adopted a λ PE = 4 design that has the best throughput, for the pairing processor evaluation in Section IV.
Table II shows the comparisons of our proposed multipliers (λ PE = 4) to those from previous studies.The proposed method achieved the highest throughput with the largest slice and DSP resource consumption compared to the existing methods.The latency of the proposed method is also comparable to other methods.The proposed method also has the best TP/ESlices, a measure of area efficiency, indicating that it is suitable for server-side FPGAs, where many resources are available.Furthermore, the proposed method achieved the lowest SDR.This indicates that the slice consumption does not become a bottleneck when the proposed multiplier is embedded in a pairing processor that spends many slices for accelerating modular adders.
The papers [34] and [35] evaluate modular multipliers on the same generation devices as ours.Reference [35] is a low-latency architecture that uses a large number of DSPs and LUTs to complete the modular multiplication in a single cycle, achieving a latency of about 15% of our Fig. 9. Fully unrolled QPMM design (see Algorithm 2) using PEs.DeMUX elements can be implemented as wire splitting with no cost.Hence, although the right-most adders and the final four-cycle addition should be implemented with few slice resources, almost the entire QPMM circuit can be implemented with DSP resources.This is the reason this module achieves a low SDR and high performance.This module can exhibit the DSP's maximum frequency without making the slices a bottleneck.implementation.Meanwhile, the throughput is less than a tenth of our implementation, making it unsuitable for serverside applications.To compare our implementation with [34], we need to take into account the resource utilization; however, we cannot use the SDR and ESlice metrics of resource utilization because Noyez et al. [34] do not provide the number of slices information.If [34] were implemented with 62 cores in parallel, the throughput and resource utilization would be approximately 62 times larger, and thus throughput/LUTs/FFs/ DSPs = 78.05/39.618/81.158/372,which is comparable to our implementation and number of DSPs.In this case, our implementation achieves 1.5 times the throughput with about 1/12 and 1/3 of the LUTs and FFs.The comparisons in Table II do not consider the difference of device generations.The proposed method has been evaluated on the latest FPGA, Virtex Ultrascale+, whereas most existing methods have been evaluated on Virtex-7, which is two generations older.The manufacturing process has been shrunk from 28 to 14 nm between these generations; therefore, we should consider the increase in the operating frequency.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II PERFORMANCE COMPARISONS OF THE PROPOSED MODULAR MULTIPLIERS AND THOSE FROM PREVIOUS STUDIES TABLE III SINGLE-CORE PAIRING PERFORMANCE EVALUATION
Considering the maximum operating frequency of the DSPs, Ultrascale+ is 644 MHz while Virtex-7 is 650 MHz, and there is no significant difference.If the operating frequency from a previous study running near the DSP's maximum frequency was evaluated on VCU118 [17], it would be bounced to the maximum frequency and almost have a similar performance.Conversely, it is difficult to predict the increase in the performance of implementations that are not near the maximum frequency limit.As an example, we estimate the frequency difference in one device generation to be at most 1.22 times, based on the results [36], which are from evaluations of the same multipliers on different FPGAs, Virtex-6 and Virtex-7.Even when assuming a frequency difference of two generations (1.48×), our implementation achieves the best TP/ESlices.

C. Pairing Processor Evaluation
1) Single-Core Evaluation: Table III shows the performance comparison of the single-core pairing implementation.In Table III, our pairing processors are evaluated using λ PE = 4 because this parameter has the best throughput and the closest SDR to 43.Note that we can adjust the trade-off between latency and throughput of the pairing computation using the different design modular multipliers with λ PE = {1, 2, 3}.
Our implementation executes τ = 7 pairings concurrently and completes them with 452 µs.As shown in Table III, Opasatian and Ikeda's implementation [13] shows the best latency and throughput performance.They implemented a low latency core using an LUT modular reduction technique and evaluated it on the same generation FPGA as ours.The throughput of their implementation exceeds our implementation by approximately 4%.However, their implementation uses a large number of LUTs to perform modular reduction, which means that it has the SDR more than 43, expecting to make worse the performance of multicore implementation.Since their paper shows only LUT utilization (225 607), we estimated their slice utilization as 56 401 = 225 607/4.Using this value, the SDR of their implementation is calculated as 87.03, which is more than twice of our implementation's, indicating that the number of slices restricts the number of cores under multicore implementation, and lots of DSPs cannot be used.On the other hand, our implementation has a good SDR (41.24) and are expected to perform better performance under multicore evaluation.Actual multicore performance comparisons are given in Section IV-C2.
Devlin [11] implemented a large-scale, server-side pairing processor and evaluated their implementation on xcvu37p, which is the same generation device to our xcvu9p.These devices have almost the same performance and circuit resources; thus, their work is most comparable to our implementation results.Our implementation achieved approximately 9.8 times throughput and 0.71 times latency, compared to Devlin's implementation.Although our implementation uses more than twice as much DSPs as Devlin's, the slice usage is less than 1/2; therefore, their implementation had a slice bottleneck and did not take advantage of the fast DSP primitives.Furthermore, our implementation has SDR = 41.24,whereas theirs has SDR = 236.95,meaning the performance difference will be spread for multicore implementations.Compared to the state-of-the-art software implementation [42], our implementation shows a better performance of approximately ten times throughput and 0.7 times latency.
For the BN254 curve, our implementation runs at 590 MHz, which is close to 644 MHz of the DSP's maximum frequency, and completes τ = 5 pairings in 187 µs.Compared to the previous studies, our pairing implementation achieved the best throughput, comparable latency, and the best area efficiency (TP/ESlices).Previous work [15] is the only one that shows the evaluation results on the same device.It aims for a low-latency pairing implementation and achieves the best latency of 102 µs.The proposed method improves throughput by 2.7 times at the cost of 1.8 times worse latency.This seems like a good compromise.Note that our implementation allows the trade-off between latency and throughput to be tuned by changing the number of pipeline stages in the modular multiplier.
It is difficult to directly compare performance on different generation FPGAs (Virtex-6 or -7), but we can highlight some of the advantages of our architecture.First, our implementation has the SDR, which is a device-generation-independent metric, close to 43; this indicates a good balance of DSP and slice usage.In addition, because operating frequency is the main difference when the same architecture is implemented on different FPGAs, we can predict the performance of other FPGAs by considering the frequency difference between different FPGAs.In [36], a modular multiplier was evaluated on Virtex-6 and -7 devices, and the frequency difference was approximately 1.22 times.Assuming 1.22 times frequency difference in one generation, we can infer a difference of approximately 1.82 times in three generations.If the proposed architecture is implemented on a Virtex-6 FPGA, which is a three generation older device than ultrascale+, the frequency can be estimated to be 324 MHz.The throughput would be 14 616 pairings/s.Even in this case, the throughput of the proposed architecture is the largest among existing studies.
2) Multicore Evaluation: This section compares multicore performance for the pairing on the target FPGA xcvu9p (see Table IV).While the performance of our implementation is an actual PAR result, most of the values in Table IV are ideal estimates, as few previous studies have provided multicore evaluation results.These estimations are based on the assumption that circuit resources and throughput are proportional to the number of cores, and that the frequency remains constant as the number of cores increases.The number of cores is determined by the maximum number of pairing processor cores implementable on an xcvu9p in terms of FPGA resources.Formally, the number of cores #Cores = min(⌊295 560/#LUTs⌋, ⌊6840/#DSPs⌋), where 295 560 and 6840 are the number of slices and DSPs available on the xcvu9p, respectively.For Devlin's design [11], for example, #Cores = min(⌊295 560/81 750⌋, ⌊6840/345⌋) = min(3, 19) = 3, where slice resources limit the #Cores.
As shown in Table IV, our implementations present the best throughput for both BN254 and BLS12_381 pairings.Comparisons with [11], [13], and [14], evaluated on FPGAs of the same generation, are important for fair evaluation.Devlin [11] uses too many slices (81 750/295 560), which limits the number of cores it can implement to three.Our implementations achieve better latency and several times better throughput compared to [11] and [14].With the SDR close to 43, our architecture can maximize performance in multicore environments.For the comparison with Opasatian's implementation [13], which is the most comparative to our implementation, we estimated his performance under two conditions (the best and worst cases).Since their study only provides the number of LUTs as resource utilization, we need to estimate their slice utilization to calculate their multicore performance.In the best case, the number of slices utilized is converted as four LUTs = one slice; #Cores = 5 and the throughput is 80 385.However, typical synthesis results do not follow such an ideal case.For the synthesis results of our implementation, the ratio between LUT and slice is 0.76:1, using this ratio, the #Cores and throughput of Opasatian's implementation are estimated as 1 and 16 077, respectively.
As a result, our implementations show several times better throughput than the previous studies for a multicore implementation using the maximum resources of the xcvu9p.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V POWER AND ENERGY COMPARISONS OF THE SINGLE-CORE PROPOSED PAIRING PROCESSOR AND PREVIOUS STUDIES
Furthermore, we do not consider the negative effects, such as the increase in routing delay caused by multicore implementation, in the performance estimation for the previous studies.Taking this into account will result in greater performance differences.
3) Energy Consumption Evaluation: Table V presents the power and energy consumption of the proposed architecture as estimated on Vivado.For previous studies that did not provide power data, we estimated it based on the following CMOS power equation [44, eq.(5.10)]: where P, α, f, C, and V denote the power, switching rate, clock frequency, switching capacitance of the total circuit, and supply voltage.Assuming that α and V are the same for all implementations enables us to estimate power as P ∝ f C. We furthermore assume that C is proportional to the circuit area.As a metrics of circuit area to calculate the C values, we use equivalent slices (ESlices = #Slices + 43 × #DSPs), where 43 (the ratio of the number of slices to the number of DSPs on the xcvu9p) is used as a weight to convert a DSP to slices.As a result, we can estimate the power of previous works as the ratio of frequency and ESlices.Vivado estimated the dynamic power at 458 MHz to be 12.035 W for our BLS12_381 implementation.We can estimate the power consumption of [11] as 12.035 × (200/458) × (96 585/66 552) = 7.62 W. Table V shows that our architecture consumes a significant amount of power, but the energy per pairing is comparative to other studies.Note that our architecture is designed for cloud FPGAs.Such cloud FPGAs are charged per FPGA; therefore, users do not need to worry about the power consumption, and using as much circuit resources as possible is reasonable.
V. DISCUSSION

A. Performance Requirement for the Pairing Computation
Blockchain-based cryptocurrency is one of the applications that actively uses PBC.Some modern blockchains [24], [25] use the BLS signature, which requires a pairing computation for signature verification, to check the validity of transactions.When a transaction is issued in blockchain-based cryptocurrencies, the payers sign their transaction and send it to the blockchain's peer-to-peer network.Each peer on the blockchain verifies the signature and adds it to the next block if the verification result is valid.Because signature verification requires pairing, the speed of pairing is directly related to the number of transactions the blockchain can handle.The maximum throughput of 124 216 transactions in our BLS12_381 implementation exceeds the current VISA credit-card transaction capacity of 65 000 transactions per second [45, p. 8].If blockchain payments are used as much or more than VISA credit cards, the proposed architecture will be able to handle all transactions.In terms of latency, the proposed architecture will be attractive in the future when blockchain payments are used for latency-critical payments.

B. Evaluation as a PBC Accelerator
When we use an FPGA as an accelerator for a general-purpose CPU rather than as a standalone encryption circuit, there is an overhead due to the connection interface between the FPGA and the CPU.We have experimentally estimated this overhead in a system where the FPGA is connected to the server PC via PCIe 3.0 × 16.The experimental results show that 2 × 320 bits and 256 × 320 bits data transfers take, independent of the transfer direction, about 17 and 22 µs, respectively, where 320 bits is the size of an F p element for BN254 pairing.The transfer time t [µs] can be interpolated as (5/81.280)x+16.96,where x is the number of bits transferred.This implies the CPU execution time of the operating system or application is more dominant than the physical data transfer; the transfer time is not proportional to the amount of data.
Using the above equation, we estimate the performance of the pairing accelerator.Our τ = 7 core BLS12_381 pairing implementation takes 6 × 7 F p elements as an input and 12 × 7 F p elements as an output, where the size of an F p element is 436 bits.Assuming the input and output data transfer time is 18 and 19 µs, the latency and throughput will be about 7% worse to 581 µs and 84 337 pairings/s, respectively.
To achieve the same throughput with a CPU, whose single-core throughput is 1538 as shown in Table III, ⌊84 337/1538⌋ = 52 CPU cores must be operated in parallel.With some recent high-end CPUs, it is not impossible to achieve the same throughput as FPGAs; however, it is not Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
reasonable to devote all computational resources to pairing computations because server CPUs must also provide many services other than cryptographic operations.In addition, FPGA accelerators have an advantage in power consumption.Intel Xeon 8592+ CPU with 64 cores has a thermal design power of 350 W, which is tens of times higher than that of FPGAs (see Table V).When all cores are used, the expected performance may not be achieved due to thermal throttling.

C. Applicability to Post-Quantum and Other Cryptography
Our pairing architecture is designed to efficiently perform the F p and F p 2 operations, which are the primitive operations for the elliptic curve-based pairing computation.Therefore, our architecture can efficiently execute the cryptosystems constructed over F p , such as isogeny-based cryptography and elliptic curve cryptography.Isogeny-based cryptography is one of the post-quantum cryptography (PQC).SQISign [46] is an isogeny-based signature scheme and submitted to NIST as a candidate for PQC standardization.SQISign has the advantage of small key and signature size and the disadvantage of high computational complexity, typically taking tens to thousands of milliseconds on a standard CPU.We believe that FPGA acceleration is a promising solution to this disadvantage, similar to the pairing computation in this article.Our architecture basically suitable for SQISign that is constructed over Fp2 just modifying instruction scheduling allows us to execute SQISign.Since SQISign uses 252-506-bits p depending on the security level, the data path width must be wider to support high-security levels (our BL12_381 implementation supports up to 381-bits p).

D. Tamper Resistance
The pairing architecture proposed in this article is primarily designed to maximize throughput, and it has no tamper resistance to physical attacks.However, recent studies report that attackers can remotely perform side-channel [47] and fault attacks [48] against FPGAs in the cloud, embedding tamper-resistant techniques into our architecture stays remains for one of the future works.
Side-channel attacks include timing attacks, which exploit differences in processing time, and differential power analysis (DPA)-type attacks, which focus on differences in power consumption.Most of the pairing computations in this study are implemented using constant-time algorithms, which are secure against timing attacks because there are no timing differences.The F p inversion is the only variable-time part.To make it secure, it is necessary to ensure that the computation always completes in the same cycles by inserting dummy processes.
Against DPA-type attacks and fault attacks, Kim et al. [49] propose the use of the randomized projective coordinate, where a pairing input is represented as a kind of randomized redundant form.To apply this countermeasure to our architecture, we need to implement a process to add a random number to the input.This can be done by changing the instruction scheduling and is expected to have a time overhead of a few percent at most.The FPGA resource consumption for the random number generator is predicted to be negligible compared to the relatively large multicore pairing circuit.

VI. CONCLUSION
First, we proposed an unrolled QPMM algorithm [30] that is suitable for a server-side FPGA, XVU9P, which has DSP48E2 primitives as dedicated multipliers.The proposed method took full advantage of DSP48E2's functions such as the asymmetric multiplier and three-input post-adder, and successfully completed most of the algorithms using only the DSP, achieving the highest throughput (188.98Gb/s), area efficiency (6873 TP/ESlice), and low SDR (20.16).
We further designed a pairing processor architecture that embeds the proposed modular multiplier.By supporting redundant adders and interleaved executions, the proposed architecture successfully maintained a high frequency and achieved BLS12_381 pairing throughput (15 477 pairings/s).In addition, the proposed pairing architecture has a good SDR (41.24), indicating that it maximizes the performance under multicore implementation.The multicore evaluation showed that the throughput of the proposed method was more than five times faster than that from previous studies.

Fig. 1 .
Fig. 1. Abstract images of multicore implementation with (a) high SDR (≫ 43) cores and (b) low SDR (< 43) cores.(a) Large unused regions including many DSPs, indicating that the FPGA does not achieve maximum performance.(b) Use most of the DSP resources.The remaining unused slices are effectively available for other peripherals such as I/O interfaces.

Fig. 3 .
Fig.3.Abstract overall architecture of the proposed pairing processor.The brackets [, ) represent the range of the output values of each module, where M is a constant for the modular multiplier algorithm.This architecture adopts the lazy-reduction technique for fast modular additions; hence, modules other than the QPMM extend the output ranges.The QPMM module performs all reduction operations, in which a value up to 1024 M − 1 is reduced to up to 2 M − 1.

TABLE I PERFORMANCE
EVALUATION RESULTS OF OUR PROPOSED 254-AND 381-bit MODULAR MULTIPLIERS IMPLEMENTED ON VCU118.λ PE AND λ MUL REPRESENT THE LATENCY OF THE DSP AND THE ENTIRE MODULAR MULTIPLICATION, RESPECTIVELY

TABLE IV MULTICORE
PERFORMANCE EVALUATION.VALUES IN ITALIC ARE OUR ROUGH (THE BEST-CASE ESTIMATION) ESTIMATION.THROUGHPUT IN THE BRACKET IS THE STRICT (WORST CASE) ESTIMATION