PipeFL: Hardware/Software co-Design of an FPGA Accelerator for Federated Learning

Federated learning has solved the problems of data silos and data fragmentation on the premise of satisfying privacy. However, cryptographic algorithms in federated learning brought significant increase in computational complexity, which limited the speed of model training. In this paper, we propose a hardware/software (HW/SW) co-designed field programmable gate array (FPGA) accelerator for federated learning. Firstly, we analyzed the time consumption of each stage in federated learning and the involved cryptographic algorithms, and found the performance bottleneck. Secondly, a HW/SW co-designed architecture is introduced, which can speed up encryption, decryption and ciphertext-space computation at the same time without reconfiguring FPGA circuit. In the HW part, we proposed a Hardware-aware Montgomery Algorithm (HWMA) which utilized data parallelism and pipeline, and designed an FPGA architecture to decouple data access and computation. In the SW part, an Operator Scheduling Engine (OSE) is designed, which can flexibly resolve the target algorithm into multiple HWMA calls, and complete other non-computation-intensive calculations. Finally, evaluations for both specific algorithms and practical applications are implemented. Experimental results show that when deployed on Intel Stratix 10 FPGA, our accelerator can increase the throughput of 2048-bit modular multiplication, modular exponentiation and Paillier algorithm to more than 3x of the CPU. When integrated into a industrial grade federated learning open source framework, the end-to-end training time of linear regression and logistic regression can be shortened by 2.28x and 3.30x respectively, which is more than 2x faster than the reported best results of FPGA accelerator.


I. INTRODUCTION
In recent years, with the continuous update and iteration 22 of algorithms and specialized hardware, the development of 23 machine learning and deep learning is in full swing, but 24 these algorithms usually have huge demands for the amount 25 of training datasets. Traditional distributed machine learn- 26 ing usually handles all the training datasets that can be 27 obtained to be managed and used by one data center, which 28 The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar . may leak and threaten the privacy contained in the data. 29 With the implementation of the European Union's General 30 Data Protection Regulations [1], the model-related services 31 provider must meet the prerequisite for ensuring the privacy 32 of the user. In this background, the federated learning tech-33 nology [2] proposed by Google has gained more and more 34 attention.   [10], [11], [12]. Among them, 82 field-programmable gate arrays (FPGAs) [10], [11], [12] 83 have gain more and more attention in recent years because 84 of their reconfiguration capability, rich logic implementa-85 tion resource, and rapid development cycle supported by 86 high-level synthesis (HLS). Therefore, how to use hetero-87 geneous hardware to accelerate federated learning has been 88 widely discussed in both academia and industry. 89 The main contributions of this paper are as follows: 90 • We make a quantitative analysis of the training process 91 of federated learning and PHE algorithms involved, and 92 obtains the performance bottleneck of the current fed-93 erated learning training stage, which lays a theoretical 94 foundation for this paper and future work. 95 • A hardware/software (HW/SW) co-design method is 96 proposed targeting CPU+FPGA-based heterogeneous 97 platforms, which can accelerate all time-consuming 98 algorithms in federated learning that use ModMult or 99 ModExp as the basic operation without programing 100 FPGA repeatedly.

101
• The accelerator proposed in this paper has taken the 102 real application in federated learning into consideration, 103 used three OpenCL kernels to decouple data access and 104 computation, and explored both data parallelism and 105 pipeline parallelism to reach state-of-the-art throughput. 106 • The proposed accelerator is implemented on Intel 107 Stratix 10 GX2800 FPGA and its performance is ver-108 ified. At a bit width of 2048-bit, the throughput of 109 ModMult operation can reach 1121.2 kOP/s, and the 110 throughput of ModExp operation is 743.1 OP/s, which is 111 5.3x and 3.0x of CPU performance respectively. When 112 integrated into an industrial grade federated learning 113 open source framework, the end-to-end linear regression 114 and logistic regression training time can be reduced by 115 2.28x and 3.30x respectively compared with CPU. Our 116 design shows more than 3× improvement on throughput 117 over the best FPGA-based federated learning accelerator 118 reported in literature.

119
This paper is organized as follows: Section II analyzes the 120 existing accelerator design, summarizes its common charac-121 teristics, and puts forward the direction that can be further 122 optimized for the federated learning scenario; Section III 123 conducts an in-depth analysis of the encryption algorithm 124 in federated learning, and points out bottlenecks in the 125 current algorithm as theoretical guidance; Based on the pre-126 vious analysis, Section IV describes the proposed HW/SW 127 co-design in detail; In section V, two experiments are 128 designed to evaluate the acceleration effect of the designed 129 accelerator in both specific algorithms and end-to-end feder-130 ated learning applications. The last Section VI gives conclu-131 sions and puts forward the future research direction.

133
In view of the challenges brought by encryption algorithms, 134 researchers have proposed several accelerator schemes for 135 encryption algorithms. Among them, due to its hard-136 ware friendliness, Montgomery modular multiplication algo-137 rithm [13] is the most widely used. 138 San et al. [14]

208
To better illustrate the performance bottleneck of federated 209 learning, we tracked and recorded the time taken to perform 210 a two-party vertical federated learning task on the CPU, and 211 quantitatively analyzed the time cost. There are 14 iterations 212 from training start to convergence and the result is shown in 213 Fig. 1.

214
It can be seen from Fig. 1 that encryption, decryption and 215 ciphertext-space calculation account for 86.4% of the total 216 time, whereas key generation and other operations (including 217 some plaintext calculation, gradient update and local data 218 access time) account for only 13.6%. This is due to the sharp 219 increase in the overall amount of computation caused by cryp-220 tography related operations. In addition to the parts shown 221 in the figure, the datasets of all parties need to be aligned 222 through privacy intersection calculation before the training 223 stage. The time-consuming of this process is closely related to 224 the amount of data of all participants, and it will significantly 225 consume additional time when the datasets are large.

226
Through the above analysis, accelerating cryptography 227 related computation has become a key factor in accelerating 228 federated learning. In the mature federated learning frame-229 work such as FATE [20], RSA [5] algorithm is used in the 230 dataset alignment phase and Paillier [4] algorithm is used in 231 the training phase. Therefore, the related operations involved 232 in these two PHE algorithms are the main acceleration targets 233 of this work. Next, we review these two algorithms with spe-234 cific application scenarios, and summarize the characteristics 235 of federated learning tasks, which leads to the motivation of 236 this paper.  (1) 244 and the Euler function ϕ(n) can be computed as: Then, select an positive integer e that coprime with ϕ(n) and smaller than ϕ(n), and the private key d can be obtained by 248 calculating the modular multiplicative inverse of e for ϕ(n) 249 by the following equation:  This process is used to generate public key (n, g) and private 300 key (λ, µ). The selection of p, q and the calculation of n are 301 the same as RSA algorithm. Then, an integer g is selected 302 to satisfy g ∈ Z * n 2 (multiplicative group of integers mod-303 ulo n 2 ), in FATE g is selected as g = n + 1 to simplify 304 computation. 305 λ is the least common multiple (lcm) of (p − 1) and (q − 1): 306 and µ can be caculated by: where, where the random number r should meet 0 < r < n and 317 r ∈ Z * n 2 . process of training is shown in Fig.3.

336
In Fig.3, D is the ID set of data shared by participants,  3) A relatively large batch size is commonly used for train-364 ing. Large batch size is conducive to improving the 365 training speed, but too large batch size is detrimental 366 to model generalization ability [21]. Thus, in federated 367 learning, the batch size of common training tasks like 368 logistic regression is usually set to about 10K for large 369 datasets.

370
One by one corresponds to the above findings, this paper 371 proposes the following innovations:   The CPU is referred to as the host or SW part. It is used to 406 control the processes such as platform management, context  The timeline of HW/SW partition is shown in Fig. 5.

431
Since the basic operation of most cryptography related  ware. These three units will be illustrated in the next 442 subsection.

443
On the CPU (SW part), we proposed an Operator Schedul-444 ing Engine (OSE), in which realized the initialization of 445 OpenCL framework, pre-calculation (e.g, key generation and 446 modular multiplicative inverse computation), and resolved 447 the target algorithm into multiple HWMA calls. It should be 448 noted that while the FPGA performs HWMA operations, the 449 OSE performs non computation intensive operations (e.g., 450 plaintext operations and loss computing) at the same time. 451 Therefore, the time consumption of these calculations can 452 overlap to further improve the acceleration effect. Both the 453 OpenCL initialization and pre-calculation phase need to be 454 implemented only once in the entire federated learning, and 455 the other phases will be executed multiple times according to 456 different scenarios.

457
Each HWMA calling in Fig. 5 is realized by the following 458 steps: First, the OSE sends instructions to the DLU through 459 OpenCL framework and ensures that the required data is 460 in the global memory. Next, the DLU on FPGA reads the 461 data and send them into MMU for accelerating computation. 462 Then, the results of MMU are transmitted back into global 463 memory through the DSU, and will be used for subsequent 464 calculation or fetched by OSE.

466
The accelerator hardware architecture proposed in this paper 467 is divided into three main parts: Data Load Unit (DLU), Data 468 Store Unit (DSU) and Montgomery Multiply Unit (MMU). 469 The overall architecture of the designed on-chip system is 470 shown in Fig. 6. All parameters in the accelerator design are 471 shown in Table 1.     The task of Data Load/Store Unit (DLU/DSU) is to realize 502 the data interaction between FPGA device and CPU, which is 503 controlled and launched by SW side. DLU and DSU form the 504 data transmission path of FPGA on-chip computing resources 505 and global memory. We designed command queues for DLU 506 and DSU respectively, when computing tasks are enqueued 507 by pyOpenCL host, DLU and DSU are launched at the same 508 time.

509
Once being launched, DLU begins to access global mem-510 ory with the predefined data structure, fetching the necessary 511 data for FPGA computing. The task scheduler in DLU has 512 two functions: One is controlling whether it is necessary to 513 update the key to DSU and MMU according to F u , the other is 514 controlling the number of required memory access according 515 to V num . These two parameters are concatenated into ctrl 516 parameters by bits and then be passed to the task scheduler 517 in DLU by SW program through global memory, after, ctrl 518 will be passed to DSU and MMU through on-chip control 519 channel, which can reduce the number of control flows. For 520 example, if a calculation task has a operation of batch size 521 4096, vs is 2 and there is no need to update key m, then the 522 task scheduler will receive V num = 2048 and F u = False. 523 Then, it will control the DLU to perform 2048 times off-chip 524 memory accesses to data X and Y , skip memory accesses 525 of m, and in the mean time, forward the control signal.

526
Each time a set of vectors is obtained, DLU will imme-527 diately write them into the corresponding first-in-first-out 528 queue (FIFO) in FPGA for computing in MMU. After startup, 529 DSU will first receive control instructions from DLU, and 530 then keep reading the result FIFO under the control of task 531 scheduler, until all calculation tasks are completed. DSU is 532 also responsible for the final subtraction of modulus m in 533 Montgomery algorithm.   number of 32-bit integers, consistent with previous data struc-555 ture definition in Section IV-C1, and it is determined by the 556 bit width of the input data. For example, when the calculation 557 process of 2048-bit input data needs to be accelerated, the 558 value of K is 64. Compared with the original Montgomery 559 algorithm, the optimizations in our HWMA are mainly as 560 follows: 561 1) The frequency of updating m, m inv is reduced. As is 562 summarized in Section III-C, in the PHE algorithm of 563 federated learning, the key will be updated only once 564 during a training task and it usually appears as the mod-565 ulus of Montgomery algorithm. Therefore, we enabled 566 the accelerator to receive the control signal from the 567 DLU and update the key only when needed, so as to 568 avoid repeatedly transmitting the same key for multiple 569 vectors.  In addition to the above optimization in MMU, we decou-583 pled the computation and data access successfully through 584 on-chip FIFOs. Since the calculation of subtracting m at the 585 end of the original Montgomery algorithm involves a branch 586 structure, which will cause operations of different large num-587 bers in one SIMD vector various, directly placing it in MMU 588 will cause the outer loop fail to generate an efficient pipeline. 589 Therefore, we moves it to DSU, and releases the branch struc-590 ture in the loop through a mask calculation, thus successfully 591   ModExp operations, and invoking the ModMult/ModExp 611 Module on CPU accordingly, then ModMult/ModExp Mod-612 ule will call hardware operators on FPGA to perform high 613 speed HWMA operations, thus accelerate the target algo-614 rithms. For example, for each encryption or decryption oper-615 ation of RSA algorithm, the AP parses it as one call to the 616 ModExp Module. In the real application, AP will use the 617 pre-defined batch size and the vs of FPGA operator to auto-618 matically generates the number of HW required operator calls 619 V num . Table 2 shows the AP's parsing of some commonly 620 used algorithms in an industrial grade federated learning open 621 source framework FATE [20].

622
It is worth noting that, thanks to the flexibility of the pro-623 posed architecture, when there is a need to accelerate a new 624 algorithm other than the algorithms shown in Table 2, as long 625 as the algorithm can be realized by ModMult or ModExp 626 calculation, the acceleration can be realized by defining a new 627 analytical method in AP at the SW side, without modifying 628 or reprogramming any HW circuit on FPGA.  The workflow of ModMult Module is shown in the orange 639 part of Fig. 10. When key n does not need to be updated, 640 ModMult Module will transmit F u = False when calling 641 DLU, and make FPGA skip the step of updating m. It should 642 be noted that when a input parameter of HWMA is not com-643 posed of V num vectors (e.g., p in Algorithm 3), the module 644 will expand it into vector to meet the input requirements of 645 hardware.

647
The ModExp Module provides hardware accelerated 648 ModExp operations. In order to efficiently call the designed 649 FPGA operator to realize the ModExp function, we use 650

676
In the design of OpenCL HW code, we use __attribute__ 677 ((autorun)) pragma to configure the MMU as a kernel that 678 runs automatically when powered on. Both DLU and DSU are 679 configured as single work-item kernels, and multiple parallel 680 processing units (PEs) are generated through #pragma unroll 681 instruction.  Table 3 and Table 4 show the acceleration effect of our accel-685 erator on common algorithms in federated learning, including 686 ModMult/ModExp (RSA algorithm) and Paillier algorithm. 687 In order to facilitate the comparison with other work, we have 688 carried out experiments on the modular bit width of 1024-bit 689 and 2048-bit respectively, and measured the throughput of 690 CPU as the baseline. When the modular bit width is increased 691 from 1024-bit to 2048-bit, there are two main reasons for the 692 decrease in throughput: The first is that a higher data bit width 693 will lead to higher computational complexity. The second is 694 that the data path and computation logic on the FPGA chip 695 with a larger bit width will occupy more resources, resulting 696 in lower parallelism. In our final deployment, vs is configured 697 as 4 at 1024-bit modular bit width and 2 at 2048-bit to ensure 698 a relatively high frequency.

699
Mark ''-'' in the table indicates that the result is not 700 reported in the study, and ''N/A'' indicates that the design 701 does not support the corresponding algorithm. From the com-702 parison in Table 3 and Table 4, it can be seen that in the 703 case of 1024-bit modulus, our PipeFL can achieve 4.0x, 2.1x 704 and 2.6x performance improvement in ModMult, ModExp 705 and Paillier algorithm respectively compared with the server 706 CPU. When the modular bit width increases to 2048-bit, the 707 improvement compared with the CPU increase to 5.3x, 3.0x 708 and 3.2x respectively.

709
It can be seen from Table 3 and Table 4 that, our acceler-710 ator can support ModMult, ModExp and Paillier algorithm 711 in the same time by using a unified hardware circuit, but 712 most existing work [14], [16], [17], [19] are fixed to only 713 one algorithm, which made them impossible to accelerate 714 multiple algorithms involved in one federated learning task. 715 Besides, as the target platform of many works is embedded 716 FPGA, and the parallelism is not designed for large-scale data 717 to improve throughput, their reported performance is only 718 slightly higher (e.g., work [14], [18]) or even worse (e.g., 719 work [16], [17]) than that of our baseline implementation on 720 CPU. It can be seen from Table 4 that Yang et al. [19] imple-721 mented the Paillier accelerator on a Xilinx high-end FPGA, 722 and its accelerated performance on the Paillier algorithm can 723 reach about 5200 OP/s, which is 3.49x than ours. There are 724 two main reasons for their higher performance. One is that 725   their design only supports the Paillier algorithm, so they can 726 achieve more efficient on-chip data flow. The other is that 727 their target FPGA has 19% more DSP resources and 41% 728 more RAM resources than ours, which will further improve 729 their performance. However, due to the lack of flexibility, 730 when their accelerator is integrated into practical federated 731 learning applications, the end-to-end performance is not as 732 good as ours, which will be illustrated in detail in the next 733 subsection.  [20].

739
As far as we know, only Yang et al. [19] have reported 740 the performance in actual federated learning application. 741 Therefore, we use the same dataset and training task for eval-742 uation. Two datasets, Kaggle datasets on breast cancer [27] 743 and motor temperature [28], are used to train linear regression 744 and logistic regression models respectively.

745
As our CPU is more powerful than the Intel Xeon 746 E5-2686 V4 used in comparative work, and we have made 747 some streamlining in the FATE framework, when running 748 federated learning training in pure SW, our implementation 749 has about 30% performance improvement. In order to make 750 the comparison more fair, we compared the performance 751 improvement based on our respective CPU implementations, 752 the results are shown in Fig.11.

753
Our PipeFL adopts a flexible HW/SW co-design method, 754 which can speed up Paillier encryption, decryption and 755 ciphertext-space calculation in federated learning at the same 756 time, whereas the work of Yang et al. [19] can only speed 757 up Paillier encryption and decryption. Therefore, although 758 their work has higher performance in Paillier encryption and 759 decryption than ours, its effect on end-to-end training tasks is 760 not as significant as ours.

761
From the comparison results shown in Fig.11, we can see 762 that our accelerator can achieve 1.33s and 1.02s for each iter-763 ation in end-to-end linear regression and logistic regression 764 tasks respectively, achieving 2.28x and 3.30x time reduc-765 tion than SW implementation. Compared with the previ-766 ous work, our training time is reduced by 2.48x and 3.04x 767 respectively. 769 Federated learning has been widely studied in recent years. 770 Its privacy protection features make it suitable for China's 771 national cloud strategy and worldwide privacy protection 772 VOLUME 10, 2022 policies like European Union's General Data Protection 773 Regulation. It has been applied in many scenarios such as 774 government, telecom operators, finance and hospitals. How-775 ever, the computational overhead caused by cryptography 776 related algorithms is a key factor that restricts the large-scale 777 commercial use of federated learning.

778
In this paper, a flexible HW/SW co-designed FPGA accel-  The accelerator proposed in this research has been used as 794 an infrastructure of China Telecom cloud computing service.

795
In the future, we will try to apply the designed accelerator to 796 more private computing scenarios, and verify its acceleration 797 effect on other algorithms based on ModMult and ModExp.

798
In addition, we also plan to combine more technologies, 799 such as compression and sparse coding to further improve 800 performance.

802
The authors would like to thank the anonymous reviewers for 803 their valuable comments and constructive suggestions, which 804 helped in improving the quality of the paper.