Architecture and Application Co-Design for Beyond-FPGA Reconfigurable Acceleration Devices

In recent years, field-programmable gate arrays (FPGAs) have been increasingly deployed in datacenters as programmable accelerators that can offer software-like flexibility and custom-hardware-like efficiency for key datacenter workloads. To improve the efficiency of FPGAs for these new datacenter use cases and data-intensive applications, a new class of reconfigurable acceleration devices (RADs) is emerging. In these devices, the FPGA fine-grained reconfigurable fabric is a component of a bigger monolithic or multi-die system-in-package that can incorporate general-purpose software-programmable cores, domain-specialized accelerator blocks, and high-performance networks-on-chip (NoCs) for efficient communication between these system components. The integration of all these components in a RAD results in a huge design space and requires re-thinking the implementation of applications that need to be migrated from conventional FPGAs to these novel devices. In this work, we introduce RAD-Sim, an architecture simulator that allows rapid design space exploration for RADs and facilitates the study of complex interactions between their various components. We also present a case study that highlights the utility of RAD-Sim in re-designing applications for these novel RADs by mapping a state-of-the-art deep learning (DL) inference FPGA overlay to different RAD instances. Our case study illustrates how RAD-Sim can capture a wide variety of reconfigurable architectures, from conventional FPGAs to devices augmented with hard NoCs, specialized matrix-vector blocks, and 3D-stacked multi-die devices. In addition, we show that our tool can help architects evaluate the effect of specific RAD architecture parameters on end-to-end workload performance. Through RAD-Sim, we also show that novel RADs can potentially achieve $2.6\times $ better performance on average compared to conventional FPGAs in the key DL application domain.

INDEX TERMS Deep learning, field-programmable gate arrays, hardware acceleration, network-on-chip, reconfigurable computing. embedded hard blocks such as on-chip memories, fracturable 27 multi-precision multipliers and high-speed transceivers to 28 enhance their efficiency [1]. However, with the increasing 29 deployment of FPGAs as datacenter accelerators, we are wit-30 nessing a more radical transition from conventional FPGAs 31 to more complex beyond-FPGA reconfigurable acceleration 32 devices (RADs). These are heterogeneous devices that inte-33 grate a traditional reconfigurable fabric with other forms of 34 VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Example 3D-stacked RAD instance integrating an FPGA fabric and an ASIC base die with different accelerator blocks, more on-chip memory, external memory controllers, and a general-purpose processor subsystem.

I. INTRODUCTION
makes the design problem even more complicated is that 69 FPGA applications can not be effortlessly migrated to novel 70 RADs or readily make the best use of the specialized accel-71 erator blocks and NoC-based communication for improved 72 performance. Therefore, architects need to re-think the imple-73 mentation of applications while designing their novel RADs, 74 which creates a more challenging architecture and application 75 co-design problem. 76 The design of conventional FPGA fabrics has been 77 extensively studied with well-established research tools for 78 exploring and evaluating new architecture ideas, such as 79 the Verilog-to-Routing (VTR) flow [4]. These tools help 80 answer questions on how to best architect the fine-grained 81 programmable routing fabric and logic blocks, what type of 82 hard blocks to integrate in the fabric, and the effect of these 83 architecture enhancements on FPGA computer-aided design 84 (CAD) algorithms and compile time. However, these tools are 85 inadequate for RAD architecture exploration as they lack the 86 following desired qualities: 87 1) Application-driven: These tools focus on optimiz- 88 ing FPGA architectures based on application-agnostic 89 performance metrics such as the maximum operat-90 ing frequency of given benchmark circuits. For com-91 plex RADs with coarse-grained accelerator blocks 92 and latency-insensitive NoCs, architecture exploration 93 must be driven by end-to-end application-specific per-94 formance. In other words, the key metric is how fast 95 a given application is executed on a candidate RAD 96 (cycles or runtime) rather than how fast a given circuit 97 is clocked on an FPGA fabric (clock frequency). 98 2) Higher level of abstraction: Conventional FPGA 99 architecture exploration is typically driven by appli-100 cations written in a hardware description language 101 (HDL), which can create a productivity bottleneck 102 when re-designing applications for RADs. 103 3) Rapid design space exploration: FPGA application 104 designers usually rely on register-transfer level (RTL) 105 simulation for functional verification of their applica-106 tions. For RAD architecture and application co-design, 107 RTL simulation would be very slow for such large com-108 plex systems and would require developing a tremen-109 dous amount of system components in HDL such as 110 NoC routers, accelerator blocks, memory controllers, 111 etc. This labour-intensive approach would significantly 112 limit the turn-around time for RAD architecture explo-113 ration, especially at early stages of the design process. 114 4) Packet Routing: Mapping application designs to a 115 conventional FPGA architecture involves placing logic 116 blocks and routing wires between them on the pro-117 grammable fabric. It has no notion of packet-switched 118 NoC-based communication between modules which is 119 the backbone of novel RADs.
• Studying a variety of RADs ranging from multi-die 158 NoC-connected FPGAs to devices augmented with spe-159 cialized matrix-vector accelerator blocks, and multi-160 active-die 3D-stacked architectures.

161
• Showcasing novel 3D-stacked RADs that can achieve 162 2.6× higher performance on average when compared 163 to current conventional FPGAs with up to 145 TOPS 164 effective performance on key DL workloads. 165 We also open-source RAD-Sim along with the NPU exam-  . Also, as the number of solid-state 187 drives (SSDs) per server increases, FPGAs can also perform 188 near-data processing in SmartSSDs to alleviate the processor-189 to-storage bandwidth bottleneck [10].

190
Secondly, the network-connected datacenter FPGAs can be 191 flexibly combined into datacenter-scale service accelerators 192 that offer low-latency processing for key datacenter services 193 at a fraction of the power budget as in Microsoft's Brainwave 194 for DL inference [11] and Bing's search engine [6]. In both 195 use cases, processing pipelines are frequently changed or 196 upgraded, which justifies the use of FPGAs as they offer 197 faster time-to-solution and less development effort compared 198 to taping out specialized fixed-function chips. In addition, 199 FPGAs also offer a variety of high-bandwidth IOs that enable 200 efficient data steering at the crossroads between different 201 datacenter server endpoints such as network, storage, CPU 202 cores and accelerators.

203
However, the FPGA's fine-grained programmable rout-204 ing fabric is struggling to keep up with the ever-increasing 205 FPGA transceiver bandwidth and data flow of key datacenter 206 workloads [12]. To mitigate these challenges, prior academic 207 research has shown that embedding hard packet-switched 208 NoCs in FPGA fabrics can offer tremendous on-chip data 209 steering bandwidth at a minimal area cost and without 210 affecting the FPGA's flexibility [13], [14]. As a result, 211 hard NoCs were recently adopted in commercial FPGAs 212 from Xilinx [15], Achronix [16], and Intel [17]. Besides 213 their programmable routing and logic, modern FPGAs incor-214 porate a variety of hardened ASIC-style blocks that ide-215 ally capture common functionalities across as many appli-216 cations as possible without sacrificing the FPGA's flexi-217 bility. Taking DL acceleration as an example, the com-218 position of layers, data manipulation between them, vec-219 tor operations, and pre/post-processing stages might signif-220 icantly differ between different workloads, which can benefit 221 from the FPGA's reconfigurability. However, all of them 222 include many dot-product operations that can benefit from the 223 increased efficiency of hardening as high-performance tensor scalable multi-core CPU simulation and more accurate per-295 formance and power modeling of mobile cores, respectively. 296 GPGPU-Sim [32] is another academic simulator for con-297 temporary Nvidia GPU architectures that can run CUDA or 298 OpenCL workloads and supports advanced features such as 299 TensorCores and CUDA dynamic parallelism. Unlike these 300 examples, our work does not target classic von Neumann 301 architecture exploration, but rather focuses on novel RADs 302 that combine traditional FPGA fabrics with other styles 303 of compute architectures. To evaluate RAD architectures, 304 the input to the simulator is not just compiled application 305 instructions. Instead, it can be a mix of instructions for 306 any software-programmable RAD components (e.g. coarse-307 grained accelerator blocks) and custom user-defined modules 308 implemented on the FPGA fabric.

309
Many simulators are also implemented to evaluate custom 310 application-specific accelerator architectures such as in [33], 311 [34], and [35]. Aladdin [36] is a more general accelerator 312 simulator for estimating the performance and power of spe-313 cialized dataflow hardware from a high-level C description. 314 More recently, gem5-Aladdin [37] integrates the gem5 CPU 315 simulator with Aladdin to model systems-on-chip (SoCs) that 316 include both CPUs and accelerator functional units with the 317 main focus on system-level considerations such as memory 318 interfaces and cache coherency. Similarly, our RAD-Sim can 319 model specialized accelerator blocks as components of a 320 RAD architecture. However, it accepts any user-specified 321 accelerator design written in SystemC and is not limited to 322 dataflow accelerators controlled by finite-state machines as 323 in Aladdin. RAD-Sim also combines accelerator blocks with 324 other application modules implemented on the RAD's recon-325 figurable fabric and with packet-switched NoCs for system-326 level communication; evaluating such combined systems is 327 not possible in gem5-Aladdin.

328
SIAM [38] is a recent example of an architecture simu-329 lator focusing on emerging compute technologies. It models 330 chiplet-based in-memory compute for deep neural networks, 331 and integrates architecture, NoC, network-on-package, and 332 DRAM models to simulate an end-to-end system. Although 333 our work similarly aims to model complete systems inte-334 grating different components including NoCs and special-335 ized accelerator blocks, it is not limited to only modeling 336 in-memory DL compute and focuses mainly on the recon-337 figurable computing domain. For modeling RADs, another 338 key difference is that both the placement of compute mod-339 ules and their attachment to NoC routers have to be flex-340 ible (i.e. not an architecture choice but programmed at 341 application design time) due to the reconfigurability of the 342 FPGA fabric.

344
In this section, we will first introduce an overview of our com-  any other hardened functionalities, RAD-Gen pushes these 384 modules through the ASIC implementation flow to provide 385 architects with silicon area footprint, timing, and power 386 results for these blocks. Both RAD-Sim and RAD-Gen will 387 share the same front-end that takes as an input the RAD 388 architecture parameters and NoC specifications. RAD-Gen 389 will then modify a parameterizable NoC router implemen-390 tation based on the user-specified inputs. It will then push 391 the RTL implementations of the NoC and other system mod-392 ules through existing ASIC implementation tools targeting 393 either proprietary standard cell libraries or open-source ones 394 (e.g. FreePDK [39] and OpenRAM [40]). To perform power 395 analysis of a RAD/application combination, RAD-Gen will 396 be used to obtain energy per operation results for the imple-397 mentation of the RAD's ASIC components on a given process 398 technology. These results, coupled with toggle rates/activities 399 collected by RAD-Sim for a specific simulated application, 400 can be used to estimate the overall power consumption. 401 On the other hand, an enhanced FPGA CAD flow is used to 402 synthesize, place, and route the application design modules to 403 be implemented on the reconfigurable fabric of a candidate 404 RAD. An enhanced version of the VTR flow (in devel-405 opment) can directly model NoC routers/adapters as hard 406 blocks embedded in the FPGA fabric. However, we can also 407 model them in commercial CAD tools by creating reserved 408 logic locked regions of appropriate size and locations, and 409 connecting design module interfaces to registers placed in 410 these regions.

411
Additionally, the embedding of hard NoCs in FPGA fab-412 rics presents a new placement problem as modules must be 413 placed not only where they have sufficient fabric resources 414 and minimize traditional programmable routing delay, but 415 also so that their connection to NoC adapters on nearby 416 routers does not cause undue NoC congestion. RAD-Sim can 417 evaluate NoC performance (latency and congestion) given a 418 specific placement solution and expected application NoC 419 traffic patterns. The enhanced FPGA CAD tools can then use 420 these metrics to adjust module placement and assignment to 421 NoC adapters/routers, and iterate again if latency constraints 422 are not met. This is similar in concept to invoking static tim-423 ing analysis during the placement stage in the conventional 424 VOLUME 10, 2022 FPGA CAD flow to evaluate the expected critical paths of a 425 design in order to guide optimization. While in this work we 426 focus mainly on hard NoCs, RAD-Sim can also readily model 427 application designs that include soft NoCs either as a design 428 component or a pre-placed and routed interconnect overlay 429 such as [41], [42].     three main stages: module interfacing, encoding/decoding, 480 and NoC interfacing. For the slave adapter, an input arbiter 481 selects one of the (possibly multiple) AXI-S interfaces con-482 nected to the same NoC router. Once an AXI-S transaction 483 is buffered, it is packetized into a number of NoC flits and 484 mapped to a specific NoC virtual channel (VC). Then, these 485 flits are pushed into an asynchronous FIFO to be injected 486 into the NoC depending on the router channel arbitration and 487 switch allocation mechanisms. The master adapter works in 488 a similar way but in reverse: flits are ejected from the NoC 489 and once a tail flit is received, they are depacketized into 490 an AXI-S transaction which is then steered to its intended 491 module interface. The adapters implemented in RAD-Sim 492 are parameterized to allow experimentation with different 493 arbitration mechanisms, VC mapping tables, and FIFO/buffer 494 sizes. They also support up to three distinct clock domains 495 where the connected module, adapter, and NoC can be all 496 operating at different clock frequencies. This enables exper-497 imentation with scenarios where stages of the NoC adapters 498 are either hardened or implemented in the FPGA's soft logic. 499 Table 1 lists some of the user input parameters of 500 RAD-Sim. Besides these parameters, RAD-Sim takes as an 501 input a NoC placement file that specifies the assignment 502 of all hard accelerator block and fabric module ports to 503 specific NoC routers/adapters. This is currently passed as 504 a user-specified manual assignment; in our future work we 505 also plan to enable automatic creation of this file such that 506 NoC latency constraints specified by the user are met and/or 507 overall application performance is optimized. As described 508 in Sec. III-A, this router assignment file could be automat-509 ically created by an enhanced FPGA placement algorithm 510 that repeatedly adjusts the routers to which modules connect 511 (essentially placing the router interfaces) as placement pro-512 ceeds and invokes RAD-Sim to quantify the effect of these 513 adjustments on the system performance.

514
In addition, RAD-Sim provides users with various teleme-515 try utilities to record specific simulation events and traces 516 along with different scripts to visualize the collected data. 517 This can be very useful in reasoning about the complex 518 interactions between the different components of a RAD 519 and understanding the effect of changing various architecture 520 parameters on the overall application performance. Fig. 4    utilities are used to record various timestamps in the transac-532 tion lifetime such as transaction initiation at the source mod-533 ule, packetization, injection/ejection, depacketization, and 534 receipt at the destination module. Fig. 4(a) shows the latency 535 in nanoseconds and number of NoC router hops for each of 536 the 62 issued transactions. The graph shows how the number 537 of hops and communication latency increase as the distance 538 between the source and destination modules increases then 539 drops when moving to the next row in the 4 × 4 mesh of 540 routers. Fig. 4(b) shows another visualization produced by 541 RAD-Sim that breaks down the latency for each transaction 542 into time spent in the injection adapter, the NoC, and the 543 ejection adapter. This can highlight the overhead introduced 544 when experimenting with different adapter implementations. 545

546
In this section, we present a case study to showcase the capa-547 bilities of RAD-Sim by migrating a state-of-the-art DL FPGA 548 benchmark, the NPU, from conventional FPGAs to novel 549 RADs. This study highlights how RAD-Sim can pin-point 550 performance bottlenecks and allows rapid experimentation 551 with potential solutions both by re-designing the application 552 to better suit RAD architectures and by changing the param-553 eters of the RAD architecture itself.

555
In this section, we present a brief overview of the NPU 556 overlay that we use as a vehicle for our case study. The 557 NPU is a state-of-the-art FPGA soft processor (i.e. software-558 programmable processor implemented on an FPGA's pro-559 grammable fabric) with an instruction set and compute 560 pipeline specialized for the acceleration of memory-intensive 561 DL models such as multi-layer perceptrons (MLPs), recur-562 rent neural networks (RNNs), gated recurrent units (GRUs), 563 and long short-term memory models (LSTMs). The NPU 564 architecture is similar to that of the Microsoft Brainwave 565 architecture [11] and achieves an order of magnitude higher 566 performance on Intel's DL-targeted FPGA, the Stratix 10 NX, 567 when compared to same-generation GPUs [43]. 568 Fig. 5 depicts the NPU overlay architecture which consists 569 of several coarse-grained compute blocks chained together 570 such that the output of one block is forwarded to the next. 571 The key block in the NPU architecture is a massively parallel 572 matrix-vector multiplication unit (MVU). It consists of T 573 tiles, each of which has D sets of C dot-product engines 574 (DPEs) of length L multiplication lanes. Each DPE is tightly 575 coupled with a register file (RF) that stores all the model 576 weights persistently on-chip and makes use of the tremen-577 dous on-chip bandwidth of the FPGA's BRAMs. An MVU 578 tile computes a row block of a matrix-vector multiplication 579 operation, and then their partial results are reduced and accu-580 mulated over multiple time steps (if needed) to output the 581 final MVU result. This is followed by an external vector 582 register file (eVRF) to skip the MVU for instructions that 583 do not include a matrix-vector multiplication, and then two 584 identical vector elementwise multi-function units (MFUs) for 585 VOLUME 10, 2022 interested readers to [44] and [43] for more details about the 605 NPU architecture and front-end. error of only 5.1% and maximum error of 10.8% compared 625 to cycle-accurate RTL simulation. These small differences 626 result from minor discrepancies between our SystemC and 627 RTL implementations of the NPU architecture that can be 628 tuned to further reduce this gap. However, the SystemC sim-629 ulations are 26× faster than the RTL simulations on average, 630 with speedups ranging from 6.5× to 100× depending on the 631 workload size. The speed of SystemC models contributes 632 to the larger architecture space we can explore in RAD-633 Sim, and the close agreement in performance results between 634 the SystemC and RTL simulation means we can trust the 635 RAD-Sim results to have high fidelity for this case study.

637
To map the NPU to a RAD instance incorporating a NoC, 638 all communication channels between NPU blocks have to 639 be latency-insensitive (LI). All the feedforward communica-640 tion between the five chained NPU blocks already contains 641 elastic FIFO interfaces. However, there are two main latency 642 sensitive channels that need to be modified (highlighted in 643 red in Fig. 5). The first is the connection from the LD 644 block to all the different RFs which is used for writing back 645 the pipeline results and issuing instruction tag updates for 646 data hazard resolution. The second is the inter-tile reduction 647 connections between all T tiles and the accumulator within 648 the MVU. Since the MVU alone constitutes 52%, 77% and 649  Finally, adding the AXI-S wrappers causes less than 3% 692 performance degradation when they are set to the full widths 693 of the NPU block interfaces.

694
Now that the NPU is fully LI, we can map it to RAD archi-695 tectures where communication and computation are decou-696 pled by a NoC. As a start, we map the LI AXI-S-wrapped 697 NPU modules to a simple RAD with only an FPGA fabric and 698 an ideal (unrealistic) NoC. This ideal NoC implements point-699 to-point connections between the NPU modules without any 700 additional arbitration or latency due to traversing multiple 701 NoC links and has no bandwidth contention between different 702 traffic streams traversing the NoC at the same time. We exper-703 iment with NoC routers with 1024-bit and 512-bit inter-704 faces. Although this limits the inter-module communication 705 bandwidth between the NPU modules compared to the base-706 line design, these router interface widths are not unrealistic; 707 512-bit interfaces are a common design choice for NoC 708 adapters in prior academic research [13] and in the Xilinx 709 Versal NoC architecture [45]. Even in the case of an idealized 710 NoC, however, this significantly throttles the NPU perfor-711 mance to only 23% and 13% of the original performance 712 for interface widths of 1024 and 512 bits, respectively. This 713 experiment highlights that migrating application designs as-is 714 from FPGAs to novel RADs with embedded NoCs can lead 715 to very poor performance; instead migration requires careful 716 consideration of inter-module communication bandwidth.  widths. Therefore, the boundaries between NPU modules 732 need to be re-structured in a way that limits the widths of 733 edges in Fig. 8(a). We refer to this as a bandwidth-driven 734 design approach; it follows three principles:    The graph representation in Fig. 8(b) shows how 748 we re-structure the NPU architecture following these  to its corresponding vector EW slice modules independently 771 as shown in Fig. 8(b). The data width of the MVU write back channel is set to match the int8 numerical precision 773 of the MVU since it is now an independent channel and does 774 not talk to the other blocks (eVRF and MFUs) using int32 775 precision. This limits the width of this channel to 640 bits at 776 no additional cost. Finally, to parallelize MVU write backs or 777 tag updates for the NPU's data hazard resolution, we also add 778 message-passing channels from one MVU slice to the next. 779 By doing this, the LD can send only one write back or tag 780 update message to the first MVU slice; this message is then 781 passed between slices and the LD can start sending messages 782 to other NPU blocks in parallel.

783
The results in Fig. 7 show that, with the same amount 784 of compute resources and an AXI-S interface data width 785 restricted to 512 bits, the bandwidth-driven re-structuring 786 of the NPU can gain back most of the performance lost 787 to bandwidth limitations. It even exceeds the performance 788 of the original NPU (that used very wide, latency-sensitive 789 communication) due to the added parallelism in tag updates 790 through the MVU slice-to-slice message passing channels. 791 The total cost of the NPU re-design to be fully LI and have 792 bandwidth-friendly 512-bit AXI-S interfaces is an average 793 23% degradation in performance compared to the original 794 latency-sensitive NPU in [43]. With a slight increase in AXI-S 795 interface width to 640 bits, performance increases by 10% on 796 average due to matching the full width of the LD write-back 797 and MVU slice-to-slice communication channels. This brings 798 the fully LI NPU to within 87% of the original NPU perfor-799 mance on average, as shown in Fig. 8(b).

801
After restructuring the NPU to be more modular, LI, and 802 bandwidth-friendly as described above, we experiment with 803 mapping it to a realistic NoC. We again assume a RAD 804 instance with a conventional FPGA fabric (similar to a 805 Stratix 10 NX) and no accelerator block, but this time we 806 use a realistic 9 × 9 mesh NoC. For this experiment, we use 807 512-bit AXI interfaces for all the NPU modules. We assume 808 the restructured NPU modules run at 300 MHz similar to 809  This data highlights an opportunity: interleaving the execu-842 tion of multiple instruction streams (i.e. threads) in the MVU 843 slice could fill these idle gaps. Fig. 9(b) illustrates the graph  and four interleaved thread executions, respectively. Fig. 10 852 shows that interleaving two and four threads can increase the 853 overall performance by 38% and 57% on average (and up to 854 77% and 135%) respectively, vs. a single thread implemen-855 tation. However, adding support for each additional thread 856 utilizes 17%, 23% and 20% more ALMs, BRAMs and TBs, 857 respectively. Therefore, it is not feasible to implement more 858 than one thread on the (already full) Stratix 10 NX 2100 used 859 by the baseline NPU. Nevertheless, it is feasible to implement 860 more threads when exploring RADs with bigger/multiple 861 FPGA fabrics or hard accelerator blocks that free up more 862 fabric resources, as we discuss in the next section.

864
In the previous section, we have shown that RAD-Sim can 865 highlight performance bottlenecks and help architects experi-866 ment with application re-design ideas (e.g. bandwidth-driven 867 restructuring and multi-threading for our NPU example) to 868 alleviate these bottlenecks. In this section, we will illustrate 869 how RAD-Sim can capture a variety of RAD architectures 870 by mapping the NPU to three example RAD instances rang-871 ing from a multi-die FPGA using passive interposers to a 872 monolithic FPGA with side accelerator complex and a device 873 using 3D active die stacking. Additionally, we will show 874 how RAD-Sim can be used to fine-tune specific architecture 875 parameters and quantify the effect on end-to-end perfor-876 mance. The intention of the experiments presented in this 877 section is by no means to perform a detailed architecture 878 study to find the best RAD architecture for a specific applica-879 tion, which is an ongoing work combining the use of both the 880 RAD-Sim and RAD-Gen components of our flow. Instead, 881 we aim to illustrate that RAD-Sim can capture a wide variety 882 of RAD styles and also guide the fine-tuning of low level 883 architecture parameters of these devices.

884
Use of LI bandwidth-driven design (as illustrated for the 885 NPU in the previous section) and a system-level NoC com-886 pletely decouples the application compute from its inter-887 module communication. This raises the interconnect abstrac-888 tion level and enables the exploration of complex RADs that 889 span multiple dice and incorporate hard accelerator blocks. 890 In this case, the conventional FPGA CAD tools do not need 891 to optimize the timing and routability of signals crossing 892 the boundaries between dice through interposers or trying to 893 reach the programmable routing interfaces of a hard accelera-894 tor block. If each application module meets timing separately 895 and can be connected to a NoC adapter, the evaluation of end-896 to-end application performance on a given RAD instance is 897 raised to the cycle-level simulation of soft/hard modules and 898 NoC latency; this is exactly what is captured by RAD-Sim. 899    We estimate performance by using RAD-Sim to map the 950 NPU to the three example RADs. We set an FPGA fabric 951 operating frequency of 300 MHz (matching the NPU operat-952 ing frequency in [43]) and conservatively assume that the hard 953 accelerator blocks run at 600 MHz. We scale the operating 954 frequency of the 28nm NoC routers from [48] to 1.5 GHz 955 in 14nm process technology, and we assume that the NoC 956 adapters operate at 4× the fabric speed, similarly to [48]. 957 The RTL implementation of the NoC router used in [48] 958 is heavily parameterizable and compatible with Booksim 959 parameters (developed by the same developers of Booksim). 960 More details about this RTL implementation and its source 961 code can be found in [49]. In all experiments, we use a mesh 962 NoC topology (dimensions specified in Table 2 for each case) 963 with 166-bit links, 3 VCs, input queuing router architecture, 964 and dimension order packet routing. The depths of the NoC 965 adapters' injection/ejection FIFOs and ouptut buffers (see 966 Fig. 3) are set to 16 and 2, respectively. We also manually 967 assign NPU module AXI-S ports to specific routers in a 968 reasonable (but possibly sub-optimal) placement.

970
To determine FPGA resource utilization, we synthesize, place 971 and route the parts of the NPU to be implemented on the 972 FPGA fabrics using Intel Quartus Prime Pro 21.2 on a 973 Stratix 10 NX 2100 device. We use reserved logic lock 974 regions at the appropriate locations for NoC routers/adapters, 975 mark them as empty design partitions, and connect the NPU 976 modules to them based on our manual module assignment 977 to different routers. We conservatively size each logic lock 978 region as a grid of 10 × 10 logic array blocks (LABs) 979  instance. 986 We also verify that the matrix-vector multiplication units   Fig. 12(b) shows the relative performance comparison 1016 between the baseline latency-sensitive NPU on Stratix 10 NX 1017 from [43] and the re-designed NPU mapped to the three 1018 RAD instances we use in our study. Although RAD1 uses 1019 two FPGA fabrics, it does not benefit from any increase in 1020 the MVU compute resources compared to the baseline NPU. 1021 It only uses the resources of the second FPGA to add support 1022 for 4 interleaved thread executions. With the overhead of 1023 LI re-design and higher-latency NoC communication, RAD1 1024 can achieve only 12% better performance on average com-1025 pared to the baseline NPU. In comparison, the single-die 1026 RAD2 achieves 1.2× (1.32×) the performance of RAD1 (the 1027 baseline NPU) by exploiting the hardened MVU slices in 1028 the side coarse-grained accelerator blocks and interleaving 1029 four thread executions. Finally, the base die of RAD3 can 1030 implement the MVU slices of 4 NPU instances and frees 1031 the FPGA resources to implement the rest of their vector 1032 EW, LD and instruction dispatch units. This results in a 1033 significant 2.6× increase in average performance compared 1034 to the baseline NPU on a same form-factor FPGA without 1035 3D stacking. In addition, since the hard matrix-vector mul-1036 tiplication units in RAD2 and RAD3 are designed to have 1037 bigger RFs, they can both run a new set of bigger workloads, 1038 shown in Fig. 12(c), that can not fit in the on-chip memory of 1039 the baseline NPU. These results show that RAD3 can achieve 1040 performance up to 145 TOPS on the LSTM-1536 workload. 1041 Besides its ability to model a variety RAD architectures, 1042 RAD-Sim also enables us to study the effect of different 1043 architecture parameters on the performance of application 1044 designs. As an example, Fig. 13 shows the impact of changing 1045 the VC buffer size in the NoC routers of RAD2 for some 1046 of the NPU workloads (other workloads show the exact 1047 same trend but were omitted for brevity). Increasing the 1048 VC buffer size increases the silicon area footprint of the 1049 NoC routers, but acts as a bigger distributed storage for 1050 the packets traversing the NoC which can help avoid frequent 1051 NoC back pressures and decrease the overall communication 1052 latency. The results show that for the NPU traffic patterns 1053 over the NoC, VC buffer depths less than 8 flits can throttle 1054 performance, while increasing them beyond 8 flits yields little 1055 or no additional performance benefit.   we experiment with. It shows that runtime varies mainly 1060 depending on the number of simulation cycles for the dif-were mainly motivated by their faster time-to-solution com-1090 pared to custom ASICs and their diverse high-bandwidth 1091 I/O interfaces that allow them to accelerate key datacen-1092 ter functionalities on-the-fly at the data crossroads between 1093 different server end points. Building on these strengths, 1094 we have started to witness the emergence of novel RADs 1095 that combine the hardware flexibility of FPGAs, the high 1096 performance of domain-specialized accelerators, and the effi-1097 ciency of packet-switched NoCs for system-level commu-1098 nication. In addition, advances in 3D chip fabrication and 1099 integration technologies are unlocking a whole new design 1100 space of multi-die RADs. However, RAD architects lack 1101 the tools to rapidly explore this huge design space and 1102 evaluate the effect of their design choices on end-to-end 1103 application performance. To this end, we develop RAD-Sim, 1104 an application-driven architecture simulator for modeling 1105 and evaluating candidate RAD architectures. It also allows 1106 early co-optimization of key application designs migrated 1107 from conventional FPGAs and the architecture parameters 1108 of a proposed RAD. We showcase RAD-Sim through a 1109 case study that maps the state-of-the-art NPU DL inference 1110 overlay on different example RAD instances. RAD-Sim's 1111 telemetry and visualization features pinpoint bottlenecks in 1112 the NPU on RADs with embedded NoCs, which we address 1113 with a new bandwidth-driven design approach and by adding 1114 multi-threading to increase tolerance of NoC latency. Our 1115 study also demonstrates that 3D-stacked RADs can increase 1116 average performance by a 2.6× compared to current FPGAs 1117 and achieve up to 145 TOPS on key DL workloads. We open 1118 source both RAD-Sim and the NPU example design for the 1119 broader research community to leverage in driving further 1120 innovations in RAD architecture.

1121
based AI smart NICs for scalable distributed AI training systems,'' 2022,