End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data science evolves at a quick pace, and manual design of custom accelerators has high non-recurrent engineering costs: general solutions are needed to automatically and rapidly transition from the formulation of a new algorithm to the deployment of a dedicated hardware implementation. Our solution is the SOftware Defined Architectures (SODA) Synthesizer, an end-to-end, multi-level, modular, extensible compiler toolchain providing a direct path from machine learning tools to hardware. The SODA Synthesizer frontend is based on the multilevel intermediate representation (MLIR) framework; it ingests pre-trained machine learning models, identifies kernels suited for acceleration, performs high-level optimizations, and prepares them for hardware synthesis. In the backend, SODA leverages state-of-the-art high-level synthesis techniques to generate highly efficient accelerators, targeting both field programmable devices (FPGAs) and application-specific circuits (ASICs). In this paper, we describe how the SODA Synthesizer can also assemble the generated accelerators (based on the finite state machine with datapath model) in a custom system driven by a distributed controller, building a coarse-grained dataflow architecture that does not require a host processor to orchestrate parallel execution of multiple accelerators. We show the effectiveness of our approach by automatically generating ASIC accelerators for layers of popular deep neural networks (DNNs). Our high-level optimizations result in up to 74x speedup on isolated accelerators for individual DNN layers, and our dynamically scheduled architecture yields an additional 3x performance improvement when combining accelerators to handle streaming inputs.


INTRODUCTION
N EXT-GENERATION edge systems will operate under conditions where exporting all the acquired data for centralized processing is inconvenient or impossible [1]. Monitoring infrastructure for highly dynamic systems (e.g., sensor networks) will need to operate in low power settings with limited bandwidth available for communication [2]. Autonomous vehicles will need to make critical decisions in real-time in a distributed setting. Experimental instruments such as the ones owned by the US Department of Energy (e.g., particle accelerators, mass spectrometers, and electron microscopes), already generate volumes of data that are impossible to store or transfer without pre-processing [3]. Such extreme conditions require highly specialized processing systems to support autonomous learning and artificial intelligence, optimized along a variety of metrics that include energy, performance, latency, size, and more. Designing and implementing domain-specific systems is challenging and expensive due to the extreme diversity and fast-paced growth of applications and algorithms, especially in the field of machine learning (ML). There is no "one-size-fits-all" solution, and developing specialized accelerators requires significant efforts by large teams of expert hardware designers.
To address these problems, we have developed the SOftware Defined Architectures (SODA) Synthesizer [4], [5], [6]: an open-source, multi-level, modular, extensible, no-humanin-the-loop hardware compiler that translates high-level ML models into domain-specific accelerators. Our tool generates highly specialized designs in a hardware description language (HDL), which can be synthesized with both commercial and open-source tools on field programmable gate arrays (FPGAs) or as application-specific integrated circuits (ASICs). The SODA Synthesizer comprises a compiler-based frontend that leverages the Multi-Level Intermediate Representation (MLIR) framework, and a compiler-based backend that integrates state-of-the-art high-level synthesis (HLS) methodologies. The SODA Synthesizer allows for the exploration of design metrics through compilation passes and parameters and enables identification of optimal trade-offs depending on the target application requirements.
HLS tools typically generate highly-specialized, power-efficient hardware designs using the finite state machine with datapath (FSMD) paradigm, which is particularly suited for extracting instruction-level parallelism. However, the FSM controller has a notable limitation: it is not scalable enough to deal with multiple, parallel, execution flows (e.g., in presence of coarse-grained parallelism). In these conditions, which are common for compute and memory-intensive ML algorithms (e.g., deep neural networks), the complexity of a centralized, statically scheduled FSM controller grows exponentially, leading to significant area and performance overheads [7]. A system-onchip (SoC) can use a central general-purpose microcontroller to drive multiple accelerators implementing different layers of an ML model; however, in such a system the data movement between the host microcontroller, the accelerators, and the memory quickly becomes a performance bottleneck. In this work, we have extended the SODA Synthesizer to enable automatic generation of a second type of system: a dynamically scheduled architecture where custom ML accelerators (based on the FSMD model) are composed in a dataflow system and are driven by a distributed controller. In this architecture, multiple accelerators can perform computations in parallel on different portions of streaming input data, without requiring orchestration from the host microcontroller, and can communicate with each other without going through external memory.
In a previous work [8], we implemented a solution to synthesize parallel C code, annotated with OpenMP-like directives, into a similar dataflow architecture, with support for spatial parallelism, resource reuse, and memory access parallelism. That approach could identify certain degrees of parallelism by analyzing program dependencies, but it was constrained by conservative alias analysis: user-provided annotations were needed to simplify the dependency analysis and to expose dynamic parallelism. ML frameworks, instead, naturally represent models as computational graphs describing how the data flows across operators, and MLIR directly interfaces with ML frameworks, offering promising opportunities for domain-specific optimizations. By leveraging the MLIR framework, the SODA Synthesizer can take advantage of such optimization opportunities. MLIR representations capture hierarchy and parallelism of computational graphs, facilitating generation and mapping to dataflow architectures. Knowing precisely how the data flows across operators and memory regions removes the need for complex alias analysis.
In summary, the contributions of this paper are: an automated, modular, multi-level, compiler-based design flow from high-level ML frameworks to optimized FPGA or ASIC accelerators implemented following the FSMD model; a search and outlining methodology to automatically extract accelerators and their dependencies from an MLIR input specification; a system integration methodology to assemble FSMD accelerators into a coarse-grained, dynamically scheduled dataflow architecture with distributed control; a comparison between a standard SoC design with a centralized microcontroller and a custom system built with our distributed controller methodology. The rest of the paper is structured as follows: we summarize related work in Section 2, the SODA Synthesizer is introduced in Section 3, and detailed in Sections 4 and 5. We show experimental results in Section 6 and draw conclusions for the paper in Section .

RELATED WORK
A large number of designs (including several based on the dataflow model) have been proposed as specialized ML accelerators, and existing HLS tools have been extended with dataflow concepts before. In this section, we summarize relevant previous works and highlight the differences between existing approaches and our work.

Hardware Acceleration for Machine Learning
Available commercial solutions offer acceleration of ML algorithms through specialized functional units (e.g., the Tensor Cores in NVIDIA GPUs) or entire chips based on tensor processing (e.g., Google TPU [9]). Some of them, including SambaNova, GraphCore, and Cerebras, exploit the dataflow paradigm, with varying degrees of generality in their processing elements. Research and industry also proposed many FPGA-based accelerators for ML inference [10], [11], often supporting specialized numeric formats to reduce resource utilization and increase efficiency.
One challenge is to design an accelerator that can support multiple classes of algorithms, rather than focusing solely on deep neural networks (DNNs) [12]. Efforts in this direction include the PuDianNao [13] and SpiNNaker [14] architectures, or the Tabla [15] and Eyeriss [16] frameworks for the generation of accelerators. DNNBuilder [17] and the related design space exploration flow DNNExplorer [18] employ configurable and composable layer-wise accelerators to implement several types of DNNs. SIGMA [19] exploits reconfigurable interconnect with specialized matrix multiplication units. Other designs with reconfigurable interconnect also support dataflow models [20], [21]. Rather than proposing yet another accelerator design, we provide a methodology to design and implement new FPGA/ASIC accelerators starting from a high-level description of the input ML algorithm. By leveraging high-level and low-level (HLS) compiler-based tools, SODA provides a more general solution: in fact, it can generate hardware designs for virtually any computational pattern, as long as a lowering to MLIR is available.
One common approach to reduce design efforts of processing elements at the register-transfer level is to compile and map a high-level description of the input algorithm onto parametrized hardware modules and architecture templates. VeriGOOD-ML [22] uses the PolyMath compiler [23] to map ML models in the ONNX format to three different architecture templates designed for different types of neural networks. GEMMINI [24] offloads operations from specific layers of ONNX models to a systolic array connected to a RISC-V core, after building the systolic array itself starting from a parametrized generator in Chisel. TVM's VTA architecture [25] is a configurable FPGA co-processor for matrix multiplication; the TVM high-level framework then compiles each ML model into instructions for VTA. All these solutions can generate specialized accelerators, but they can only support ML layers and operators that have a direct mapping to the available hardware templates. Alternative approaches translate high-level abstractions into a form that can be ingested by commercial HLS tools (typically, C/C++ code with tool-specific optimization directives). PyLog [26] defines a high-level compilation infrastructure to transform Python programs into annotated C/C++ code, which is subsequently fed to Xilinx Vivado HLS. HeteroCL [27] provides a Python-based domain-specific language to partition an algorithm between generalpurpose processor and FPGA, and to insert hardware-specific information in the code, which is then compiled into annotated C/C++ for different backend HLS tools. The hls4ml [28] framework translates input models selecting operators from a library of C/C++ templates optimized for Vivado HLS. ScaleHLS [29] aims at facilitating and optimizing HLS through high-level transformations implemented in MLIR, exploiting different levels of abstraction and finally generating annotated C code for Vivado HLS. These tools provide a bridge between high-level programming frameworks and hardware generation, but they have limited flexibility: they only support specific high-level frameworks and backend HLS tools, and they generate code at a different (higher) level of abstraction after applying hardware-related optimizations, potentially losing a considerable amount of semantic information in the process. With SODA, instead, we bring together MLIR and HLS to build an integrated open-source toolchain, optimizing input ML models at appropriate levels of abstractions, without the need to generate intermediate C/C++, and offering a wide choice of FPGA and ASIC targets in the backend.

Generation of Dataflow Accelerators
Our methodology generates a custom architecture that dynamically invokes, in a dataflow fashion, highly optimized, statically scheduled accelerators based on the FSMD model. This requires a distributed controller that activates each module as soon as its inputs are available. Previous HLS research tried to decompose and distribute the classical FSM controller to reduce its complexity, restructuring it in a hierarchical way [30], but this is not very efficient in managing concurrent execution of independent units.
The Bluespec compiler [31] implements an event-driven execution paradigm based on rules and atomic transactions that is similar to our approach. Our approach performs the synthesis of an event-driven dataflow architecture starting from outlined kernels in an MLIR description, i.e., from a high-level abstract representation of the application code. The Bluespec compiler instead uses as input BSV, a language that, while higher-level than Verilog and VHDL, is closer to a behavioral HDL than to a software description. Additionally, our kernels are FSMD accelerators, rather than functional units. Dynamatic [32] proposes an HLS methodology that generates dynamically scheduled designs using the dataflow paradigm, mainly focusing on supporting dataflow at the instruction level, rather than at the task/function level. It does not support resource sharing, and abstracts memory by decoupling it from the accelerator through a single load/ store queue, thus not taking advantage of memory-level parallelism. Dynamic scheduling leads to simpler designs when exploiting parallelism across basic block boundaries; however, FSMDs provide very high quality of results (both in performance and area) when the focus is optimizing for instruction-level parallelism inside a function or a basic block. Dynamatic has been extended to couple dynamic with static scheduling [33], supporting resource reuse and a simple memory abstraction, but it does not consider coarsergrained parallelism in the input specifications, and it only works on C inputs.
Several other research projects are proposing domainspecific languages and frameworks to generate accelerators that can exploit coarse-grained parallelism by combining dataflow concepts with FSMDs, as we do in our methodology. For example, Spatial [34] allows to mark both dataflow modules (akin to our distributed controller) and FSM modules at different levels of the code hierarchy. Xilinx Vivado/ Vitis HLS tools support dataflow pipelining mechanisms across functions or loops by annotating the input C specification with a custom pragma [35]; while the solution allows overlapping execution of functions and loops in a pipelined fashion, it only works when a similar initiation interval for all the functions/loops can be found. All these approaches still require users to describe to some extent the behavior and desired features of a circuit in their code, while the SODA Synthesizer provides a completely automated path from high-level frameworks to hardware, with no additional information required. Fig. 1 provides an overview of the SODA Synthesizer [4], [5], [6]; components extended to support the generation of dataflow architectures are highlighted in blue. The tool is composed of two main parts: a compiler-based frontend and a compiler-based hardware synthesis engine. The frontend is based on MLIR [36], a framework that allows building reusable compiler infrastructure inspired by (and contributed to) the LLVM project. The SODA Synthesizer frontend interfaces with high-level programming frameworks, partitions the input applications by identifying key computational kernels for hardware acceleration, and performs high-level optimizations that improve the subsequent generation of custom accelerators and systems. The frontend then generates an LLVM IR as output, which is the starting point for hardware generation. The SODA backend integrates Bambu [37], a state-of-the-art open-source HLS tool, to generate the hardware accelerators. To compile code that will be executed on a host processor, instead, SODA uses standard LLVM tools. The frontend compiler and HLS backend are available at https://gitlab.pnnl.gov/sodalite/ soda-opt and https://panda.dei.polimi.it, respectively.

Frontend Compiler
SODA-OPT ( Fig. 2) is the high-level compilation frontend of the SODA Synthesizer. SODA-OPT performs search, outlining, optimization, and dispatching passes on the input program, preparing it for hardware synthesis targeting FPGAs or ASICs. To implement all these functionalities, SODA-OPT leverages and extends the MLIR framework. MLIR allows developers to define dialects, i.e., self-contained IRs that respect MLIR's meta-IR syntax. Dialects model code at different levels of abstraction, creating specialized representations that facilitate the implementation of new compiler optimizations. For example, dialects that are maintained in tree along with the MLIR framework include abstractions for linear algebra, polyhedral analysis, structured control flow, and others. We will refer to these as built-in dialects in the rest of the paper.
Built-in dialects are the entry points to the SODA Synthesizer frontend. SODA-OPT introduces new constructs specific to hardware generation, but it also exploits existing dialects and optimizations: this enables high-level programming frameworks to leverage our toolchain just by providing a translation to built-in MLIR dialects. Several frameworks already implemented their own specific MLIR dialects, optimization passes, and lowering methods, including TensorFlow, ONNX-MLIR, and TORCH-MLIR. One entry point to the SODA synthesizer is through the TensorFlow tf-mlir-translate and tf-opt tools, which compile ML models defined and trained in TensorFlow into an MLIR representation.
SODA-OPT implements analysis and transformation passes that parse MLIR inputs lowered from high-level frameworks, identify key operation groups, and outline them into separate MLIR modules. The operations selected for hardware acceleration undergo an optimization pipeline with progressive lowerings through different MLIR dialects (linalg ! affine ! scf ! std ! llvm), and they are finally translated into an LLVM IR purposely restructured for hardware generation. SODA-OPT can lower the remaining operations in two different ways, depending on the desired target: they can represent the orchestrating code executed by a host microcontroller, or the relationship between the accelerators in our dataflow architecture. In the first case, SODA-OPT produces another LLVM IR file including runtime calls to control the generated accelerators. In the second case, operations are transformed into a function-based representation (task graph) that allows Bambu to generate the required distributed controller logic and memory interface; accelerators and controller modules will then be assembled together to form the dataflow architecture.

High-Level Synthesis Backend
The SODA Synthesizer backend, Bambu, leverages state-ofthe-art HLS techniques to synthesize the LLVM IR produced by the SODA-OPT frontend into an accelerator design. Bambu includes frontends based on standard open-source compilers (GCC or Clang), supporting C, C++ and, among others, LLVM IR inputs. It builds an internal IR and performs HLS steps such as resource allocation, scheduling, and binding, and finally generates the designs in a hardware description language (Verilog or VHDL).
Bambu synthesizes Register-Transfer Level (RTL) designs in Verilog following the finite state machine with datapath (FSMD) model, and we have extended it with novel methodologies that enhance modularity and generate dynamically scheduled accelerators. We enabled the reuse across an entire design of synthesized modules representing functions within a larger specification [38], providing opportunities for modular and hierarchical designs. We further extended Bambu to allow the integration of FSMD modules as processing elements in a coarse-grained dataflow design [8], and in multithreaded parallel accelerators [39]. We initially developed these synthesis methodologies by integrating support for parallel C specifications annotated with a set of OpenMP directives: users identify parallel sections in the input code through annotations, allowing Bambu to generate custom accelerator modules, and to combine them in a top-level, dynamically scheduled architecture. MLIR descriptions are naturally parallel and hierarchical, and the MLIR framework facilitates the implementation of the required analyses and transformations. Hence, a multi-level, extensible compiler approach as the one we implemented in SODA-OPT provides opportunities to significantly improve system-level design: identifying kernels that need to be accelerated, analyzing their interactions, and composing them in a system are tasks that are better solved at the MLIR level, allowing the HLS engine to focus on the generation of optimized accelerators.
The resource library provides Bambu with RTL descriptions of functional units to implement the operations present in the IR (adders, subtractors, multipliers, etc.), with different versions for different data types. It also contains the architectural templates, controller logic, and interfaces that enable the integration of synthesized modules in a toplevel design. To effectively drive synthesis algorithms, Bambu relies on a characterization process for the components in the resource library in terms of performance (e.g., latency of the critical path) and area for each target technology or device.
Bambu provides several options to connect accelerators to memories: for example, it can generate one read and one store port for a whole module (representing a function), or read and store ports for each function argument (if they do not alias). Then it instantiates and connects multi-ported scratchpads (or BRAMs for FPGAs) to such ports. By default, Bambu connects a dual-ported scratchpad memory to each couple of load store ports, assuming a fixed latency of 1 clock cycle for reads and 2 for stores.
The SODA toolchain interfaces with both commercial and open-source logic synthesis tools. Bambu supports FPGA devices from several vendors, and we introduced the option of targeting ASICs through the OpenROAD flow, employing the OpenPDK 45 nm cell technology library. Thus, the SODA toolchain provides a completely opensource, end-to-end compiler-based hardware generation flow from high-level programming environments to silicon. We also added support for the Synopsis Design Compiler, targeting both the OpenPDK 45 nm and the Global Foundries 12/14 nm technology nodes. For each new tool and technology, we ran the Bambu characterization process, collecting all area and performance metrics needed to update the resource library and the models estimating interconnections cost.
Finally, the SODA toolchain also provides verification features to ensure that the generated designs are functionally correct. Bambu includes a suite of tools that enable automatic testbench generation and validation of results, supporting external open-source and commercial simulators. One of these is the open-source tool Verilator [40], which generates optimized models for the accelerators that it simulates, and drives them through C++ or SystemC top modules. The SODA frontend feeds simulation inputs to Bambu; Bambu, in turn, generates testbenches, scripts, and glue code to drive the execution of Verilator, and automatically verifies that the output values of the simulated accelerators correspond to the results from the execution of the original application with the same inputs.

KERNEL SELECTION AND OPTIMIZATION
As mentioned at the beginning of Section 3, the SODA-OPT frontend performs search, outlining, optimization, and dispatching passes to select relevant kernels from the input model, and prepare them for hardware generation. We exploit existing and custom MLIR dialects, leveraging the possibility of working at different levels of abstraction in different stages of the compilation process. For example, high-level built-in dialects such as linalg and tosa maintain semantics from the input specification (e.g., ML operators) that simplify the identification of kernels, while lower-level abstractions such as affine and scf (also built-in) provide opportunities for code optimizations. We introduced the soda dialect to partition input ML models into kernels that will be translated into hardware accelerators, and logic that controls their execution. Table 1 describes the soda dialect operations; in the following we will detail the search and outlining processes that use them, and the subsequent optimization and dispatching phases.

Search Phase
SODA-OPT automatically identifies operations that are well suited for acceleration by matching key patterns at the earliest stages of the compilation process (search phase). Searched patterns are mostly linear algebra operations or affine structures wrapping arithmetic operations, selected among the most common computational kernels in ML applications. Users can easily extend SODA-OPT by adding new patterns of interest beyond ML, as it could happen when the input is a scientific computing application lowered from a domainspecific framework. Search passes wrap a soda.launch operation around the operations to be outlined, and inject a soda.terminator operation at its end. Looking at Fig. 3a, representing a small portion of a CNN, a user might decide to separately accelerate each node in the computational graph (one reshape operation, one convolution, one bias add, and one ReLu activation function). When the model is lowered to the MLIR linalg dialect, each of them is represented by a linalg.generic construct (Fig. 3b), which SODA-OPT can mark with launch and terminator operations. When targeting the dataflow architecture, SODA-OPT individually marks for outlining all operations in the MLIR file, so that each of them will be synthesized as a dataflow stage.

Outline Phase
Then, at the beginning of the outlining phase, SODA-OPT extracts each region of code within marks into a separate MLIR module, inlining any functions invoked inside it. SODA-OPT adds an attribute to the module to indicate the target architecture (centralized or dataflow), and to later select the corresponding backend compilation/ Holds the list of outlined operations, it will become a unique accelerator module.

soda.func
Defines an outlined function with its interface. soda.return Indicates the end of an outlined function.

soda.launch_func
Calls the outlined function from the control logic partition of the code. synthesis flow. The outlining process proceeds by analyzing use-def chains of values inside each module to generate the interface of the top-level kernel functions, adding to their arguments also memory buffers allocated outside the soda.launch region, but referenced inside it. Constant values are instead pulled inside the kernel. The process ends with the generation of a soda.module containing a soda.func replacing each soda.launch block. Outlined kernels are finally substituted by soda. launch_func operations in the top-level code that will orchestrate their execution (Fig. 3c).

Optimization Phase
After outlining, each kernel is optimized separately, passing through progressive lowering steps that transform its code into LLVM IR. SODA-OPT exploits several dialect-specific optimization passes from built-in dialects, together with some custom, HLS-oriented transformations. It provides a modular optimization pipeline that restructures the kernels so that the final low-level IR is well suited for hardware synthesis. The main available optimizations are summarized in Table 2: loop unrolling increases instruction-level parallelism, loop tiling can balance computation and data movement, alias analysis adds opportunities for data-level parallelism, and other typical compiler optimizations remove unnecessary operations (scalar replacement of aggregates, dead code elimination, common sub-expression elimination). Temporary buffer allocation and alloca buffer promotion are custom SODA-OPT optimizations that reduce expensive accesses to external memory by generating registers or memories internal to the kernel that allow to reuse values from input arguments. The optimized LLVM IR presents simpler dependency chains, few or no redundant instructions, and regular load-computestore patterns: such characteristics improve the resource allocation and static scheduling of operations performed by the HLS engine, resulting in significant performance gains. The pipeline is not monolithic: developers can easily enable, disable, reuse, or modify optimizations, providing ample opportunities to customize the process for different applications and implement automated exploration strategies. In fact, optimizations significantly influence the generated hardware designs in terms of performance, area, and power, and they are all implemented as compiler passes: users can thus perform an exhaustive exploration of the design space without manual interventions on the code.

Dispatch Phase
Dispatching separates the kernels from the logic that orchestrates their execution: at the end of the compilation, SODA-OPT generates a separate file for each kernel that does not  contain references to the rest of the code, and collects all orchestrating logic in another file. Bambu generates an FSMD accelerator for each of the IR files containing the kernels, later integrated in one of two possible system-level architectures. We currently target two types of architectures: a conventional system-on-chip where a microcontroller drives one or more accelerators connected through a bus (centralized architecture, Fig. 3d), and a single accelerator where kernels are connected together in a dynamically scheduled dataflow architecture (Fig. 3e). In the first case, the orchestrating logic extracted by SODA-OPT will contain function calls for the outlined kernels, which will be substituted by driver calls to the corresponding accelerators in the compiled host program. Instead, when targeting the dataflow architecture, SODA-OPT generates a task graph representing interactions between the kernels, containing information that will be used to assemble the accelerators and the distributed controller. In particular, the task graph includes the name of each kernel with the direction (input/output) of its arguments, and the sizes of exchanged data structures retrieved by leveraging the memref dialect.

DATAFLOW ARCHITECTURE GENERATION
The process to generate the dynamically scheduled dataflow architecture starts by instantiating Distributed Controller (DC) components, that will activate FSMD accelerators at runtime. The DC starts the execution of each FSMD (synthesized from the kernels outlined by SODA-OPT) according to data dependencies described in the task graph (also generated by SODA-OPT). The generation process then continues by instantiating a Hierarchical Memory Interface (HMI) that manages concurrent memory access to a shared memory from multiple accelerators. The designs of the DC and of the HMI derive from the ones presented in [8], but they are now integrated in the SODA Synthesizer where the generation process can take advantage of the outlining, analysis, and transformations performed by SODA-OPT.

Distributed Controller
The DC employs dedicated hardware components to check, at runtime, when to start the execution of the FSMD accelerators, allowing concurrent execution of multiple modules even when their latency depends on the inputs, or they simultaneously access a shared memory. In this way, the DC allows pipelined execution of kernels, which is essential to run ML inference on streaming inputs with low latency. The DC generation flow instantiates a dedicated component named Execution Manager (EM, Fig. 4) for each FSMD accelerator. The EM collects token signals and triggers the execution of FSMD accelerators once the activating conditions for an operation are verified. Specifically, EMs are composed of three parts: the Activation Manager (AM), the Operation Manager (OM), and the Status Manager (SM). The AM is responsible for collecting token signals denoting completion of producer operations and verifying whether they correspond to an activating condition. Activating conditions for each FSMD accelerator are derived from the data dependencies between operators in the task graph provided by SODA-OPT: once all necessary tokens are received, the AM notifies the OM to start execution. In case the associated module is shared among multiple operations, the OM checks for resource availability by interacting with a dedicated arbiter, the Resource Manager (RM). When execution of and FSMD accelerator starts, the SM sends required control signals to the accelerator, and prevents the RM from accepting new requests until the operation is completed, i.e., when a completion signal (FU-done in Fig. 4) is received from the module itself. Each FSMD accelerator produces the completion signal which notifies that the output of the operation is ready for its consumers.
The FU-done signal is received by all EMs bound to the same shared module; however, it is ignored if the SM does not indicate that the operation is running. This procedure allows each EM to discriminate between the end of their associated operation and the end of other operations mapped on the same module. Finally, the EM emits OP-done token signals to notify the end of the execution to EMs associated with consumer operations. All EM components are based on combinational logic, so they do not add delay cycles to the execution time of the FSMD modules.
In statically scheduled designs, operations that execute concurrently are not allowed to share the same hardware module, thus avoiding resource conflicts. Instead, in our dataflow design, RMs dynamically resolve resource conflicts at runtime. During the synthesis process, the module binding phase maps operations to resources: in our case, operations are neural network layers (or other coarse-grained linear algebra algorithms), and resources are the statically scheduled FSMD accelerators. Module binding aims at heuristically reducing the number of resource conflicts stalling the execution. Binding is implemented with an heuristic algorithm that solves the clique covering problem on a Weighted Compatibility Graph (WCG) [41] where nodes represent operations, and edges represent compatibility relations (i.e., if two operations are connected, they can share a hardware resource). While clique covering is an NP-complete algorithm, we use a well-known heuristic on relatively small graphs: nodes in our approach correspond to layers in a neural network, or in any case to large portions of the input application. Typically the size of the graphs is at most in the hundreds of nodes, allowing the module binding phase to complete in few minutes on the largest cases.
After binding, the distributed controller generation process defines tie-breaking rules for the RMs, determining how to resolve structural conflicts that may occur at runtime if operations concurrently request the same module. When operations require the same module at different times, there is no competition and RMs simply process the requests following the order of arrival. Our implementation defines the tie-breaking rules based on the topological order of the operations in the input task graph. A different method may lead to a different execution order, but the execution output would remain the same because the system is built to respect dependencies between operations.
The high regularity of the DC architecture facilitates the automated synthesis process. The synthesis process allocates one RM for each shared module, according to the results of module binding. Then, it traverses the call graph, instantiating an EM for each operation, with custom AMs synthesized according to the operation dependencies. Fig. 5 shows a schematic of the overall architecture design for the ML model of Fig. 3. After SODA-OPT optimization and dispatching, the task graph contains four calls to the different kernel functions. Bambu synthesizes the four kernel functions using the standard FSMD approach, and the necessary DC components from the task graph describing the dependencies between functions. FSMD modules will then be assembled with their EMs, RMs, and with the memory interface (Section 5.2).

Hierarchical Memory Interface
After generating the datapath and the DC, the accelerators need to connect to the memory. Our dynamically scheduled design leverages a specialized memory architecture (hierarchical memory interface, or HMI) to manage concurrent access to shared memory from independent accelerators. The HMI is a multi-ported memory controller that dynamically assigns concurrent memory requests from different resources to multiple external memory channels, computing destination addresses at runtime with no additional delay. If the destination addresses of different memory operations collide, the HMI serializes the memory accesses. The HMI extends the design of the custom Memory Interface Controller (MIC) described in [42]. It is composed of several replicated memory interfaces (MIs), interconnected in a chain. Each MI performs only one memory operation at a time, but all MIs can operate in parallel. The concept of hierarchy appears in the way signals are propagated across the architecture. Fig. 6 shows the schematic representation of the HMI for two modules x and y. Additional modules would be chained in the same way. Each MI provides the following ports: 1) sel_store: write access request; 2) sel_load: read access request; 3) addr: memory address; 4) w_data: data to write; 5) r_data: loaded data; 6) ready: completion of the memory access. The top-level module is the only one that directly interfaces with the memory. The propagation scheme requires that only one module at a time sets sel_store and sel_load signals, which identify memory access requests. Statically scheduled designs ensure this behavior by pre-determining the operations order and executing only one operation at a time. However, this can degenerate in sequential execution of modules that could instead execute simultaneously for part of their computation. We avoid this issue by integrating additional control logic in the HMI that exploits the presence of the RM and SM blocks from the dataflow architecture. An RM intercepts memory access requests (req). If it accepts a request, it notifies a dedicated SM component, associated with the MI of each module. For example, in Fig. 6, SM_x is associated with the MI of module x, while SM_y is associated with the MI of y. If the top module encapsulating x and y also needed to access memory, a third SM and a third MI would simply be added to the arbitration scheme.
Load and store ports from communicating FSMD accelerators are connected to the HMI which, in turn, connects them to a multi-ported shared memory. Our dataflow design can either connect to high-performance multi-banked scratchpad memories or to external multi-ported DRAM controllers (e.g., Xilinx AXI DRAM controllers for FPGAs). SODA-OPT analysis passes compute the amount of data exchanged between kernels, and consequently determine the required size of the shared memory, accounting for double buffering and concurrent execution of the accelerators.

EXPERIMENTAL RESULTS
In this section, we present results of our end-to-end hardware generation flow. We first synthesize isolated ML operators from representative DNN models, and then we evaluate the difference between the two available architectures (centralized and dataflow) composed of multiple kernels.
In all experiments we maintained the following setup: we target ASIC devices at the 45 nm technology node through the OpenRoad flow, with an operating frequency of 500MHz. We use Bambu with its Clang12 frontend and O2 optimization level. Each synthesized kernel has two ports connecting it to a shared memory with 2 cycles read latency and 1 cycle write latency. Models are synthesized using 32-bit floating point units. Our flow is also able to generate several solutions for memory interfacing, including instantiating dedicated load/store ports for each input/output argument to an operator, or a parametric number of load/ store ports. However, for this analysis, we only employ two ports per accelerator, because we then combine them in larger designs with multiple intercommunicating accelerators. Hence, in the complete architectures, memory parallelism is exploited by having different accelerators operating concurrently, limiting growth of the HMI complexity.

Effectiveness of the Optimization Pipeline
We automatically outline and synthesize individual operators from the ResNet50 and MobileNetV2 DNN models (Fig. 7), in two different configurations. In the baseline configuration, we outline, lower to LLVM IR, and synthesize each kernel without applying optimization passes. In the optimized configuration, we add the SODA-OPT high-level optimization pipeline, with the goal of reducing execution time. As discussed in Section 4, SODA-OPT automatically optimizes IRs, for example, to present increased instruction-level parallelism and reduced number of redundant instructions. The transformations allow Bambu to compute efficient schedules and to best leverage the available hardware resources during HLS, as shown by the increase in performance. By enabling the high-level optimizations in SODA-OPT, we observe an average speedup of 7.2x in the execution time (clock cycles) over the baseline for ResNet50 layers (Fig. 8a). For operations from the MobileNetV2 model, we see an average speedup of 23.5x over the baselines, with peaks of 52-74x in the convolutional layers (Fig. 8b). Table 3 shows the post floorplanning characteristics of the optimized accelerators generated by the SODA Synthesizer. We computed the efficiency (GFLOPS/W) by counting the total number of floating point arithmetic operations performed during the whole execution of an accelerator, divided by execution time and power consumption reported by Open-ROAD after floorplanning. All the accelerators provide efficiency well over the GFLOP/W, with power consumption ranging from 20 to 440 mW.
In all cases, after enabling SODA-OPT we observe a tradeoff between performance and area/power consumption, with power and area overheads that linearly increase with the obtained speedup. This is expected, as the SODA-OPT default optimization pipeline generates bigger designs by allocating more resources in parallel to reduce the execution time (especially through loop unrolling). With the selected benchmarks, we can see that simple operators, such as ReLU, achieve an efficiency up to hundreds of GFLOPS/W, while more complex operators, such as convolutions, reach 10 GFLOPS/W. In fact, an increase in the amount of  allocated computational resources increases power consumption: for this reason, smaller kernels (e.g., ReLU) have lower power overhead and higher efficiency. Table 4 shows characteristics and quality metrics of popular architectures used for training and inference of DNNs, compared to one of the highly specialized accelerators generated by the SODA Synthesizer. To perform this comparison, we used SODA to generate dense matrix multiplication (MatMul) accelerators with 8 memory ports and different number formats. For programmable devices, peak rates may or may not be achieved depending on how optimized is the input code; instead, our approach implements fully specialized accelerators for specific operations (e.g., matrix multiplication, matrix-vector multiplication, or ML operators and models). Therefore, the efficiency results reported in Table 4 are derived from a theoretical peak throughput, except for the accelerator generated by SODA, where we calculated the actual throughput and efficiency based on execution time. SODA generates FSMD accelerators where the number of functional units depends on the amount of exposed parallel arithmetic operations in the kernel (which is controlled by high-level optimizations), while other devices based on systolic arrays contain thousands of processing units that may or may not be fully utilized depending on the operation. Considering these differences, and the significantly older technology node, our efficiency results at FP32 are comparable to the other FP32 accelerators, which are also considerably larger devices and might not meet edge requirements.

Qualitative Comparison With Other ML Accelerators
The designs in the bottom half of the table support MLspecific floating point number formats (e.g., bfloat16 or ten-sorfloat32) or integer/fixed point formats. FPGA-based custom accelerators typically focus on integer/fixed point formats, as implementing floating point units on finegrained reconfigurable devices is inefficient. For example, DNNBuilder, targeting the Kintex UltraScale FPGA, leverages a fixed-point 16-bit format. Custom number formats increase efficiency with limited loss of accuracy; the SODA toolchain, thanks to its modularity, can easily be extended to parse quantized models and generate functional units with specialized number formats. The SODA-generated fixedpoint 16-bit MatMul reaches an efficiency >150 GOPS/W.

Comparison Between Dataflow Architecture and Centralized Architecture
We used the blocks of DNN layers in Fig. 7 to compare the performance of the two different ways in which we can connect synthesized accelerators: the centralized architecture, and the dataflow architecture. The centralized architecture is a system like the one depicted in Fig. 3d, where individual accelerators are attached to a central bus, a microcontroller drives their execution, and the data they exchange is stored in and retrieved from an external memory. The dataflow architecture, instead, is a system that uses our distributed controller to orchestrate the execution of accelerators accessing a shared memory, similar to the one of Fig. 3e. For all experiments, each individual operator in the DNN graph is outlined, processed by SODA-OPT with the full optimization pipeline, synthesized by Bambu, and simulated with Verilator. While the simulation already accounts for shared memory accesses, we estimate the cost of communication between accelerators and external memory taking into consideration the type and size of the inputs and outputs for each kernel. We consider a memory bandwidth of 6400MB/s, typical of DDR3 RAM modules using 45 nm technology cells, and calculate transfer times as seen by the accelerator, in terms of clock cycles at 500MHz. In the centralized architecture, accelerators communicate with each other through the external memory. In the dataflow architecture, only the graph inputs and outputs go through the external memory, while intermediate results are kept within a shared internal scratchpad memory. We assume this shared scratchpad memory to have as many ports as independent accelerators, so that, using the HMI memory interface described in Section 5, the architecture can support conflict-free concurrent accelerator execution, allowing for efficient pipelined execution of streaming workloads. We assume a latency of 2 cycles for read and 1 cycle for write operations. This assumption is reasonable, as there exist highperformance scratchpad designs with up to 16 independent banks, enough to support the benchmarks in our experiments.
To model the overall latency of the centralized architecture, we simply add the execution time of each accelerator with the time it takes to transfer data to/from external memory before and after its execution. We compute the streaming latency by multiplying the result by the number of inputs in the stream. In fact, although the synthesized accelerators can execute in parallel on different inputs, the application host code derived from the original MLIR representation of the DNN model only invokes them sequentially. The model to estimate the performance of the dataflow architecture, instead, takes into account the support for concurrent and pipelined execution provided by the distributed controller. We first identify the longest path in a directed acyclic graph where vertices correspond to kernel or memory latencies and edges replicate the edges in the application dataflow graph; the sum of latencies along the critical path corresponds to the overall execution time for a single input. In this way, we account for fork-join patterns in the application dataflow graph, where multiple branches can be executed in parallel and the overall latency is determined by the slowest branch latency. In streaming execution, the dataflow architecture latency becomes the latency of a single input execution plus N -1 times the initiation interval, where N is the number of elements in the input stream and the initiation interval is the latency of the slowest kernel or memory transfer. Table 5 provides the execution latency in clock cycles for the two blocks of layers in Fig. 7, and uses the results from the centralized architecture as a baseline to assess the performance improvement provided by the dataflow architecture. For the Resnet50 block, using our dataflow architecture, accelerators implementing layers in the upper branch of Fig. 7a can execute in parallel with accelerators implementing layers in the lower branch. Compared to the centralized architecture baseline, this results in a speedup of 1.4x during single input execution, and a speedup of 3.5x when streaming a batch of 100 inputs. For MobileNetV2, although Fig. 7b does not have parallel branches, we still observe significant savings due to the reduced accesses to external memory; in fact, the centralized system spends 57.9x and 5,791.2x more cycles to transfer data between accelerators and external memory, with a single input and when streaming a batch of inputs, respectively.

Discussion
Our experiments show how the SODA Synthesizer high-level optimizations and dataflow methodology provide significant performance gains, with reasonable area and power overheads, while requiring minimal user interaction. Outlining each layer for acceleration, as we did in the experiments, can lead to imbalanced execution times and utilization (e.g., a ReLU node remaining idle for most of the time waiting until the convolution node has finished). In the future, the outlining strategy can be improved to better exploit optimization at the level of the computational graph. Operators (e.g., convolution, bias, ReLU) can be fused together or further partitioned into smaller primitives, aiming to generate accelerators with similar computational intensity and a higher utilization of resources. For kernels that are memory-bound, the HMI design can be further extended to support FSMD accelerators with multiple ports and better manage buffers between nodes, to increase memory parallelism in the dataflow  architecture. In summary, our dataflow generation methodology enables the synthesis of input DNN models in a system of specialized accelerators, generates efficient accelerators through high-level optimizations, and does not require any manual code modification to identify the accelerators in the input specification.

CONCLUSION
This article presents the SODA Synthesizer, an open-source, multi-level, no-human-in-the-loop hardware compiler able to transform specifications from high-level software frameworks (Machine Learning in particular) into efficient FPGA/ ASIC accelerators. Its frontend, SODA-OPT, leverages the MLIR framework to identify kernels for acceleration, to generate orchestrating code, and to implement a set of high-level optimizations that restructure the kernels to enhance the hardware generation backend, i.e., state-of-the-art HLS tool Bambu. SODA also implements a methodology to assemble highly optimized FSMD accelerators in a coarse-grained, dynamically scheduled dataflow design, which provides better performance compared to a centralized architecture with a microcontroller driving the execution of accelerators, especially in the case of streaming inputs. We show that ourhigh level optimization pipeline effectively yields better HLS results (up to 74x speedup compared to an unoptimized baseline), and that the dataflow architecture can provide a further 3x speedup thanks to reduced accesses to external memory, concurrent execution, and pipelining.
Future works involve further extending the methodology for the generation of the dataflow system of accelerators, both at the frontend level and at the backend. At the frontend, SODA can exploit semantic information of the computational graph to better balance the custom accelerators generated for each operator or layer. At the backend, there are opportunities to further improve interfacing between accelerators and memory.
Our compiler-based toolchain is modular by construction, and it will easily allow further development to introduce new optimization techniques, automate design space exploration, and explore different architectural models.
Serena Curzel received the BS degree in electronics and telecommunication engineering from the Universit a degli studi di Trento, Italy, in 2016, and the MS degree in electronics engineering from Politecnico di Milano, Italy, in 2019, where she is currently working toward the PhD degree in information technology. Her main research interests include acceleration of domain-specific applications (including deep neural networks) and high-level synthesis. Since 2021, she is a PhD intern at Pacific Northwest National Laboratory, where she is collaborating with the HPC Group to develop new HLS techniques and compiler optimizations for machine learning hardware.
Nicolas Bohm Agostini received the bachelors's degree in electrical engineering from the Universidade Federal do Rio Grande do Sul (UFRGS, Brazil), in 2015. He is currently working toward the PhD degree with the Department of Electrical and Computer Engineering, Northeastern University. His primary research interests include computer architecture and high-performance computing. He enjoys teaching and mentoring, which he exemplified by instructing courses such as compilers, GPU programming, and embedded robotics during his graduate career. Recently, he has been working on projects focused on accelerating machine learning and linear algebra applications by proposing compiler extensions for different targets or new computer architecture features.
Vito Giovanni Castellana received the MSc (cum laude) degree in computer engineering and the PhD degree in computer science and engineering from Polictecnico di Milano, Italy, in 2010 and 2014. He is a senior computer scientist with Pacific Northwest National Laboratory, High-Performance Computing Group, which he joined in 2012. His research interests include design automation and high level synthesis, parallel programming, big data and graph analytics, and, compiler technologies. Ankur Limaye received the PhD degree in electrical and computer engineering from the University of Arizona, Tucson, Arizona, in 2020. He is a postdoctoral research associate with the High-Performance Computing Group, Pacific Northwest National Laboratory, Richland, WA. His research interests include computer architecture, HW/SW co-design, and workload characterization and performance analysis.

Marco
Joseph Manzano received the PhD degree from the University of Delaware, in 2011. He is a senior computer scientist with the High-Performance Computing Group, Pacific Northwest National Laboratory(PNNL). His interests reach areas such as compilers, runtime systems, cybersecurity, performance modeling, and benchmarking.
Jeff Zhang received the PhD degree from the Electrical and Computer Engineering Department, New York University, in 2020. He is a postdoctoral fellow with the Architecture, Circuits, and Compilers Group, Harvard University. He also has research internship experience with Samsung Semiconductor and Microsoft Research. His general research interests include deep learning, computer architecture, and EDA, with particular emphasis on energy-efficient and faulttolerant design for AI/ML systems and hardware accelerators.