A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of compute in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with particular focus on premises behind the different CGRA architectures and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover existing knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.


I. INTRODUCTION
With the end of Dennard's scaling [1] and the looming threat that even Moore's law [2] is about to end [3], computing is perhaps facing its most challenging moments. Today, computer researchers and practitioners are aggressively pursuing and exploring alternative forms of computing in order to try fill the void that an end of Moore's law would leave behind. Today there are a plethora of emerging technologies with the promise of overcoming the limits of technology scaling, such as quantum-or neuromorphic-computing [4], [5]. However, not all Post-Moore architectures are intrusive and some merely require us to step-away from the comforts that von-Neumann architecture offer. Among the more salient of these technologies are reconfigurable architectures [6].
Reconfigurable architectures are systems that attempts to retain some of the silicon plasticity that an ASIC solution usually throws away. These systems -at least conceptuallyallow the silicon to be malleable and its functionality dynamically configurable. A reconfigurable system can for example mimic a processor architecture for some time (e.g. a RISC-V core [7]), and then be changed to mimic a LTE baseband station [8]. This property of reconfigurability is highly sought after, since it can mitigate the end of Moore's law to some extent-we do not need more transistors, we just need to spatially configure the silicon to match the computation in time.
Recently, a particular branch of reconfigurable architecture -the Field-Programmable Gate Arrays (FPGAs) [9] -has experienced a surge of renewed interest for use in High-Performance computing (HPC), and recent research has shown performance-or power-benefits for multiple applications [10]- [14]. At the same time, many of the limitations that FPGAs have, such as slow configuration times, long compilations times, and (comparably) low clock frequencies, remain unsolved. These limitations have been recognized for decades (e.g. [15]- [17]), and have been used to drive forth a different branch of reconfigurable architecture: the Coarse-Grained Reconfigurable Architecture (CGRAs).
CGRAs trade some of the flexibility that FPGAs have to solve their limitations. A CGRA can operate at higher frequencies, can provide higher theoretical compute performance, and can drastically reduce compilation times. While CGRAs have traditionally been used in embedded systems (particular for media-processing), lately, they too are considered for HPC. Even traditional FPGA vendors such as Xilinx [18] and Intel [19] are creating and/or investigating to coarsen their existing reconfigurable architecture to complement other forms of compute.
In this paper, we survey the literature of CGRAs, summarizing the different architectures and systems that have been introduced over time. We complement surveys written by our peers by focusing on understanding the trends in performance that CGRAs have been experiencing, providing insights into where the community is moving and any eventual gaps in knowledge that can/should be filled.
The contributions of our work are as follows: The remaining paper is organized in the following way. Section II introduces the motivation behind CGRAs, as well as their generic design for the unfamiliar reader. Section III positions this survey against existing surveys on the topic. Section IV quantitatively summarizes each architecture that we reviewed, describing key characteristics and the premise behind respective architecture. Section V analyzes the reviewed architecture from different perspectives (Sections VII, VIII, and VI), which we finally discuss at the end of the paper in section IX.

II. INTRODUCTION TO CGRAS
Before summarizing the now three-decades of Coarse-Grained Reconfigurable Architecture (CGRA) research, we start by describing the main aspirations and motivations behind them. To do so, we need to look at the CGRAs predecessor: The Field-Programmable Gate Array (FPGA).
FPGAs are devices that were developed to reduce the cost of simulation and developing Application-Specific Integrated Circuits (ASICs). Because any bug/fault that were left undiscovered post ASIC tape-out would incur a (potentially) great economical loss, FPGAs were (and still are) crucial to digital design. In order for FPGAs to mimic any digital design, they are made to have a large degree of fine-grained reconfigurability. This fine-grained reconfigurability was achieved by building FPGAs to contain a large amount of on-chip SRAM cells called Look-Up Tables (LUTs) [20]. Each LUT was interfaced by few input wires (usually 4-6) and produced an output (and its complement) as a function of the SRAM content and their inputs. Hence, depending on the soughtafter functionality to be simulated, LUTs could be configured and -through a highly reconfigurable interconnect -could be connected to each other finally yield the expected designs. The design would naturally run one to three order of magnitude lower than the final standard-cell ASIC, but would nevertheless be an invaluable prototyping tool.
By the early 1990s, FPGAs had already found other uses (aside from digital development) within telecommunication, military, and automobile industries-the FPGA was seen as a compute device in owns right and there was some aspiration of use it for general-purpose computing, and not only something used in the niche market of prototyping digital designs. Despite this, several limitations of FPGAs were quickly identified that prohibited coverage of a wide range of applications. For example, unlike software compilation tools that take minutes to compile applications, the FPGA Electronic Design Automation (EDA) flow took significantly longer, often requiring hours or even days of compilation time. Similarly, if the expected application could not fit one a single device, the long reconfiguration overhead (the time it takes to program the FPGA) demotivated time-sharing or context-switching of its resources. Another limitation was that some important arithmetic operators did not map well to the FPGA; for example, a single integer multiplication could often consume larger fraction of the FPGA resources. Finally, FPGAs was relatively slow, running at a low clock frequency.
Many of these challenges and limitations of applying FPGAs for general-purpose computing holds to this day.
Many early reconfigurable computing pioneers looked at the limitations of FPGAs and considered what would happen if one would increase the granularity at which it was programmed? By increasing the granularity, larger and more specialized units could be built, which would increase the performance (clock frequency) of the device. Also, since the larger units require less configuration state, reconfiguring the device would be significantly faster, allowing fine-grained time-sharing (multiple contexts) of the device. Finally, by coarsening the units of reconfiguration, one would include those units that maps poorly on FPGAs into the fabric (e.g. multiplications), making better use of the silicon and increasing generality of the device. These new devices would later be called: Coarse-Grained Reconfigurable Architecture (CGRAs).
An example of what a CGRA looks like from the architecture perspective is shown in Figure 1. In Figure 1:a we see a mesh of reconfigurable cells (RCs) or processing element (PEs), which is the smallest unit of reconfiguration that perform work, and it is through this mesh that a user (or compiler) decides how data flows through the system. There are multiple ways of bringing data in/out to/from the fabric. One common way is to map the device in the memory of a host processor (memory-mapped) and have the host processor orchestrate the execution. A different way is to include (generic-) address generations (AGs) that can be configured to access external memory using some pattern (often corresponding to the nested loops of the application) and push it through the array. A third option is to have the re-configurable cells do both the computation and address generation. Figure 1:b illustrates the internal of a RC element, which includes an ALU (integer and/or floating-point capable), two multiplexers (MUXs), and a local static RAM (SRAM) used for storage. The two multiplexers decide which of the external inputs to operate on. The inputs are usually the output of adjacent RCs, the local SRAM scratchpad, a constant, or a previous output (e.g. for accumulations). The output of the ALU is similarly connected to adjacent RCs, local SRAM, or back to one of the MUXes. Operation of the RC is governed by a configuration register, here briefly shown in Figure 1:c. For simplicity, we show a single register that hold the state-however, in many architectures, each RC can hold multiple configurations that are cycled through over the application lifetime. Each of the configuration can for example hold the computation for a particular basic block (where live-out variables are stored in SRAM) or discrete kernels. Figure 1 illustrates how a majority of today's CGRAs look like, but at the same time there are multiple variations. For example, early CGRAs often included fine-grained reconfigurable elements (Look-Up Tables, LUTs) inside the fabric. While the mesh topology is by far the most common, some work chose a ring or linear-array topology. Finally, the flowcontrol of data in the network can be of varying complexity (e.g. token or tagged-token). We describe many of these in our summary in the sections that follows. vertical and horizontal axis, and the second layer are four quadrants composed into a mesh. Unlike previous CGRAs, MorphoSys had a dedicated multiplier inside the ALUs. A CGRA based on MorphoSys was also realized in silicon nearly seven years from its inception [39].
While most of the CGRA described so-far used a mesh topology of interconnection (with some connectivity), other topologies have been considered. RaPiD [40], [41] is an CGRA that arranged its reconfigurable processing elements in a single dimension. Here, each processing element is composed of a number of primitive blocks, such as ALUs, Multipliers, scratchpads, or registers. These primitive blocks are connected to each other through a number of local, segmented, tri-stated bus lines that can be configured to form a data-path-a socalled linear array. These processing elements can themselves be chained together to form the final CGRA. Interestingly, RaPiD can be partial reconfigured during execution in what the authors called "virtual execution". RaPiD itself does not access data; instead, a number of generic address-pattern generator interface external memory and stream the data through the compute fabric.
The KressArray [42]- [44] was one of the earliest CGRA design to be created, and the project spanned nearly a decade with multiple versions and variants of the architecture. It features a hierarchical topology, where the lowest tier was composed of a mesh of processing elements. The processing elements interfaced neighbours, and also included predication signals (to map if-then-else primitives). Generic address generators supported the CGRA fabric by continuously streaming data to the architecture.
Chimaera [45] is a co-processor conceptually similar to GARP, with an array of reconfigurable processing elements operating at a quite fine granularity (similar to modern FPGAs) can be reconfigured to perform a particular operation. It is closely coupled to the host processor to the point where the register file is (in part) shadowed and shared. Mapping application to the architecture was assisted by a "simple" C compiler, and they demonstrated performance on Mediabench [46] and Honeywell [47].
PipeRench [48] applied a novel network topology that was a hybrid between that of a mesh and a linear array. Here, a large number of linear arrays were layered, where each layer sent data uni-directionally to the next layer. Several future CGRA would adopts this kind of structure, including data-flow machines (e.g. TARTAR) and loop-accelerators (e.g. FPCA). The layers themselves in PipeRench were fairly fine-grained and comparable to GARP as they had reconfigurable Look-Up Tables rather than fixed-function ALUs within. PipeRench introduced a virtualization technique that treated each separate layer as a discrete accelerator, where a partial reconfiguration traveled alongside with its associated data, reconfiguring the next layer according to its functionality in a pipeline fashion, which was new at the time. PipeRench was also later implemented in silicon [49].
The DREAM [50] architecture was explicitly designed to target (then) next-generation 3G networks, and argues that CGRAs are well suited for the upcoming standard with respect to software-defined radio and the flexibility to hot-fix bugs (through patches) and firmware. The system has a hierarchy of configuration managers and a mesh of simple, ALUbased, RPEs operating on 16-bit operands and with limited support for complex operations such as multiplications (since operations were realized through Look-Up Tables).
So-far, all architecture has been computing using integer arithmetic's. Imagine [51] is among the early architectures that included hardware floating-point arithmetic units. The architecture itself is similar to RaPiD-it is a linear array, where each processing elements has number of resource (scratchpads, ALUs, etc.) all connecting using a global bus. Similar to RaPiD, the processing elements are passive, and external drivers are responsible for streaming data along the connect processing elements. The Imagine architecture had a prototype realized six years after its seminal paper [52].
1) Modern Coarse-Grained Reconfigurable Architectures: Most modern CGRA architecture's lineage can be linked back to those described in the previous section, and a majority of these architecture follows the generic template that was described in the previous section. However, while the overall template remains similar, many recent architectures specialize themselves towards a certain niche use (low-power, Deep-Learning, GPU-like programmable, etc.) The ADRES CGRA systems [53], [54] has been a remarkably successful architecture template for embedded architectures, and is still widely used. ADRES -like many previous and future CGRAs -consists of a mesh of processing elements where each element has neighbor (or Manhattan distance-2) connectivity. Inside each element we find an ALU of varying capability and a register file, alongside the multiplexers configured to bring in data from neighbours. The first row in the mesh, however, is unique, as it only contains the ALU (and no scratchpad/RF to store state). Instead, an optional processor can extend its pipeline to support interfacing that very first row in a Very Long Instruction Word [55] (VLIW) fashion. ADRES, by design, is thus heterogeneous. ADRES comes with a compiler called DRESC [56]. ADRES as an architecture has (and still is) a popular platform for CGRA research, such as when exploring multi-threaded CGRA support [57], topologies [58], asynchronous further-than-neighbor communication(e.g. HyCube [59]), or CGRA designs frameworks/generators (e.g. CGRA-ME [60], [61]). Furthermore, ADRES has been taped out in in silicon, for example in the Samsung Reconfigurable Processor (SRP) and the follow-up UL-SRP [62] architecture.
The Dynamically Reconfigurable ALU Array (DRAA) [63] is a generic CGRA template proposed in 2003 to encourage compilation research on CGRA architecture. Architecturewise, DRAA allows changing many of the parameters that define an CGRA, such as the data-path width, the interconnections, size of register file, etc. Preceding both DySER and ADRES, DRAA as a template has been used to e.g. study the memory hierarchy of CGRAs [64].
The TRIPS/EDGE [65], [66] microarchitecture was a long-running influential project that attempted to move away from the traditional approach of exploiting instruction-level parallelism in modern processors. The premise behind TRIPS was that as technology reduced the sizes of transistors, wire-delays and path would dominate latency, and that it would hard to scale the (often) communication wires traditional super-scalar processors [67]. Instead, by tightly coupling functional units in (for example) a mesh, direct neighbor communication could easily be scaled. In effect, TRIPS/EDGE replaced the traditional super-scalar Out-of-Order pipeline with a large CGRA array: single instructions were no longer scheduled, but instead a new compiler [68], [69] was developed that scheduled entire blocks ("CGRA configurations") temporally on the processor, allowing up-to 16 instructions to be executed at a single time (and many more in-flight). The TRIPS architecture was taped out in silicon [70], [71] and -despite being discontinuedrepresented a milestone of true high-performance computing with CGRAs. An interesting observation, albeit not necessarily related to CGRAs, was that the Edge ISA has received renewed interested as an alternative to express large amounts of ILP in FPGA soft processors [72]. The DySER [73] architecture integrates a CGRA into the backend of a processor's pipeline to complement (unlike e.g. TRIPS that replace) the functionality of the traditional (super-)scalar pipeline, and has been integrated in the OpenSPARC [74] platform [75]. The key premise behind DySer is that there are many local hot regions in program code and higher performance can be obtained by specializing on accelerating these inside the CPU. DySER was evaluated using both simulator-based (m5 [76]) and an FPGA implementation on well-known benchmarks (PARSEC [77] and SPECint) and compared with both CPU and GPU approaches, showing between 1.5x-15x improvements over SSE and comparable flexibility and performance to GPUs. Recently (2016 on-wards), DySER has been the focus of much of the FPGA-overlay scene (see Section IV-F). Other, similar work to DySeR that integrates CGRA-like structures into processing cores with various goals, includes CReAMS/HARTMP [78], [79] (applies dynamic binary translation) or CGRA-sharing [80] (conceptually similar to what AMD Bulldozer architecture [81] and Ultrasparc T1/T2 did with their floating-point units).
The AMIDAR [82] is another long running exciting project that (amongst others) uses CGRA to accelerate performance critical sections. The AMIDAR CGRA extends the traditional CGRA PE architecture with direct interface to memory (through DMA). There is support for multiple contexts and hardware support for branching (through dedicated conditionboxes operating on predication signals), which also allow speculation. The AMIDA CGRA has been implemented and verified on a FPGA platform, and early results show that it can reach over 1 GHz of clock frequency when mapped to a 45 nm technology.
The MORA [83] architecture is a platform for CGRArelated research. MORA targets media-processing, and hence provide an 8-bit architecture with processing elements covering the most commonly used operations. MORA itself is simi-lar to the previous MATRIX, with a simple 2D mesh structure with neighbour communication. Each processing element has a scratchpad (256 bytes large). MORA is programmable using a domain-specific language developed over C++ [84].
The CGRA Express [85] is yet another architecture that follows the concept of being a mesh with simple, ALU-like structures. The premise and motivation for their work is that most existing CGRA application are optimized for maximal graph coverage rather than sequential speed. The hypothesis is that -depending on the operators each PEs is configured to use -they can exploit the resulting positive clock slack of the operators and cascade (fuse) more operations per clock cycle than blindly registering the intermediate output. This, in turns, allow them to execute more instructions per cycles (or reduces the frequency) with little performance losses. In their architecture, they add an extra bypass network that can be configured to not be pipelined. They show both power and performance benefits on multimedia benchmarks with and without their approach. The work could be conceptually seen as the opposite to what modern FPGAs (e.g. Stratix 10) does with HyperFlex [86], but for CGRAs.
The Polymorphic Pipeline Array (PPA) [87] performed an interesting pilot study that drove the parameters of their CGRA: they simulated a large number of benchmarks scheduled on a hypothetical (infinite) CGRAs, with focus on modulo-scheduling and loop unrolling. They revealed that even with infinite larger CGRAs, the performance levels will be bound as a function of the instruction-level parallelism in the loops and the limitation of modulo-scheduling, and they argue that there is a definitive need to include other forms of parallelism to scale on CGRAs. While the PEs themselves follows a standard layout, they propose an interesting technique that allows multiple (unique) kernels to be executed concurrently on the CGRA, where each kernel communicate with each other either through DMA or shared memory. Kernels can also be resized to fully exploit the CGRA array.
The premise behind SIMD-RA [88] is similar to that of PPA: CGRAs relies too much on instructional-level parallelism, and opportunities from other forms of parallelism are lost. SIMD-RA focuses on embedding support to modularize the CGRA-array to supporting multiple discrete controllable regions that (may) operate in SIMD fashion. They found that using SIMD not only yielded better performance, but were also more area efficient compared to only using softwarepipelining.
SmartCell [89] is a CGRA that aspires to be low-power with high-performance, supporting both SIMD-and MIMDtype parallelism. The architecture is effectively a 2D mesh, but with the mesh divided into 2x2 quadrants of processing elements. These 2x2 islands share a reconfigurable router and inter-quadrant communication is limited to the connectivity of these routers. The processing elements themselves are fairly standard and contain instruction memory whose instruction (configuration) is set either per processing element (MIMD) or sequenced globally (SIMD).
BilRC [90] is a heterogeneous mesh composed of three different blocks: generic ALU blocks, Multiplication/Shifter nodes, and memory blocks, following the (by now) traditional recipe of a CGRA. Unique to BilRC is that the architecture explicitly exposes the triggering of instruction, allowing the programmer and/or application fine-grained control over the amount of parallelism or when instructions are triggered.
The lack of floating-pointer support in CGRAs has also been a research driving force. FloRA [91] is 16-bit IEEE-754 floating-point capable CGRA. The architecture itself is composed of 64 RCs, and each RC is fairly standard and do not include dedicated floating-point cores themselves; instead, multiple (2) RCs can be combined to enable single-precision (32-bit) floating-point support, where mantissa-and exponentcomputation is distributed among the pair.
Feng et al. [92] introduce a floating-point capable architecture specifically designed specifically radar signal processing. Despite the familiar mesh-based interconnection, the design deviates from the traditional approach since their processing elements are fairly diverse and heterogeneous. The CGRA itself was taped out in silicon and could reach up-to 70 GFLOP/s floating-point performance.
PRET-driven (Precision Timed) CGRA aimed towards predictable real-time processing was developed by Siqueira et al. [93]. Interestingly, the CGRA has support for threads, which is a concept used more in High-Performance rather than Real-Time computing. The architecture is similar to what a SIMT (Simulatenous Multi-Threading) architecture looks like, where each processing elements has a duplicate number of resource (primarily the register files) that are unique to each thread. Aside from having deterministic timing inside the CGRA, the authors also implement a predictable external memory access timing, required for real-time systems.
The recent SPU [94] architecture aspires to provide a CGRA for general-purpose computing. The main novelty is that SPU extends existing CGRA designs with support for two types of computational patterns: what they call "stream-joins" (e.g. sparse vector multiplication inner-product) and alias-free scatter/gather (regular loops with indirection). This is achieved by extending the typical CGRA with options to conditionally consume input tokens (re-use values), reset accumulators, or conditionally discard output tokens. Address-Generation units (linear and non-linear) reside inside on-chip SRAM controllers. The SPU targets general-purpose workloads with some favor towards deep-learning applications.
The premise Soft-Brain [95] is to combine both vectorlevel (for regular efficient memory-access) and data-flow (for parallelism) computation in CGRAs to reach high performance and power-efficiency. The architecture consists of different number of stream-engines (essentially address-generators in prior work) and the CGRA substrate itself. The input to the CGRA substrate is a number of vector-ports, which are essentially the vector-data itself (512-bit) as fetched from memory, the on-chip scratchpad, or fed-back from the output of the CGRA, a stream-controller orchestrates the execution of the system. The Chameleon [96] CGRA is an early commercial CGRA that focuses on competing with early DSPs and FPGAs. Here the CGRA is layered, where they call each layer a slice. Each slice has three tiles, where each tile has a number of scratchpad memories that interface with eight processing element that can be reconfigured. The idea is to load the local scratchpad with data, configure the processing elements associated with the scratchpad with some functionality, and the pipe the data through and onto other slices. The Chameleon was implemented in a 0.25um process running at a 125 MHz clock frequency. The architecture itself operates on 32-bit datapath width, but can be configured to divide the data-stream into two 16-bit or four 8-bit streams as well.
SiLAGO [97] is a methodology for creating CGRA-based platforms. The premise behind the method is to use reconfigurable CGRA processing elements (based on DRRA [98]) as building-blocks for future systems in order to reduce production cost with little impact on performance (compared to hand-made ASICs). Platforms based on SiLAGO and DRRA are amongst others specialized architectures for Deep-Learning [99], Brain-simulation computing [100], and genome identification [101]. The Q100 [102] is similar in concepts but specializes on data-base computing and provides tiles for computing on data-flow streams that users can assemble larger systems from.

B. Larger CGRAs
Most CGRA systems (e.g. ADRES, TRIPS, DySER) limit the size of the array to around (or less than) 64 processing element, and only a few of so-far mentioned CGRAs are larger than that (PipeRench has 256 PEs, GARP has 768) but they are relatively fine-grained. Likely the limited size of CGRA was their application domain, which most involved image-, audio-, or telecommunication applications. However, in recent years, even larger CGRA-based systems targeting e.g. High-Performance Computing application has started to emerge.
The eXtreme Processing Platform [103] (XPP) was a CGRA that focused on multiple levels of parallelism, including that of pipeline processing, data-flow computing, and task-level execution. XPP's interconnection was deep and complex, consisting of multiple levels of various functionality. At the lowest tier, small processing elements containing scratchpad, an ALUs, and associated configuration manager reside in mesh-like connectivity called a cluster. These clusters themselves are connected through switch-boxes running along their vertical and horizontal axes. Each tier has a configuration manager that is responsible for all functionality of that layer (and below), allowing fine-grained control and partitioning of the functionality of the system. XPP was token-driven, and execution of operation occurs only when data is present at inputs.
The High-Performance Reconfigurable Processor [104] (HiPReP) is an on-going CGRA reserach platform capable of floating-pointer computations. HiPReP has dedicated floatingpointer cirucitry (unlike e.g. FloRA). Processing elements are organized in a mesh with a heterogeneous (in terms of multiple contexts). A number of scratchpad memories sits at the fringes of each tile and are used to store streamed data. Controlling the operation of the processing elements is done through an instruction pointer that is governed by hierarchical sequencers (one per tile and one global). The sequencereffectively a programmable FSM -dictates which context is being executed, and can (re)act on signals from the tiles themselves. DRP was commercially taped out in the DRP-1 prototype by NEC.
The commercial DAPDNA-2 [120] processor produced by IPFlex contains up-to 376 32-bit processing elements, organized as 8x8 PE quadrants in a mesh. The architecture is heterogeneous, with discrete tiles supporting ALU operations, scratchpad, programmable delay lines, simple addressgenerations (counters) or I/O buffers. The processing elements contains both multiplication and arithmetic units and also support optional pre-processing of inputs through rotation/masking units. The tiles are interconnected using horizontal and vertical busses that run in-between and through the mesh, and crossing the quadrants can only be done at border tile locations. Performance of selected applications (FIR, FFTs, Image processing) show two orders of magnitude better performance over the then state-of-the-art Pentium 4 processor.
The 167-processor architecture [121] borderlines CGRAs and conventional multi-core processors, but we include it here since the processing elements are simple and communication between them is only performed using direct (yet dynamically configured) connectivity (and not through shared-memory or cache coherence as done in multi-core). The main focus behind this work is low-power, and through a series of advanced lowlevel optimizations (DVFS, clock generation and distribution, GALS [122] etc.) They show performance of up to 196.7 GMAs/Watt when fabricated in 65 nm technology. Other similar architecture, based on programmable cores with limited connectivity, are the IMAPCAR [123]/Imap-CE CGRA [124] from NEC aimed towards image recognition in automobiles.
The Rhyme-Redefine [125], [126] architecture is a CGRA targeting High-Performance Computing (HPC) kernels. It follows a fairly typical CGRA design, where processing elements are connected through in a torus network. The premise of their work is that there is a need to exploit multiple levels of parallelism (instruction-, loop-and task-level parallelism), albeit the current implementation focuses primarily on instruction-level parallelism through modulo-scheduling. The Rhyme-Redefine supports floating-point computations.
Plasticine [127] is a recent, large CGRA that focus on parallel patterns. At the highest abstraction layer, it is built of a mesh of units. There are two types of units: compute and memory units, both of which are programmable with patterns. Inside the compute units we find the raw functional units (the ALUs) as well as programmable state for controlling them. The compute units are built with both SISD-and SIMD-type parallelism in mind, and vector operations map natively to these units. Similarly, inside the memory units, we find a small set of ALUs coupled with programmable logic to interface the SRAM local to the units. The mesh itself interfaces external memory through a set of address generators and coalescing units. More importantly, Plasticine targets floating-point intensive applications, which is also shown in their evaluation (only three out of 13 applications are integer-only). Plasticine is programmable using Spatial [128]-a custom language based on patterns for data-flow computing.
Recently, the Cerebras Wafer Scale Engine [129] has been created explicitly for high-end deep-learning training. Little information is publicly available, but the architecture seems to be a hybrid solution between general-purpose processing code and specialized tiles for tensor computations, and could make it the single largest CGRA architecture to date with a size of over 46,225 mm 2 .

C. Linear-Arrays and Loop-Accelerators
VEAL [130] is a linear array that explicitly targets accelerating small, compute-intensive loop-bodies. Similar to a beforementioned PPA (Polymorphic Pipeline Array), the authors behind VEAL performed a rigid empirically evaluation of the benchmarks they target, and demonstrate that one of the main limitation to the performance of mapping said benchmarks to CGRA fabrics is not the number of resources, but actually the amount of instruction-level parallelism extracted by moduloscheduling. The VEAL is linear array fed by a number of custom address-generators, which broadly corresponds to the induction variables of the loops that are executed. An interesting observation is that VEAL is among few CGRA work that use double-precision arithmetic's. Another loopaccelerator similar to VEAL is FPCA [131].
The BERET [132] architecture is yet another linear array that is designed to accelerate hot traces found by the hosting general-purpose processing. One of main BERETs main contribution was that they identified a small set of graphs that the processing elements should cover (called SEBs); the set was empirically extracted from the benchmark and has since then been used in other studies (e.g. SEED [133], which is similar but improved in concept).

D. Deep-Learning CGRAs
Deep-Learning [134], in particular the computationally regular Convolutional Neural Networks (CNNs), have lately become target for specialized CGRAs. Here the focus is to limit the generality and reconfigurability of traditional CGRA to fit the computational patterns of CNNs, and instead spend the gained logic on supporting specialized operations for the intended deep-learning workloads (such as compression, multicasting, etc.). Furthermore, these architectures often honor smaller (or mixed) number representations, since deep-learning often is amendable to lower-precision calculations [135].
The DT-CGRA [136], [137] architecture follows a CGRA design with fairly coarse processing elements that include up-to three multiple-accumulate instructions. The processing elements also include programmable delay lines to easier map temporally close data. Data inside the PEs is synchronized through tokens using FIFO empty/full signals as proxy. To support the different access patterns that modern deep-learning layers have (stride, type, etc.), the CGRA mesh is driven by a number of stream-buffer units that are programmable using in a VLIW-fashion and that control the address-generations to external memory to stream data.
The Sparse CNN (SCNN) [138] is a deep-learning architecture that targets (primarily) CNNs and can exploit sparseness in both activations and kernel weights. The architecture itself is composed of a mesh of RCs, where each element also includes a 4x4 multiplier array and a bank for accumulation registers. These RCs are driven-and orchestrated-by a layer sequencer, which fetches and broadcasts (compressed) weights and activations. SCNN supports inter-PE parallelism in the form of spatial blocking/tiling, where each block is artificially enlarged with a halo region, which is exchanged between adjacent tiles at the end of the computation. They also implement a dense version (DCNN) of the architecture in order to measure the area overhead and power-and performance-gains of including sparsity in the accelerator.
Liang et al. [139] introduce a CGRA accelerator that targets reinforced learning. The processing elements themselves are fairly static, with support for addition, multiplication, or a fuse of both. Additionally, a number of different activation functions (ReLu, sigmoid, and tanh) can be selected using the configuration register, and data can be temporally stored in a local scratchpad. Unlike most existing CGRAs that place address-generators in discrete units outside the RCs, Lian et al.'s RC include them inside. Global communication lines allow the user to control the reinforced training experience of the system.
The Eyeriss deep-learning inference engines [140], [141] follows a CGRA design methodology as well, albeit the specialize on re-configuring the network access-patterns rather than the compute (which mostly is based on multiplyaccumulate operations). The CGRA itself is a mesh with a variety of options of point-to-point and broadcast operations, highly suitable for deep-learning convolution patterns. Additionally, the platform supports compression of data and exploits sparseness of intermediate activations to increase observed bandwidth. The Eyeriss architecture -depending on the type of neural network used -can utilize nearly 100% of the CGRA resources when inferring AlexNET.
One of the most recent (and perhaps radical) changes to the FPGAs is coming in the form of support for Deep-Learning CGRAs. The Xilinx Versal [142], [143] series occupies a large part of the silicon to a mesh-like CGRA structure of programmable, neighbor-communicating, processing elements. The elements themselves are fairly general-purpose, but are marketed as targeting deep-learning and telecommunication application. To remedy the eventuality of the AI engine missing crucial parts of deep-learning functionality that has yet to come, the AI engine can directly interface remaining parts of the reconfigurable (FPGA) silicon, which is in the form of the fine-grained reconfigurable cells that Xilinx are known for. The system itself is an attempt to combine the best of both the fine-grained and coarse-grained reconfigurable worlds.

E. Low-Power CGRAs
CGRAs has also been shown to be competitive in terms of power-consumption, particularly when compared to existing (low-power) processors and DSP -engines. The CGRAs in this domain follow the same concept as earlier CGRA design, but focus on both technology and architecture improvements to reduce the static and/or dynamic power of the fabric.
These CGRAs tend to focus on reduction the frequency and voltage as much as possible. Since the dynamic powerconsumption of a system is a function of both frequency and voltage (P dynamic = C * V 2 * f clk ), reducing frequency can have a dramatic effect on power-consumption. Several CGRAs in this area operate on near-MHz level, and some even remove the clock altogether.
The Cool Mega Array [144], [145] (CMA-1 and CMA-2) architecture builds on the following two premises: (i) the clock (clock-tree, flip-flops, state, etc.) is a culprit behind much of the consumed power on modern chip, and (ii) applications have adequate parallelism to freely trade silicon for performance where needed. The CMA-1 was a typical CGRA mesh architecture, but created without a single clock. The architecture focuses on stream-computing, where a processor presents inputs to the CGRA that -in due to time -are computed using the clock-less fabric. The architecture (and its follow up, CMA-2) is power-efficient, and experiments on taped-out versions showed that the leakage-current of the chip could be as low as 1 mW. The CMA architecture managers to reach up-to 89.28 GOPS/Watt using a 24-bit data-path. The CMA architecture is continued to be researched, and recent work has focused on improving performance (through variable-latency pipelines in VPCMA [146]) or further reduce power-consumption through body-biasing.
The SYSCORE [147] architecture is another similar architecture that focuses on low-power consumption, but leverage dynamically scaling both voltage and frequency (DVFS) for power-benefits, and using a fixed-point (and not floatingpoint) number representation. As with CMA-1/2, it is a 24-bit datapath with a standard mesh-like arrangements of CGRAtiles.
Lopes et al. [148] evaluated a standard mesh-like CGRA for use in real-time bio-signal processing. The CGRA they constructed had the additional benefits of being able to powerdown sections of the CGRA when unused to further extend battery-life. Another bio-medical CGRA was introduced by Duch et al. [149] uses a mesh-like composition and a 1 MHz clock-frequency to accelerate electrocardiogram (ECG) analysis kernels.
The Samsung UL-SRP was designed for bio-medical application. The UL-SRP [62] is based on the ADRES, and featured a hybrid high-power/high-performance mode and a low-power/low-performance mode covered the different needs and use-cases.
The PULP [150] cluster system features a 16 RC mesh to improve performance and energy-consumption for near-sensor Not strictly a CGRA, ZUMA [159] is an early effort to virtualize the fine-grained resources of an FPGA using a "virtual FPGA", for reason of portability, compatibility, and FPGA-like reconfigurability inside of FPGA designs. Similar to a real FPGA, ZUMA discretized the FPGA into logic clusters that contains a crossbar and K-input Look-Up Table  with an optional flip-flop capturing the output. Each cluster is connected to a switch-box that can be programmed to route the data around. The area cost of using virtual-FPGA can be as low as 40% more than the barebone FPGA, demonstrating its benefits. Other (even earlier) work was FIRM-core [160], as well some more recent efforts include the vFPGA [161].
Intermediate Fabrics (IF) [154] coarsen the FPGA logic by creating a mesh of computational elements of varying sizes, such as for example multipliers and square root functions; small connectivity boxes (routers) govern the traffic throughout the data-path. IF was evaluated on image processing (stencil) kernels, and overall showed an on average 17% drop in clock frequency against a gain of 700x in compilation time over using the FPGA alone.
The MIN Overlay architecture [162] approach the CGRA design differently; it uses a one-dimensional strip of processing elements whose output is connect to each other through an all-to-all interconnect. Hence, data-flow graphs are spliced and fit onto the linear array, and different parts of the graph are scheduled in time on the array and the interconnect. Different combinations and compositions of the processing elements were evaluated and the clock frequency, for the most part, ran at 100 MHz, competitive to soft processor cores at the time. Other, arguable less configurable, overlays follow a similar one-dimension strip design, such as the VectorBlox MXP Matrix Processor [163]. The FPCA loop-accelerator described earlier was also prototyped on FPGAs. The READY [164] architecture extend the linear array concept further by also having multiple threads running on the overlay.
An example of a layered CGRA overlay for FPGAs is the VDR architecture [165]. Here, computational resources are laid out in one-dimensional strip where each strip is fullyconnected to downstream units. Links are unidirectional, and synchronization protocol guides data throughout the data-path. The VDR architecture runs at a clock frequency of 172 MHz, and was shown to be between 3 and 8 times faster compared to the NiosII processor [166] (a well-known soft-core used in FPGA design). Another architecture similar to VDR is the RALU [167].
A flurry of innovative overlays was introduced in the 2012 onwards, all centered around the modern FPGAs Digital Signal Block (DSP). The DSP blocks were originally included to allow the use of expensive operations that do not necessarily map well to FPGAs (e.g. multipliers). DSP blocks have since then continuously evolved to include more diverge (varioussize multiplication, accumulation, etc.) or more complex (e.g. single-precision arithmetic [153]) functionality. Some of the vendors (Xilinx) directly exposed the interface to control the different functionality of the DSP blocks to the FPGA fabric, and it was not long before the idea to base CGRAs architecture around said DSP blocks. ReMorph [168] was one of the early architectures to adopt this style of reasoning. Several different architectures have been explored around the concept of DSPs, including various topologies (e.g. trees [169]) or adaptation of existing architectures (e.g. DySer using DSP blocks [170]). The strengths of these architectures lie in their near-native performance, where small overlays built around DSPs can run at 390 MHz (or higher).
Quickdough [171], [172] is a design framework for using CGRA overlays on FPGAs, specifically targeting them to assist CPU in accelerating compute-intensive program code. The overlay itself follows the standard layout with a mesh of processing elements, each containing a small instruction memory that sequences the ALU within the processing element. The mesh can interface external memory by enqueuing requests to an address unit. Unique for the architecture is that the two parts (the address unit and the PE mesh) runs at two distinctly different frequencies.
Most FPGA overlays presented so far focus unique on integer computation, likely because most FPGA overlay work target Xilinx devices, whose DSPs units do not contain hardened floating-point operations. The Mesh-of-ALU [173] is an exception that targets both integer and floating-point computation. The architecture is similar to other mesh-based approaches, but the work demonstrates high (at the time) performance capabilities of FPGAs also in floating-point operations, reaching nearly 20 GFLOPs on a Stratix IV [174] device. Using floating-point processing elements seem to incur a 33% area overhead, yielding a smaller CGRA mesh, and also a (arguably negligible) 13% reduction in clock frequency.
A different overlay architecture that target floating-point operations is the TILT array [175], [176]. The TILT array architecture is very similar conceptually to the MIN overlay. A linear array of processing elements is arranged to communicate with an all-to-all crossbar, which saves state into an on-chip RAM and relaying information to the computation in the next cycle. The authors illustrated the benefits of TILT over High-Level Synthesis (OpenCL) with both comparable performance and improved productivity, reaching operating frequency of up-to 387 MHz on a Stratix V [177].
The URUK [178] architecture takes a different approach on how the ALUs inside overlay should be implemented. Rather than having a fixed-function, URUK leverage partial reconfiguration [179], changing the RCs functionality throughout time.
Finally, tools for automatically creating CGRA overlays for FPGAs are emerging, such as the Rapid Overlay Builder [180] and CGRA-ME [60] that simplify generation and (in the case of CGRA-ME also) compilation of applications and overlays.
An interesting observation is that out of the 14 unique FPGA architecture that we surveyed here, 9 chose Xilinx as the target platform while 5 focused on Intel (then Altera) FPGAs. There seems to be a favoring of Xilinx architectures, which we believe is due to the more dynamic control that Xilinx offers in their DSP blocks compared to Intel. On the other side, Intel DSPs have (starting from Arria 10 onwards) hardened support for IEEE-754 single-precision floating-point operations, encouraging floating-point heavy architecture to use those systems.

V. CGRA TRENDS AND CHARACTERISTICS A. Method and Materials
For all previous surveyed and summarized work, we collected several metrics associated with each study. These were: 1) Year of publication, 2) Size of the CGRA array in terms of unique RCs, 3) Data-path width of the CGRA (e.g. MATRIX operates on 4-bit while RaPiD operates on 16-bit), 4) Clock frequency of operations (f max ) in MHz as reported in the study, 5) Power consumption in Watt. For studies that empirically measured this metric, we collected the (benchmark,power) tuple. Otherwise we used what is reported in the study (often the post place-and-route power estimation), 6) Technology (in nm) of architecture when either taped out in silicon, or the standard cell library used with the EDA tools, 7) Area (mm 2 ) of the fully synthesized chip as reported in the study. In some cases we had to manually calculate it based on the individual RC size or based on the gates used (after verification with authors). For FPGAs we used the chip (BGA) package size and assumed a chipto-die ratio of 7:1, as has been reported in [181]. 8) Peak performance, including peak operations-persecond (OP/s), peak multiple-accumulates/second (MAC/s), Peak Floating-Point Operations/second (FLOPS) as reported in the paper. We differentiate between integer MAC/s and OP/s because some architectures (e.g. EGRA) do not balance them, leading to a large theoretical OP/s but not a proportionally large MAC/s. 9) Obtained Performance out of the theoretical peak (%).
We used what the authors reported. For those cases where authors did not report obtained performance (e.g. only reported absolute time), we calculated this metric manually where applicable, such as for example when the authors report both the input dimension and the execution time (in seconds or cycles) of known applications such as (non-Strassen) matrix-multiplication, FIR-filters, matrix-vector multiplication, etc. For item 8-9 we ignored studies that showed relative performance improvements, as it is hard to reason around the performance of a baseline unless explicitly stated. All metrics included have either been directly reported in the seminal publication, have been verified by the authors, or we were confident in our understanding of the architecture to derive them ourselves. We position and related our obtained CGRA characteristics against those of modern GPUs. We used NVIDIA GPUs as references with data collected from [182] and integer performance calculated using methods described in [183].

VI. OVERALL ARCHITECTURAL TRENDS
We start by analyzing data that is associated with time, and how the CGRAs have grown as a function of time. Figure 5 overviews how CGRAs have changed through time with respect to various metrics. The total number of RCs, as a function of the respective publication year, is shown in Figure 5:a. We see that a majority of CGRAs are quite small (median: 64 RCs) and even smaller for FPGA-based CGRAs (median: 25 RCs). This is in-line with the reasoning that most CGRAs focus on small kernels in the embedded application domain, honoring ILP rather than other forms of parallelism (e.g. thread-or task-level). There are several exceptions to this, such as GARP, which was an early CGRA that used 1/2-bit reconfigurable data-paths and thus needed a large number of RCs to implement various functionality. The other exception is TARTAN, where the author's largest evaluated version is up-to 25,600 RCs, making it likely the largest CGRA ever simulated; this awe-inspiring size was reached by severely restricting the functionality of RCs (e.g. there is no multiplication support). Thirdly, the Plasticine architecture can have up-to 6208 RCs of varying sorts. Figure 5:b shows transistor scaling of CGRAs and NVidia GPUs. As expected, the transistor dimensions have continuously grown smaller and smaller, as predicted by Moore. Note however that both FPGAs and GPUs is (on average) one transistor generation ahead of CGRAs, likely due to most CGRAs being developed by academia and thus restricted to those standard cell libraries available at that time (which usually is not the most recent). Figure 5:c shows the area of the CGRAs as reported either by the ASIC synthesis tools, estimation by authors, or by the final taped-out chip. We also include the full-size of the FPGA die sizes (that FPGA-based CGRAs have access to), and we position these against the die-size of modern NVidia GPUs. We can see that the trend of CGRA research is -as with the size of CGRAs -to favor smaller CGRAs, and the average size of the CGRAs are 13.117mm 2 Compared to GPUs, which has monotonically increased their size through time, CGRAs have almost done the inverse, and decrease in size. There are two major exceptions: the first is the Imagine architecture, which reported an amazing size of 1000 mm 2 (later 144 mm 2 in the follow-up paper 6 years later)-larger than any CGRA or GPU reported to this day. The other larger architecture is CUDA-programmable the SGMF at 800 mm 2 . Figure 5:d shows how the reported power-consumption of ASIC-based CGRAs has grown over time, and is compared to the Thermal-Design Package (TDP) reported for NVidia GPUs. The CGRAs are experiencing on-average an exponential decrease in power-consumption, which is likely due to smaller standard cell libraries coupled with small CGRA size ( Figure 5:a,c,d). On the other hand, NVIDIA GPUs continuously consumes more and more power as time goes on (albeit even that is drawing to a halt due to Moore's law). The highest and most power-consumption CGRA, out of those reporting, is the Plasticine architecture consuming a by limitation in the fabric itself; however, it is interesting to see that the operation frequency of FPGA-based CGRAs are rivaling most of the ASIC CGRAs. Figure 5:f shows the chosen data-path width that CGRAs research tend to adopt. Most architecture adopt either a 16-bit (28%) or 32-bit (56%) data-path width; those targeting 16-bit data-path are usually more tailored towards a specific application, such as telecommunication or deep-learning, while those that target 32-bit (or beyond) is more general-purpose. A few (13.3%) target 8-bit architecture, but often have support for chaining 8-bit operations for 16-bit use. Matrix and GARP target very fine-grained reconfigurability, with 4-bit and 1-/2bit respectively. Despite this, we can expect future architecture to include more support for low-or hybrid-precision, since it is a reliable way of obtaining more performance while mitigating memory-boundness for applications that permits it. Figure 5:g shows the power-consumption of CGRAs and GPUs as a function of their respective die sizes. This graph complements the graph in Figure 5:d to show that the low power-consumption of CGRAs is mainly because they are small, with (out of those CGRAs that report both power-and area) only Plasticine coming closer to the trend of GPUs.
Finally, Figure 5:h shows how the individual RC area has grown throughout time, and we see that the size of RCs has been following the technology scaling, and continuously decreased in size. However, when normalizing the CGRAs manufacturing technology to that of 16 nm, we actually noticed a different trend, where the area of individual RCs is increasing, due to incorporating more complex elements (such as wider data-paths, more complex arithmetic units, etc.). Figure 6 overviews data associated with the pure performance of the CGRAs, often when positioned against that of NVidia GPUs.

VII. INTEGER AND FLOATING-POINT PERFORMANCE ANALYSIS
Figure 6a-f shows the obtained integer performance. Here we distinguish between operations and MAC-based operations in order to reveal architecture which are starved of multipliers. For example, the TARTAN CGRA can execute a large amount of operations per unit time, but has no support for multiplications, leading to a very low comparable multiple-add performance. Figure 6a-b show the GARP and Matrix architecture as the sole candidates for low-precision arithmetic, and that while both of these have comparable high performance (for their time), their multiplication (MAC) performance is lacking (in GARP, the overhead was 32x compared to an addition). Figure 7:c shows 8-bit integer performance, which has recently been of interested to the deep-learning inference community, and where next-generation Xilinx Versile architecture will be capable of reaching thousands of GOP/s of 8-bit integer performance. Figure 6:d shows 16-bit integer performance, showing a continuous growth over the years. Note how the TARTAN architecture claims to reach similar performance levels of the upcoming Xilinx Versile CGRA, despite being more than a decade old. Figure 6:e is a special case, and only a few CGRAs (e.g. Cool-Mega Array-1 and SYSCORE); despite their low visible performance, these devices are actually very power-efficient (see next section for discussion). Finally, 32bit integer performance is shown in Figure 6:f, where we also included NVidia GPU integer performance for comparison. We see that CGRA has historically been comparable to that of NVidia GPUs, and even FPGAs are becoming a valid way of obtaining integer performance through CGRAs. Figure 6:g shows the peak floating-point performance that CGRAs reported over the years. The number of floating-point capable CGRAs prohibits us from drawing any reasonable trend-line, unlike the one for GPUs that exponentially grows with years (together with the die-area, see Figure 5:c). However, those CGRAs that do include floating-point units can compete with the performance of modern GPUs-sometimes even outperform them. For example, the Plasticine architecture is capable of delivering 24.6 TFLOP/s of performance, rivaling GPUs from that generation, and the earlier Redefine and SGMF architecture could deliver 500 and 840 GFLOP/s respectively. Even earlier, the Wavescalar architecture was capable of 500 GFLOP/s, which was well ahead of GPUs at that time. At a lower performance, we find architecture such as FLoRa ( 600 MFLOP/s) and the loop-accelerator VEAL (5.4 GFLOP/s). Figure 6:h shows the distribution over the number of CGRAs that support floating-point versus those that support integer computations. Floating-point support is clearly underrepresented, with more than a factor four more CGRAs that support integer computations. Figure 7:a shows the number of instructions-per-cycle (IPC) that applications experienced when executing on different CGRAs. We see that a majority of CGRAs operate in a fairly low performance domain, primarily due to their size, and most execute around 12 IPC (median). There are corner cases, such as the Rhyme-Redefine architecture which aims to explore CGRA in High-Performance Computing reaching 300+ IPC on selected workloads, or the Deep-Learning SDT-CGRA architecture on inference, reaching 172 IPC. Similarly, Eyeriss is capable to occupy 100% of its resources when inferring AlexNET, yielding astounding 700+ IPC. Most FPGA-based CGRA also execute less than 100 IPC; this is primarily since the size of most FPGA CGRAs are rather small (see Figure  5a). Figure 7:b shows the performance applications experience when running on different CGRA architectures as a function of topology size, where we group CGRA architectures into three groups: small (¡16), medium  and large (¿64). As is expected, we see that the performance and obtained IPC grows as the architectures become larger, where applications running on small-sized CGRA experience on average 12.60 IPC, 27.69 IPC on medium-sized CGRAs, and large 79 IPC on larger, with outliers being capable of reaching much more. A complementary graph is seen in Figure 7:c, where we see the obtained performance as fraction of the raw peak performance. reach 0.374 PFLOP/s of performance. From a 32-bit integer performance perspective, most architecture would outperform the NVIDIA Volta-100. The highest performance on the graph is actually a 4x4 CGRA provided as an example by CGRA-ME [60] that, albeit unusable by itself (only compute and no orchestration is available), would reach a compute capacity of nearly 1 Peta-OP/s. While this extrapolation is indeed limited, it aims to show that CGRAs have the architectural capability of competing with modern GPUs designs, assuming we can fully utilize these (potentially over-provisioned) computing resources.

IX. DISCUSSION AND CONCLUSION
As we saw in the previous section, a vast majority of CGRAs are fairly small and run at a (comparably) low frequency. This, in turn, lead to very power-efficient designs, allowing CGRAs to be placed into embedded devices such as mobile phones or wearables and operate for hours. This powerefficiency, with respect to the performance they provide, allow CGRAs to compete (and possible perform better) than GPUs, which in turn could lead to a partial remedy for dark silicon. From the analysis that we did in this study, CGRAs should be considered a serious competitor to GPUs, particularly in a future post-Moore era when power-efficiency becomes more important.
However, in order to reap the better compute-densities and better power-efficiency that CGRA offers, larger architectures must be more thoroughly researched. Larger CGRA architectures, in particularly those aimed towards aiding or accelerating general-purpose computing, will be challenging to keep occupied. As we saw in the final graph, CGRA scaled to the level of a V100 will potentially have a peak performance consisting of hundreds of teraflops, but the main question is: will we be able to map and fully utilize all those computing resources for anything but the most trivial kernels?
Several authors have already pointed out that in order to harvest larger CGRAs, we need to complement current ways of extracting instruction parallelism (primary modulo scheduling) with other forms of concurrency. While modern CGRAs (e.g. Plasticine, SIMD-RA) do exploit SIMD-level parallelism, there will without doubt be needed to research even further, and support programming models such as CUDA (SGMF move towards this direction), multi-threading or even multi-tasking (e.g. OpenMP [186]) should be more aggressively pursued both from an architectural and programmability viewpoint in order to leverage future, large-sized, CGRAs. For example, the recently added dependencies features in frameworks such as OpenMP and OmpSs [187] matches very well to clustered CGRAs that have islands of both compute and scratchpad, where the dependencies would dictate how data would flow on these CGRAs (exploiting both inter-and intra-task level parallelism and data locality).
Another limitation of existing architectures is the application domain on which they accelerate. A large majority of CGRAs target embedded applications such as filters, stencils, decoders, etc. Studies that integrate the CGRA into the backend of a processor (e.g. TRIPS, DySER) tend to have a more diverse set of benchmarks available, and those studies (e.g. TARTAN, SEED, SGMF) that rely only on simulation (without hardware being developed) have the richest set of application support. Despite this, CGRAs suffer from a similar problem that current FPGAs struggle with: we limit ourselves to small, simple kernels, rather than studying the impact of these architecture on more complex applications. To give a concrete example, there is to this day no reconfigurable architecture that has seriously considered any of the proxy applications that drive HPC system procurement, such as for example the RIKEN Fiber [188] or ECP benchmark suites [189]. For FPGAs and High-Level Synthesis, this might make sense, since there is always the danger that these large kernels might not fit onto a single FPGA; CGRAs, however, can store multiple contexts and kernels with little overhead in switching between them, opening up possibilities for wholeapplication executions as well as opportunities to exploit interkernel temporal-and spatial data locality.
A different challenge with the present (and similar future) surveys is in the amount of reporting that the different studies do. For example, studies that applies a simulation methodology often have a wider benchmark coverage, but fails to report hardware details (e.g. area or RTL-information). At the same time, many CGRAs that were actually implemented in hardware (or RTL) do report area and power-consumption, but limits the benchmark selection and information. This, in turn, leads to gaps in several graphs were a clear highperformance candidate is represented in one graph, but due to limited information, is absent from the other. This could more clearly be seen in this graphs that a derive metric, such as performance per power (OPs/W). Similarly, many papers often prefer to report relative performance improvement, rather than absolute numbers, leading to difficulty in reasoning around performance across a wide range of CGRAs. We include this in the discussion section more as a observation for future studies that may attempt to perform a similar survey.
Overall, this survey has shown that there is plenty of room for CGRA research to grow and to continue to be an active research subjects for use in future architecture, particularly striving to design high-performance CGRAs that aim at niche or general-purpose computation at scale. As transistor dimensions stop shrinking and Moore's law no longer allows us the architectural freedom of carelessly spending silicon, it is here that reconfigurable architectures such as CGRAs might excel at providing performance in a post-Moore era.