Fanout-Bounded Logic Synthesis for Emerging Technologies

In logic circuits, the number of fanouts a gate can drive is limited, and such limits are tighter in emerging technologies such as superconducting electronic circuits. Moreover, some such technologies, e.g., adiabatic quantum-flux-parametron (AQFP), pose additional constraints such as the need for balanced input-to-output paths to ensure proper signal propagation. In this work, targeting emerging technologies, we study the problem of resynthesizing a logic network with bounded-fanout gates while minimizing area for a given depth. Namely, we 1) formulate the fanout-bounded synthesis (FBS) problem for a fixed target logic depth as an integer linear program (ILP); 2) propose a scalable top-down approach to construct a feasible solution to the ILP; and 3) extend both the exact and the heuristic approaches to the setting of path-balanced networks. Using our ILP, we obtain the global optimum solutions for a number of benchmarks that serve as ground truth for evaluating heuristic algorithms in both general and path-balanced FBS. Our heuristic algorithm for general FBS achieves an 11.82% better area than the state of the art with matching or better delays while attaining the optimum/near-optimum area for several considered benchmarks. For the path-balanced setting, our heuristic approach achieves 8.76% better delay on average with an average area improvement of 0.5% when using AQFP as the exemplar technology, while achieving more than 17% better delays on several benchmarks.


Fanout-Bounded Logic Synthesis for Emerging Technologies
Dewmini Sudara Marakkalage , Graduate Student Member, IEEE, and Giovanni De Micheli , Life Fellow, IEEE Abstract-In logic circuits, the number of fanouts a gate can drive is limited, and such limits are tighter in emerging technologies such as superconducting electronic circuits.Moreover, some such technologies, e.g., adiabatic quantum-flux-parametron (AQFP), pose additional constraints such as the need for balanced input-to-output paths to ensure proper signal propagation.In this work, targeting emerging technologies, we study the problem of resynthesizing a logic network with bounded-fanout gates while minimizing area for a given depth.Namely, we 1) formulate the fanout-bounded synthesis (FBS) problem for a fixed target logic depth as an integer linear program (ILP); 2) propose a scalable top-down approach to construct a feasible solution to the ILP; and 3) extend both the exact and the heuristic approaches to the setting of path-balanced networks.Using our ILP, we obtain the global optimum solutions for a number of benchmarks that serve as ground truth for evaluating heuristic algorithms in both general and path-balanced FBS.Our heuristic algorithm for general FBS achieves an 11.82% better area than the state of the art with matching or better delays while attaining the optimum/near-optimum area for several considered benchmarks.For the path-balanced setting, our heuristic approach achieves 8.76% better delay on average with an average area improvement of 0.5% when using AQFP as the exemplar technology, while achieving more than 17% better delays on several benchmarks.

I. INTRODUCTION
I N digital electronics, the ability to have multiple fanouts per gate allows for compact implementations of complex logic functions.However, increasing the number of fanouts of a gate can negatively impact delay performance, and the maximum number of fanouts a gate can support is typically limited.Therefore, it is important to develop synthesis algorithms that effectively utilize fanouts.
In conventional CMOS technology, fanout optimization has been well studied, both as a means of improving the critical path delay [1], [2], [3], [4], [5], and as a method of optimizing special high-fanout nets, such as clock and reset signals [6].However, the techniques developed for CMOS technology The authors are with the Integrated Systems Laboratory, Swiss Federal Institute of Technology Lausanne, 1015 Lausanne, Switzerland (e-mail: dewmini.marakkalage@epfl.ch).
Digital Object Identifier 10.1109/TCAD.2023.3339440 are not generally transferable to emerging technologies such as superconducting electronics (e.g., adiabatic quantumflux-parametron (AQFP) [7], RQL [8], and RSFQ [9]), field-coupled nanocomputing technologies (e.g., QCA [10]), and spintronics [11], which generally have tight, explicit fanout bounds and/or significantly different timing models (e.g., clocked gates).Thus, the allowed circuit transformations in such technologies can be fundamentally different.For instance, in CMOS technology, the delay increase caused by a high number of fanouts can be mitigated by techniques such as transistor sizing.However, this option is not available for post-CMOS technologies.Instead, when designing for emerging technologies that have globally imposed, hard fanout limits, fanout-bounding is achieved through a combination of gate duplications and buffer insertions.This procedure tends to consume a significant portion of resources as compared to CMOS, so it is typically considered relatively early in the synthesis process, e.g., in the logic synthesis stage.
Motivated by the aforementioned differences from CMOS, we first consider the following general fanout-bounded synthesis (FBS) problem in the unit-delay model: given an input logic network and the fanout bounds and area costs of different gate types/buffers, resynthesize the logic network by means of gate duplications and buffer insertions such that each gate meets its respective fanout bound while the total area is minimized.Note that the unit-delay model encompasses many emerging technologies that have clocked gates (e.g., AQFP and QCA).Zhang and Jiang [12] recently studied this general FBS problem (in the same unit-delay model) and presented an algorithm composed of several heuristics, where the main idea was to duplicate gates if doing so locally reduces the number of buffers (see Section III for more details).
In this work, 1 we revisit the FBS problem by taking a rigorous approach: namely, we present the first known integer linear programming (ILP) formulation of this problem for a fixed target delay and use it to obtain optimum area FBS solutions for a number of EPFL [14] benchmarks and benchmarks of [15].Our ILP uses the number of copies and buffers associated with different gates and levels as variables and has constraints to ensure that there are sufficiently many gate copies and buffers to support all fanouts subject to fanout bounds.As we see in Section IV, this formulation is versatile and can be extended, for example, to facilitate different types of gates and buffers as well as different fanout constraints for primary inputs.
We then present a scalable top-down synthesis algorithm for the general FBS problem based on a heuristic different than that of [12], where we give preference to adding buffers over duplicating gates.Specifically, the main idea of the new approach is to duplicate gates only if the critical path delay would be increased otherwise.As we explain in Section III, our heuristic exploits several improvement opportunities we identified in the algorithm of [12].We also present an additional optimization step on top of the proposed top-down approach which can be used as a high-effort optimization step to obtain even better results.Our basic top-down heuristic achieves a 10.9% better area as compared to the state of the art [12] while the top-down approach with the additional optimization step allows an 11.82% improvement on average.Notably, the critical path delays of the resulting output networks of our top-down approaches are less than or equal to those obtained by the state of the art because they retain the same logic depth as the original fanout-unbounded network. 2ext, we consider the FBS with the additional requirement of path-balancing, which is a crucial constraint of several emerging technologies, such as AQFP and QCA.In these technologies, gates can only drive at most one fanout, and special branching cells called splitters are required to support multiple fanouts.By considering the splitters as buffers with a fanout capacity of at least two, synthesizing for such technologies can be considered a special case of FBS.However, what makes synthesizing for these technologies more challenging is the constraints on the arrival times of fanins of a gate.For example, in AQFP technology, the signal propagation between gates is facilitated by a multiphase clocking scheme, which requires all fanins of a gate to be clocked in the same phase (see Section II for details).One way to ensure this same-phasefanins constraint is to require that all fanins of a gate be at the same logic level by adding extra buffers as necessary, which is referred to as path-balancing in the literature.
For FBS in the path-balanced case, we first adapt the aforementioned general-case-ILP to account for the pathbalancing constraints considering both scenarios where gate duplications are enabled and disabled.(The latter setting has been studied as the AQFP splitter/buffer insertion problem in a series of research work [15], [16], [17], [18] as we describe in Section III.) Then, as with the general FBS problem, we present a scalable heuristic algorithm for path-balanced FBS focusing on AQFP technology.This algorithm starts with a top-down approach resembling our heuristic for the general setting (with some differences to avoid excessively duplicating gates) to determine initial gate/buffer counts, and then follows on with additional optimizations to mitigate the overhead of pathbalancing buffers.Remarkably, as compared to the optimum delays in the setting without gate duplications, our heuristic with gate duplications achieves 8.76% better delays on average together with a 0.5% average area improvement.
In the remainder of this article, we first summarize some concepts useful to better understand our work, including the logic network structures we use, timing and node equivalence concepts, and a brief introduction to AQFP technology (Section II).Then, we discuss some prior work on general FBS as well as splitter-buffer insertion for AQFP technology (Section III).Next, in Section IV, we describe our ILP formulation for the general FBS, and in Section V, we present our scalable top-down algorithm and related further optimizations.Following that, in Section VI, we extend our approaches from Sections IV and V to facilitate the pathbalancing constraints.Finally, in Section VII, we present our experimental results, and in Section VIII, we conclude with a brief discussion on the results and possible future directions.

II. BACKGROUND
In this section, we first introduce two representations of logic that we use in our algorithms, namely, and-inverter graphs (AIGs) and majority-inverter graphs (MIGs).Next, we describe the notion of static timing analysis for the unit delay model and the concept of node equivalence.Finally, we briefly introduce the AQFP technology on which we demonstrate our FBS approaches for the path-balanced setting.

A. And-Inverter Graphs/Majority-Inverter Graphs
The AIG is a directed acyclic graph (DAG) representation of logic where nodes represent either primary inputs or 2-input AND gates which, respectively, have in-degree zero or two.AIGs have two possible types of directed edges, representing noninverted or inverted fanins.The AIG is a universal logic representation, meaning that an AIG can represent an arbitrary logic function, and is supported by numerous logic synthesis tools and libraries, such as ABC [19] and mockturtle [20], owing to its simplicity and wider compatibility with many logic synthesis algorithms.At the same time, structural hashing can be easily implemented in AIGs, enabling efficient collapsing of logically equivalent nodes.We use AIG as the preferred logic representation in Sections IV and V.
The MIG is defined similarly to the AIG; the only differences are that the internal nodes represent 3-input majority gates and have in-degree three.The 3-input majority gate outputs 1 if and only if at least two of the inputs are 1.When one input is tied to constant 0 or 1, the majority gate acts as a 2-input AND gate or a 2-input OR gate.Thus, MIG is also a universal logic representation.
The use of majority gates in logic synthesis has been studied extensively in the past [21], [22], [23].Recently, Amarù et al. [24], [25] proposed MIG as a new paradigm for logic synthesis.Due to the majority gate being the natural gate in several superconducting technologies (as we see in Section II-D), it is preferable to consider a majoritygate-based logic representation when synthesizing for such technologies.Consequently, we use MIG as the preferred logic representation in Section VI.

B. Static Timing Analysis
In this work, we use the unit-delay model which assumes that a signal incurs a unit delay when it passes through a gate.The arrival time of a node n, denoted by t arr n , is defined as follows: if n is a primary input, t arr n = 0. Otherwise, t arr n = 1 + max m∈FI(n) t arr m , where FI(n) denotes the set of fanin nodes of n.Note that the arrival time of a node is equal to the maximum length of a path from the node to any primary input.Hence, we sometimes use the term level to refer to the arrival time.The overall circuit delay (depth of the circuit) is defined as the maximum arrival time of any primary output.
For a given target delay D, the required time t req n of a node n is defined as follows: if n has no fanout nodes which are internal to the logic network (i.e., all fanouts are primary outputs), t A critical path in a network is an input-to-output path of nodes where each node n on the path satisfies t req n = t arr n .We say a node is critical if it lies on at least one critical path.

C. Node Equivalence
In general, we say two nodes m and n in a logic network are equivalent if their outputs are equal under all possible value combinations of primary inputs.If the input graph contains two or more equivalent nodes, their fanouts can be redistributed among themselves at the discretion of a synthesis algorithm without altering the overall output of the circuit.However, for a network with many primary inputs, the computation needed to identify all sets of equivalent nodes can be prohibitively expensive.Thus, a more practical approach is to find equivalent nodes by considering a node's function with respect to a small cut, i.e., a set of nodes that separates the considered node from primary inputs.An example of this type of weaker equivalence checking is structural hashing which was originally used in IBM CAD tools [26]; for AIGs, a widely used structural hashing technique is to identify each gate with a signature consisting of the gate's fanins and flags denoting which fanins are inverted.
In this work, we do not explicitly check for equivalent nodes; instead, we allow the AIG data structure to internally use structural hashing to collapse any equivalent nodes.For the output logic network, our algorithms may explicitly duplicate some gates, hence we disable structural hashing for the output.

D. AQFP Logic Circuits
AQFP is a superconducting electronics technology with very low power consumption due to adiabatic operations.In AQFP, logic gates are constructed using superconductive inductors and Josephson junctions (JJs) which are based on the Josephson effect [27].The number of JJs in an AQFP circuit is commonly used as a proxy for the area cost.
For AQFP, Takeuchi et al. [28] proposed a simple cell library based on four primitive cells-buffer, inverter, constant, and branch-where a gate is created using an array of primitive cells together with a branch while a splitter is constructed using a buffer and a branch.The majority-3 gate consists of three buffer cells together with a branch.The different fanin inverted versions of a majority-3 gate are constructed by substituting a subset of buffer cells with inverter cells [28].Analogously, 2-input AND and OR gates are constructed by substituting a buffer cell with a constant 0 or 1 cell.Each of the three primitive cells, buffer, inverter, and constant, consists of two JJs, and hence a splitter also uses two JJs.All gates-majority-3, AND-2, and OR-2-as well as all their input-inverted versions use six JJs each.
In AQFP logic, the majority-3 gate is the elementary gate as other gates AND and OR are derived from it.Moreover, these derived gates all have the same area as the original majority-3 gate.As such, Cai et al. [29] proposed that majority-gate-based logic synthesis is more suitable when optimizing logic networks for the AQFP technology.
The output signals of AQFP gates are rather weak and unable to drive more than one fanout.Instead, when driving multiple fanouts, splitters (or a tree of splitters) must be used to boost the output signal.Depending on the implementation details, a splitter's branching capacity can vary (usually 3 or 4 [28], [30]), and in our logic synthesis experiments for AQFP, we assume it is 4.
As with many superconducting technologies, AQFP gates are clocked.The logic values are propagated between consecutive gates when their active periods overlap.This overlap is achieved by ensuring that, for each gate n, all fanins of n are clocked by the same phase and n itself is clocked by the next available phase (e.g., for a 4-phase clocking scheme, if fanins of n are activated by a clock in some phase φ, then n is activated by a clock in phase φ + π/4.)To achieve this kind of overlap throughout the network, the usual practice is: 1) to ensure all fanins of a gate are in the same logic level; 2) map consecutive logic levels to consecutive rows of gates/buffers in the physical circuit; 3) activate consecutive rows of gates by clock signals in consecutive phases.We remark that, in general, it is not mandatory to have all fanins exactly in the same logic level, but it is sufficient to have them in the same logic level modulo the number of clock phases.Even this requirement can be eliminated by using a more elaborate clocking scheme where nonconsecutive clock phases can also overlap [31].To keep things simple and allow comparisons with recent work on AQFP logic synthesis, in this work, we work in the former setting.
Depending on the design of registers and the clocking mechanism used, there can be different requirements on whether splitters are needed for primary inputs, whether path balancing is needed for primary inputs, and if path balancing is needed for primary outputs [32].In our proposed FBS approaches for the path-balanced setting, we assume that splitters are needed for primary inputs (which is a notable difference from the general FBS setting where we assume primary inputs have unbounded fanout capacity to be consistent with [12]) and that path-balancing is needed for primary inputs and primary outputs (i.e., all primary outputs are at the same level).
To illustrate synthesis for AQFP under fanout and pathbalancing constraints, consider the example logic network on the left of Fig. 1 and two of its fanout-bounded, path-balanced versions in the middle and on the right.The one in the middle does not have any duplicated gates while the one on the right has one gate duplication.In this example, duplicating gates benefits both the area and the delay; the delay is reduced by one logic level and the area is reduced by two JJs.

III. RELATED WORK
In this section, we first discuss some notable work related to FBS and briefly explain how our approach differs from the existing methods.Then, we also discuss some work related to the AQFP splitter/buffer insertion problem which can be viewed as a special case of FBS with path-balancing constraints.

A. General Fanout-Bounded Synthesis
An early theoretical work on general FBS using gate duplications and buffers by Hoover et al. [33] presented an algorithm that limits the number of fanouts of each gate by any given constant c ≥ 2 at the expense of a constant factor increase in both the total number of gates and the depth. 3Their algorithm assumes the natural setting that the input consists of bounded-fanin gates.
A vital ingredient of their work that is pertinent to FBS in general is the minimum-size minimum-height buffer tree construction.Namely, given the levels of fanouts of a gate, construct a tree consisting of the gate and a set of buffers such that 1) the gate is at the root; 2) the total number of buffers is minimized; and 3) the height of the tree is minimized.In the case where the gates and buffers have the same fanout bound t ≥ 2, Golumbic [34] showed how to construct such a tree using a slightly modified Huffman-coding-like algorithm [35].
Recently, Zhang and Jiang [12] studied the problem of general FBS in the unit delay model and proposed an algorithm consisting of several heuristic optimizations.The main idea of their work is to duplicate gates if that results in a buffer reduction in the local neighborhood without significantly affecting the critical path delay.To this end, they proposed a recursive evaluation procedure to determine the number of duplicates for each gate.After the duplicate count for each gate has been determined, for each node in the reverse topological order, their algorithm constructs "skewed" buffer trees using an algorithm similar to [35].Finally, for each set of equivalent nodes, their buffer trees are considered together and the load is redistributed.This step does not alter the levels of the nodes but may remove some redundant equivalent nodes.
After further analyzing the algorithm of Zhang and Jiang, we identify the following optimization opportunities.
1) The computed numbers of gate duplicates in the recursive evaluation step do not guarantee that the fanout-bounded version achieves the same minimum possible logic depth as the original, fanout-unbounded network.(Note that the original depth is always achievable using gate duplicates under the assumption that the number of fanouts for a primary input is unbounded.) 2) The priority-queue-based method used in [12] for skewed buffer tree construction, although achieves the best possible size for the buffer tree, is not guaranteed to achieve the best possible level for the root node unless the fanout bound is two.However, for fanout bounds ≥ 3, it is always possible to obtain the best size for the buffer tree as well as the optimal level for the root node using the method proposed by Golumbic [34].3) In [12], it is not stated how the fanouts are initially assigned to the duplicated copies prior to the skewed buffer tree construction or how their initial levels are determined.For instance, if all copies of a gate are naively placed at the same level when it is possible to place some copies at higher levels, the critical path delay can be adversely affected.However, it is difficult for an algorithm to make such decisions unless it already knows the levels of the fanouts.4) The buffer forest rebalancing step does not guarantee that we get the minimum possible duplicate count (even locally for a considered set of equivalent nodes).This is because the rebalancing step is run only after fixing the levels of the duplicated nodes.In our scalable algorithm for general FBS, we capitalize on all these optimization opportunities.Specifically, by reconstructing the network in the reverse topological order, our algorithm has the full knowledge of the levels of fanouts of a gate, before the gate itself is synthesized.In Section V, we describe in detail how our top-down approach enables exploiting each aforementioned opportunity.

B. Path-Balanced Fanout-Bounded Synthesis
As for the path-balanced setting, there is a line of work on satisfying fanout and path-balancing constraints for the AQFP technology (e.g., [16], [17], and [18]), but these works mainly consider doing so without gate duplications.In literature, this problem is often referred to as the AQFP splitter/buffer insertion problem, and it is a special case of the path-balanced FBS.
In early work on AQFP splitter/buffer insertion, the main idea was to optimize individual fanout nets using different approaches such as dynamic programming and local retiminglike methods for pushing buffers from fanins to fanouts.The work of Lee et al. [16] took a rigorous approach where they presented an exact formulation of the problem as a satisfiability modulo theory (SMT) problem using the theory of integer linear arithmetic.Namely, they use the logic depth of each gate as an SMT variable and, for each fanout net, they consider constraints that must be satisfied by any valid splitter/buffer insertion.In contrast, our proposed method uses an ILP to encode the problem and uses the number of gate copies/buffers of each fanout net in each level as variables, which supports gate duplications.
The work in [16] also presented a more elaborate retiming algorithm where an initial splitter/buffer inserted network is further optimized by identifying collections of tightly connected gates (chunks) where buffers can be pushed forward (from inputs to outputs) or vice-versa to reduce the buffer Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.count.This retiming technique was later used in [17] for area recovery in delay optimal AQFP synthesis.More recently, Fu et al. [18] presented a dynamic programming approach to globally optimize splitters and buffers in AQFP synthesis and an ILP-based solution to approximate the optimum solution.
As a final remark, we emphasize the lack of gate duplications in the existing work on splitter/buffer insertion.However, duplicating gates is an important option that warrants increased attention because it can reduce both the area and the delay as we see in the example of Fig. 1.

IV. GLOBALLY OPTIMUM GENERAL FANOUT-BOUNDED SYNTHESIS
In this section, we present our ILP formulation of FBS in the unit-delay model.Given an input logic network, a predefined target logic depth D, the gate and buffer costs (e.g., area), and their respective fanout bounds, the proposed ILP finds the minimum cost logic network that meets all fanout bounds, has logic depth at most D, and is functionally equivalent to the input logic network.
We remark that we do not aim to make any logic restructuring; instead, our ILP determines how to duplicate gates and add buffers to the input logic network.For instance, consider the logic network shown on the left of Fig. 2 where the primary inputs (i 1 , . . ., i 4 ) are shown on the bottom and the primary outputs (o 1 , . . ., o 5 ) are at the top.If we assume gates and buffers both have fanout capacity 2, then one possible solution to the FBS problem is the network shown on the right, where we have two gates duplications (n 1 and n 3 ) and added two buffers (shown in blue triangles).
To derive the ILP, we start with the following notation.Let I be the set of all primary inputs of the input network, let G be the set of all gates, and let N = I ∪ G be the set of all nodes.For example, in the example network shown in Fig. 2, I = {i 1 , . . ., i 4 }, G = {n 1 , n 2 , . . ., n 7 } and N = {i 1 , . . ., i 4 , n 1 , . . ., n 7 }.For a node n ∈ N, let FO(n) be the collection of fanout nodes of n.Let k n be the number of primary outputs directly connected to node n.Thus, for example, for the network in Fig. 2, we have FO(n 1 ) = {n 3 , n 4 } and FO(n 3 ) = {n 4 , n 5 , n 6 }, and Let c gate be the cost (area) of a gate (we assume the network is homogeneous, but our ILP can easily be generalized to support different types of gates), let c buff be the cost of a buffer, let f gate be the fanout capacity of a gate, and let f buff be the fanout capacity of a buffer.
For example, the setting studied in [12] for FBS assumed gates and buffers each have fanout capacity 2 and considered the optimization of the total node count.For this case, we thus have f gate = f buff = 2 and c gate = c buff = 1.
Let n ∈ N be a node in the original graph.We say a node m in a fanout-bounded circuit is n-equivalent if one of the following holds.
1) n is a primary input and m is the corresponding primary input in the fanout-bounded version.
2) n is a gate with fanins n 1 , n 2 and m is a gate with fanins m 1 , m 2 such that m 1 is n 1 -equivalent and m 2 is n 2 -equivalent.3) m is a buffer such that its fanin m 1 is n-equivalent.Note that by the third criterion, any buffer in a buffer tree rooted at an n-equivalent gate is also n-equivalent.According to this definition, in the example fanout-bounded network (assuming f gate = f buff = 2) shown on the right of Fig. 2, there are two n 1 -equivalent gates and two n 2 -equivalent gates.Moreover, the two buffers represented as blue triangles in level 2 are n 2 -equivalent.
1) Variables: We use two kinds of integer variables.For each node n ∈ N and for each level ∈ {1, . . ., D}, we introduce variables g n, to denote the number of gate copies in level in the fanout-bounded circuit that is n-equivalent.Similarly, we introduce variables b n, to denote the number of buffers in level in the fanout-bounded circuit that are n-equivalent.For example, for the logic network shown in Fig. 2, the introduced variables take the following values: and g n, = 0 for all unspecified variables g n q , with q ≤ 7 and ≤ 3.
2) Constraints: Next, we introduce constraints to ensure that the values of variables indeed correspond to a valid fanout-bounded logic network that is equivalent to the input network.To this end, we first have that g n,0 = 0 and b n,0 = 0 for all n ∈ N since there cannot be any gates or buffers in the same level as the primary inputs.(In fact, these variables are redundant and we can write the ILP without them, but having these variables with the above constraint makes it easier to specify the remaining constraints in a concise manner.)Next, consider a fixed level ∈ {1, . . ., D} and a fixed gate n ∈ G.We denote by avl(n, ), which stands for "availability of n-equivalent signals by level ," the total fanout capacity of all n-equivalent gates/buffers that are placed in levels strictly less than .Note that which is a linear function of the ILP variables.We denote by req(n, ), which stands for the "requirement of n-equivalent signals by level ," the total fanout requirement of n-equivalent gates/buffers by all gates and buffers in level or below.Note that each copy of a fanout of an n-equivalent gate increases the fanout requirement by one, and each n-equivalent buffer also increases the fanout requirement by one.Namely, we can write which is again a linear function of the ILP variables.Now, observe that, in any variable assignment that corresponds to a valid fanout-bounded network with depth D, it must hold that avl(n, ) ≥ req(n, ) for all n ∈ G and ∈ 1, . . ., D. To see this, consider any valid depth-D fanoutbounded version of the input network, and let g n, , b n, be the corresponding ILP variable values.Fix any gate n ∈ G and let = 1.Note that for any gate m ∈ FO(n), g m,1 must be 0. Otherwise, there must be a copy of n at level 0, which is a contradiction as n is not a primary input.Similarly, there cannot be any n-equivalent buffer at level 1 either.Thus, it must hold that avl(n, 1) = 0 ≥ 0 = req(n, 1).Now, suppose that avl(n, ) ≥ req(n, ) must hold for any valid depth-D fanout-bounded version.We inductively show that avl(n, +1) ≥ req(n, + 1) must also hold.Observe that the total number of connections between n-equivalent gates/buffers and their fanouts that must cross the boundary between level and + 1 is at least m∈FO(n) g m, +1 + b n, +1 .The total remaining capacity of n-equivalent gates/buffers that are at levels below is avl(n, ) − req(n, ).Thus, the additional capacity needed to support all crossing connections must be provided by n-equivalent gates/buffer that are at level .Namely, we must have or equivalently, avl(n, + 1) ≥ req(n, + 1) after rearranging.
Finally, we ensure that we have enough capacity remaining in n-equivalent gates/buffers to support the respective primary outputs (if any).Namely, for all n, it must hold that avl(n, D + 1) − req(n, D) ≥ k n .
The same can be achieved by viewing all fanouts connected to a gate n as n-equivalent buffers placed at level D + 1, and simply adding the constraint avl(n, D + 1) ≥ req(n, D + 1).
We thus get the following ILP formulation for FBS under a predetermined depth bound D, where the objective function is to minimize the total area Let OPT be the optimum area of a fanout-bounded version of the input network with maximum depth D. Since any such valid network corresponds to a feasible solution for the ILP, it is clear that the value of ILP is at most OPT.We now give an algorithm (Algorithm 1) to transform any feasible ILP solution to a fanout-bounded network of maximum depth D, which is equivalent to the original network, thus showing that our ILP in fact finds the optimum area.
The algorithm first sorts all variables g n, , b n, in the increasing order of .Then, considering the variable values in that order, construct the g n, gate copies or b n, buffers in a new network.To facilitate this construction, for each n ∈ N, the algorithm maintains a queue of currently constructed n-equivalent gates/buffers together with their remaining fanout capacities.Each time it uses such a gate/buffer, it decrements the count; once the count reaches zero, the corresponding gate/buffer instance is removed from the queue.Since the algorithm constructs gates/buffers in a level-by-level fashion using a feasible variable assignment, we can see that the algorithm always has sufficient equivalent signals in the corresponding queues when executing lines 11 and 15.

V. TOP-DOWN HEURISTIC APPROACH FOR GENERAL FANOUT-BOUNDED SYNTHESIS PROBLEM
In this section, we first present our scalable top-down heuristic algorithm that greedily finds a feasible solution to the derived ILP.We then propose an additional optimization step that we can integrate with the top-down approach that allows further area reductions in certain cases.
Although solving the ILP introduced in Section IV gives the optimum solution, solving it optimally for large networks which we often encounter in practice is a prohibitively expensive computation, and hence not a viable approach in many practical settings.On the other hand, the top-down approach we propose in this section is scalable to very large networks as it runs in O(S log S) time, where S is the size of the input network (i.e., the number of wires in the network).Although this approach is not optimum in general, we note that it achieves optimum or near-optimum areas for several considered benchmarks in our experiments.
In the proposed approach, we consider the gates n ∈ G in the reverse topological order, and for each n in this order, determine values for variables g n, and b n, such that the constraints avl(n, ) − req(n, ) ≥ 0 and avl(n, D + 1) − req(n, D) ≥ k n are satisfied.Since we consider the nodes in the reverse topological order, when we consider a node n, we already know the levels of all fanouts of n-equivalent gates/buffers except for those fanouts that arise due to fanins of n-equivalent buffers.We call those fanouts external fanouts of n-equivalent gates/buffers.
When determining the values for g n, and b n, , we prefer minimizing the number of gate duplicates by utilizing buffers as much as possible to support the fanout requirement.This decision is motivated by the following facts.First, duplicating a gate will increase the fanout requirement of other nodes: for example, suppose that n's fanins are m 1 and m 2 .Then, duplicating an n-equivalent gate increases the fanout load of m 1 and m 2 -equivalent gates/buffers.This is in contrast to adding a buffer which only increases the fanout load by one.Second, it is natural to assume that the area of a buffer is not more than that of a gate, and the fanout capacity of a buffer is usually more than that of a gate.Thus, in terms of area, replacing a gate copy with a buffer is always beneficial.
However, we cannot completely eliminate gate duplication because the addition of buffers can increase the number of logic levels (i.e., the critical path length).Recall that t arr n is the minimum level node n can be at even if we assume unbounded fanout capacities.Thus, for any < t arr n , setting g n, to a nonzero value makes the solution infeasible.Similarly, for any ≤ t arr n (note the inclusion of equality), setting b n, to a nonzero value also makes the solution infeasible.
For given levels of external fanouts of n-equivalent gates/buffers and the minimum possible level (i.e., t arr n ) for an n-equivalent gate, we use Algorithm 2 to determine the values of g n, and b n, variables by considering each node in the reverse topological order.We then use Algorithm 1 to construct the corresponding fanout-bounded logic network.
We remark that our top-down approach is fundamentally different from the work of Zhang and Jiang [12].In [12], a set of n-equivalent gates and their corresponding levels are already determined when the buffer-forest rebalancing algorithm is run in order to reduce the number of gate duplicates.This can lead to some redundant gate copies that remain in the network even after rebalancing is performed.In contrast, our algorithm uses Algorithm 2 to decide the set of n-equivalent gates that we absolutely need along with their levels, thus redundant gate copies are never created.Moreover, in the "skewed buffer tree construction" and "buffer-forest rebalancing" algorithms of [12], there can be situations where it does not construct the best buffer tree/forest when f gate , f buff > 2 and c gate > c buff .To see this, suppose that f gate = f buff = 3 and c gate > c buff and consider the fanout net shown in Fig. 3(a).The algorithm of [12] may either decide to duplicate node n and produce the forest shown in Fig. 3(b) which has a cost of 2 • c gate or it may construct the skewed buffer tree shown in Fig. 3(c) where the node n is placed at level 4.However, the buffer tree shown in Fig. 3. Fanout net for a node n with levels of fanouts already decided (a), two possible outcomes for the fanout net of n if the algorithm of [12] is used (b) and (c), and the optimum buffer tree for n (d) when f buff = f gate = 3 and c gate > c buff .Add f buff − s many copies of ∞ to folev n (i.e., dummy fanouts with unbounded required time).

11
Use the skewed buffer tree construction from [12] until we have t buffer trees.
12 if the root levels of all buffer trees are at least t arr n then 13 Set g n, and b n, according to the construction.

return variable values
Fig. 3(d) is better than both the options; it has a lower area than the one in Fig. 3(b) and gives a better placement for node n than the one in Fig. 3(c).In contrast to [12], our algorithm always constructs the optimum buffer forest for given levels of external fanouts and t arr n .Namely, for r = 1, 2, . . ., we consider r copies for the root gate, employ a modified version of the algorithm of Golumbic [34] to derive r buffer trees, and find the minimum value of r such that roots of all trees meet the arrival time requirement.

A. Improved Top-Down Approach With Overduplication
Recall that in our vanilla top-down approach, for each fanout net, we find the smallest buffer forest that does not increase the overall critical path length.The intuition behind settling for the smallest buffer forest is to minimize duplication of gates, and hence avoid unnecessarily increasing the load on the fanins of those gates.
One potential drawback of this frugal approach is as follows: consider a scenario where we may have the option of placing two copies of a node n at level t arr n + 1.However, we may end up placing a single copy of n at level t arr n instead, thus forcing more duplications for n's fanin nodes as their fanout Intermediate step of FBS with levels decided for all nodes except n, n 1 , n 2 (top), the synthesized fanout nets by the algorithm described in naive top-down approach (middle), and the synthesized fanout nets if overduplication allowed (bottom) when f buff = f gate = 3 and c gate > c buff .nets do not have enough slack to add buffers.To illustrate this point, assuming that f gate = f buff = 3 and c gate > c buff , consider the time our algorithm processes the fanout net of node n in the setting shown on top in Fig. 4 where the levels are already decided for all nodes except n, n 1 , and n 2 .Since the naive top-down approach greedily tries to minimize the number of duplicates for n, it will be placed at level 2 (no duplication) with one buffer at level 3 as shown in the middle of Fig. 4.This forces both n 1 and n 2 to be duplicated (unless the critical path length is to be increased) which results in an overall cost of 5 • c gate + c buff for the fanout nets of n, n 1 , and n 2 .However, if we allow the locally suboptimal choice of duplicating n, it is possible to place two copies of n in level 3.This allows more room for fanout nets of n 1 and n 2 to have buffers, resulting in the outcome shown at the bottom of Fig. 4 with an overall cost of 4•c gate +2•c buff (which is strictly a better cost when c gate > c buff ).As such, allowing more duplicates than absolutely necessary (i.e., overduplication) can be good if that provides more room for the fanins to have buffers and prevents them from being duplicated.
In an improved version of our top-down approach, we incorporate this idea of overduplication as follows.For the fanout net of a considered node n, instead of stopping the algorithm at the minimum possible number of trees t, we continue increasing t and construct the corresponding buffer forests.For each such buffer forest, we consider the overall area incurred by the fanout net of the considered node and the fanout nets of its fanin nodes, assuming that we do not use overduplication for those fanin nodes.Then for node n, we choose the buffer forest that gives the minimum overall area computed in the above step.
There are two issues with this approach.First, due to the top-down implementation, when considering node n, all levels of its fanouts (including their potential copies) are known.However, for a fanin m of n, there can be some fanouts that are yet to be considered by the algorithm, and hence their final levels are not known.Second, suppose that a node m has k fanouts.For each of those fanouts, the cost of the fanout net of m will be re-evaluated multiple times, i.e., the fanout net of m is evaluated at least k-times.Since each evaluation also takes time at least linear in k, the total work involved in evaluating a node's fanout net can be very expensive for high-fanout nodes.
To circumvent the first issue, we propose to use a proxy level for the so-far unconsidered nodes; namely, we use their maximum possible level (i.e., the required time) as the proxy level.To mitigate the effects of the second issue, we set a constant bound F max (e.g., 10) and ignore nodes with more than F max fanouts when computing the overall area impact.

VI. PATH-BALANCED FANOUT-BOUNDED SYNTHESIS
In this section, we focus on FBS with the additional requirement of path-balancing.

A. ILP Formulation for the Global Optimum
Recall that the path-balancing constraint states that all inputto-output paths are of the same length.Equivalently, for a gate in level , all of its fanins must be in level − 1.Thus, for a gate or primary input n in the input network and for a level in the output network, it must hold the following: the total available fanout capacity of all n-equivalent nodes in level −1 must be at least the total required number of n-equivalent signals by nodes in level .We can easily incorporate this constraint into the ILP of Section IV by simply redefining avl(n, ) and req(n, ) as As discussed in Section II, the FBS with path-balancing constraints is a generalization of splitter/buffer insertion for AQFP technology, and AQFP technology can have different assumptions on the need for buffers/splitters on primary inputs and primary outputs.In particular, the requirement that all input-to-output paths must be of the same length falls under the assumption that both primary inputs and primary outputs need path-balancing.
However, our ILP is versatile as it can be adapted to different AQFP-technology-specific assumptions.For example, we can remove the path-balancing requirement on primary inputs by retaining the definitions of avl(n, ) and req(n, ) from Section IV for nodes n ∈ I, i.e., the primary inputs.Similarly, we can remove the path-balancing requirement on primary outputs by retaining those definitions only in the constraint avl(n, D+1)−req(n, D) ≥ k n .Moreover, if we need to also enforce that primary inputs need splitters to support multiple fanouts, we can add constraints dictating g n,0 = 1 and g n, = 0 for all n ∈ I and > 0. (In the ILP for the general fanout-bounded setting with no fanout limit on primary inputs, we simply omitted these constraints.This allows the ILP solver to place as many copies of primary inputs anywhere in the network, which is effectively equal to assuming unbounded fanout capacity.) In addition to supporting the different AQFP-specific assumptions, we can also change the ILP to match the original splitter/buffer insertion problem where duplicating gates is not an option.To this end, we simply have to introduce a new constraint that 1≤ ≤D g n, = 1 for all gates g ∈ G.

B. Scalable Heuristic Approach
In the path-balanced setting, we need buffers not only to support multiple fanouts but also to ensure that all input-tooutput paths are of the same length.If we naively use the same top-down approach from the general FBS for the pathbalanced setting, it can unnecessarily increase the area due to path-balancing buffers.To see this, suppose that we have a gate n whose arrival time is 1, but its only fanout is determined to be in level 3 by our top-down algorithm.In this case, the algorithm prefers to keep n in level 2 (as opposed to 1) because the main idea of the algorithm from Section V was to keep gates in the highest level possible to give sufficient room for its fanins to have buffers.Now suppose that n's fanins have no other fanouts, in which case, we will need two path-balancing buffers at n's fanins.However, if we placed n in level 1 instead, we could only use one path-balancing buffer at n's output.In general, the situation can be much worse: for example, we could have a block of logic that has k 1 outputs and k 2 inputs in place of n.If k 1 < k 2 , moving the whole logic block down by 1 level can save k 2 − k 1 buffers.On the other hand, if k 2 < k 1 , then the algorithm's choice to keep the logic block in the highest possible level is meaningful.
Taking such scenarios into account, for the path-balanced setting, we start with a top-down approach similar to Section V to determine initial gate/buffer counts in different levels (i.e., values for variables g n, and b n, ), but we then perform an additional optimization to modify these gate/buffer counts to further reduce the area.To this end, we first identify (gate and level)-pairs that may correspond to potential path-balancing buffers.If all fanins of a gate are path-balancing buffers, we can push the buffers toward the output of the gate.In general, this can be done on blocks of logic whose inputs all correspond to path-balancing buffers.
This kind of retiming techniques have already been considered in the past [16], [17], but they work on existing AQFP netlists.Our proposed method is more general and works on gate/buffer counts in each level, before constructing the netlist, and hence it is able to capture more retiming opportunities.To illustrate, consider the part of a netlist shown in Fig. 5(b), where we assume that the splitter fanout capacity is 2 for the sake of simplicity.The existing retiming techniques can optimize this by moving node a one level down to obtain the configuration in Fig. 5(c), saving one buffer in the process.However, these algorithms fail to identify an optimization opportunity for the configuration in Fig. 5(a) because one of the fanins of node a is not a path-balancing buffer but a splitter.Our approach, instead works on gate/buffer counts in each level and hence is able to identify the optimization opportunity in both scenarios.Namely, for each fanin x of node a, we check if we have a potential path-balancing buffer by checking: 1) if we have x-equivalent buffers in the lower level; 2) and if we can isolate one path-balancing buffer (a buffer with fanout one) from those.To check the first condition for a fanin x of node a in level , we check if b x, −1 > 0. For the second condition, we check whether the remaining x-equivalent nodes in level − 1 can still satisfy the requirement of remaining fanouts of x in level after dedicating a single x-equivalent buffer to supply node a's fanin; namely, we check if avl(x, − 1) − f buff ≥ req(x, ) − 1.After optimizing the gate/buffer counts with this improved retiming step, we construct an AQFP netlist using Algorithm 1.Then, we also run the state-of-the-art retiming from [17] on the constructed circuit to further optimize buffer counts.
In addition to the retiming, our initial top-down heuristic has some minor differences with respect to Section V. Namely, when computing the arrival times for signals, if the originating gate of a signal has more than one fanout, we assume a delay of 2 (instead of 1) accounting for an additional splitter at the output of that gate.This is an AQFP-specific setting: in AQFP, the gates can only support one fanout, and if we always assume a delay of 1 for a gate, the top-down algorithm can end up excessively duplicating gates to meet this delay bound.However, if a gate is in the critical path, has only two fanouts, and if its fanins will have splitters added at their outputs (i.e., they have multiple fanouts), then it is likely that we may be able to duplicate the gate with only a small additional cost.So for such gates, we take the delay to be 1 when computing the arrival time.

VII. EXPERIMENTAL RESULTS
In this section, we present the experimental results obtained from our ILP formulations and heuristic FBS algorithms for both the general and path-balanced settings.All our experiments were run on a MacBook Pro M1 with 10 cores of CPU, 16 cores of GPU, and 32 GB of RAM.
Note that in all our experiments for the general FBS setting, the benchmarks are preprocessed with a single round of resyn2 command in ABC [19], to do a fair comparison with prior work [12].No such preprocessing was done in experiments for the path-balanced FBS setting.

A. Global Optimum for General Fanout-Bounded Synthesis
First, for a set of small benchmarks, we use the ILP to find the global optimum solutions; namely, using the minimum possible circuit delay as the delay bound, we write the ILP introduced in Section IV, and then solve it using the Gurobi optimizer [36].In the ILP formulation, we use the same setting as [12] where we have fanout capacity 2 and unit-area for both AND gates and buffers.
The results are shown in Table I where the first eight benchmarks are from the EPFL logic synthesis benchmarks suite [14] and the rest of the benchmarks are a subset of those used in [15].

B. Heuristics for General Fanout-Bounded Synthesis
Next, we evaluate our top-down FBS approaches on the benchmarks of [15] and on EPFL benchmarks [14].
For benchmarks of [15], we present the results in Table II.As we see, our initial top-down approach already achieves the optimum on several benchmarks.Our top-down approach with overduplication performs even better and achieves results that are optimum or closer to optimum on some additional benchmarks.We recall that both our approaches do not increase the number of logic levels of the input network (computed with no restrictions on the fanout capacity of gates).
For EPFL benchmarks, we present the results in Table III together with the results of [12] for a comparison.We remark that the measure of quality of results (QoR) used in [12] is slightly different, and if we were to use their QoR measure on our results, our approach would score even higher.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.[15] Namely, the QoR measure used in [12] is size(G)/size(G ) + depth(G)/depth(G ), where G is the original input network and G is the fanout-bounded version produced by the algorithm.In our approach, the depths of G and G are always equal, whereas in [12], depth(G) ≤ depth(G ) with strict inequality for some benchmarks (e.g., see the results for benchmark "sqrt").
In our top-down approach without overduplication, the average improvement over all standard EPFL benchmarks is 10.93%.However, for the benchmark "bar," our algorithm's result is 12.2% worse.Remarkably, combining the top-down algorithm with the overduplication step from Section V-A achieves the same results as [12] for that benchmark, while increasing the average improvement over all EPFL benchmarks to 11.82%.Notably, our method results in fanoutbounded circuits that are much closer to the optimum results on several benchmarks (e.g., adder, cavlc, int2float, and router).

C. Global Optimum Splitter/Buffer Insertion for AQFP
In this section, we present the results of our ILP-based global optimization algorithm for the FBS in the path-balanced setting targeting the AQFP technology.To this end, we set f buff = 4 and f gate = 1 to capture the fanout constraints commonly used in prior work on the AQFP technology.We use the number of JJs as the area cost, and hence we set c gate = 6 and c buff = 2. Recall that, according to the ILP formulation, the global optimum means the minimum area for a fixed depth.
In our experiments, we consider two scenarios: one without gate duplications and one with gate duplications.To the best of our knowledge, no prior work on AQFP splitter/buffer insertion considers gate duplications.
In Table IV, we present our optimum results on the same benchmarks used by [15] for the case with no gate duplicates and compare them with the results of four prior work [15], [16], [17], [18] in the same setting.In this experiment, we use the minimum achievable delay without duplicating gates as the target depth bound.In the table, the optimum area for the target depth is shown in blue.The term "opt" in the last columns means that the ILP solver was able to find the optimum solution.On the other hand, the term "tle" (time-limit-exceeded) means that the solver failed to find the optimum solution within a given time limit of 300 s, so the presented results for "tle" rows are based on a tentative feasible solution found by the solver.Note that having the global optimum results in this setting allows for an objective evaluation of other heuristic algorithms.
In Table V, we present the optimum results obtained considering different target logic depths on the same benchmarks for the setting with gate duplicates, which can be used to evaluate future algorithms in this setting.To obtain these results, we start with the minimum delay achievable without gate duplications as the target delay and proceed with gradually decreasing the target delay.In the table, for each benchmark, the minimum observed is shown in blue where the ties are broken using the overall delay.Note that these results serve as a proof of concept that allowing gates duplications can help improve both the area and delay in AQFP synthesis.

D. Heuristic Splitter/Buffer Insertion for AQFP
Finally, we run our scalable heuristic algorithm for pathbalanced FBS on the same benchmarks used by [15] and compare our results with the latest scalable algorithm for AQFP splitter/buffer insertion [17] in Table VI.For all benchmarks, our approach achieves the same or significantly better delays as compared to the optimum delay achieved by the method in [17].For some benchmarks with significant delay improvements, there is a considerable area overhead which is likely caused by duplicated gates.However, some other benchmarks with higher delay improvements show considerable area improvements as well, which can be attributed to the decrease in path-balancing buffers.The average delay improvement of our approach is 8.76% while the average area improvement is 0.5%.Notably, our heuristic algorithm achieves more than 17% delay improvements on several benchmarks.

VIII. CONCLUSION
In this work, we took a rigorous approach for the FBS of circuits in the unit-delay model.To this end, we formulated the problem of FBS for fixed target delay as an ILP and obtained the global optimum solutions for a number of benchmarks.We then showed how to find a feasible solution to the ILP using a scalable top-down approach while mitigating some Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.we introduced in Section V-A.We also believe that a deeper analysis of the benchmark "bar" might hint at what kind of real-world circuit patterns benefit more from such heuristics.
We extended both our optimum and heuristic approaches to the setting with path-balancing constraints and demonstrated their effectiveness considering the splitter/buffer insertion problem in the AQFP technology.Our globally optimum results considering different target depths for the setting with gate duplications show that there exists a large gap in existing AQFP splitter/buffer insertion techniques.Remarkably, our scalable heuristic algorithm for this setting was able to exploit many optimization opportunities by considering the duplication of gates on critical paths.In particular, several benchmarks showed over 17% delay improvements under our method including two benchmarks (adder8 and alu32) that also showed over 10% area improvements.However, comparing the results of our heuristic with the globally optimum solutions, it is clear that there are many opportunities for further improvements.
Considering these promising findings, we envision that FBS will have a bigger role to play in logic synthesis for emerging technologies with unconventional design constraints, and we hope that our work will motivate more research in this direction that would ultimately lead to better heuristics.The globally optimum solutions presented in this work can serve as the ground truth for evaluating such heuristics.

Manuscript received 19
June 2023; accepted 9 November 2023.Date of publication 5 December 2023; date of current version 23 April 2024.This work was supported by the SNF Grant "Supercool: Design Methods and Tools for Superconducting Electronics" under Grant 200021_1920981.This article was recommended by Associate Editor S. Gao.(Corresponding author: Dewmini Sudara Marakkalage.) req n = D. Otherwise, t req n = min m∈FO(n) t req m −1, where FO(n) denotes the set of fanout nodes of node n.

Fig. 1 .
Fig. 1.Example logic network (left) and two of its possible fanout-bounded, path-balanced versions targeting AQFP technology assuming a fanout capacity of 1 for gates and 3 for splitters.(Buffers and splitters are shown by triangles.)The version in the middle does not use any gate duplication whereas the version on the right allows gate duplication resulting in a reduction in both the overall number of logic levels as well as the total area.

Fig. 2 .
Fig. 2. Example logic network (left) and a possible fanout-bounded version assuming a fanout limit of 2 (right).

Fig. 4 .
Fig. 4.Intermediate step of FBS with levels decided for all nodes except n, n 1 , n 2 (top), the synthesized fanout nets by the algorithm described in naive top-down approach (middle), and the synthesized fanout nets if overduplication allowed (bottom) when f buff = f gate = 3 and c gate > c buff .

Fig. 5 .
Fig. 5. Two possibilities for a part of an AQFP netlist (a) and (b), and their retimed version (c).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Input network ntk, parameters f gate , f buff , and a feasible ILP solution g n, , b n, for n ∈ N and 0 ≤ ≤ D. output: A fanout-bounded version of ntk.Let data be an empty list.5 for all nonzero g n, do Add ( , n, "gate") to data 6 for all nonzero b n, do Add ( , n, "buff ") to data 7 Sort data in the ascending order of levels.Create a new gate by choosing the first available equivalent fanins in newsig.

Algorithm 2 :
Algorithm for Determining g n, and b n, Values for a Node n ∈ N, Given min Input network ntk, parameters f gate , f buff , a node n, t arr n , and a list folev n of levels of n's fanouts.output: Values of g n, , b n, variables for = 1, . . ., D. 1 Set g n, , b n, = 0 for all 2 for t = 1 to length(folev n ) do input:3 Let rem ← length(folev n ) − t • f gate 4 if rem ≤ 0 then 5 for i = 1 to length(folev n ) in steps of f gate do 6 Increment g n,folev n [i]−1 .7 return variable values8 s ← rem mod (f buff − 1) 9 if s > 0 then 10

TABLE I GLOBAL
OPTIMUMS FOR GENERAL FBS TABLE II RESULTS OF THE TOP-DOWN FBS ON BENCHMARKS OF

TABLE III RESULTS
OF THE TOP-DOWN FBS ALGORITHM ON EPFL BENCHMARKS TABLE IV RESULTS OF AQFP SPLITTER/BUFFER INSERTION WITHOUT GATE DUPLICATION

TABLE V RESULTS
OF AQFP SPLITTER/BUFFER INSERTION WITH GATE DUPLICATION shortcomings of earlier work.As compared to the known best results for this problem, our algorithm produced an 11.82% improved area while achieving matching or better delays.As we see in Section VII, the overduplication heuristic with a local cost function improved the area reduction.It will be interesting to find a more elaborate but efficiently computable cost function for evaluating heuristic choices such as the one

TABLE VI RESULTS
OF SCALABLE HEURISTIC APPROACH FOR AQFP