Impact of Sequential Design on the Cost of Adiabatic Quantum-Flux Parametron Circuits

Adiabatic quantum-flux-parametron (AQFP) logic is a superconductor logic family whose energy efficiency approaches theoretical limits. Because AQFP logic gates depend on a polyphase excitation current to perform their computation, gate fanins must arrive at the appropriate excitation phase. Such a technology constraint has conventionally been treated by inserting buffers to balance shorter paths. However, path-balancing buffers account for a large portion of the circuit area, limiting the scalability of AQFP circuits. In this article, we examine the necessity of AQFP design constraints and propose a more relaxed set of constraints, which still guarantees the correct operation of AQFP sequential circuits. In particular, we propose to consider phase alignment instead of path balancing. Experimental results show that adopting the relaxed constraints reduces <inline-formula><tex-math notation="LaTeX">$73 \%$</tex-math></inline-formula> of buffers on average, and up to <inline-formula><tex-math notation="LaTeX">$90 \%$</tex-math></inline-formula> in some particularly-imbalanced benchmarks.


Impact of Sequential Design on the Cost of Adiabatic
Quantum-Flux Parametron Circuits Siang-Yun Lee , Christopher Lawrence Ayala , Senior Member, IEEE, and Giovanni De Micheli , Life Fellow, IEEE Abstract-Adiabatic quantum-flux-parametron (AQFP) logic is a superconductor logic family whose energy efficiency approaches theoretical limits.Because AQFP logic gates depend on a polyphase excitation current to perform their computation, gate fanins must arrive at the appropriate excitation phase.Such a technology constraint has conventionally been treated by inserting buffers to balance shorter paths.However, path-balancing buffers account for a large portion of the circuit area, limiting the scalability of AQFP circuits.In this article, we examine the necessity of AQFP design constraints and propose a more relaxed set of constraints, which still guarantees the correct operation of AQFP sequential circuits.In particular, we propose to consider phase alignment instead of path balancing.Experimental results show that adopting the relaxed constraints reduces 73% of buffers on average, and up to 90% in some particularly-imbalanced benchmarks.

I. INTRODUCTION
A DIABATIC quantum-flux-parametron (AQFP) logic is an emerging superconductor digital logic technology receiving increased interest.By operating in the adiabatic mode, its energy efficiency is reaching theoretical limits [1].Being a promising and attractive alternative to CMOS-based digital families for high-performance computing, design automation algorithms specialized for AQFP circuits are also being rapidly developed in recent years.As switching energy dissipation in AQFP is related to the number of Josephson junctions (JJs), reducing the JJ count of AQFP circuits has been the primary optimization goal along with reducing circuit latency.This, in turn, also helps to reduce the overall circuit area as AQFP primitives have a large footprint due to their output transformer.Surprisingly, research has found that a large portion of JJs in AQFP benchmark circuits is dedicated to buffering cells to fulfill technology constraints.
The AQFP technology imposes some special constraints uncommon to classical CMOS technologies.First, because every gate in an AQFP circuit is clocked, all input signals for any given logic gate have to arrive before the clock.To ensure this, shorter data paths need to be delayed with clocked buffers.Moreover, the output signal of AQFP logic gates cannot be directly branched to feed into multiple fanouts.Instead, splitters are placed at the output of multifanout gates to amplify the output current, which are also clocked.The AQFP splitter cell is based on the buffer cell and has the same cost of 2 JJs [2].Thus, buffer insertion and splitter insertion are often considered together as an optimization problem [3], [4], [5], [6].In previous works, the buffer and splitter insertion problem has been formulated as follows: All paths should be balanced to the same length (path balancing), and all gates, including primary inputs (PIs), with multiple fanouts must be branched (fanout branching).
While the path-balancing and fanout-branching constraints are absolutely required for the correct operation of an AQFP combinational1 circuit without memory devices, in the context of a sequential computing model where combinational inputs and outputs are connected to registers, these constraints may be too conservative.According to the architectural clocking scheme currently used in AQFP sequential circuits, registers generally hold their values throughout the architectural clock cycle and their outputs can be taken by the next-stage combinational circuit multiple times.In other words, the same computation is repeated in waves in an AQFP combinational circuit.With a careful analysis, we argue that it is not always necessary to balance all paths to equal length.Instead, aligning the gate-level clock phases is enough.
In this work, we provide an overview of AQFP sequential circuit design and discuss how architectural clocking and register design affect the technology constraints.We argue that the commonly adopted constraint formulation is sometimes too conservative and propose relaxations to the constraints.Consequently, we also investigate how the relaxation of constraints affect the number of buffers needed, and discuss possible tradeoffs when the constraints are relaxed.Our following contributions are three-fold.
1) We re-examine the formulation of AQFP technology constraints and propose possible relaxations on these constraints: phase alignment instead of path balancing, and the flexibilities on combinational inputs' splitting capacity and phases.We also discuss a potential issue with clock skew and the tradeoff of adopting relaxed constraints.

2)
We implement and open source the first buffer-insertion framework, which considers detailed and realistic constraints and possible relaxations.The framework is parameterized for easy customization of constraint specification.3) We investigate the influence of technology constraints on JJ count.Using the relaxed constraints, an average of 73% of buffers can be saved.This observation can help scale up AQFP circuits, which were bottle-necked by too many buffers before.

II. ADIABATIC QUANTUM-FLUX-PARAMETRON
In AQFP digital circuits, logic "0" and "1" are represented by different current directions of the same magnitude, instead of low and high voltages as in CMOS.The basic circuit components in AQFP logic are the buffer cell and the branch cell.A majority-3 (MAJ3) logic gate can be constructed by combining three buffer cells with a reverted branch cell (i.e., a 3-to-1 merger).Other preliminary logic gates, such as the AND2 and OR2 gates, can be built from the MAJ3 gate with a constant input (constant 0 for AND2 and constant 1 for OR2) made of an asymmetric buffer cell.Input negation of logic gates is realized using a negative mutual inductance and is of no extra cost [7].The commonly used cost metric for AQFP circuits is the JJ count.A buffer costs 2 JJs, a branch cell is of zero JJ-cost, and a logic gate based on MAJ3 costs 6 JJs [7].

A. Gate-Level Clocking Schemes
Logic gates, buffers, and splitters in AQFP are periodically activated and reset by alternating excitation current [1].A gate takes its inputs, computes, and provides its output with the presence of the excitation current.In the absence of the excitation current, an AQFP gate produces no output current (i.e., neither logic "0" nor logic "1").Thus, two cascaded gates must be fed with consecutive clocking phases, where the capturing gate is activated later than, but overlapping with, the activation of the launching gate, such that the information can be propagated along the circuit.We call the capturing gate a fanout of the launching gate, and the launching gate a fanin of the capturing gate.
Various clocking schemes have been proposed.3-phase clocking was used in earlier works [1], [7], [8], where three excitation currents with a phase shift of 120 • to each other are fed into different levels of gates.A few years later, 4-phase clocking was proposed [9] and has remained the most commonly used clocking scheme until now.In 4-phase clocking, the phase shift decreases to 90 • , the number of alternating current sources decreases to 2, and the number of clocking phases in each clock cycle increases to 4, allowing for slightly lower latencies by enabling a logical depth of 4 gates instead of 3 per cycle.In both 3-and 4-phase clocking, logic gates in each level are assigned to one of the three or four phases and phase synchronization must be ensured: Any fanin of a gate g must be at the previous phase of g.

TABLE I TRUTH TABLE OF D-LATCH AND NDRO
Another clocking scheme is delay-line clocking [10], where a single alternating excitation current is used and transmission lines are inserted between levels to delay the clock.Delay-line clocking not only allows for even lower latency but also enables the phase-skipping operation [11], [12], reducing the number of path-balancing buffers.
In this article, we use p clk to denote the number of phases in a (gate-level) clock cycle.Typically, p clk = 3 or 4.

B. Memory Devices and Architectural Clocking
To implement sequential circuits using a similar finite-statemachine model as CMOS digital systems, AQFP memory devices are needed.At least two possible designs have been proposed in the literature: 1) D-latch and 2) QFPL-based NDRO.
A simplified AQFP feedback delay latch (D-latch) is depicted in Fig. 1, where the 4-phase clocking scheme is used.A Dlatch takes an Enable signal E and a Data signal D as inputs.Its operation is illustrated by the truth table shown in Table I.When E = 0, the majority gate has input values (0, 1, Q n ), thus keeping the same internal state Q n+1 = Q n ; when E = 1, the majority gate has input values (D, D, Q n ); thus, the internal state is overwritten by the new data D [13].
A quantum-flux-parametron latch (QFPL) is a special AQFP gate that can hold its state when the excitation current is low.The internal state of a QFPL is updated only when its two inputs A and B present the same value; otherwise, it keeps the previous state.Combining a QFPL and some logic gates, a nondestructive-read-out (NDRO) can be made, as shown in Fig. 2.An NDRO also takes an Enable signal E and a Data Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.signal D as inputs and has the same truth table as in Table I.When E = 0, we have A = 0 and B = 1; thus, the QFPL holds its previous state; when E = 1, then A = B = D and the new data is written into the QFPL [14].
For a D-latch, an update to the state, caused by a new value at the input D enabled with E = 1, is propagated through the circuit and changes the output Q n+1 4 phases later.In contrast, an update to the state of an NDRO is available at the output 3 phases later.
In a classical sequential circuit model, the data D inputs of registers come from the outputs of the previous-stage combinational circuit, the outputs Q of registers are connected to the inputs of the next-stage combinational circuit, and the enable E input of registers comes from an architectural clock (in contrast to the gate-level clock discussed in Section II-A).In the CMOS paradigm, the enabling signal of registers is the rising edge or falling edge of a periodic clock signal.In contrast, in AQFP, the enabling signal E is kept at 0 most of the time and becomes 1 once every k gate-level clock cycle, where the value k depends on the length of the critical combinational path.In this article, we denote the number of phases in an architectural clock cycle by p arch = k • p clk .

A. Terminology
This work mainly focuses on the network-level abstraction of an AQFP circuit.A logic network is a directed acyclic graph (DAG) defined by a pair (V, E) of a set V of nodes and a set E of directed edges.The node set V = I ∪ O ∪ G is disjointly composed of a set I of PIs, a set O of primary outputs (POs), and a set G of (logic) gates chosen from an AQFP technology library (e.g., composed of AND2, OR2, and MAJ3 with optional input negation).Each PI has in-degree 0 and unbounded outdegree, whereas each PO has in-degree 1 and out-degree 0. The out-degree of each gate is unbounded and the in-degree is a fixed number depending on the type of the gate.This abstraction models the combinational part of digital circuits.In practice, PIs of a logic network are often provided by the register outputs of the previous sequential stage and POs are connected to the register inputs of the next stage.
After synthesizing and optimizing a logic network using some logic synthesis algorithms, buffers and splitters need to be inserted to fulfill technology constraints, producing a buffered network.A buffered network N = (V , E ) is a DAG extended from a logic network N = (V, E).In particular, the node set V = V ∪ B is supplemented with a set B of buffers.A buffer node models an AQFP buffer cell (when having out-degree 1) or an AQFP splitter cell (when having out-degree larger than 1) and always has in-degree 1.
When not specified, a network may be a logic network or a buffered network.In addition, a schedule can be defined for a network.A schedule of a network N = (V, E) is a function l : V → Z ≥0 that assigns a nonnegative integer l(n) to each node n ∈ V , called the level of n.The depth of a network N with a PO set O and associated with a schedule l is defined as We define the following properties for a buffered network E ) and its associated schedule l related to the AQFP technology constraints, subject to some parameters: The splitting capacities s i = 1, s g = 1, and s b ≥ 1 of PIs, gates, and buffers, respectively, are the maximum number of fanouts each type of nodes may have.The clocking scheme p clk is the number of phases in a gate-level clock cycle.Φ ro is the set of phase differences a register may produce its output relative to its input phase.
1) Path balancing: ∀i ∈ I : l(i) = 0, and (2) 2) Phase alignment: N is phase-aligned if 3) Fanout branching: N is properly-branched if every PI has an out-degree no larger than s i , every gate has an out-degree no larger than s g , and every buffer has an outdegree no larger than s b .In the following, we discuss which subset of these properties shall be required as AQFP technology constraints and the values of the parameters involved.

B. Phase Alignment Instead of Path Balancing
Existing works on AQFP sequential architectural design [14], [15], logic synthesis [2], [16], [17], [18], [19], [20], and buffer insertion-optimization [3], [4], [5], [6] conventionally adopt a more conservative set of constraints: path balancing and fanout branching.Notice that fulfilling path balancing, with an additional constraint that d(N ) mod p clk = 0, implies fulfilling phase alignment with Φ ro = {0}.While this ensures correct and robust operation of the AQFP circuit even with fast clock frequencies, enforcing these constraints often leads to bulky circuits with more than half-sometimes up to 90%-area taken by buffers.In this section, we argue that in the context of synthesizing combinational logic between register stages, assuring phase alignment, instead of the stronger path-balancing constraint, is enough.
In [14], Saito et al. proposed that when registers in several sequential stages share the same enable signal, which arrives once per p arch phases matching the depth of the deepest stage, shallower stages do not need to be balanced to the same length as the deepest stage.The main reason is that memory devices output their value every p clk phase and do not change their internal state for the entire architectural clock cycle until the enable signal arrives.Thus, although shallower stages finish their computation earlier than when the registers are enabled to take the next values again, the same computation is repeated every (gate-level) clock cycle, and the same computational results are produced repeatedly until the registers are enabled again to accept them.
With a similar reasoning, we extend the argument further and propose that the path-balancing constraint can be relaxed to phase alignment, formally stated as follows.
Proposition 1: In an AQFP sequential circuit, let d be the longest path length between any two register stages, φ ro be the phase difference between the register output Q n+1 and inputs D, E. Suppose that the register enables signal E is 1 for one phase in every p arch = k • p clk phase, where p arch ≥ φ ro + d, then fulfilling the phase-alignment constraint (4)- (6), in addition to fanout branching, is enough to ensure correct sequential operation of the circuit.
Proof: Without loss of generality, consider the computation propagated from one register stage I, through a combinational circuit N , to the next register stage O, in one architectural clock cycle.Suppose that E = 1 at time t = 0 and at time t = p arch (the unit of time is the number of phases) and that E = 0 all the other time.Let the (multi-input, multioutput) Boolean function computed by N be f N and let the values presented at the outputs of registers I at time t = φ ro be x, we will prove that the values presented at the inputs of registers O at time t = p arch are exactly f N ( x).
First, observe that the same x is produced at I every p clk phase until (excluding) t = p arch + φ ro , i.e., at Comparing Section III-B against (5), we conclude that for all combinational inputs i, its value is ready at time t = l(i) corresponding to its assigned level, as well as every p clk phase afterward.
Next, consider a gate n with two fanins2 n i1 and n i2 and suppose that the values of n i1 and n i2 are ready at times corresponding to their assigned level, as well as every p clk phase after these times, i.e., t = l(n i1 ) + j • p clk and t = l(n i2 ) + j • p clk , respectively, where j ∈ Z ≥0 .By (4), we know that at time t = l(n) − 1, both fanins of n provide their correct values; thus, n computes its correct value at time t = l(n).Moreover, as n i1 and n i2 produce the same values every p clk phase, the same correct computation also repeats every p clk phase since t = l(n).Notice that this argument does not require l(n i1 ) and l(n i2 to be equal.By induction, we conclude that all gates compute and produce the correct value since time corresponding to their assigned levels and every p clk phase afterward.Notice that in this analysis, the requirements for the architectural clock period p arch = k • p clk and p arch ≥ Φ ro + d must hold regardless of adopting path-balancing or phase-alignment constraints.In other words, the proposed relaxation does not affect architectural clock frequency or latency.

C. PI Capacity and Phases
Based on the conventional D-latch as shown in Fig. 1, which adopts the 4-phase clocking scheme, we modify the design in Fig. 3 to show the possibility for memory devices to have an output capacity larger than 1 and to have their output signal available at multiple phases.In Fig. 3, buffers are replaced by splitters to drive up to s b − 1 fanouts at various phases, not only phase 4. Adopting such D-latches as registers in a sequential circuit, PIs of the combinational network now have a splitting capacity s i = s b − 1 (where s b is usually 3 or 4) instead of 1.
With the modified D-latch design in Fig. 3, instead of Φ ro = {4} when adopting D-latches in Fig. 1, we may use Φ ro = {3, 4, 5} for a more relaxed phase-alignment requirement because register outputs can be provided at various phases in the feedback loop in D-latch.

D. Consideration of Clock Skews
The analysis above assumes an ideal clock with zero clock skew.However, in real circuits, clock skews may arise when the clock signal travels along many logic levels.In other words, the activated time of a gate receiving a phase-1 clock closer to the clock source may be earlier than another gate receiving also a phase-1 clock but farther away from the clock source.The difference in the clock timing is called clock skew.One typical superconductor electronics process used to manufacture AQFP circuits is the National Institute of Advanced Industrial Science and Technology (AIST) 10 kAcm −2 Nb four-layer high-speed standard process (HSTP).In this process, microstriplines with a ground layer are used to deliver the ac power-clock signals to the AQFPs.A first-order approximation of the transport delay of a 5-µm-long microstripline in this process is approximately 6.20 psmm −1 [2].This results in a nonzero clock skew that accumulates along the meandering power-clock network of the AQFPs [15].With the existence of a nonzero clock skew, there is an upper limit on how many phases can be skipped without any buffer in between, in addition to the phase alignment constraint.
For large AQFP circuit designs such as a microprocessor, a meandering power-clock network may span across an entire chip, which is typically in the range of 5 × 5 mm to 10 × 10 mm in present-day superconductor fabrication processes.The accumulated skew at this scale is significant enough to produce timing errors at gigahertz range operating frequencies.In this case, it is important to physically constrain the clock skew by using microwave power dividers [15] or microwave H-tree networks [21] to reduce the physical size of the local meandering microstripline power-clock networks and, thus, reduce the accumulated clock skew.Timing characterization of AQFP cells indicates that for 5-GHz sinusoidal clocks, data can still be successfully captured with a clock skew of up to 30 ps between the launching and capturing AQFP [2], [22], [23].This provides a nominal baseline target for how the power-clock network should be designed, and it also provides an upper limit on how much phase-skipping can be tolerated.

IV. BUFFER/SPLITTER INSERTION AND OPTIMIZATION
Research on the problem of buffer and splitter insertion and optimization for AQFP circuits has gained more interest in recent years.While it is possible to consider buffer and splitter cost early in logic synthesis, together with logic restructuring [18], [19], to simplify and focus on the problem and to provide fairer comparisons, many works consider only the problem of inserting the least buffers and splitters into a logic network, without logic restructuring, to fulfill the technology constraints (the buffer insertion problem).Starting as a postprocessing step with relatively lightweight optimization [2], [16], the buffer insertion problem has been identified as a scheduling problem in [4].As finding the optimal schedule in terms of buffer count is likely an NP-hard problem due to the interplay between buffers and splitters, various heuristics have been proposed and improvements were made rapidly [3], [4], [5], [6].All of these works assume the more conservative constraints, i.e., path balancing and fanout branching.
In this section, we first briefly introduce the buffer insertion algorithms this work is based on and, then, present an adapted framework considering the relaxed constraints, i.e., phase alignment and fanout branching, as discussed in Section III.

A. Related Works 1) Irredundant Local Insertion:
In [4], a local insertion algorithm is proposed, which constructs the minimal buffer tree at the fanout of a gate g, given the level assignment of its fanout gates FO(g).The algorithm runs in linear time and is optimal subject to a given schedule.
2) Depth-Optimal Scheduling: [5] proposes to leverage the local insertion algorithm to determine the latest possible level for the root gate g, and to run this algorithm in an as-late-as-possible (ALAP) fashion, i.e., for each gate in a reversed topological order from POs to PIs.It is formally proved that this yields a depth-optimal scheduling.Intuitively, buffered networks with smaller depths likely also have fewer buffers.
3) Buffer Optimization by Chunked Movement: The optimality guarantee given by the local insertion algorithm is subject to a given schedule.In other words, it is possible to optimize a schedule for a smaller buffer count.Thus, given an initial schedule, the chunked movement algorithm proposed in [4] finds chunks of gates and buffers, which are tightly connected and try to move them altogether to reduce the number of buffers.

B. Adapted Framework Considering Relaxed Constraints
We integrated the three algorithms mentioned in Section IV-A in a buffer insertion framework.The implementation is available in the open-source logic synthesis library mockturtle3 [24].
The overall buffer insertion flow combines the state-of-the-art algorithms.First, an initial schedule is obtained using the depthoptimal scheduling, and the minimum number of buffers needed for this schedule is counted using the irredundant local insertion algorithm.Then, the schedule is optimized for a smaller buffer count using the chunked movement algorithm, until no more improvement can be made.
To experiment on different formulations of the technology constraints, we adapted the algorithms to support customizable parameters involved in the constraints.These parameters include the following.
1) Buffer's splitting capacity s b : The maximum out-degree of buffers.This is the same as in previous works.2) PI's splitting capacity s i : The maximum out-degree of PIs.
s i was fixed to 1 in previous works.However, as discussed in Section III-C, it is possible to have s i = s b − 1.Thus, we make this an integer parameter to be specified by the user.3) A flag to switch between path balancing (1)-( 3) and phase alignment ( 4)-( 6): If phase alignment is adopted, modifications in the algorithms are made.First, levels of PIs and POs are not fixed anymore in both scheduling and chunked movement.Also, the chunked movement algorithm is modified to include PIs and POs in chunks and try to move them.Special care is given to ensure that PIs and POs are always assigned to a legal phase with respect to p clk and Φ ro .Finally, chains of single-fanout buffers of a length being a multiple of p clk are removed in a postprocessing step.4) Number of phases in a gate-level clock cycle p clk : When adopting path balancing, as in previous works, this parameter is not relevant.However, when relaxing path balancing to phase alignment, p clk is involved in the constraints.5) Possible phase differences between register input and output Φ ro : Set of phases PIs are allowed to be scheduled (5).In previous works, PIs can only be scheduled at level 0 (2).6) If clock skew is of concern, as discussed in Section III-D, then in any unbalanced path, a user-specified maximum phase-skip is ensured.By default, our framework considers phase-alignment constraints and uses parameters s b = 3, s i = 2, p clk = 4, Φ ro = {3, 4, 5}.This setting is expected to be the most realistic and result in the smallest size of buffered networks.

V. IMPACT OF TECHNOLOGY CONSTRAINTS ON JJ COUNT
In this section, we demonstrate the impact of the proposed relaxation on technology constraints on the number of buffers and, thus, on the JJ count of an AQFP circuit.First, a small example circuit is presented, for which the optimum can be easily derived.Then, experimental results comparing different constraint formulations are listed.

A. Motivational Example
We use a 1-b full adder circuit as an example.In Fig. 4, PIs are at the bottom and POs on top; ellipse nodes are MAJ gates whose constant inputs are neglected for simplicity (i.e., AND gates or OR gates) and negated fanins are dashed; and square blue and red nodes are buffers and splitters, respectively.
The buffered network when adopting conventional constraints (path balancing and fanout branching, s i = 1) is shown in Fig. 4(a), which is the optimal insertion with 16 buffers already shown in state-of-the-art works [4].If s i is increased to 2 as discussed in Section III-C, splitters at the first level are no longer needed, decreasing the network depth by 1 and reducing the number of buffers to 13, as shown in Fig. 4(b).
Moreover, as discussed in Section III-B, when enforcing the phase alignment constraint instead of path balancing, the number of buffers further reduces to 5, which is less than a third of the initial buffered network.This adjustment is done in two steps as described in Section IV-B.First, relaxing the constraints on PIs and POs ( 5), (6) instead of (2), (3) results in Fig. 4(c) with 9 buffers.Then, removing buffer chains [(4) instead of (1)] saves four more buffers.

B. Experimental Results on Constraint Relaxation
Table II shows the experimental results on the commonly used benchmark suite consisting of ISCAS benchmarks and some arithmetic circuits. 4Five sets of constraints are presented and compared.To have a fair comparison, all of them use s b = 3 and p clk = 4 and the buffered networks are obtained using our adapted buffer insertion framework described in Section IV-B.Columns "#Bufs."list the number of buffers in the (optimized) buffered networks, columns "#JJs" list the JJ count of the buffered networks (obtained by #JJs = 6 • #Gates + 2 • #Buf.), columns "ΔB" and "ΔJJ" list the reduction on buffer count and JJ count, respectively, and column "MPS" list the maximum phase skip."Baseline" is the most conservative constraint used in related works [2], [3], [4], [5], [6], [16], i.e., path balancing and fanout branching, plus an additional but realistic constraint that the network depth must be a multiple of p clk = 4. 5 Column "A" uses the improved D-latch design discussed in Section III-C but still adopts path balancing.In contrast, column "B" still uses the classical register design, but does not balance PIs and POs.Column "A+B" combines both improvements.Finally, column "Best" further removes buffer chains in "A+B," shifting from path balancing to phase alignment and achieving the best constraint relaxation proposed in this article.
We observe from this experiment that considering phase alignment instead of path balancing reduces about 70% of buffers in AQFP circuits, among which about 40% are balancing PIs and POs, and the other 30% are chains of buffers within the network.

C. Experimental Results Using Larger Benchmarks
Table III shows the results of a similar experiment on the EPFL benchmark suite [25], which consists of up to 100× larger benchmarks than in the previous section.For the sake of simplicity, only the settings corresponding to columns "Baseline" and "Best" in Table II are shown.The number of buffers ("#Buf.")and the buffer-to-gate ratio ("R B-G ," the number of buffers divided by the number of gates) are listed for the two settings, as well as the reduction percentage of buffer count after relaxation ("ΔB").
It can be observed that many benchmarks have a high bufferto-gate ratio when adopting the conventional conservative constraints, especially the arithmetic circuits (upper half).This is likely due to the imbalanced nature of these circuits.By relaxing 5 Many related works do not impose this constraint, although it is necessary.Enforcing this constraint adds about 1.7% buffers on this benchmark suite.

TABLE III EXPERIMENTAL RESULTS ON EPFL BENCHMARKS
the path-balancing constraint to phase alignment, a large portion of path-balancing buffers are eliminated, drastically reducing the number of buffers and making the buffer-to-gate ratio more reasonable.Take the adder benchmark as an example, with merely 384 gates in the original network, state-of-the-art buffer insertion algorithms adopting conservative constraints need to insert around 50k buffers to balance every path, 130× of the number of gates.Most JJs in the circuit and energy dissipation are wasted on these buffers.The resulting bulky buffered network also makes the following physical design and fabrication steps difficult.However, simply by relaxing the constraints to phase alignment, only about 1.9k buffers are actually needed, reducing the buffer count by 96%.

A. Tradeoff Between Throughput and Maximum Phase Skip
A disadvantage of replacing path balancing with phase alignment is that the possibility of wave-pipelining is disabled.Wave-pipelining, or multithreaded gate-level pipelining, is a technique to increase throughput by propagating more than one computation in one (architectural) clock cycle, which has been researched for classical CMOS-based digital systems [26] as well as emerging technologies [27], [28].One important requirement for a wave-pipelined system is path balancing, thus making AQFP circuits a natural candidate to adopt this technique, although related research has not been proposed yet.
If an AQFP circuit is fully path-balanced, up to k = p arch /p clk waves may be propagated between two register stages at the same time, increasing its throughput by k×.When phase alignment is adopted instead to reduce JJ count, a tradeoff between throughput and buffer count (thus energy and area) arises.In such case, the number of waves allowed is bounded by the maximum phase skip, or inversely, given a desired throughput, the maximum allowed phase skip must be ensured, which can be achieved with our framework.Related work for the single-flux quantum (SFQ) technology family has been proposed [28], which uses ILP for scheduling and buffer insertion under similar constraints.However, for AQFP, because splitters are also clocked, this formulation cannot guarantee optimality and is also less scalable than our approach.Future AQFP circuit designers may choose path-balanced, wave-pipelined circuits for smaller components requiring higher throughput, and phase-aligned, nonpipelined circuits for larger parts consuming more energy.

B. N -Phase Clocking
Another buffer reduction method leveraging an n-phase clocking scheme has recently been proposed [11].The basic idea is to multiply the number of phases in one (gate-level) clock cycle by an integer r, such that any chain of r buffers can be reduced to 1.The n-phase clocking technique is also very effective in reducing the number of buffers in AQFP circuits but it does not diminish the value of this work.Instead of comparing against n-phase clocking, we argue that these are two independent techniques that may work in collaboration to achieve the best results.Using our constraint formulation terminologies, n-phase clocking can be seen as using fractions instead of integers as the range of the schedule, i.e., a gate may be assigned to levels 1/r, 2/r, . . ., etc. n-phase clocking relaxes the path-balancing constraint by changing the clocking scheme, whereas we develop our argument from analysis of the sequential circuit model.Thus, these two relaxations affect the constraints independently and future work remains to formally consider them together.Also, as both techniques have their own drawbacks, engineers may choose between the two depending on the application requirements.

C. Physical Design and Postphysical-Design Legalization
In this article, we propose to relax path-balancing constraints to phase alignment, which will have an impact on physical design because current tools generally expect a path-balanced netlist as their input.Although adapting a physical design tool accordingly to generate realistic layouts is beyond the scope of this article, Fig. 4 serves as a good visualization of how a real layout would appear.Moreover, to truly exploit the possible area reduction due to the lower buffer count, the placement algorithm needs to be adapted to allow circuit folding.That is, instead of placing logic gates scheduled at the same level in the same physical row and having as many rows as logic levels, some gates could be placed in different rows with empty slots because of phase skipping.However, this would affect wire lengths and clock synthesis, with additional physical and timing constraints to be carefully considered.
The real clock skew between two gates in an AQFP circuit does not only depend on the number of phases in between but also on the microstripline length of the power-clock network between them [15].Moreover, interconnect delay of data signals and longer wire lengths must also be considered to ensure the correct operation of an AQFP circuit.If the physical distance between the launching and capturing gates is too long (> 0.7 mm for buffer-to-buffer connections), we may need to insert repeaters or use current boosters.However, these values are only available after physical design and are hard to predict during the buffer insertion stage.Thus, an estimation must be used in buffer insertion.More careful analysis and legalization, which may result in extra buffers being inserted, have to be done during or after physical design.Such overhead may occur in any AQFP synthesis flow regardless of adopting the proposals of this work or not but having a higher phase skip may cause the circuit being more prone to these issues, especially when operating in high frequency.
Assuming a layout realized similar to Fig. 4(d), we expect the power-clock margins to remain unchanged.However, we expect timing margins to reduce because larger phase skipping will likely incur more skew beyond the ideal timing of the capturing clocking phase.Thus, timing-aware placement [29] is important to make sure the circuit still meets sufficient timing margins.

D. Limitations and Future Directions
In this work, we experiment how assumptions on technology constraints impact the AQFP circuit cost using a postlogicsynthesis buffer insertion framework.Technology-aware logic synthesis is not considered because these algorithms need to be adapted to consider the relaxed constraints.Also, the buffer insertion and optimization algorithms in our framework do not guarantee optimum solutions because the AQFP buffer insertion problem is likely NP-hard due to the interplay between buffers and clocked splitters, and a scalable and globally optimal algorithm does not exist yet [4].Nevertheless, this research is dedicated to explore different possibilities in formulating the technology constraints and to demonstrate their impact.In fact, it divides future research on the AQFP buffer insertion problem into two independent directions: On the one hand, considering path balancing makes the problem computationally Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
easier and maintains the possibility of wave-pipelining.Thus, existing algorithms are still valuable and are worth further improving, and wave-pipelining can be explored.On the other hand, considering phase alignment largely reduces JJ count, as shown in Section V-B, but its optimization problem becomes harder because of the increased flexibility.Thus, a second line of research is opened to better solve this newly-defined computational problem.
Finally, by definition of d(N ), we know that all combinational outputs o are ready since time t = l(o) ≤ d(N ) ≤ φ ro + d; thus, at time t = p arch ≥ φ ro + d, correct values f N ( x) are presented at the inputs of registers O. Equation (6) ensures that register inputs are placed at the correct phase.