MANA: A Monolithic Adiabatic iNtegration Architecture Microprocessor Using 1.4-zJ/op Unshunted Superconductor Josephson Junction Devices

We conducted the first successful demonstration of an adiabatic microprocessor based on unshunted Josephson junction (JJ) devices manufactured using a Nb/AlOx/Nb superconductor IC fabrication process. It is a hybrid of RISC and dataflow architectures operating on 4-b data words. We demonstrate register file R/W access, ALU execution, hardware stalling, and program branching performed at 100 kHz under the cryogenic temperature of 4.2 K. We also successfully demonstrated a high-speed breakout chip of the microprocessor execution units up to 2.5 GHz. We use a logic primitive called the adiabatic quantum-flux-parametron (AQFP), which has a switching energy of 1.4 zJ per JJ when driven by a four-phase 5-GHz sinusoidal ac-clock at 4.2 K. These demonstrations show that AQFP logic is capable of both processing and memory operations and that we have a path toward practical adiabatic computing operating at high-clock rates while dissipating very little energy.

communications technology expected to rise to over 20% by 2030, half of which is attributed to data centers [4], [5], there is a need to consider other technological alternatives for the future development of our information and communications infrastructure, and this is an area where JJ-based superconductor logic may fulfil.
Adiabatic quantum-flux-parametron (AQFP) logic is a superconductor logic family shown to operate with a switching energy of 1.4 zJ per JJ when driven by a four-phase 5-GHz ac power-clock at 4.2 K in experiments using unshunted JJ devices [11]. As expected from adiabatic circuits, the switching energy of the AQFP also decreases linearly as the powerclock frequency decreases, or more specifically, as the rise time of the power-clock increases [12]. With a cryogenic cooling overhead taken into account [3], the switching energy increases 1000× to 1.4 aJ at 5 GHz, which is still ∼80× more efficient than a 7−nm multi-gate technology with a V DD = 0.8 V [13]. Fig. 1(a) shows the schematic of the AQFP. The AQFP is composed of two components: 1) the two-junction dc superconducting quantum interference device (dc-SQUID) which corresponds to the J 1 -L 1 -L 2 -J 2 loop and 2) the output transformer which corresponds to the mutual inductance pair of L q and L out . An excitation clock line (L x ) and a dc-offset line (L d ) are coupled to the dc-SQUID via L 1 and L 2 . An input data current is provided through L in . When an ac current (power-clock) ramps up through the clock excitation line (L x ), a single-flux-quantum (SFQ) is induced in the J 1 -L 1 -L q loop if the input current through L in was positive (logic '1') or the J 2 -L 2 -L q loop if it was negative (logic "0"). A large current is induced by the SFQ through L q whose direction is also positive for logic "1" or negative for logic "0." The current through L q induces an output current through L out via their mutual inductance. This is the active state of the AQFP. When the ac power-clock ramps down, the SFQ in either loop of the AQFP dissipates, and the output current goes to zero. This is the reset or null state of the AQFP. This structure effectively behaves as a clocked buffer. A clocked inverter is created by inverting the coupling coefficient between L q and L out . A constant 1/0 cell is created by adding physical asymmetry to the L 1 -L 2 inductors of the dc-SQUID. By combining these minimal variations of the AQFP, majority-based Boolean logic cells can be created [14], [15].
Our AQFP circuits are clocked by a four-phase currentbased ac power-clock. Two sinusoidal ac sources, AC1 and AC2, plus a dc offset create the four-phases [14], as shown in Fig. 1(b). The power-clock excitation lines are typically routed in a meandering fashion, as shown in Fig. 1(c), for a simple four-stage shift register composed of four AQFP buffers connected in series. The meandering directions create positive and negative versions of the ac power-clocks and the dc offset. AC1 and AC2 are in quadrature (90 • out of phase) and when, respectively, combined with the positive and negative dc offset, all four phases of the clock can be generated. The nominal peak ac amplitude is approximately 0.9 mA and the nominal dc offset is 1.2 mA. These power-clock lines never physically "sink" into the AQFP but rather provide an excitation current through inductive coupling.
The simulation of a four-stage AQFP shift register is shown in Fig. 2 which was performed using the freely available superconductor analog circuit simulator Josephson SIMulator (JSIM) [16]. The four power-clock phases are annotated on the ac signals as observed directly from the ac sources before they are shifted by the dc offsets in the AQFP circuits. Input data are provided in the form of a positive or negative current which the first-stage AQFP samples when the first phase of the power-clock arrives. Upon arrival of the power-clock, the AQFP generates a corresponding positive or negative output current, which, in turn, is sampled by the next AQFP in series. As the power-clock ramps down, so does the AQFP output and eventually, it goes into the reset or null state where the output current is zero. Then, on the next power-clock cycle, the process repeats. Because of the overlap between active clock phases, the AQFP can transfer data from one phase to the next, as shown in the simulation waveform. The overlaps between The bottom two traces show the switching energy profile of the first stage buffer. The "X" observed in the D trace is a random output event as the shift register has yet to be initialized with valid data. active phases are visualized in Fig. 1(b). It is only during these overlaps between adjacent phases that data can transfer from the earlier phase of the two to the later phase of the two, as the earlier phase provides a signal output that the later phase can sample to produce its own output. In this clocking approach, we also cannot skip phases. For example, in Fig. 1(b), Phase 1 (P1) has a 180 • phase difference with P3; therefore, there is no way for the AQFP circuits to properly sample outputs between these phases. Data also cannot be transferred from P1 to P4 because P4 is already active before P1 starts; thus, data can only be transferred from P4 to P1 (not vice versa). In short, data propagates from one phase to the next with each phase consisting of only one logic stage. Because we are using fourphase clocking, we can only propagate through four stages of logic in a single cycle. This severely limits how much logic we can perform in a single cycle, and it is what drove the design decisions discussed in Section II.
Also shown in the waveform is the total dissipated energy of the first buffer which exhibits a typical adiabatic switching profile where energy is dissipated and then recovered. The switching energy trace is obtained by performing the rolling integral shown in (1) An example of an AQFP-based full-adder using majority logic gates is shown in Fig. 3 as a schematic (a) and physical layout (b). The majority gates are a mix-and-match of normal AQFP buffers, inverters, and constant cells tied together in parallel. Gates are arranged in logic rows where each row is clocked by one phase of the power-clock. Data propagate from one row to the next row, phase by phase. As previously mentioned, it is not possible to skip phases or propagate through multiple gates within the same phase. Furthermore, all connections are one-to-one, meaning we have no notion of tri-state buffers that can drive the same interconnect line, and gates do not have passive fan-out. Active fan-out cells up to FO3 are used to split data signals. The power-clock lines are made of superconductor microstriplines whereas the gate-to-gate interconnect are made of shielded striplines. Due to the parasitic inductances of the interconnect, the AQFP cells have a limited driving distance of up to only 1 mm [15], after which another buffer must be inserted as a repeater to re-amplify the signal.
As previously mentioned, there are a number of other JJ-based superconductor logic families (ERSFQ, eSFQ, RQL, LR-biased RSFQ, and LV-RSFQ). They all fall under the class of SFQ logic families where the data tokens of logic "1" or "0" are represented as the presence or absence of an SFQ voltage pulse, respectively [17]. AQFP logic differs from these SFQ logic families in that data is not transmitted as SFQ voltage pulses but as positive (logic "1") or negative (logic "0") current with a reset state (zero current). But the most important difference is that AQFP logic operates adiabatically which limits the clock rate to around 10 GHz in order to remain in the adiabatic regime. The SFQ logic families are non-adiabatic, which means they are capable of running at extremely fast clock rates as high as 770 GHz [18] at the cost of much higher switching energy. The switching energy of AQFP is so low that its modest clock rates put its energydelay-product (EDP) at about 2-3 orders of magnitude lower than that of SFQ logic [19].
This work describes a proof-of-concept microprocessor as a first step toward demonstrating adiabatic data processing and storage using practical circuits. We call this microprocessor: MANA, Monolithic Adiabatic iNtegration Architecture. The efforts described herein build upon previous work, including the use of an established AQFP cell library based on unshunted JJs for a four-layer Nb/AlO x /Nb superconductor IC process with a superconducting critical current density of 10 kA/cm 2 manufactured by the National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan [14]. The fabrication process is referred to as the AIST high-speed standard process (HSTP). We also use a semi-custom AQFP design environment described in [20] which consists of a rudimentary logic synthesis framework [21], and a geneticalgorithm-based place and route tool [22]. Some of the components of MANA are variations of previously demonstrated circuits, including a 16 × 1 b register file (RF) in [23], 8−b Kogge-Stone adders in [11], [24], and a 4−b data shifter in [22]. Preliminary low-frequency demonstrations of actual MANA units have been reported for the ALU and shifter in [20].
The remaining sections of this article are organized as follows: Section II describes the architecture of MANA and how it was developed specifically for AQFP logic. Section III describes the implementation of MANA and the general challenges of designing the microprocessor. Section IV shows the experimental demonstrations of the MANA chip and a highspeed breakout chip of the execution components of MANA. Section V makes a brief comparison with other adiabatic circuits and microprocessors and discusses the challenges in moving the technology forward. Section VI estimates the performance of a refrigerated supercomputer implemented in AQFP logic. Section VII concludes this article.

II. ARCHITECTURE
We developed the microarchitecture of MANA as a demonstration vehicle to show that AQFP logic is capable of performing computation, including the processing of logic and the storage of data into AQFP-based memory structures, all within a single technology, single-logic family, and single chip. Optimizing raw performance in terms of throughput and instructions per cycle (IPC) was not our priority. It was specifically catered to AQFP logic to take into account the following.
1) Four-Phase Clocking: Data can only propagate through four-levels of logic per clock cycle; therefore, program control must be very simple to achieve a peak IPC of 1. 2) Lack of System Integration Tools: Integration of components at the chip-level is done by hand; therefore, a design with a large number of control signals broadcasted all over the chip is not feasible. Instead, the architecture should be simple so that data and control signals have a straightforward dataflow but still perform a meaningful computation. To this end, we combined aspects of conventional RISC design [25] with some concepts taken from dataflow architectures [26], [27]. The microarchitecture diagram of MANA is shown in Fig. 4. Table I lists the core set of supported instructions. MANA operates on 4−b data words via 16−b instruction words whose formats are shown in the inset of  a large instruction memory due to its design complexity; therefore, we only emulate its behavior in the experiment. The small IB is sufficient to perform key demonstrations of the implemented processor. MANA instructions have an NPC (Next PC) field which corresponds to which of the other three instructions in the IB will be executed next. If ID determines that the NPC field refers to the current instruction (e.g., the instruction in the 11 2 slot of IB has an NPC of 11 2 ), this notifies IFETCH to increment PC and request the next block of four instructions to be serially loaded into IB. A branch instruction whose branch address is outside the block of four instructions would also notify IFETCH. This eliminates the need to calculate the NPC within the block of four instructions. This concept was inspired by dataflow architectures, where instructions have an understanding of what would be executed next without the need of a PC. Furthermore, MANA instructions also have an S bit, also known as the stall bit. This S bit is encoded by the compiler to notify the IDI stage on whether the next instruction has a dependence on the current instruction. The simple detection for this bit makes it very easy for the IDI stage to decide whether it must stall the processor to resolve both data and control hazards in a single cycle. When an S bit is detected, it prevents the next instruction from being fetched from the IB, and it propagates through the pipeline with the current instruction in flight. It returns to the IDI stage during writeback as an acknowledgment signal to notify that the next instruction can finally be issued.
Thanks to the S bit and NPC field, program control decisions are made at the very start of the pipeline. Once those program control decisions are determined, the rest of the instruction is decoded by ID, and all necessary control signals are generated and buffered appropriately. This actually makes MANA a two-stage pipeline from a pure computer architecture point of view. The first stage checks on the last S bit status and performs either a hardware stall or enables the fetching of the next instruction. This first stage is very simple and can be completed within four stages of logic (one clock cycle in four-phase clocking) so that a peak IPC of 1 can be obtained. The second stage consists of everything else in the processor, including the enabled fetching of the instruction from IB. It has a total latency of 26 cycles. The gate-level clocking nature of AQFP logic allows the multi-cycle second stage of MANA to accommodate 26 instructions in-flight although MANA will only issue a maximum of four instructions successively due to the small four-instruction-word IB. To keep the architecture as simple as possible, we do not support other pipeline hazard resolution mechanisms, such as data forwarding. This kept the number of feedback loops small, and thus, kept manual timing closure feasible.
The RFX stage is a two-read/one-write 16 × 4 b RF with registers $14 and $15 to hold I/O data, and $0 as a constant zero register. External I/O data are transferred in and out of $14 and $15 serially such that $14 contains the higher nibble of the memory byte data and $15 contains the lower nibble. Control flags are available to notify the processor that data in $14 and $15 are valid data from I/O and likewise, a similar set is also available to notify I/O devices that data $14 and $15 are valid from the processor.
The EX stage has a 4−b integer ALU and 4−b data shifter unit. The units are connected in series so data must propagate through both units regardless of the instruction.
The WB stage routes processed data back to the RFX stage to the appropriate destination register. This stage also includes the necessary buffering of control signals to be tapped OFF when needed as the instruction flows through the pipeline and clocked signal repeaters to transmit data across the long distances of the WB path.

A. Design Environment
Before implementing MANA, we developed a semi-custom design environment [20] which features a four-phase AQFP cell library [14] in a Cadence design environment that has been augmented with scripts and external programs to do AQFPbased combinational logic synthesis [21] and component-level place and route using the genetic algorithm and left-edge channel routing [22], [28]. Place and route tools operate on adjacent logic rows and are aware of the limited driving ability of AQFP logic cells. Together, they attempt to make all interconnect in the routing channel to be less than 1 mm. If it is not possible, buffers are inserted as signal repeaters automatically. The design environment also includes gatelevel models specified in SystemVerilog [29] with timing information extracted from freely available superconductor circuit simulators, such as JSIM and Josephson SIMulator (JoSIM) [16], [30], [31]. The AQFP cells require accurate extraction of the 3-D inductors and their mutual coupling factors, all of which can be extracted through a tool called InductEx [32]. The post-extraction simulation results align well with the characterization experiments of fabricated cell tests of our library [14], [15].

B. Combinational Logic Design
We applied our combinational logic synthesis along with place and route on the ID unit of the IDI stage and the data shifter unit of the EX stage. The shifter was our first example in which we showed a full demonstration from a behavioral Verilog description to a working circuit in experiments [20], [22]. In this work, the ID unit was also synthesized using a subset of an RTL description of MANA. It is straightforward decoding of input signals, mainly the 16−b instruction word, into control signals and source/destination addresses.
We designed an integer ALU by hand using a majority logic-based parallel prefix carry look-ahead Kogge-Stone adder as its structural foundation composed of majority-based carry-merge blocks (black and gray cells) [24]. As shown in Fig. 5 via the ALU slices, we modified the first stage of the adder normally used to generate bitwise carry propagate p i and generate g i signals so that bitwise logic operations can also be performed in-place. When performing addition, ALU slices will produce the appropriate p i and g i signals as expected in a Kogge-Stone adder. When performing subtraction, the ALU slices will generate the two's complement of operand B by inverting B in each slice and adding "1" in the least significant bit by introducing a carry-in. In the case of a logic operation, we produce the bitwise logic result in each of the ALU slices and push the result out as a p i signal. No bitwise g i signals are produced in this case so the logic result propagates through the adder tree unchanged. We first experimentally validated the design of the ALU and data shifter each in separate breakout chips at low frequency in [20], but in this work, we combined them together in a high-speed test chip described later. Furthermore, we integrated architectural glue logic into the ALU, such as the generation of carry (C), negative (N), and zero (Z) flags.

C. Memory
Memory structures are needed for the IDI and RFX stages of MANA. We created an adiabatic latch by connecting AQFP logic gates in a short four-phase feedback loop similar to the academic latches composed of NAND or NOR gates [33]. This gate-level implementation of an AQFP latch has an enable input which is asserted on the same clock phase as the input data when we want to write into the latch.
Our logic synthesis and place and route framework currently do not support circuits with memory and feedback structures. Such structures had to be designed by hand. The 16 × 4 b RF unit of the RFX stage is the largest and most complicated component of the entire MANA processor. The design is based on a previously demonstrated 16 × 1 b RF unit [23]. For the development of MANA, efforts were made to establish a compact bit-slice design methodology by hand to systematically scale up the design to 4−b data words. A block diagram of RF is shown in Fig. 6 with the read and write decoders omitted. The D cells in the diagram are gate-implemented D latches serving as the memory cells whereas DS cells are modified D cells but with additional "serial ctrl" interfaces to enable separate serial read and write with external I/O. AQFP logic has limited fan-out with discrete active fan-out cells up to only FO3. We also cannot use shared lines, as CMOS does with tri-state buffer access of an output bus bit line. The "rd ctrl" components in Fig. 6 that control the enabling of a word line output are implemented as AND gates whose output is then merged into the OR-trees for each bit column inside the "bit merge tree" block. The output of the "bit merge tree" forms one bit of the 4−b data output. These one-to-one connections contribute to the large area of the RF unit (4.7 mm × 6.6 mm).
Other memory structures in MANA, such as the IB and PC in the IDI stage, follow the same design concepts as the RF implementation but are actually much simpler. The IB has only four words, one read port, and is loaded serially. The PC is just a set of eight D latches updated by ID after all instructions in IB have finished execution.

D. Clock Distribution
As illustrated in the full-adder example in Fig. 3, we generally distribute the power-clock network of AQFP logic circuits in a meandering fashion from one logic row to the next. This is perfectly fine for small circuits, but if we increase the scale to that of the RF unit, we start to see the impact of clock skew because the microstriplines have an effective signal velocity. The AIST HSTP fabrication process used in this work has four Nb superconductor layers. The Nb layer and dielectric thicknesses are specified in [14]. We made a first-order extraction of the microstripline velocity to be 161 μm/ps or a delay of 6.21 ps/mm in [20]. Fig. 7(a) shows an illustration of round trip clock skew which is governed by the total length of the AC1 line traveling from the start of the phase 1 logic row to the end of the phase 3 logic row. Detailed timing analysis is beyond the scope of this article, but as a soft rule, we constrain this round trip length to be no more than 5 mm. This results in a maximum clock skew of about 30 ps, which we expect to be tolerable for AQFP circuits operating at 5 GHz given that we also cannot drive data interconnects longer than 1 mm [20].
But to apply this constraint on to large circuits, we need to divide the power-clock using microwave power dividers so that we can create multiple local power-clock networks in parallel with each one abiding by the round trip constraint, as shown in Fig. 7(b). Wilkinson power dividers [34] would be a good choice for AQFP circuit designs deemed ready for operating at high clock rates. They are lossless when their output ports have properly matched impedances, but the narrow bandwidth of these dividers does not allow us to verify the circuit at lower frequencies making experimentation more difficult. Because this is the first demonstration, we opted for a resistor-based power divider network which we can operate over a large range of frequencies. We applied it to the RF unit via a 1-to-4 power divider with a 50-resistor placed in series of every output of each 1-to-2 dividing stage. This implies that we need 4× the nominal ac excitation current amplitude to operate the circuit (4 × 0.9 mA). The round trip length of each meandering local power-clock network is about 3 mm. Because the divider is resistor-based, additional power will be dissipated through the resistors. The equivalent resistance of the entire 1-to-4 divider network is 37.5 . We apply an ac sinusoidal voltage with a peak amplitude of 135 mV so that ac power-clock currents are at the nominal peak amplitudes of 0.9 mA in each of the four branches of the 1-to-4 power divider. The resulting dissipated power from the divider is P = (135 mV/ √ 2) 2 /37.5 = 243 μW, which is about three orders of magnitude larger than the power dissipated from the AQFP circuits in MANA. When using a well-designed Wilkinson power divider, we expect very little to no power dissipation from the power-clock network on chip.
The area of the resistor-based power divider is almost negligible as they can be designed as a straightforward compact binary tree with resistors in series at each dividing stage. The Wilkinson power divider, on the other hand, requires a substantial area to accommodate microstripline meanders for achieving the desired impedances. A 1-to-4 Wilkinson power divider may take up an area of 1 mm × 2 mm [35] and an additional copy of the divider is needed to recombine the divided power-clocks, as shown in Fig. 7(b). A fabrication process with more superconductor layers can help alleviate this challenge.

E. Component Integration
While our design environment allows us to generate standalone combinational processor components, we do not have an automated way to integrate them together at the chiplevel. This was done by hand. All processor components were designed independently without any real consideration of how they will be connected to each other. Each component was designed with the assumption that its first stage of logic is clocked by the first phase of the ac power-clock with the corresponding ac lines physically coming from the topleft of each block. When connecting components together, manual buffer insertion was necessary for two reasons: 1) to synchronize the output signals of the transmitting component such that they arrive at the starting phase of the receiving component and 2) to re-amplify signals traveling across long distances (>1 mm).
In addition, the meandering local power-clock network of the individual components was designed without consideration of additional buffering of signals adjacent to them, such as  control signals that are simply passing through for the next stage. It was necessary to stretch or even completely redesign (if stretching resulted in too large of a round trip clock skew) the local power-clock network to accommodate those extra control signals.
Efforts to improve and automate this aspect of the design flow through electronic design automation (EDA) tools are underway (see Section V).

F. Taped-Out Chips
We manufactured our chips using the AIST HSTP 10kA/cm 2 Nb/AlO x /Nb superconductor IC fabrication process [14]. In this work, we taped out two designs. The first chip is a low-frequency design of the MANA processor. We underestimated the overall size of the design which resulted in having to use a larger 1 cm × 1 cm chip. At the time of this writing, we, unfortunately, did not have a high-speed experimental probe that can support this chip size; therefore, our experiments for the MANA chip are limited to 100 kHz using our low-frequency probes. We included a debug output port "RESDBG" which is tapped from the writeback data right after the EX stage so that we can observe the state of the microprocessor step-by-step. A microphotograph of the MANA chip is shown in Fig. 8(a). The active circuit core consists of 21460 JJs and has an area of 8.5 mm × 9.5 mm. Table II shows a stage-by-stage breakdown of the MANA processor.
To demonstrate that at least part of the MANA processor is capable of high-speed operation, we produced a second design which is a breakout chip of the EX stage of MANA. It includes the ALU and data shifter connected together along with the buffering of their respective control signals, but we omitted the generation of the ALU flags. This EX chip was designed primarily for high-speed demonstration (>1 GHz) so we made sure that the round trip clock length was less than 5 mm, and we interfaced the outputs of the EX stage with highspeed dc-SQUID stack voltage drivers, so we can observe the high-speed output [36]. The design is placed on a 7 mm × 7 mm chip which is a compatible size for our high-speed experimental probe. The chip contains a frame of 48 pads to interface with the probe. A microphotograph of the EX chip is shown in Fig. 8(b). The active circuit core consists of 2076 JJs and has an area of 2 mm × 3.5 mm.
Both designs were taped out, and we experimentally demonstrated their operation across multiple wafers and chips as detailed in Section IV.

A. Experimental Setup
We measured each chip using an immersion probe lowered into a liquid helium Dewar to achieve a 4.2 K cryogenic temperature. We surrounded the chip carrier end of the probe with two-layers of Mu-metal shielding. A low-speed probe suitable for testing 1 cm × 1 cm chips via wire-bonding was used for the MANA chip, whereas a high-speed probe that accepts 7 × 7 mm chips via a 48-pin pad frame was used for the EX chip. We used a function generator to create two ac sinusoidal sources (AC1 and AC2) in quadrature (90 • out of phase). A dc offset is coupled to both ac excitation lines in the positive and negative direction on the chip to create the four-phase excitation clock. The excitation clock lines and dc offset are 50-terminated at room temperature. The nominal amplitude of AC1 and AC2 are both 0.9 mA. The nominal dc offset is 1.2 mA. Because the MANA chip uses a 1-to-4 power divider, we applied approximately 4× the nominal values of AC1, AC2, and the dc offset. We used a data pattern generator to create a +10 μA logic "1" and a −10 μA logic "0." An on-chip dc-SQUID stack voltage driver interface [36] with additional room temperature amplification provides output readout. The signal produced from the output  interface is a unipolar return-to-zero (RTZ) signal in which a logic "1" is represented by a positive output voltage for a duration that is proportional to the experimental clock period before returning to zero. When the output is a logic "0," no activity is observed on the output trace.

B. MANA Chip Test
During the MANA chip test, we operate the circuit only at low speed (100 kHz) and 16−b instruction words were serially loaded into the instruction buffer 1b/cycle. We first zeroed all registers in the RF by ANDing each register with $0. Next, we used the load-immediate instruction to write each register with its register number (e.g., $1 := 0001 2 ; $2 := 0010 2 ; . . . $0 is always fixed to 0000 2 ) and then we read each register out to confirm their values starting from $15 down to $0 by observing the RESDBG port. This serves as a smoke test to check that at least the RF unit (the most complex component) is operating before proceeding to other tests. The setting and reading of the registers are shown in Fig. 9, where R 3 is the most significant bit (MSB) of the output, and R 0 is the least significant bit (LSB) of the output. This smoke test also shows that we can achieve an IPC of 1 for up to four-instruction bursts, which is limited by the size of IB. A 64-cycle stall between bursts is used to serially load the next three-four instructions into the IB unit.
We demonstrated two short four-instruction programs for MANA. The first program loaded into the IB incremented $3 by the value of 2 stored in $2 until the contents of $3 equals 9. When $3 equals 9, we add the value of 6 stored in $6 to $3 so that it is now 15. We checked the contents of $3 by adding it with $0, and observed the resulting output on the RESDBG output port. The experimental waveform is shown in Fig. 10(a) with a zoomed-in view on the first and last results in Fig. 10(b). The test program and output sequence are shown in Fig. 10(c) and (d), respectively. Note that each instruction in the program has either a data or control hazard. To resolve these hazards, a 27-cycle hardware stall is performed by MANA after every instruction. This can be seen by the large gaps of idle activity between the outputs of the RESDBG port shown in Fig. 10(a). Some gaps might appear even larger because the bneq instruction produces no observable output on RESDBG. In addition, the RESDBG outputs show all ALU calculations even though some are not necessarily written to the RF unit as is the case for the subnw (compare) instructions.
After resetting the registers back to the initial state, we loaded the second program, which tests a few shifter operations. It starts by performing shift right arithmetic (sra) by 1 on $8 until it is 1111 2 . When it is 1111 2 , it performs a shift left logic sll by 3 so that $8 should now hold the value of 8. Finally, we checked the contents of $8 by ORing it with $0. The corresponding waveforms, test program, and output sequence are all shown in Fig. 11.
Both programs demonstrate successful RF R/W access, execution of a few arithmetic/logic operations, hardware stalling, and program branching. Furthermore, we were able to replicate these results on a total of five out of six chips across two wafers that we have tested, as shown in Table III, with one chip failing to produce any output, possibly due to a poor connection between the immersion probe and chip. We also varied the amplitudes of the excitation clock amplitudes of AC1 and AC2 until we induced bit errors. The operating margins of AC1 and AC2 shown in Fig. 12 are reasonably wide at 5.0 and 4.6 dB, respectively, when averaged across all fully functional chips.

C. EX Chip Test
For the EX chip, we first confirmed all ALU and shifter operations at 100 kHz, as shown in Fig. 13. The test starts by setting operands A and B such that they cover all input combinations across the bits for the logic operators. We repeated the test with the bits reversed and swapped between A and B. Then, we test a few random operands for addition and subtraction, including a set of test vectors that show carry propagating from the LSB to carry out (R 4 ). Lastly, we fixed the B operand to 0111 2 and 1000 2 with the A operand fixed to 0000 2 to test the shift operations. All shift operations and shift amount values have been tested.
Next, we perform the high-speed test. Our setup is limited to a single data input from an Agilent N4906B BERT with single output monitoring. Our high-speed test pattern consists of the critical carry propagation test where control signals are set for addition, operand A 3:0 is fixed to 1111 2 , B 3:1 is fixed to 000 2 , and B 0 is toggled at high-speed as a pseudorandom input from the BERT. The carry-out R 4 and sum R 3:0 outputs were observed on a high-speed oscilloscope (Fig. 14) with carry out R 4 also being monitored by the BERT. When B 0 is 1, R 4 is 1 and R 3:0 is 0000 2 . When B 0 is 0, R 4 is 0 and R 3:0 is 1111 2 .
For our best sample, we confirmed functional operation at 1, 2, and 2.5 GHz and measured excitation margins of AC1 and AC2 to be 2.6 and 2.4 dB, respectively, at 2.5 GHz, as seen in Fig. 12. Beyond 2.5 GHz, we noted bit errors, and upon closer investigation, it revealed that 101 2 patterns did not fully return to zero. We believe this is not a problem with AQFP circuit itself but perhaps with the voltage driver interface or our experimental setup. In total, we tested 12 EX chips (three chips across four different wafers) and reproduced correct functionality on seven chips, as shown in Table III. For the high-speed test, the maximum operating frequency ranged from 1.2 to 2.5 GHz across all working chips. Some chips had unstable or oscillating outputs which we attribute to the superconducting phenomenon of flux trapping [37]. Such unwanted flux may be inadvertently trapped near sensitive circuits which may modulate the data signal currents along the interconnect striplines. Nonetheless, the results are indeed promising, and this is in spite of the fact that the critical current (I c ) of the fabricated JJs, which is governed by the effective physical area of the junction, had as much as a 12% difference from their designed values.

V. OUTLOOK
The experimental results show that we can indeed perform computation using superconductor AQFP logic. A number of other adiabatic circuits have been reported, including an experimentally demonstrated 16−b carry look-ahead adder (CLA) [38], a demonstrated 8−b DLX-based microprocessor [39], and an in-progress physical design of a 16−b microprocessor [40]. We compare these circuits in Table IV. Our superconductor AQFP circuits surpass all of them in terms of clock rates by approximately three orders of magnitude compared with most adiabatic circuits, with a 5× advantage when compared with [40]. These much higher clock rates are still possible while also exhibiting higher energy efficiency by two  to three orders of magnitude despite the extra power needed to cool the circuits to cryogenic temperatures. However, the other adiabatic circuits all have a major practical advantage in that they are highly suitable for mobile applications. Superconductor digital circuits, in general, must showcase their advantages in the supercomputing, accelerator, and data center markets, where much more complex circuits are needed, in contrast to the simple architecture of MANA.
To move forward with this technology, improvement in the following areas are necessary: area efficiency, latency and clock distribution, flux trapping, and development of advanced EDA tools.

A. Area Efficiency
The chip photographs in Fig. 8 show that there is a lot of inefficient use of space. This is partly because of the design methodology of using only a single-clock phase per row of logic and the large logic cell size due to the output transformer of the AQFP. We have already shown promising work on using a more advanced eight-layer superconductor IC fabrication process provided by MIT Lincoln Laboratory, Lexington, MA, USA [41] to create a more compact AQFP cell library [42]. We can also potentially remove the large output transformer completely by using directly coupled QFP (DQFP) [43].
But being able to place any logic cell in any physical row regardless of which phase it is clocked would improve row utilization substantially. More investigation is needed to intelligently achieve this through more routing layers in the process or through careful design of the multi-phase clock network such that clock lines are sufficiently isolated from each other within the same row so that they do not interfere with each other or inadvertently excite a logic cell on multiple phases. Another area of investigation to help improve the area is a novel memory design that is compatible with AQFP logic instead of using gate-level implementations of latches.

B. Latency and Clock Distribution
The four-phase power-clocking scheme limits data propagation to only four stages of logic per cycle. This introduced a number of difficulties in trying to operate a microprocessor at the native clock frequency, as discussed in Section II. The limited amount of logic stages per cycle makes it extremely difficult or perhaps impossible to implement hazard resolution techniques to make the overall pipeline stall free. However, AQFP logic is actually an intrinsically fast logic family despite the adiabatic operation. If we consider other clocking schemes, such as delay line clocking [44] or N-phase powerdividing clocking [45], the latency can be reduced substantially by allowing more stages of logic to operate in a singleclock cycle. Such approaches will allow circuit designers and architects more flexibility to implement advanced pipelining techniques.
Because these novel clocking schemes still operate at the nominal 5-GHz clock rate used in four-phase clocking, the switching energy remains unchanged. But these clocking schemes come with more complexity on how to distribute their networks at the chip-level and on how to close feedback loops which may require the use of non-adiabatic latches [46] as clock synchronizers.

C. Flux Trapping
As briefly touched upon in the experimental results, superconductor circuits are prone to be negatively affected by unwanted flux trapping [37]. One way to ameliorate this problem is to design moat structures to attract unwanted flux to areas relatively far from the sensitive circuits [47]. Our cell library does integrate some moat structures, but our interconnect currently does not. Development of a systematic strategy on moat design and placement would make AQFP logic and superconductor electronics in general, more resilient and practical.

D. Advanced EDA Tools
All investigations into the aforementioned challenges can be accelerated with the right set of CAD and EDA tools. Of particular importance are tools to rapidly prototype and evaluate 3-D inductor structures, accurate and fast analog simulation, flux trapping analysis, and flexible chip-level integration tools. With the IARPA SuperTools program in progress for developing EDA and TCAD for VLSI superconductor electronics [48], [49], the outlook looks positive.

VI. TOWARD PRACTICAL CRYOSYSTEMS
AQFP logic is based on superconductivity, which requires cryogenic cooling for the devices to transition into the superconducting state to operate. This implies that AQFP logic is impractical for mobile devices and is better suited for larger computing facilities, such as data centers and supercomputers, where cryocooling systems can be accommodated. Some estimations on the power budgets for large-scale superconductorbased computing systems already exist in [3]. To extend this estimation to AQFP logic, we considered the supercomputing system featuring two Linde LR280 helium reliquefier refrigeration systems from [3]. The details of this system are summarized in the following. Specifically, the above-mentioned system requires a room temperature power of 0.4 MW to cool 1020 W worth of cryoelectronics. An estimate of an additional 1.6 MW is for the non-refrigerated components, such as room temperature interfaces, power supplies, and storage [3]. Thus, the total power of 2 MW is estimated for such a system.
Using this system, we made some estimations based on the NVIDIA Ampere GA100 GPU [50], [51], and the Intel Teraflops Research Chip (codenamed Polaris) [52]. The GA100  Table V shows how many chips of the aforementioned architectures can be cooled within the refrigeration capacity of the LR280-cooled supercomputer. Note that we are only considering the cooling power budget and not the logistics of physically inserting chips into the cooling system and interfacing them to room temperature. To estimate the complexity of the AQFPbased implementations of the two architectures, we made two separate assumptions: (A) we assumed that the number of JJs is equivalent to the number of transistors reported in the design and (B) we assumed that the number of JJs is 4× the number of transistors. (B) is considered the more conservative estimation as we expect more JJ devices to be used for data buffering (synchronization), repeaters along long interconnects, active fan-out, and memory when compared with the transistor-based implementation. On the other hand, AQFP logic is intrinsically a majority logic family, which can yield better quality-ofresults in majority-logic-optimized synthesis when compared with conventional logic primitives [24], [53].
Noting approximately how many chips can be cooled in the LR280-based supercomputing system, we can estimate performance from the reported TFLOPS of each architecture for different floating-point (FP) operations [50]- [52]. The estimates are shown in Table VI. The goal of the U.S. Department of Energy (DoE) Exascale Computing Initiative, Washington, DC, USA, is to build a high-performance computing system that has a LINPACK performance of 1 EFLOPS within a power budget of 20 MW [54]. The scaled EFLOPS shown in Table VI shows the estimated exascale performance of the LR280-cooled supercomputing system when it is scaled from the original total power budget of 2 MW to the DoE power budget constraint of 20 MW. The results indicate that a conservative estimation (B) of an AQFP-based GA100 GPU would exceed the 1-EFLOPS goal for both single-precision (3.3 EFLOPS) and double-precision (1.6 EFLOPS). But the GA100 uses a lot of devices not just for raw computation but for working with graphics and textures as well. The Intel Polaris chip was specifically designed for terascale computing; therefore, it has a higher TFLOPS-to-device ratio when compared with the GA100. In the conservative case (B) of an AQFP-based Intel Polaris supercomputer, the performance is nearly 6× as much as the 1-EFLOPS goal of the DoE. If we can sufficiently address the challenges mentioned in Section V, an expected performance of somewhere between our (A) estimations and (B) estimations should be possible. Our implementation of MANA is nowhere near the complexity and performance of the aforementioned architectures, but its successful low-speed functional demonstration along with the high-speed demonstration of its EX stage indicate that AQFP logic can indeed perform practical adiabatic computation. The extremely low switching energy and relatively high-clock rates (2-5 GHz) of AQFP logic also enable it to interface closely to superconductor-based qubits in the form of a controller for the read-out of quantum states for quantum computing [55]. With further refinement in the design methodologies to improve area efficiency and latency and the emergence of superconductor EDA tools to help with flux trapping analysis, system-level integration, and clocking, more feature-rich performance-driven AQFP circuits may be feasible in the near future.

VII. CONCLUSION
We have designed and demonstrated an adiabatic microprocessor called MANA using unshunted superconductor JJ devices. We implemented this microprocessor using AQFP logic. The basic AQFP cell has a measured switching energy of only 1.4 zJ per JJ at 4.2 K. Even after considering cooling costs, the switching energy is approximately only 1.4 aJ per JJ. The microprocessor was manufactured using a fourlayer 10kA/cm 2 Nb/AlO x /Nb superconductor integrated circuit process. Our test programs show that our MANA chip is capable of performing all the basic operations of the processor, including R/W into the RF unit, ALU execution, hardware stalling, and program branching under a 100−kHz clock rate in experiments. In addition, we developed a highspeed breakout chip of the EX stage of MANA and showed correct operation up to 2.5 GHz. This validates that AQFP logic is not only energy efficient but also a very capable foundation for high-performance design. With these successful demonstrations, we see that there is a potential future in applying this technology for the next generation of energyefficient supercomputers and data centers.