A Design Framework for Invertible Logic

Invertible logic using a probabilistic magnetoresistive device model has been recently presented that can compute functions in bidirectional ways and solve several problems quickly, such as factorization and combinational optimization. In this article, we present a design framework for invertible logic circuits. Our approach makes use of linear programming to create a Hamiltonian library with the minimum number of nodes for small invertible-logic functions. In addition, as the device model is approximated based on stochastic computing in synthesizable SystemVerilog, a faster simulation using the compiled SystemC binary is realized than a conventional SPICE-level simulation and is verified using field-programmable gate array (FPGA) as prototyping. Using our design framework, several invertible-logic circuits are designed and emulated (verified) in SystemC, exhibiting five order-of-magnitude faster simulation than conventional work.


I. INTRODUCTION
I NVERTIBLE logic has been recently presented for providing a capability of forward and backward operations [1] as opposed to typical binary logic for the forward operation. It is designed based on underlying Boltzmann machines [2] and probabilistic magnetoresistive device models (p-bits) [3] whose input and output signals are represented by random bit streams. The bidirectional computing capability is realized by reducing the network energies of the machines with noise control (e.g., a multiplier could be used as a factorizer in the backward mode). Hence, several challenging problems could be quickly solved, such as integer factorization (e.g., cryptography problems [4]), combinatorial optimization (e.g., wireless sensor networks [5]), and machine learning (e.g., training neural networks [6]).
However, there are several issues for designing large-scale invertible logic circuits. The functions that operate in bidirectional modes are defined by the Boltzmann machine configurations (Hamiltonians). A method of generating Hamiltonian is limited to small function blocks, such as Boolean logic [7], [8]. Another issue is the simulation speed due to the complicated device model described at the transistor level. In [9], the model is emulated in software using a microcontroller, however, the simulation speed per sample is not fast (e.g., 100-300 ms).
In this article, a design framework for large-scale invertible logic is presented in order to tackle the two main issues: 1) network configurations (Hamiltonians) and 2) simulation speed. For the small network configurations, a Hamiltonian library is created based on linear programming (LP), which provides the minimum number of nodes in Hamiltonians for basic functions, including adders and nonlinear functions. In addition, Hamiltonians of large functions (e.g., multiplication) can be constructed by adding those of small function blocks. For faster simulations, the probabilistic device model is approximated using stochastic computing [10]- [12] in synthesizable SystemVerilog. Stochastic computing that uses random bit streams realizes area-efficient computation blocks (e.g., multiplication and tanh function) and has been recently used for several applications, such as low-density parity-check decoders [13]- [16], image processing [17], [18], and deep neural networks [19]. As invertible logic may operate as serial computing, stochastic computing efficiently approximates the device model. Therefore, invertible logic can be emulated (verified) in the compiled SystemC environment and verified in the prototyping hardware (FPGA). Using our design framework, two noise-control methods are introduced and discussed in terms of convergence speed.
Our contributions are summarized: 1) the first design framework for invertible logic from specification to simulation; 2) Hamiltonian design using LP with the minimum number of nodes; and 3) five order-of-magnitude faster simulation than conventional works. The remainder of this article is as follows. Section II reviews invertible logic with related works and discusses the current issues of invertible logic design. Section III describes an overview of the proposed design framework for invertible logic. Section IV introduces a creation of Hamiltonian library using LP and a method of designing large-scale invertible logic. Section V models the probabilistic device model (p-bits) using stochastic computing for fast simulation. Section VI introduces two noise-control optimization methods for fast convergence of invertible logic.  Section VII evaluates the proposed design framework with the conventional work in several aspects, such as Hamiltonian and simulation speed. Section VIII concludes this article. Fig. 1(a) shows a concept of invertible logic realized using Boltzmann machine and probabilistic bits (p-bits) [1]. Invertible logic circuits operate at forward and/or backward modes, where functions are embedded using Hamiltonian with inputs (x i ∈ {0, 1} (1 ≤ i ≤ p)) and outputs (y i ∈ {0, 1} (1 ≤ i ≤ q)). Note that the 2's complement format is used to represent data in invertible logic throughout this article. For example, an invertible multiplier exhibits a capability of multiplication with fixed inputs (forward mode) and factorization with fixed outputs (backward mode). If partial inputs and outputs are fixed, the invertible multiplier operates as division. Fig. 1(c) shows a Hamiltonian of a two-input AND corresponding to the gate shown in Fig. 1(b). There are three nodes, where weight values (J) between nodes and bias values at nodes are given by

A. Invertible Logic
where the first two rows correspond to A and B and the last row corresponds to Y. Hamiltonians of simple logic gates can be obtained using ground-state spin logic [7], [8]. With given h and J, each node (p-bit) probabilistically generates an output where l is the number of nodes. m i is given by the following equations: where rnd(−1, +1) is a uniformly distributed random (real) number between −1 and +1, sgn is the sign function (with binary +1 or −1 outputs), and I 0 is a scaling factor (an inverse pseudo-temperature). As m i is represented in bipolar format, "m i = +1" and "m i = −1" correspond to logic values of "1" and "0," respectively. Energies (H) of invertible logic circuits are given by By controlling noise levels using several parameters, such as I 0 , H ideally decreases to the global minimum, leading to desired inputs and/or outputs. Fig. 1(d) shows an example of the two-input invertible AND in the backward mode. With fixing the output (Y) to "0" ("m y = −1"), there are three valid states ("ABY") of ("000," "010," "100"). In this simulation, the three valid states are obtained with almost the same probability of 33%.

B. Related Works
Table I summarizes comparisons of logic family characteristics. Unlike conventional Boolean logic that realizes only forward operations, invertible logic can realize bidirectional (forward/backward) operations. The number of inputs and outputs are flexible, while computation is deterministic or probabilistic in conventional and invertible logic, respectively. Invertible logic is designed using a probabilistic device model and can be implemented using a magnetoresistive device [1].
Reversible logic circuits are constructed of special gates (such as controlled NOT (CNOT) or Toffoli gates) having a direct one-to-one mapping of inputs to outputs [20]. While reversible logic gates allow for circuits to be built which are bidirectional, they must be designed differently and do not include standard gates (such as AND or OR gates) and require different design methods, such as binary decision diagrams (BDDs) [21]. While both reversible and invertible logic circuits reconstruct inputs from a given output value, they differ at fundamental levels. Unlike invertible logic, the number of inputs is equal to the number of outputs, which could require additional outputs/inputs, such as even a simple AND gate in reversible logic [22]. For physical realization, gates of reversible logic used in quantum circuits can be converted to standard binary logic that can be in turn realized in standard CMOS.

C. Design Issues With Invertible Logic
There are two main issues for large-scale invertiblelogic circuits. The first issue is a Hamiltonian design method. Different from reversible logic, large-scale functions (e.g., multiplication) are represented using a corresponding Hamiltonian as it is designed by adding small Hamiltonians based on ground-state spin logic [7], [8]. However, a variety of Hamiltonians is limited to small functions, such as AND. In addition, there is no specific design method of creating Hamiltonians corresponding to other functions. The second issue is simulation speed. Small invertible logic circuits have been designed and simulated at the transistor level [1] and in a microcontroller (software) [9], which takes 100-300 ms for a cycle of operation. For designing large-scale invertible logic, slow simulation could be a critical issue. Especially, as a control of noise effect, I 0 , in Eq. (2) is required to converge to a valid state (minimum energy), a parameter search of the noise effect is required. A fast simulator allows designers to find a good noise parameter quickly. In this article, these two issues are mainly tackled using the proposed design framework for large-scale invertible circuits, such as the design methodology of large variety of Hamiltonians, fast simulation environment, and noise-control optimization. Fig. 2 shows the proposed design framework for invertible logic. Let us explain the framework from the beginning.

III. DESIGN FRAMEWORK
1) A circuit design specification is defined, such as desired functions and input/output bit widths. 2) A whole network Hamiltonian corresponding to the function is generated based on a Hamiltonian library.
The Hamiltonian library is preliminarily created using LP described in Section IV, where Hamiltonians of small invertible logic circuits are included in the library, such as logic functions and adders. The whole Hamiltonian is obtained by adding the small Hamiltonians using our custom Python tool.
3) The whole Hamiltonian is converted to the corresponding SystemVerilog model using SystemVerilog primitive modules. The primitive modules are preliminarily designed using stochastic computing [10] described in Section V, where stochastic computing approximates the probabilistic device model. The SystemVerilog model generated using our custom Python tool is synthesizable using commercial EDA tools, such as the synopsys design compiler. 4) A test bench is created with noise control of parameters, such as I 0 . In invertible logic, the convergence speed could be significantly changed due to the noise control including hyper parameters, where a selection of optimum parameters can reach the global minimum energy. Two noise-control methods are introduced in Section VI. 5) The invertible logic circuit using the SystemVerilog model is verified (emulated) using Verilator [23] that is faster than SPICE simulations and interpreted Verilog simulations, where Verilator compiles SystemC test benches and the SystemVerilog models. Using the fast simulation environment, hyper parameters for fast convergence to the global minimum can be optimized (noise-control optimization) described in Section VI. In addition, the SystemVerilog model can be verified using field-programmable gate array (FPGA) boards for quite large invertible circuits as prototyping through commercial FPGA design tools, such as Xilinx Vivado.

A. Hamiltonian Library of Small Invertible Logic Using Linear Programming
Hamiltonians of small functions blocks, such as logic gates, are obtained using LP. Fig. 3 illustrates an example of Hamiltonian design of an invertible AND (Y = A ∧ B). There are total eight states that are divided into valid and invalid states based on the AND function.
Let us explain a procedure of generating a Hamiltonian using the invertible AND. The inputs (x i ∈ {0, 1} (1 ≤ i ≤ p)) and the outputs (y i ∈ {0, 1} (1 ≤ i ≤ q)) are defined shown in Fig. 1(a). First, logical values are converted to bipolar format as m i . Second, an energy of each state (E k (1 ≤ k ≤ (l + l(l − 1)/2))) is defined based on Eq. (3), where l is a summation of input and output bit widths. In this case, l is 3 and the maximum k is 6. In invertible logic, the energies of the valid states must be equal to the minimum (E min ) while that of the invalid states are larger than E min described as following: where d is the energy difference between E min and the second minimum energy. Third, the objective function is maximizing d using LP in order to obtain h i and J ij as follows: where m i and m j are constants and h i , J ij , E, and d are variables.
Hamiltonians are obtained using LP with PuLP [24]. Listing 1 shows a part of python code of LP for the invertible-AND Hamiltonian. Using this method, Hamiltonians of small functions blocks are obtained in Table II. The number of nodes is the summation of input and output bit widths. These numbers are the minimum value because there is not auxiliary bit (node). Note that the auxiliary bits are extra bits except the input and the output bits [see Fig. 5 (a)].
In addition to the logic functions and the adders, Hamiltonians of several unique functions, such as bitcount  function and rectified linear unit (ReLU) function can be obtained using LP. The reason of creating these Hamiltonians is that these functions are often used for machine learning as building blocks in neural networks [25]- [27]. Both functions are activation functions of neural networks, where the bitcount function is used in binary neural networks. By using these building blocks, invertible logic could be applied for machine learning, especially training neural networks using the bidirectional operations of invertible logic [6]. Fig. 4(a) shows a Hamiltonian example of a 6-input bitcount function with 6 inputs of (x 1 , x 2 , . . . , x 6 ) and a 3-b unsigned output of Y = (y 2 , y 1 , y 0 ) in 2's complement format. The invertible bitcount circuit can realize Y = 6 i=1 x i in forward and backward modes. Fig. 4(b) shows a Hamiltonian example of a 5-b ReLU function with a 5-b signed input of X = (x 4 , x 3 , x 2 , x 1 , x 0 ) and a 5-b signed output of Y = (y 4 , y 3 , y 2 , y 1 , y 0 ), where the function of ReLU is defined by Y = max(0, X).

B. Hamiltonian Construction for Large Invertible Logic
Hamiltonians of large and/or complicated functions, such as multiplication, cannot be directly generated using LP because of linear separability problems. Hence, auxiliary bits are required to create such Hamiltonians. The whole Hamiltonian can be created by adding small Hamiltonians as follows: where h k and J k represent a Hamiltonian corresponding to a small circuit, such as AND, HA, and FA. Fig. 5 shows an example of Hamiltonians of a threeinput AND logic. The Hamiltonian is obtained by adding two Hamiltonians of the two-input AND logic. In this case, there are an additional connection (c) that becomes an auxiliary bit. If the Hamiltonian is directly created from the three-input AND logic, the auxiliary bit could be removed, leading to the minimum number of nodes. When the number of nodes is increased due to the auxiliary bits, the hardware of invertible logic could be larger and the convergence speed could be slower.
For designing Hamiltonians of large invertible logic, a circuit architecture is a important factor that can affect the performance of invertible logic. Fig. 6(a) shows a 4×4-bit unsigned multiplier architecture based on a simple adder-based structure. This design includes (2×4) inputs, 8 outputs and 32 internal connections. The Hamiltonian is obtained by adding that of AND and FA generated using LP. The number of nodes in the Hamiltonian is 48. The number of internal connections (auxiliary bits) is exponentially increased when the input bit width is increased because of horizontal and vertical internal connections. Note that a well-known Wallace-tree structure for fast multiplier design in conventional logic [28] causes a larger number of internal signals (nodes) than the adder-based structure.
In order to obtain smaller number of nodes, the proposed multiplier is designed using the bitcount circuits as shown in Fig. 6(b). As there is no internal connection in the bitcount circuit, the vertical internal connections can be eliminated, leading to a smaller number of nodes. In case of the 4×4-bit multiplier, the number of internal connections (auxiliary bits) decreases to 26 and hence the total number of nodes decreases to 42. The reduction method is much more effective in larger multipliers. Fig. 7 compares the number of nodes in invertible multipliers (factorizers). The number of nodes is exponentially increased in the conventional adder-based multiplier because horizontal and vertical internal connections (auxiliary bits)  are required. As the proposed bitcount-based structure eliminates the vertical internal connections, the number of nodes is almost linearly increased, leading to significant reductions in the number of nodes. As a result, the number of nodes in the 4×4-bit and the 12×12-bit multipliers are reduced by 80.6% and 89.1%, respectively. The detailed evaluation is described in Section VII-E.

A. Binary and Integral Stochastic Computing
In invertible logic, p-bits operate based on Eq. (2) with bias values (h) and weight values (J) of Hamiltonians. In order to realize faster simulations than conventional works at the transistor level [1] and Microcontroller (software) [9], a SystemVerilog model corresponding to Eq. (2) is created. The SystemVerilog model is designed based on binary and integral stochastic computing [10], [19].
In stochastic computing, data values are represented by frequencies of "1" in bit streams. Let us denote by S ∈ {0, 1} a random bit streams. A real number, s ∈ [−1 : 1], is represented by (2 * E[S] − 1) in binary stochastic computing in bipolar format, where E[S] denotes the expected value of the random variable, S. In contrast, in case of integral stochastic computing, one or more bit streams are concurrently used to represent data values in larger ranges than that of binary stochastic computing. Let us denote by X ∈ {−r, −(r − 1), . . . , r} a random bit streams, where r ∈ {1, 2, . . .}. A real number, x ∈ [−r : r], is represented by E[X] in signed format, where E[X] denotes the expected value of the random variable X.
Stochastic computing realizes several functions, such as addition, multiplication, and nonlinear functions (see detail in [12]). Fig. 8(a) shows a multiplier of an integer stochastic bitstream and a binary stochastic bitstream (y = a * s) designed using a two-input multiplexor. Fig. 8(b) shows a tanh function block (Stanh) using a finite state machine (FSM) in stochastic computing. The tanh function is approximated using Stanh as follows: where 2 · N T is the total number of states of the FSM. The Stanh block is designed using a saturated updown counter in hardware.

B. Spin-Gate Circuit for Modeling p-Bits
where I 0 of Eq. (2) corresponds to N T . In addition, the weighted noise source with corresponding weight denoted as w rnd is an additional parameter from Eq. (2). The weighted noise source is generated using a random number generator [29]. The model is designed as an extended version of [30] and [31], which can support controlling I 0 using N T . The inputs and the output of the spin-gate circuits (m i ) are represented in binary stochastic computing in bipolar format as stochastic bit streams, s i = (1 + m i )/2. Instead, integral stochastic computing is exploited inside the spin-gate circuits in order to deal with integer values of h and J. As the model is fully designed using stochastic computing, it is synthesizable for standard digital CMOS circuits.

VI. NOISE-CONTROL OPTIMIZATION
In invertible logic, it is important to control noise effects in order to reach the global minimum of energy (Hamiltonian). To converge node states to that at the global minimum, N T and/or w rnd of Eq. (10) can be controlled as noise optimization. In this article, w rnd is selected to be controlled for two scenarios.

A. Grid Search on Monotonous Noise Reduction
To find the optimum control of w rnd , a grid search is used as shown in Fig. 10. In the grid-search method, w rnd is linearly decreased using four parameters as follows.   These four parameters are swept in order to converge energy to the global minimum. The monotonic noise reduction (MNR) method derives from [32] that monotonically increases I 0 . This method is simple, but it requires long simulation time to find good noise parameters. The detailed simulation results are summarized in Section VII-D.

B. Tuning Parameter Repeat With Pulsed Noise
In order to reduce the simulation time of finding good noise parameters, tuning parameter repeat (TPR) is introduced. In this method, a short pulsed noise is repeatedly applied as opposed to MNR based on grid search. Fig. 11 shows a noise control based on TPR with an example of factorization of 756 (in_A × in_B). In TPR, w rnd is decreased from large to small as a one shot. There are three parameters in TPR as follows.
1) RND_WEIGHT: The maximum value of w rnd .
3) T: Cycles at large or small noise. Hence, a cycle of one shot is (2 * T + RND_STEP).
As invertible logic is probabilistic, the results (energies) can be different, if the same noise parameter is applied. This example shows a factorization of 756 using TPR with RND_WEIGHT = 6, RND_STEP = 4, and T = 6. The tuning parameters were determined using simulations of a small invertible factorizer and can be applied to larger invertible factorizers. In this example, at the first and second trials, the correct input values of in_A and in_B are not obtained. In contrast, at the third trial using the same noise parameters, the

A. Comparisons of Hamiltonian
The Hamiltonian library is created using LP with PuLP [24] in AMD Opteron 6282 SE (2.6 GHz) used for all the simulations. Table III summarizes the number of nodes in Hamiltonians in comparison with a conventional work [33]. The conventional method is based on [1] that uses auxiliary bits and a handle bit to create Hamiltonians, causing a larger number of nodes. In contrast, the proposed method using LP generates the minimum number of nodes for the Hamiltonians of AND, FA, and 32-b adder. The number of nodes in FA and the 32-b adder are reduced by 64.3% and 77.7%, respectively, in comparison with the conventional method.

B. Simulation of Invertible Logic Circuits
Invertible logic circuits are simulated using our SystemVerilog model with the compiled SystemC binary in Verilator [23] and SystemC-2.3.2. Verilator is a fast Verilog-HDL simulator running on C++ and SystemC, which accepts synthesizable Verilog-HDL and SystemVerilog. Fig. 12 shows simulated waveforms of a seven-bit bitcount function in the backward mode. The output of Y is fixed in order to obtain correct seven inputs of (x 1 , x 2 , . . . , x 7 ) at Y = 2 and Y = 6 with a noise control of MNR, where RND_WEIGHT=16, RND_STEP = 2, N s = 7, and T = 10 are used. When the   Fig. 13 shows simulated waveforms of a 10-b ReLU function in the backward mode with the same noise control used in the previous simulation. When the output of Y is fixed to 121 in Fig. 13(a), the input of X reaches the correct value of 121 at the global minimum energy of −28. In contrast, when the output is fixed to 0 in Fig. 13(b), the input can be any negative values as the correct values because of the function: Y = ReLU(X) = max(0, X). Fig. 14 shows simulated waveforms of the invertible factorizer (adder-based) based on the architecture of Fig. 6(a). For small invertible-logic circuits simulated in the previous paragraph, it is easy to reach the global minimum energy with many different noise parameters. In contrast, for large invertible-logic circuits, such as invertible multipliers, specific noise parameters are required for the convergence. In order to converge to the global minimum, first, the grid search on MNR is used to find the optimum control of w rnd of Eq. (10). The total number of cycles is 9.5 × 10 7 in the grid search to find the optimum noise parameters on MNR. In case of (A × B) = 182, w rnd is reduced with RND_WEIGHT=8, RND_STEP = 1, N s = 7, and T = 128. In contrast, in case of (A × B) = 598, w rnd is reduced with RND_WEIGHT=16, RND_STEP = 1, N s = 15, and T = 64. Depending on the outputs (A × B), the optimum noise parameters are different, causing long convergence time, even though our fast simulation environment is used. The evaluation of noise control is described in Section VII-D.

C. Comparisons of Simulation Speed
Table IV summarizes the simulation time per sample (cycle) in a 2×2-bit invertible multiplier (factorizer). In the conventional work [9], the complicated device models of (2) are realized using software running on a microcontroller. The sample time is slow, such as 100-300 ms. In such the environment, a noise-control optimization of I 0 for convergence to the global minimum requires significantly large time.
In contrast, our SystemVerilog model using stochastic computing is simulated as the emulation of the device model in Verilator and SystemC-2.3.2. As a result, the cycle (sample) time is around 5.3 μs, leading to around five order of magnitude reductions. As opposed to the conventional work, larger invertible multipliers can be also designed and simulated, such as 32-b.

D. Comparisons of Noise Control in Invertible Multipliers
Table V summarizes the total number of cycles and the simulation time of invertible multipliers with different noise controls described in Section VI. The total number of cycles are ones until good noise parameters for convergence are obtained. Using the grid search, both the total number of cycles and the simulation time are significantly increased as the bit width is increased. As a result, in larger invertible logic circuits, it is hard to converge to the global minimum and hence obtain correct values.
In contrast, using TPR, both total number of cycles and the simulation time are negligibly increased as the bit width is increased. In case of the 16-b invertible multiplier, the simulation time is a five order-of-magnitude faster than that of the

E. FPGA Implementation for Prototyping
As the SystemVerilog model is synthesizable, invertible logic circuits can be evaluated using FPGA as prototyping. Table VI summarizes the synthesis results of invertible logic circuits in Xilinx Vivado 2016.4 for Digilent Genesys 2 with the clock frequency of 100 MHz. As the clock cycle is 100 MHz, the sample time is 10-ns that significantly increases simulation speed in comparison with the conventional work and the proposed SystemC summarized in Table IV. Note that generating bitstream files for FPGA takes a much longer time than compiling to the SystemC binary files. Therefore, the SystemC-based environment is useful for small invertible circuits while the FPGA environment is useful large ones that require longer simulation time.
Considering the hardware resources, in general, the number of LUTs and FFs are large when the number of nodes and nonzero weights in Hamiltonian are large. Note that the number of nonzero weights are obtained from h and J of Hamiltonians. When adder-based and bitcount-based invertible multipliers are compared, the number of nodes in the bitcount is significantly smaller than that of the adder-based structure as described in Fig. 6. In contrast, the number of nonzero weights are larger because of the denser matrix of J. As a result, the bitcount-based invertible multiplier reduces LUT by 7% and FF by 38% in comparison with the adder-based one.

VIII. CONCLUSION
In this article, we have presented the design framework for large-scale invertible logic. The Hamiltonian library created using LP provides the minimum number of nodes in Hamiltonians for basic functions, where the library includes Boolean logic, adders, bitcount, and ReLU functions. As a design example of a large invertible logic circuit, the Hamiltonians for invertible multipliers (factorizers) are constructed using the library, resulting in more than 80% reduction in the number of nodes in comparison with that of the conventional structure. For fast simulations, the probabilistic device model used for invertible logic is approximated using stochastic computing in SystemVerilog running with the compiled SystemC binary, providing almost five orders-ofmagnitude reductions in simulation time in comparison with the conventional environment. In our fast simulation environment, the tuning-parameter repeat method as noise-control optimization is introduced, reducing the convergence time by five orders-of-magnitude in comparison with the grid search method.
Invertible logic was recently presented to demonstrate integer factorization in 2017 [1] and have been studied in several aspects, such as scalability, applications, and implementations. The scalability has been studied and discussed in [32]; however, optimization algorithms for specific problems (e.g., graph coloring) have not been studied and would be a future research. One of the possible applications could be machine learning, especially training neural networks using the bidirectional operations of invertible logic [6]. In future prospects, our design framework would be useful as a design and test tool for implementing invertible logic with the probabilistic magnetoresistive devices. In addition, larger invertible logic circuits using stochastic computing with standard CMOS devices could be designed for several applications.