Autonomous Probabilistic Coprocessing with Petaflips per Second

In this paper we present a concrete design for a probabilistic (p-) computer based on a network of p-bits, robust classical entities fluctuating between -1 and +1, with probabilities that are controlled through an input constructed from the outputs of other p-bits. The architecture of this probabilistic computer is similar to a stochastic neural network with the p-bit playing the role of a binary stochastic neuron, but with one key difference: there is no sequencer used to enforce an ordering of p-bit updates, as is typically required. Instead, we explore sequencerless designs where all p-bits are allowed to flip autonomously and demonstrate that such designs can allow ultrafast operation unconstrained by available clock speeds without compromising the solution's fidelity. Based on experimental results from a hardware benchmark of the autonomous design and benchmarked device models, we project that a nanomagnetic implementation can scale to achieve petaflips per second with millions of neurons.


INTRODUCTION
Stochastic neural networks (SNN) are widely used for machine learning, inference and many other emerging applications 1 . A common version of such algorithms is based on the concept of a binary stochastic neuron (BSN) 2,3 which fluctuates between -1 and +1 with probabilities that can be controlled through an input, I i , constructed from the outputs of other BSNs, m j . The synaptic function, I i ({m}), can have many different forms depending on the desired functionality, but we will restrict this discussion to linear functions defined by a set of weights W ij such that where β is a constant and τ S is the 'synapse time', that is the time it takes to recompute the inputs {I} everytime the outputs {m} change. In software implementations, each BSN is updated repeatedly according to where r [a,b] represents a random number in the range [a, b], and τ N is the 'neuron' time, that is the time it takes for a neuron to provide stochastic output m i with the correct statistics dictated by a new input I i . It is well-known 4 that to ensure fidelity of operation it is important to avoid simultaneous updates of two BSNs that are causally connected through a non-zero W ij . The standard approach is to update each BSN sequentially according to Eq. (2), recomputing the input from Eq. (1) after each update, a procedure known as Gibbs sampling 5 . By contrast, the objective of this paper is to explore the feasibility of ultrafast operation through an autonomous a) Electronic mail: bmsutton@purdue.edu b) Electronic mail: datta@purdue.edu architecture whereby each BSN continually fluctuates between -1 and +1 with probabilities that are controlled by the input I i . We refer to this autonomous BSN as a p-bit to highlight its role as the key element of an autonomous p-computer (ApC), similar to the role of a q-bit in a quantum computer.
As a quantitative measure of an ApC's speed of operation we use the number of flips per second (f ), a flip being defined as a p-bit update attempt. For purely sequential updating, the number of flips per second is ∼ 1/(τ S +τ N ). However, updating need not be purely sequential since unconnected BSNs can be simultaneously updated without loss of fidelity. If a number (N p ) out of the total number (N ) of BSNs can be updated in parallel, then the number of flips per second will be much larger ∼ N p /(τ S + τ N ). Note, however, that in order to achieve this enhanced flip rate, the number of neurons that are simultaneously updated, N p , have to be deliberately selected using a digital sequencer, so that the clock period, τ clock , limits the maximum number of flips per second: This clock speed will be limited by τ S , τ N , and the overhead associated with clock distribution 6 . The objective of this paper is to present a framework for clockless operation 4 whose speed is limited only by the neuron and synapse speeds where s ≡ τ S /τ N . We will show that this autonomous mode provides high fidelity results without supervision keeping the fraction of detrimental simultaneous updates down to an acceptably low. Detrimental updates are managed simply by choosing a small s so that τ N is much longer than τ S , without using a digital sequencer to enforce a deliberate update order. As a result, f is not limited by τ clock and can continue to operate faster as the synapse time τ S is lowered. For example if we have 2 nearest neighbor connections with weights of {−1, 0, +1}, then the synapse can be implemented with short wires which respond in times less than 10 to 100 ps 7 , much shorter than typical clock periods. Eqs. (3a) and (3b) suggest that an autonomous design will allow faster operation (that is, more flips per second) if The factor N p /N represents the fraction of neurons that can be updated simultaneously, which depends on the fan-in and the nature of interconnections and problem topology. For example, with nearest neighbor connections on a 2D square lattice as in Fig. 1(b), half the nodes can be updated simultaneously so that N p /N = 1/2. The factor s also depends on the interconnections, but it additionally depends on the nature of the problem and the degree of solution fidelity needed, as we will show in this paper. This paper demonstrates the feasibility of an ApC that performs the weight logic and p-bit functions defined by Eqs. (1) and (2) without the aid of sequencers as portrayed in Figs. 1(a) and 1(b). Our work is motivated by the compact, fast, energy efficient hardware that are currently being developed for the implementation of these functions 8 as shown in Fig. 1(c), which we emulate using existing CMOS devices on an easily reconfigurable, cloud accessible digital FPGA platform, Fig. 1(d). The detailed methodology for constructing this p-computing coprocessor is described in the Methods section. In the following sections we will describe the implementation approach and discuss results for two distinct applications, one involving combinatorial optimization and one involving emulated quantum annealing. These applications will help show that accurate results can be obtained with a sequencerless probabilistic computer of the type envisioned by Feynman 9 , implemented using modern devices to enable operation at ultrafast rates unconstrained by the available clock speed.

APC IMPLEMENTATION
The building block for our ApC has four components as shown in Fig. 1(a): • weight logic to implement Eq. (1), • p-bit to implement Eq. (5), • write unit to program the weights W ij and β • read unit to access the individual p-bit outputs Fig. 1(b) shows how multiple building blocks can be interconnected to form a p-computer. The tiling shown is based on nearest-neighbor connections, but the connections need not be limited to nearest-neighbor. We have also implemented all-to-all networks using the digital emulator shown in Fig. 1 Superficially Fig. 1 looks like other existing neural network architectures, like the one used for TrueNorth 10 . However, to our knowledge, earlier implementations have used time-multiplexing 10,11 to share the same resource among different neurons and synapses, while our objective is to eliminate sequencing and time-multiplexing altogether so that we are not constrained by available clock speeds. TrueNorth for example uses 4096 neuronsynaptic cores, each core having dedicated neuron and synaptic memory forming 256 logical neurons that are time multiplexed sequentially to implement 4096 × 256 ≈ 1M logical neurons 10 . For our sequencerless operation we would need 1M distinct building blocks for the same number of neurons. This would be impractical if we were relying on fully digital implementations, but the compact hardware implementations currently being developed makes such a design feasible. As an example, stochastic behavior of nanomagnets has recently attracted attention in the context of novel computing paradigms, and they show promise in probabilistic and neuromorphic applications [12][13][14][15][16][17][18] . For example, an MRAM-based pbit requires only 3 transistors and a magnetic tunnel junction 19 , while its digital emulation requires significantly more transistors. With such compact hardware, it is feasible to have one building block for every p-bit in order to support sequencerless, autonomous operation that is not limited by clock speeds.
Since digital platforms are inherently synchronous, we mimic autonomous operation by replacing Eq. (2) with a new hardware-inspired model, Eq. (5) (PPSL), that we benchmarked against established state-of-the-art physical models as described in the Methods section. These equations are based on SPICE simulations of Boltzmann networks where the update order of p-bits is irrelevant given symmetric coupling between connected pbits. However, for certain networks such as those with directed connections, the update ordering of p-bits may be important and other hardware models more appropriate for these systems are likely required. These models are not discussed in this paper, but the overall FPGA architecture was designed to support the exploration of different hardware models and network topologies, hence they can be included here with minor effort.
At each time step, all p-bits are free to flip and they do so with a probability ∼ s that is controlled by the input I i having a zero-input value s(I i = 0) = s 0 1.
In each time step the p-bit flips with a probability ∼ s, so that the average time taken for it to respond is 1/s. Since Here a black and white image is used to generate magnetic domains corresponding to each character. The problem is then solved leveraging an FPGA ApC co-processor (see Methods). b. Online annealing is used to transition the network from a high to low temperature through a reduction of T (n + 1) = 0.9 T (n) where T = β −1 (inset). During this process the network converges to a low energy state solution to the problem of interest. c. The network features begin to emerge as lower temperatures are reached. The probability of an individual p-bit flipping is controlled to explore how simultaneous updates effect the network or s = 1 c, s = 1/2 d, and s = 1/4 e. If s is too large, there is no convergence to the ideal solution. As s is lowered, the network is more effective at finding low energy solutions, even at higher temperatures.
In general, these systems map a given optimization problem onto a hardware whose operation is guided by a cost function 30 . A common architecture used for such problems is the nearest-neighbor Ising model as shown in Fig.  2

(a).
In the Ising model, each p-bit is connected to its neighbor through a coupling matrix, J ij , and is influenced by an on-site bias h i . Note that this is trivially mapped into W ij as in Eq. (1). By mapping a problem of interest to this system, an Ising computer will intrinsically search for the lowest energy solution to the problem.
Typical implementations of these systems leverage some form of simulated annealing in hardware to guide the system into a low energy solution, using careful control and sequencing of spin-flip updates to avoid non-ideal spin updates 21,31,32 . In the context of a nearest-neighbor topology, the network can be split into two groups in a checkerboard-like pattern such that all spins in a given group can be updated in parallel 33 . This results in the ability to update half of all spins within a given clock period. According to Eq. (3a) this results in An ApC can be used to solve the same class of problems, but to do so a value of s must be found that produces a solution with the desired fidelity. By identifying this limit for s, an upper bound is placed on the synapse speed that must be obtained to achieve a competitive f from Eq. (4).
Shown in Fig. 2 is a Max-Cut problem for which a black and white image was used to encode magnetic domains for the p-bits. The network is initialized to run at a high effective-temperature (low β) resulting in an effectively uncoupled network. The temperature is then gradually reduced according to an annealing scheduling until the network crystallizes at a low energy, in this case ideal, solution as shown in Fig. 2 , different values of s are used to convey how simultaneous updating affects the ability of the network to converge to the solution. As s is decreased from 1 to 1/2 and 1/4, the smaller values of s result in a more effective convergence. While these results are heuristic in nature given their visual display, a quantitative energy based analysis conveys the same relationship as discussed in the following section with a demonstration of simulated quantum annealing.
For this problem, a value of s = 1/4 resulted in effective convergence to the ideal solution and well-behaved network operation. Given this result, the ApC has Comparing (6) and (7), as long as the synapse is twice as fast as the best τ clock that could be obtained from a synchronous implementation, the network will perform the same number of flips per second without needing a sequencer.

QUANTUM EMULATION
Simulating the behavior of quantum systems using classical models has long attracted interest. Since the seminal work of Suzuki 34 , it has been known that thermodynamic features of quantum systems that avoid the "sign problem" 35 can be efficiently simulated by classical computers. This allows Quantum Annealing (QA) algorithms designed for quantum systems to be simulated on classical computers, an approach that is called Simulated Quantum Annealing (SQA). Computationally, SQA uses a finite number of "replicas" where each replica is sized to match the size of the original quantum system. This replication enables a mapping of the quantum system to a classical collection of p-bits. The number of replicas that are needed depends on the temperature of the quantum system 36,37 and the desired accuracy to emulate the quantum system. SQA algorithms are typically run on software 36-38 and on sequencer-based hardware designs 39,40 . Here, we demonstrate how quantum systems can be emulated using our ApC by solving a model quantum system, the Transverse Ising Hamiltonian, and establish how ApC exactly reproduces the thermodynamics of a many-body quantum system at finite temperatures and magnetic fields.
Recently, it was theoretically argued 41 that a network of p-bits described by Eq. (1) and interconnected according to Eq. (2) can be used to emulate a quantum system with a finite number of classical replicas, where a p-bit is represented in hardware by the stochastic MTJ-based implementation of Fig. 1(c).
We start from the nearest neighbor Transverse Ising Hamiltonian in 1D 42 : where J i,i+1 represents the interaction between neighboring spins, Γ x and Γ z are local magnetic fields in thê x andẑ directions. After a Suzuki-Trotter mapping, the 2D classical Hamiltonian that approximates this system with n replicas is given as: where (J ) i,j = J i,j /n, γ z = Γ z /n and the vertical coupling term is Eq. (9) can be used to find each diagonal element of the quantum density matrix and therefore, all diagonal operators, and their correlations can be calculated from it. The corresponding interaction matrix (J ij ) to perform a p-bit simulation can be calculated from the mapped classical Hamiltonian, such that I i = −∂H C /∂m i to yield the Γx at an inverse temperature of β = 20 using a network of 8 × 250 p-bits with periodic boundary conditions. The 250 p-bits serve as replicas (b.) generated using a Suzuki-Trotter decomposition. The exact quantum Boltzmann solution is compared against the average z magnetization for different values of s. As s is decreased, the system converges to the Boltzmann solution for a value of s = 1/12. As s decreases beyond s = 1/12, the system begins to diverge from the solution for larger Γx/J due to the chosen precision of the digital implementation. For each ΓX and s, 200 samples from a free-running FPGA implementation were collected spaced ≈ 30, 000 synapse delays apart. c. Boxplots are shown of the difference in computed mean at each Γx/J from the ideal Boltzmann solution, for each value of s. weight coefficients. Note that the Suzuki-Trotter decomposition adds another dimension to the classical system, therefore a 1D linear chain for a quantum system is emulated by a 2D classical system. In Fig. 3, we simulate a ferromagnetic linear chain (J i,j = +2) using M = 8 spins with periodic boundary conditions at different transverse magnetic fields and we compute the average magnetization, m z . We include a symmetry breaking field in the +z direction (Γ z = +1) to obtain a net magnetization at vanishing transverse fields (Γ x = 0). We obtain the exact density matrix by directly diagonalizing the quantum Hamiltonian (Eq. (9) shown as a black solid line in Fig. 3(a).
The corresponding average magnetization is obtained by running the classical system that is described by Eq. (9) with n = 250 replicas (250 × 8 = 2000 spins) using Eq. (5) in the FPGA emulator. Fig. 3(a) shows different s values that are used to obtain the exact result. Clearly, choosing s too high (s = 1/1) fails, but gradually decreasing s allows the result to approach the exact result. We observe that s = 1/12 seems to be an optimal choice, as decreasing s further does not yield more improvement due to the chosen precision in the digital implementation. Specifically, as s continues to reduce, the chosen precision of the arithmetic logic and look-up tables in the FPGA emulator design limits the accuracy of the solution (please see Methods). Fig. 3(b) quantitatively shows a boxplot of the error incurred at each s value across 200 trials. In the the Suzuki-Trotter decomposition, increasing Γ X /J systematically increases the error but a reasonably close agreement is observed for s ≤ 1/8.
Typically, SQA algorithms initialize the system at a high magnetic field (Γ X ) and slowly remove it to keep the system in its ground state and guide it to the desired ground state of the Ising Hamiltonian. In Fig. 3, however, we have not changed the magnetic field as a function of time, but rather sampled from 200 ensembles, each separated by ∼ 30, 000 synapse delays, to obtain the system statistics. This means that not only was the system guided to the expected ground state (Γ X → 0, m → 1), but it also followed the correct average magnetization at high magnetic fields. As such, this could be viewed as an example of sampling a probability distribution rather than finding the ground state of the system 2 , an important problem space where an ApC could be useful.
It is important to note that to solve optimization problems using SQA, an exact mapping of the quantum system may not be optimal and a finite number of replicas that only approximate the thermodynamics of the quantum system could lead to more efficient results. In the present context, optimizing the number of replicas for a given system can be arranged since our system is not a natural quantum system nor is it trying to faithfully reproduce the behavior of such a system. Therefore, optimizing the number of physical replicas or finding the replica with the lowest energy becomes possible in our engineered system 37 .

DISCUSSION
These examples demonstrate the feasibility of an ApC governed by Eq. (5) to follow Boltzmann statistics without the need for a controlling sequencer. But in order to make use of ultrafast flip rates, f, enabled by short τ S times, it is important that the building blocks be energy efficient to ensure that power levels are acceptable: a Calculations based on information extracted from 21 . Design used sequential implementation, therefore we assume ( †) s = 1 and an optimal updating scheme such that half of all neurons are updated every step such that f = (0.5)N/τ S . Synapse and neuron delays assumed to operate at interaction frequency of 100 MHz. b Calculations based on information extracted from 31 . Janus II supports a much larger number of neurons using external memory, multiple FPGAs, and data shuttling between devices. Here we limit the comparison to only a single spin processor assuming all spins are co-located on one FPGA, as would be required for an autonomous design. As with (a), we assume ( †) s = 1. With all spins on one chip, an optimal updating scheme updates half of all neurons every step for f = (0.5)N/τ S . Based on this, the spin update time calculated here is 4 instead of 2 as in 31 . Design assumed to operate at operating frequency of 250 MHz.  where P is the power budget and ε is the energy required per flip.
In Table I, we compare representative Ising computers based on sequenced designs 21,31 to the autonomous designs implemented in and projected by this work. The last row of the table shows ε for both the sequenced and autonomous designs. As shown in the table, the FPGA based 8K-spin ApC achieves an energy per flip that is similar to the Janus II sequenced FPGA implementation. However, this comparison applies some artificial constraints on the Janus II design: namely that all spins must be simultaneously resident in the device at one time. This is a requirement for unclocked autonomous designs that for the sake of comparison we applied to the sequenced design. A clocked, sequenced design can pause the system and leverage external memory for storage of neuron state, providing a large pool of logical neurons, though limited by available memory bandwidth. Additionally, the ability to pause the network enables timemultiplexing and can harness the ability to re-use logic resources.
The FPGA results naturally consume more power and have reduced density 45 . The 8K spin result of Table I has ∼ 18 W of static power dissipation due to the periphery included in the FPGA design, not all of which is used. Using the FPGA to ASIC power ratio 45 of 14, a naive migration of the design to an ASIC would result in an ε of ∼ 4.0 × 10 −3 . By translating these approaches to a tailored ASIC implementation, the energy efficiency per flip increases and the ability to increase the density of resident neurons within a device increases substantially. Extending this further, it should be feasible to obtain designs with ∼ 10 petaflips per second with a power budget of ∼ 10 W, but we need devices with ε ∼ 1 fJ that also support a density of 1M devices.
The CMOS based SRAM design of Hitachi 21 has an estimated energy per flip of ε ∼ 50 fJ, though the neuron density limits the ability of the approach to obtain an f of petaflips per second. Using a hybrid CMOS and MTJ design as would be encountered in modern commercial MRAMs 43 , it should be feasible to obtain ∼ 10 petaflips per sec as projected in the last column of Table I 8 . Modern MRAMs can achieve Gb densities; however, we limit the projection to a 1 Mb density based on a target power consumption of ∼ 20 W for the neurons and synapses in the design.
It should be noted that the 20 W power target was arbitrarily chosen. In principle the autonomous designs can leverage clocking to choose when global p-bit updates should occur, much like the FPGA emulator discussed in the methods section. While this approach can save power, it limits the utility of the MTJ based approach by constraining the flips per second (f ) to the same limits of Eq. (3a) due to the clocking scheme. Even with this limit in place, as the connectivity between neurons increases beyond nearest-neighbor, the sequencing logic becomes more complex while the ApC only requires a balance of s to ensure proper convergence. In the case of all-to-all connectivity, the sequencing logic reduces in complexity and technically can be implemented with a single time-multiplex weighted p-bit. In this situation, the benefits of an ApC begin to degrade, except for the elimination of memory bottlenecks with the use of distributed weights and the avoidance of time-multiplexing. A 500 node all-to-all network was implemented using the FPGA emulator, and the resulting s was directly proportional to the number of neurons, affirming the reduced benefit.
Based on these results, the removal of sequencers, ability to run at speeds limited only by synapse delays, and ability scale to millions of neurons, all within an accessible power budget, makes an ApC a compelling alternative to clocked, sequential designs for SNN coprocessing.

FPGA p-computing coprocessor
An all-digital framework based on Eqs. (5a) and (5b) was developed, Fig. 4, to facilitate architectural exploration of an ApC, study various trade-offs in p-bit, weight logic, and topology design, and to accelerate the combinatorial optimization and sampling problems used in this work. The digital framework leverages reconfigurable computing devices to support rapid exploration of different designs. A Xilinx Virtex Ultrascale+ xcvu9p-flgb2104-2-i provided via Amazon Web Services F1 cloudaccessible EC2 compute instances was used for the problems of Figs. 2 and 3. While a Xilinx FPGA was used in this work, the design is hardware agnostic and another device, e.g. an Intel Stratix FPGA, can be used.
As shown in Fig. 1(b) and Fig. 4(b), an ApC comprises multiple weighted p-bits arranged in various topologies, each supporting programmable problem instances. There are many options for the implementation of programmable control, weight logic, p-bits, and p-bit readout in a digital platform. The digital ApC of Fig. 4 comprises a modular weighted p-bit, Fig. 4(c), that can be organized into various topologies, Fig. 4(b), supporting programmed problem instances. An example weighted p-bit implementation is shown in Fig. 1(d) leveraging a memory-mapped weight-logic register bank supporting linear weight coupling. The output of the weight logic block is provided to a programmable, activation function look-up table that is used in conjunction with a Linear Feedback Shift Register (LFSR) based pseudorandom function to implement Eqs. (5a) and (5b). Multiple options were developed and explored for these building block elements, see Fig. 6 beyond what is shown in Fig. 1(d).
Interaction with the framework is provided through  4. Cloud Accessible p-computing Co-processor a. Client applications are used to specify a desired network topology, problem specific weights, and current pseudotemperature through a network accessible pool of dedicated servers. The servers provide general purpose processing, network connectivity, and PCIe accessible FPGA accelerators. After loading the desired topology into the FPGA, the system operates and provides a sampling interface for the current state of the network p-bits. b. The architecture supports multiple topologies depending on the problem of interest ranging from nearest-neighbor connectivity to all-to-all connectivity. c. Each of the modular p-bits in the network comprises a bus accessible programmable interface, synapse block, and neuron. The neuron is further modularized to support different approaches for flip-attempt logic, pseudo random number generation, and activation function implementations.
MATLAB MEX programs in a client-server command based model, Fig. 4(a). Clients issue commands to select which of the pre-built topologies to program into the cloud FPGA instance, the current pseudo-temperature for the network, problem specific weights, and options to pause or resume the network. Commands are also provided to support random sampling from the network for readout operation. Online annealing is directly supported through global update operations of the activation function look-up. All weights are dynamically programmable through memory-mapped operations as discussed in the following section. The server interfaces with a PCIe attached FPGA, interacting with the programmed design as commanded. Other clients such as Octave and Python are readily supported through a C++ abstraction layer leveraging networking and serialization libraries. design.

FPGA ApC Logical Organization
The global address space provides information on the ApC pertinent for client interaction with the system. This includes information such as the total number of p-bits in the design, the p-bit connectivity graph topology, weight precision, and other useful runtime information such as the number of elapsed synapse delays between client sampling requests. Additionally, global control functions include the ability to pause and resume the network so that a client can access the global p-bit readout array. This readout array holds the current output of all p-bits sampled on each system clock cycle. Reads from this interface are performed when the network is paused to ensure atomic readout of all p-bits given the limited bandwidth of the readout interface.
Each p-bit in the system has a local memory space supporting programmable control of its function. Each p-bit has localized programmable weights enabled through registers or internal memory (RAM), a programmable activation function lookup table supporting direct look-up or interpolation, support for seeding the chosen psuedo random number generation (PRNG) function, and finally a set of control registers for the p-bit. The output of the p-bit programmable elements directly interface with the weight logic, activation function, PRNG, and comparator operation of the weighted p-bits. Online annealing is accomplished using a global bus broadcast when programming the activation function lookup tables, so that all p-bits are updated simultaneously. Alternatively, each p-bit's activation lookup table can be independently controlled, allowing the exploration of non-uniform bias, local temperature effects, and other non-idealities, facili-tating future opportunities for exploration.

Digital p-bit Implementation Flexibility
A modular digital p-bit was designed to support different options for the p-bit building block elements. Each p-bit is logically partitioned into a unit for psuedorandomness or "entropy", a block for computing the activation function, and a portion that uses the results of a comparison between the activation function and PRNG to determine if a p-bit update attempt should occur. There are various ways to construct these elements using digital logic. In this design, a few select implementations were explored as shown in Fig. 6. Fig 6(a) and 6(b) show two methods for performing autonomous updates of each p-bit. Shown in Fig. 6(a) is the logic corresponding to Eq. (5a) where the comparator provides sgn e −s − r [0,1] . This approach emulates autonomy while preserving fully synchronous operation of the digital design. This p-bit update logic was used for the problems in this work. The activation look-up is programmed with e −s .
Alternative approaches were explored and implemented for the p-bit updates including the use of freerunning ring oscillators, Fig. 6(b), to mimic the naturally stochastic update frequency of an MTJ based p-bit as described by (2). In this implementation, each p-bit has a dedicated free running ring oscillator that generates asynchronous "attempt" edges that are synchronized into a system clock domain. Each asynchronous edge determines when the p-bit should attempt to up- Using output from a comparator, the state of a given p-bit is probabilistically flipped on any given system clock edge. b. Alternatively, a free-running ring oscillator can be used to generated edges centered around a oscillator characteristic frequency, mimicking the attempt rate of a stochastic nanomagnet. These edges control the enable of the p-bit which is then updated based on the output of a comparator. c. Pseudo-randomness is used to provide the rand() function. A linear feedback shift register is an area efficient means to generate pseudo random values, however the quality is limited. d. For improved quality of pseudo random output at the expense of more area, a Xoshiro128+ 46 generator can be used. e The p-bit activation function can be implemented using a straight-forward look-up table or through an interpolated look-up table output. g. By combining the source of "entropy" with the activation function, a tunable delay chain can be used to leverage controlled metastability at the input of a flip-flop to produce a probabilistic output. Outputs from three independent p-bits are shown overlaid in f.

Ring Oscillator
date based on the current output from a comparator according to Eq. (2), in which case the activation look-up is programmed with tanh. While the attempt logic is used to determine when an update attempt should be made, the logic relies on the output of a comparator to determine if a flip should occur. The comparator computes the sign of the difference between the output of a PRNG and the output of the programmed activation function. Shown in the second row of Fig. 6 are two PRNG implementations. A Linear Feedback Shift Register (LFSR) is a PRNG that provides pseudo randomness in a compact design at the expense of output quality, Fig. 6(c). While the LFSR may be sufficient for many problems, a higher-quality PRNG was implemented that requires minimal FPGA resources, but significantly improves the PRNG quality 46 .
The second input to the comparator is from an activation function output, shown on the third row if Fig. 6(e). As implemented, a straight-forward look-up table was sufficient for the designs in this paper. However, an improved interpolation based activation function 47 would improve the accuracy of the lookup results and may be necessary for certain problem classes.
Finally, an additional building block was created to leverage physical randomness and a built-in sigmoidal response from within a digital design as shown in Fig. 6(g). Leveraging a delay based building block 48 and flip-flop metastability, "true" entropy was used to construct a sigmoidal response as depicted in Fig. 6(f). However, over continued operation within the device, temperature and other variations caused the sigmoidal curves to drift, resulting in non-uniform bias and operation. As a result, this building block is not currently being used in the design; however, it does provide insight into non-idealities that may be encountered in a chip design.

Benchmarking Eq. (5) with stochastic LLG
A coupled stochastic Landau-Lifshitz-Gilbert (sLLG) equation is solved and benchmarked against the autonomous p-bit model of (5a) and (5b) (PPSL). Magnetization dynamics of a circular stochastic nanomagnet are captured by solving the sLLG equation in the macrospin assumption within a modular 49 SPICE framework , where α is the damping coefficient, γ is the electron gyromagnetic ratio, N s = M s Vol./µ B is the total number of Bohr magnetons in the magnet, M s is the saturation magnetization, H = H d + H n is the effective field including the out-of-plane (x directed) demagnetization field H d = −4πM s m xx , as well as the thermally fluctuating magnetic field due to the three dimensional uncorrelated thermal noise H n with zero mean H n = 0 and standard deviation H 2 n = 2αkT/|γ|M s Vol. along each direction, I S is the applied spin current to the nanomagnet.
Individual p-bits are coupled according to: where, I s0 is the tanh fitting parameter of the sigmoidal response (sgn(m z ) versus spin current I z s along z-direction). In the benchmark, a circular disk magnet with a vanishing anisotropy (H K ) is used with the parameters: diameter D = 150 nm and thickness t = 2 nm, α = 0.01, M s = 1100 emu/cc, H K = 1 Oe resulting in an autocorrelation time of τ corr = 1.372 ns and I s0 = 1 mA. A fitting parameter of 1.4 is used in the PPSL model for τ N , i.e. τ N = 1.4τ corr .
The simulated network is a Sherrington-Kirkpatrick 50 spin glass with a random coupling matrix and random bias between -1 and +1. The benchmarking of the proposed PPSL model with the coupled sLLG network, anal- ogous to the probabilistic circuit proposed in 51 , is accomplished by comparing two different quantities: (1) Euclidean distance and (2) Free energy.
Euclidean distance is defined by: where P i is the probability of occurrence of the i-th configuration computed out of 4000 ensembles at each time step of the simulation. P i,Boltzmann is computed from the joint probability distribution obtained from a Boltzmann law. The second benchmark approach is based on a comparison of the free energy of the system with what is expected from the principles of statistical mechanics. Free energy is defined by 52 the partition function Z: where, β is the pseudo-inverse temperature. Partition function Z is given by: where k represents different configurations of the network. Energy of a specific configuration is defined by: When numerically calculating free energy from the sLLG data, the following steps have been applied (similar to the importance sampling method described in 53 ): 1. The probability of different configurations, P i , are calculated out of 4000 ensembles for each time step 2. For each P i larger than a certain threshold value P th , the partition function Z i = exp(−I 0 E i )/P i is calculated, so that outliers are excluded 3. For each Z i , the free energy F E i = − ln(Z i )/I 0 is calculated.
4. Finally the mean of all F E i is computed.
The above method is suitable for small examples, but may not scale to large examples due to the difficulty in empirically calculating different probabilities P i as the network size grows. The striking agreement between the sLLG model and the behavioral model given by Eq. (5) shown in Figs. 7 and ref14 establishes the validity of Eq. 5 as a suitable model for the projected autonomous, stochastic MTJ-based computer.