High-Throughput Low Power Area Efficient 17-bit 2’s Complement Multilayer Perceptron Components and Architecture for on-Chip Machine Learning in Implantable Devices

In this manuscript the authors, design new hardware efficient combinational building blocks for a Multi Layer Perceptron (MLP) unit which eliminates the need for hardware generic Digital Signal Processing (DSP) units and also eliminates the need for on-chip block RAMs (BRAMs). The components were designed to minimise power and area consumption without sacrificing throughput. All designs were validated in a Field Programmable Gate Array (FPGA) and compared against unrestricted <italic>CPU-MATLAB</italic> implementations. Furthermore, a (2,2,2,2) MLP with back propagation was implemented and tested in a FPGA showing a total hardware utilisation of just 3782 LUTs, and no DSP or BRAMs. The MLP was also built in a Application Specific Integrated Circuit (ASIC) using a 130 nm technology by <italic>Skywater 130A</italic>. The results show that the area occupation was just <inline-formula> <tex-math notation="LaTeX">$0.12~mm^{2}$ </tex-math></inline-formula> and consumed just 100 mW at 100 MHz input stimulus.


I. INTRODUCTION
Machine learning is a branch of Artificial intelligence which has been around since the 1980s and is a technique which intends to make software applications gain accuracy in their predictions without specifically encoding the information and hence have a learning aspect. Machine Learning (ML) has come to play a crucial and almost transparent role in everyday life, whether it be image classification, real-time object detection in autonomous vehicles or pattern recognition for the detection of diseases. ML is being applied all over the world and has had great success, now as ML techniques become more and more complex so do their techniques such as Neural Networks (NN). This increasing complexity is leading to bloated generic hardware which make use of parallelism such as Graphical Processing Units (GPUs). Nonetheless, this The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . hardware is power hungry and big, meaning it is not suitable for IoT based device applications or implantable bio-medical devices [1]. Typical ML techniques include supervised learning, in which the system is given the correct answer and is penalised if it chooses the wrong answer and unsupervised learning, which uses only input data and looks for specific patterns such as clusters. A typical ML structure is the Neural Network (NN). The NN is a loose representation of the neurons which fire in the human brain based on the sum of the inputs to the neuron which activates or not. These neurons are typically stacked into various layers providing fine tuning to the network which in turn increases accuracy in their predictions and classifications. NNs have been around since the 50s, however it wasn't really until the 1980s when the MOSFET transistors were introduced that the idea of NNs really took off. [2] Now many years on, ML techniques have shown great promise in the detection and classification of diseases such as Epilepsy (EP). EP is a chronic disorder of the brain which results in uncontrollable seizures ranging from slight jerking or muscle twitches to complete loss of motor-control and violent jerking lasting several minutes. The probability of premature death is actually three times higher in EP patients and affects an estimated 50 million people worldwide. It has been identified that EP occurs due to hyperactive neural circuits which begin to fire in synchrony, this can be seen on an Electroencephalography (EEG) recording as high amplitude local field potentials (LFPs) [3], [4]. Examples of ML in EP can be seen in [5], where the authors use ML techniques for diagnosing and prognosis of patients via Convolutions Neural Networks (CNNs) using neuroimages or in [6], where advanced ML techniques are used for the detection of Epilepsy via Electroencephalogram (EEG) signals captured via a scalp cap. Further examples can be found in [7] and [8]. However, due to the large size of the NNs and the non-optimised hardware these types of NNs would produce very high area and power consuming devices, which would not be useful for such applications as in-vivo implantable devices.
Implantable device for medicine have been around for many years in terms of pacemakers, insulin pumps, hip joints and more, but its not until recently that the idea of implementable microchips in the form of Application Specific Integrated Circuits (ASICs) has become a growing and very realistic theme. Firstly, this is partly due to the advances in semi-conductor technologies where the miniaturisation of complementary metal oxide semiconductors now range down to 7-5 nm in size approximately. This allows for more transistors per area hence increasing the processing power per area. Area occupation is an important factor in implantable devices, for example in cortical implants where the ASIC, sits below the cranium close to the delicate brain tissue there is limited room for big processors [9]. Secondly, the operating voltages are continuing to decline providing more processing power at lower power per area [10]. This aspect of implantable devices is possibly the most crucial since the continuous processing of ASIC devices produces excess heat which can damage cortical tissue if not properly managed. In [11], the authors noted that Computer Brain Interfaces which surpass 39 • C can produce irreversible damage to the surrounding tissue area. To this end it is important for the ASIC designer to ensure the lowest power consumption possible. Both of these aspects can be reduce by cleverly designing the hardware to reduces the total number of transistors switching per operation or reduce the operating frequency. This can be achieved by avoiding generic hardware and controlling the maximum bus widths of the system.
Examples of cortical based ASIC implementations can be seen in [12], where the authors introduce an optimised Very Large Scale Integration of a brain state classifier using a 130 nm technology. The design could correctly classify epileptic seizures with up to 97% accuracy and consumed only 169 µJ per classification sitting on a total occupied area of just 2.55 mm by 1.3 mm. Another example of a Epileptic detection system can be seen in [13], where the authors try to create a trade-off between accuracy and power consumption which in this case was implemented in a 180 µm technology, consumed just 15 nW at 0.5V supply and sat on an area of just 0.05 mm 2 . Furthermore, the throughput and overall latency of the system is also an important aspect when considering the detection and possible stimulation of diseases such as epilepsy where the seizures may spontaneously erupt. In which case, the system should be capable of classifying this information within micro seconds. This of course becomes a trade of between accuracy/latency/power and area consumption. Where high parallelism will provide higher throughput but will consume more area and power [14].
The miniaturisation, of technologies is also giving rise to the possibility to place ML classifiers on-chip providing fast, re-configurable and sensitive possibilities for the detection of diseases. The main problem with ML and DL techniques mainly comes from the total amount of complex operations which are needed. Which includes both long multiplications and divisions. These are known as the most complex operations in digital systems and tend to increase latency and can vastly increase the area and power consumption of a system which contains many multipliers or divisions [15]. This is especially true in NN since many of the common building blocks are based on tun-able elements. Field Programmable Gate Arrays (FPGAs) and programmable logic (PL) devices for example incorporate DSP blocks which incorporate very specific reconfigurable hardware blocks multipliers, pattern detectors and accumulators, in the Virtex-7 FPGA for example a standard DSP element consists of a 25 × 18 two'scomplement multiplier, 48-bit accumulator, pattern detector, pre-adder and a Single-instruction-multiple-data (SIMD) arithmetic unit. Furthermore, many designs leverage the need for large BRAM memory units, where the Vitex-7 has a reconfigurable 16-32 KB RAM unit. BRAMs suffer from read latencies and hence are inferior to the speed of Look Up Tables (LUTs) [16].
Therefore, it is clear that important aspects such as the area occupied by the designed hardware, power consumption, parallelism and latencies are not prioritised in most systems instead leveraging components from the standard libraries making on-chip NNs unfeasible in real world applications due to high resource allocation and non-optimisation. Due to these factors, there is a need to reduce hardware resource utilisation in order to allow on-chip NNs in implantable devices whilst maintaining high throughput at low power costs. These types of reductions can be key to unlocking on-the-fly NN computing for diseases such as EP.
The following contributions are made in this manuscript: • Firstly, the authors intent to alleviate the heavy power and area consumption of on-chip ML by designing and implementing low-cost digital hardware building blocks capable of being inserted into various other ML designs.These digital building blocks also give way to the possibility of building ASIC applications by VOLUME 10, 2022 modifying and reducing the cost of multiplications and divisions.
• The digital building blocks are explained and designed in a Minized 7Z007S and further simulated comparing the results to unrestricted MATLAB based CPU implementations and each component is compared against the state of the art providing evidence of lower area and power consumption.
• As evidence of correct functionality the construction of a (2,2,2,2) feed forward topology is implemented in a FPGA and the results shown for a simple application.
• Lastly, the individual components are implemented into an ASIC 130nm technology implementation for area and power analysis and compared against the state of the art to show the benefits of these small and power efficient components. The paper is constructed in the following manner, Section II, gives an introduction into some of the basic principles of NNs and deep layers including common Activation functions (AFs), propagation, weight/bias regularisation and loss functions. Section III, introduces the state of the art in implantable ML devices, Section IV, introduces the digital building blocks. Section V, shows the results from the FPGA implementations. Section VI, introduces the ASIC design including power and area consumption results, finishing with the conclusions in Section VII.

A. THE NEURON (PERCEPTRON)
In the human body neurons number in the billions, interconnecting which are activated by sum of the input voltages leading to an action potential (activation) or not based on a specific voltage threshold. In NN the neuron is a loose representation of its biological counterpart Fig. 1 shows a representation of a neuron where input j are the individual inputs to the neuron, w j are the trainable adjustment weight (discussed further in sec II-E), b1, is an initial bias offset and σ (explained in more detail in sec II-C) is an AF. The function of the neuron can be denoted by equation (1). As the neurons potential increases or decreases the sum is calculated of which is then passed through an AF which represents the modelling of the action potential. However, in this case the AF can be more complex allowing for more complex non-linear data modelling [17].   activation function and an output layer denoted O 1 closley followed by a softmax activation for categorisation. The weights and biases for each neuron are denoted w weight,layer . This basic example is what is known as a fully connected layer or Multi Layer Perceptron and are mainly used for non-linear model fitting and require back propagation to update the weights and bias values at the input of each neuron.

C. ACTIVATION FUNCTIONS
Due to the non-linearity of most data and classification problems AFs are used which typically have a non-linear output such that from (1) we get, σ (Output(x)), where σ represents the AF. Below we introduce two of the main low area consuming AFs and a categorisation AF.

1) RECTIFIED LINEAR UNIT
The RelU AF is non-linear described by (2). This AF is by far the most commonly used due to its simplicity and is the most common AF in CNNs. Nonetheless, the RelU function is not as non-linear as the other functions making it less sensitive to non-linear input data. Furthermore, the output of the function is bounded as the maximum value of the neurons meaning large bus widths at the hardware-level and the absence of outputs for x <= 0, can give rise to dead neurons accumulating in the system. However the implementation remains easy and the derivative as seen in (3) and is equal to a step function about zero [18].

2) RECTIFIED LINEAR UNIT LEAKY
The Leaky RelU AF is non-linear described by (4). This AF is very similar to the ReLu function however a small decay value C is placed on the values less than zero producing a leaky effect. This solves the problem of dead neurons. (5) is the derivative which is not differentiable at 0 [18]. (4) The softmax AF is a normalised exponential function widely used as an output layer AF. Due to its probabilistic characteristics allows the outputs to be mapped to a probability distribution. Softmax is incorporated whenever more than one category at the output exists. A simple example would be the categorisation of animals such as cats, dogs, horses etc.. The equation for a softmax function can be seen in (6) [18].
These activation functions are just three of many AFs where other more non-linear functions such as the logistic function (LF) defined as: σ (x) = 1 1+e −x or the Tanh AF: 1+e −2x exist however are much less hardware friendly and are used when high non-linearity is needed.

D. LOSS FUNCTIONS
Loss functions (LF) are used to estimate the error in a supervised learning system and quantifies how close the predicted result from the NN was with respect to what the actual value is. Some examples of loss functions include:

1) SQUARED ERROR
The squared error (SE) is a typical loss function as seen in (7) the quality estimator is always strictly positive decreasing as the error approaches zero. The derivative can be calculated as in (8). Whereŷ i is the expected value and y i is the predicted value by the system and D is the total number of output neurons. [19]

2) CROSS-ENTROPY
In binary classification, where the number of classes M equals 2, Binary Cross-Entropy(BCE) can be calculated as per equation (9): If M > 2 (i.e. multi-class classification), we calculate a separate loss for each class label per observation and sum the result as per equation (10) [19].
and the derivative can be calculated as per (11) Again, Whereŷ i is the expected value and y i is the predicted value by the system.
Other loss functions may include, Huber loss and regression [19].
Many other types of activation functions exist both linear and non. Some examples include, Linear function, GelU, Exponential linear unit [18].

E. BACK-PROPAGATION
Back propagation is the process in which the NN actually learns based on passed mistakes. In a supervised learning system this means providing the correct output to the softmax layer and using gradient decent to update the weights and biases based on the partial derivative error at each node in the network.
As an example of the PD of the network error for weights w 1,2 and w 1,1 in Fig. 2 would lead to the PD equations (as seen in Appendix VII) for each weight and bias assuming that the network consists of one hidden layer an input layer and output layer and a softmax classification [20].

F. GRADIENT DECENT
Gradient decent (GD) is a iterative optimiser used in NN to find the local minimum or maximum of a system. In (12) the GD can be used to minimise the cost of the function by updating the weights with the P.D of the error, where α is an adjustable parameter called the learning rate imposed on the system to speed up learning and/or to prevent over and under-fitting of the system [21].
G. NORMALISATION Normalisation is used in many cases due tot he exploding gradient affect in which the weights shoot of towards infinity. In order to counteract this the weights should be penalised each time they get too big or too small. The two most common methods are L1 and L2 normalisation and can be denoted by VOLUME 10, 2022 equations (13) and (14) respectively.

III. STATE OF THE ART A. FPGA IMPLEMENTATIONS
Hardware implementations of NNs on FPGA has become the forefront for high speed custom application specific NNs. FPGAs consist of LUTs, D-type flip-flops and multiplexers which are programably selectable allowing the designer to correctly select bus-widths, create custom hardware and increase parallelism. Therfore, FPGAs, provide more advanced processing per operational power than even the most common NN based hardware platforms such as CPUs and Graphical Processing Units (GPUs). In fact, FPGAs can provide great increases in throughput, in [22], the authors indicated a speed up of approximately 144× of a MLP when compared to that of a typical modern day CPU with multiple cores and threads. Nonetheless, many FPGA implementations do not focus on minimisation techniques and indeed use many of the standard library implementations offered by the software. This in turn leads to a more generic system which consumes more power and area than necessary.
In [23], the authors build a MLP for various bit lengths, and produces a 95% classification accuracy at 16-bits resolution and 6 perceptrons in the hidden layer and a (7,6,5) topology. The design was implemented in an Artix 7 FPGA, and utilised only 3466 LUTs however the design also leveraged the FPGAs digital signal processing units (DPS) utilising a total of 81 DSP units and 1069 sliced registers, which, as explained in Section I, use low throughput generic multipliers and division units. The estimated power consumption of the MLP was 120 mW. nother example can be found in, [24], where the authors try to alleviate the hardware burden of the FPGAs DSP units on NNs by implementing a MLP unit which incorporates a modified multiplier unit based on intelligent shift additions. The system occupied 1179 LUTs and 1385 Sliced registers for 10 perceptrons achieving a 50% resource reduction when compared to other systems. This advancement eliminates the DSP module increasing scalability. In [25], the authors build a 18-bit, (15,20,20,1)  As stated multiplications are the bed rock of MLPs requiring multiple multiplications per perceptron. To this end work such as in [28], show how approximate multipliers can save hardware whilst maintain accuracy. Some notable designs can be found in [29], where the authors use bit truncation. By truncating the LSBs of the data word a smaller multiplier can be used however at the expense of a slight decrease in performance. In [30], where a LSB search checks the bits for a 1 in the case that a 1 is detected all reaming LSB bits are set to one and the multiplication is made on the MSB. Further, multiplier implementations can be seen in [31], where an analogue multiplier is implemented in the form of a digital to analogue converter (DAC). In [32], a N X N analogue mixed-signal vector multiplier is built for neural computing where the inputs are digital pulses and the weights are controlled by current sources.

C. ACTIVATION FUNCTIONS
Another bottle neck of MLPs is are the non-linear activation functions. [33], gives an overview and some hardware reduction techniques for the implementations of these functions, including techniques such as COordinate Rotation DIgital Computer (CORDIC) implementations. CORDIC is a hardware friendly method for the implementation of functions and its roots date back as far as 1956. The CORDIC can calculate a wide range of functions, including trigonometric functions, by taking a vector v i and rotating it in small positive or negative increments in a circular, linear or hyperbolic coordinate system [34].The CORDIC suffers from several key downfalls, firstly CORDIC is an iterative process leading to latency issues, furthermore it requires large look-uptables in order to store relative phase angles. Other methods include the calculation of functions using Taylor's expansion (TE) which consists of calculating an infinite sum of a terms derivatives for a given sample [35], however suffer from complex divisions and large exponential values creating a heavy dependence on multipliers. In [36], the authors build an estimate of the softmax function by estimating e x via LUT lookups and use the FPGA fabric DSP blocks to calculate multiplications and divisions in the system. The DSP units means non optimised hardware is used occupying non-necessary area. Using a 16-bit system and a Virtex 6 FPGA, the design occupies 300 LUTs, 558 sliced registers and 5 DSP units as well as 8 BRAMS. Another approximation can be found in [37]. In [38], the authors try to estimate the signmoid, tangent and Radial Basis Function (RBF) using a simplicity Canonical Piece-wise Linear model (SCPLm).
In [39], the authors implement a hardware friendly softmax function which uses base splitting to store the results of the

IV. PROPOSED ARCHITECTURE AND FPGA IMPLEMENTATION
The proposed architecture has been designed to minimise both area and power consumption, which are the two main drawbacks of NN implementations in implantable devices. This is achieved via the carefully design of hardware alleviating digital building blocks which do not compromise throughput. The architecture was designed as a 17-bit, signed, 2's complement system where the max fractional precision is 8-bits leading to a max decimal precision of 0.00390625. The integer is represented as 9-bits of data including a signed bit. The total bus width and hence precision is adaptable based on application specific tasks and should be adjusted to accommodate user specific applications. In this system, the input data must be normalised in the range [−1,1] however the full data width was chosen as it is a common data width making comparisons more accessible.

A. SYSTEM MULTIPLIER UNIT
The system multiplier (SM) unit is one of the most important blocks in the system since the multiplier units are usually very expensive in terms of area and power cost. Moreover, NNs make extensive use of multipliers meaning reductions in this component can vastly alleviate the main drawbacks of NNs. The block designed in this manuscript is a low cost multiplier which maintains accuracy whilst increasing overall throughput. For reference the SM can be seen in Fig. 3. The SM unit takes advantage of 2 N right and left arithmetic shift operations which vastly increases operational throughput with very little cost in hardware. This 2 N based 2's complement shift only multiplier has the advantage of greatly reduced hardware due to the simplicity of the shifts as well as high throughput since all the shifts happen in parallel. This is contrary to most SM designs which intend to increase throughput via parallelism, however with the consequence of higher resource utilisation or vise a versa. The SM unit is constrained such that −1 < Input j < 1 and works as follows: From Fig. 3 by using the individual fractional bits of Input j the sum of the shifts are calculated, where the amount of shift corresponds to the fractional bit of Input j . A logical AND array is used to control the sum where, if the individual bit of the fractional part of Input j is active the logical AND array works as a pass-through key allowing the shifted right value to be summed to the final output. In the case that the Input j bit is deactivated the logical AND array provides an array of zeros to the sum not affecting the overall value. As a simple example let Input j = 11000000 2 and w i,j = 00000101.10100000 2 . This equates to w i,j ·Input j = 5.625 10 · 0.625 10 = 3.515625. Since bits 7 and 5 are both active the final result at the output would be ((w i,j , sra, 1) + 0 + (w i,j , sra, 3)+0+0+0+0+0) = (5.625 10 ·0.5 10 )+(5.625 10 · 0.125 10 ), where sra, x is a shift arithmetic right operation and x is the amount of shift to be applied. This has the advantage of maintaining high accuracy which, indeed, is only limited by the number of bits in the system. Furthermore, the system has a high throughput which can be as low as t AND + 3t sum , where t AND is the combinational time delay of a single logical AND gate and t sum is the combinational time delay of a full adder.
From Fig. 3, we should also note that the inputs to the SM unit must be converted from from 2's complement for values of input j < 0 and converted back to 2's complement at the output. As an example for the input input j = −0.5 and w i,j = −1.5, the inputs would be converted to input j = +0.5 and w i,j = −1.5 hence the result would be a right arithmetic shift by 1 leading to −0.7 and a 2's conversion on the output would be necessary to bring the value back positive. To achieve this a simple 2's compliement converter system is placed on the input and output of the system and consists of a simple Most Significant Bit (MSB) comparator to detect the negative or positive input, a 1's complement converter and a full adder. To increase the accuracy of this SM unit we can simply add more fractional bits however this should be application specific and chosen to minimise area and power.

B. LEAKY RELU UNIT
The Leaky Relu unit is one of the least expensive blocks in terms of hardware cost. In this design the Leaky Relu block has been modified according to (15), in which two  decay variables C1 and C2 are introduced. C1 is a value which tries to constrain the output of the ReLu unit and should be user adjusted to such that it ensures the output does not increase above 1. In this way, the output is constrained similar to that of a sigmoid AF and allows the SM unit to be employed in deeper layers. Moreover, C2 is used to ensure small negative values and avoid dead neurons which is inline with the traditional Leaky ReLu such as in (4). This Relu unit should utilise the same amount of hardware resources as a standard leaky relu unit as long as the values of C1 and C2 are constrained to 2 N values. Nonetheless, the introduction of the C2 constraint further alleviates the system in future layers of the network. The hardware can be seen in Fig. 4, and incorporates a MSB comparator of 1-bit which checks the signed bit of the data word, if 0 it means that the value is above or equal to 0 otherwise the value is negative. This comparator can be implemented as a simple logical AND gate. A multiplexer on the output then pushes either C1 · x or C2·x to the output. C1 and C2 decay rates should be selected as 2 N values such that they can be implemented as shift only values The latency of the system is calculated simply as the latency of the multiplexer t mux , since the decay values are hard-wired connections.

C. NEURON STRUCTURE
Now that we have introduced the AF and SM we introduce the hardware for a simple 2-input perceptron. Fig. 5, shows an overview of the hardware used to construct the neuron N 1,1 in the hidden layer of the NN shown in Fig. 2. The hardware corresponds to (1) where two SM units from Section IV-A are used to multiply the weights with the corresponding inputs. Moreover, two adder units are used to sum the results of the SM units and also sum the bias value before finally passing the output to our modified ReLu unit as seen in Section IV-B. Since the biases and weights will be updated during the training process each weight and bias must pass thorough a multiplexer where the symbol * represents the updated weight or bias. In total the latency of a single perceptron is t SM + 2t sum + t ReLu , where t ReLu is the combinational delay of our leaky ReLu unit. The neuron structure in this case is greatly simplified by the use of the SM unit which as we will see in future section the SM unit can be hardwired. This means that per perceptron we use no multipliers at all allowing for high throughput and low power consumption.

D. EXPONENTIAL UNIT
To estimate the exponential function we can use the taylors expansion (TE) where, for a function f : R → R k-times differentiable at the point x = a a ∈ R, Taylor's theorem states that there exists a function h k : R → R such that: Taking the first three terms and rearranging the equation we can get the estimation for e x as per (17). From Fig 6 a, we can see the hardware configuration where two full adders are used to sum each term. The x 2 2 term is separated into two calculations firstly, x 2 , which is handled via a shift right arithmetic similar to that in sec IV-A, secondly the multiplication of this term with x, note that since x comes from the adapted leaky ReLu block (as in sec IV-B) the maximum output value is scaled to less than 1 and hence the multiplication can be made using the adapted system multiplier from sec IV-A. This exponential unit uses very little hardware and is further adopted into blocks such as the softmax AF.
The accuracy of the exponential unit can be improved by the addition of further terms, as an example in (18), the expansion for an extra term x 3 6 is shown, in the case of hardware this is implemented and shown in Fig. 6 b, where the new term is further broken down to allow for shifts of 2 N . However, x 3 6 cannot be fully broken down into perfect shifts such as x 3 · x 2 · x 1 and so is estimated to x 2 · x 2 · x 1 . In the case of the hardware this incorporates an additional full adder and two SM units.
With respect to the latency the first exponential unit can update its output every 2t add + t SM , where t SM , is the combinational delay of the SM block in section IV-A. Since the second exponential incorporates more SM units the latency increases to 2t add + 2t SM .
Both of these implementations force the TE series into a 2 N shift, which again like the previous blocks can be handled via simple and fast right shift arithmetic operations. Furthermore, this method allows for the SM unit block to be further implemented which as explained has high throughput and low area usage greatly reducing the size of the exponential units. If another method was used it is likely that the exponential units would consist of large generic based multipliers and DSP units.

E. SYSTEM DIVIDER UNIT
The SD unit is a crucial block in the softmax AF as described in Section IV-E. In this section the authors will introduce two possible implementations. The first is implemented as 1 x ·y and is based on a piece-wise-linear approximation (PWL). The PWL approximations were calculated as per 1 x = mx + c where the individual cases of m and c were calculated as per equation (19). Note that the maximum value of x in this case is x < 10 and the SD unit should be adapted as per application needs. The hardware implementation of the PWL approximation can be seen in Fig. 7. Here we can note that several compartor units are needed to calculate both the fractional and integer parts of the input value to conform with the specific conditions set out by (19), where x f is the fractional part and x i is the integer part. A N to 4 decoder unit is then used to select the constant m, c values which are stored in LUTs, and routed via twin multiplexers. To finish we multiply m by the input via a SM unit and sum the result to c via a full adder. This system maintains very high throughput and can be calculated in as little as t comp +t dec +t sum +t SM , where t comp , t dec , t add , t SM are the combinational delays of a comparator, decoder, full adder and system multiplier respectively. It is important to note that in our application specific system the ReLu maximum output should be constrained to 1 and hence any m, c values above 1 can be eliminated from the hardware greatly reducing resources. In this case they have been left in to show the possible scale-ability of the PWL design, however in said case which the value of x increases above 1 the SM unit would VOLUME 10, 2022  need to be replaced with a DSP core unit. This unit increases overall throughput by eliminating the repetitive subtractions which are normally needed in more generic systems.
For higher precision applications another possibility for implementing 1 x can be seen in Fig. 8 and is based on the equation (20), (20) applying the first 4 terms equation (20) can be re-arranged to (21). (21) This can be further simplified to (22).
To implement (22) in hardware extensive use of the SM is incorporated, requiring a total of 7, hence, increasing hardware utilisation.

F. SOFTMAX UNIT
The softmax unit can be seen in Fig. 9 which as per (6), makes heavy use of the exponential function, SM unit and SD unit. To this end the input to the softmax unit calculates the exponential values of sm in1 and sm in2 using twin exponential blocks as described in sec. IV-D. Since a division is needed here a SD unit and a SM unit are used such that we calculate: Due to the restriction placed by (20), a shift right by 3 is applied to the previously calculated exponential values such that it provides a scaling factor e x < 1 This increases the number of neurons that we can actively accommodate in each layer. Note that in statistical manner this scaling does not change the overall prediction, however it does limit the number of neurons in the output layer if the bus width is not increased.
In this case it is clear that the latency is t exp + t sum + t SD + t SM . This makes this block one of the most computationally hungry and large blocks, since a softmax block should be employed for each row of neurons user should decide if it is necessary.
The softmax block takes advantage of the previously designed exponential unit, SD unit and SM unit meaning that the hardware reductions and high throughput are carried over. This block reduces the need for CORDIC like repetitive algorithms and whilst marinating hardware reductions.

G. GRADIENT DECENT
The GD unit is simple and as per (12), the only complication is the multiplication by the learning rate α. The hyper-parameter can be set easily to a power of two meaning a simple shift right arithmetic block can be applied. An example of the GD block can be seen in Fig 10, where, in this case as per (24), we employ a L1 normalisation technique as seen in (13). In this case a MSB comparator is used to check if the weights are above or below zero and applies a addition or subtraction of λ based on the current weight via a multiplexer. The λ value tries to force weights above zero back towards zero and weights less than zero back towards zero. This has the benefit of allowing more room for the SM unit and the precision of the system to operate and furthermore ensures over-training does not occur.

H. BACK-PROPAGATION
The back propagation hardware is usually difficult, large and non generic since the back propagation depends on the network structure. In this manuscript the authors show an example of how our hardware can be used to implement a back pass. The example is based on updating the weights for w 1,2 and w 1,1 as seen in Fig. 2 and implemented as the PD from (28) and (32). Fig. 11, shows an example of the hardware where (25), is implemented simply using a subtracter unit and a SM. to calculate (26) and (27) duplicate hardware is needed where 3 SM units calculate the multiplications before being summed together via a full adder to calculate the total error (32). Finally a GD unit from Section IV-G is used to update the weights accordingly based on the error. Using our SM unit we can note that as long as the input j value of the SM unit remains in the range 0 < input j < 1 we can employ the unit with no scaling. ∂E From (29), ∂E1 ∂w 1,1 = (y 1 −ŷ 1 ) · w 1,2 · N 1,1,out · input 1 (26) ∂E2 ∂w 1,1 = (y 2 −ŷ 2 ) · w 2,2 · N 1,1,out · input 1 (27) An important note is that the latency of the back pass in this system is 4t SM + t sum + 2t sub , this will increase as the number of hidden layers increases but does not necessarily increase if the number of rows increases due to parallelism in the hardware.

I. HARDWARE REDUCTION TECHNIQUES 1) SHIFTS
The authors proposed architecture relies heavily on shift right and shift left arithmetic operations in order to provide us with a multiplication/division estimation. Reducing this block would greatly reduce hardware utilisation whilst maintaining high-throughput for the system. A shift right/left arithmetic operation consists of basic bit-shifting. This can be achieved synchronously using memory elements or asynchronously as a simple wired connection. Fig. 12, shows an example of this where for a right shift by 1 we can hard-wire the most significant bit (msb) b (16) from the input to bit bo (15) of the output and bo (16) of the output also receives b(16) more generically put b(N ) → bo (N − 1), bo(N ) → b(N ). In the case of a right shift by 2, the input msb b(N ) connects to output bo (N − 2), bo(N ) and bo(N − 1) of the output both receive b(N ), note that the least significant bits of the input are lost, however this is the same in the case of logical shift registers. A shift left would be the same only prioritising the LSB units. Note that the latency of this is zero hence it is just a wire meaning we can ignore this block from the calculations of latency.
As a hardware reduction example this would reduce our SM module to 8 x adders and 136 AND gates.

V. RESULTS
The FPGA implementations were made using VHDL in the VIVADO environment and uploaded in an Avent Zynq 7Z007S development board using a default synthesising strategy. Note that the FPGA was chosen as a main strategy platform for testing as it is adaptable and allows for rapid application development. The FPGA, is the first stepping stone to full ASIC implementation. Fig. 13, shows an example of the FPGA results for the SM unit assuming a constant input of positive 1.5 and a multiplier input with a range of 0 to 0.99. A unrestricted CPU MATLAB, implementation has also been plotted for reference. Note that the SM FPGA implementation provides almost perfect output response, which is to be expected due to the hardware implementation and scaling techniques and produces a MSE of just 2.4661 −05 . The SM hardware as per table 1, shows the full break down of LUTs used and shows that this SM unit utilises only 184 mixed-sized LUTs in total. Furthermore, when compared to other approximation multipliers as seen in table 2, our design uses less resources as well as provides less delay due to the single shift operation and logical AND gate key-pass.

B. RELU UNIT
Fig.14, shows an example output of the modified leaky Relu block taken from the FPGA implementation. In this scenario, C1 = 0.125 and C2 = 0.0625. For input values ranging from −8 to +8 the output range is hence allowed to swing from approximately −0.5 to 1 respectively. Note that the negative leaky section of the Relu unit approaches −1 at a slower rate due to the decay rate being lower. For our system C1 was chosen to allow the maximum input neuron sum to reach a maximum of +8 however this should be user adjusted. Table 3, shows a comparison between our design and an implementation in a Ultrascale 9 FPGA. Note that our design only uses 12 LUTs whilst the other design leverages 2 DSP units. Furthermore, due to the fixed shift values for C1 and C2 (to be user defined) our design has very little latency with the critical path just 0.9 ns.          Fig. 16 shows the output results for negative input values ranging from -1 to 0. In this case the biggest errors occur the more negative the input value. Nonetheless, the second FPGA implementation again, clearly achieves higher accuracy, producing an MSE of 0.0012 whilst the first FPGA implementation produced an MSE of 0.0017. Table 1, shows the number of LUTs used by the FPGA in both cases where the fist and second exponential unit use just 186 and 541 mixed-size LUTs respectively. Similarly to the SM unit the precision is based on the number of TE terms incorporated into the design and can be expanded to gain higher precision based on application specific needs. In table 4, the resources of the two exponential units are compared to the state of the art. Once again due to the lower dependency on generic multiplications the TE exponential uses a lot less resources not leveraging any DSP units. Furthermore, the delays are smaller again due mainly to the fixed 2 N shifts and hardwired multiplications.   In fact the high performance exponential unit has a delay 4 times less than in [43]. Fig. 17, shows the results from the FPGA implementation of the design, here we can note that although not perfect the PWL FPGA implementation provides a sufficiently accurate estimation this can be seen in the zoomed window where the PWL tries to follow the unconstrained MATLAB implementation. In fact the MSE between the MATLAB, implementation  and reduced FPGA hardware is just 0.002. Table 1 (SD unit 1) shows the resources for for this first SD unit which utilises just over 500 LUTs for an input range of x = [−10, 10], for the range [−1, 1] the SD unit uses just 290 LUTs and hence the range should be predefined by the designer.

D. SD UNIT
In contrast, from Table 1, SD unit 2 utilises just 635 mixed sized LUTs and produces an MSE of 3.212 −04 for the range −2 < x < 2. Table 5, shows a comparison between this implementation and another design with similar characteristics. Note that the number of resources are greatly reduced in this case whilst the critical delay remains almost the same. Fig. 18, shows an example of the output from the FPGA compared to that of A unrestricted CPU MATLAB implementation for the varying input range of both inputs. Note that at the very start when both the input values are equal the prediction value is 0.5 as expected. As the distance between the inputs increases the prediction for input1 starts heading towards 1, reaching a maximum value of 0.61 when input2 = 0.49 and input 1 = 0.95.

E. SOFTMAX UNIT
With respect to the noise, Fig. 19, shows the absolute error between the two such that |MATLAB − FPGA| is calculated, in this case we can note that the maximum absolute error is 0.011 and the mean absolute error is approximately 0.004. This error for most application specific applications should be sufficient however, as explained, increasing the bus widths and increasing fractional bus width will help in smoothing this function. The hardware requirements for the softmax function can be seen in Table 1 (softmax).   Furthermore, when compared to other softmax implementations as seen in table 6, our design uses less resources. This is mainly contributed to the effectiveness and low area consuming exponential function and SD unit implementations.

F. NN LAYER AND FEED BACK
The full feed forward resources for the NN shown in Fig. 2 can be seen in Table 1. The topology is 2,2,2,2 where the output layer is made up of softmax and is used to validate the block connections and provides a simple functionality verification by asking the NN to solve a simple validation input problem. The topology was chosen as a simple example of functionality. The full layer uses just over 3800 mixedsized LUTs and a single perceptron as shown in Fig. 1, is implemented using just 197 mixed sized LUTs. The back In this case the smoothed softmax outputs from the two layer neurons can be seen in Fig. 20, where layer 2 approaches 1 and layer 1 approaches 0. The loss function was calculated in MATLAB as the cross entropy loss and can be seen to rapidly converge to below 0.1 after 50 iterations.
The final MLP design has a critical path of approximately 3.12ns and maximum frequency of approximately 320MHz.

VI. ASIC IMPLEMENTATION
For comparison with other ASIC designs the individual blocks and full feed forward network from Section II-B were implemented in a 130 nm technology working at an operating voltage of 1.8 V. The technology was based on the Skywater 130A technology developed by Google [47], which consists of 5 V I/O pads and 5 metal layers, where the ASIC was implemented using the VHDL code in the software OpenLane [48]. Fig. 21, shows the power estimations of various blocks. The power estimations include leakage power, switching power and internal power. The frequency is based on the frequency of the input patterns generated to stimulate the circuits ranging from 10 KHz through to 100 MHz. From almost all of the designs we can note that the leakage power is almost negligible and is most notable in the total network of. This is consistent as the number of transistors increase. Taking a look at the maximum and minimum input frequencies we can note that at 100 MHz the RelU unit consumes negligible total power at approximately 0.0001 W which is to be expected due to small hardware requirements, at the same frequency the the first exponential unit consumes approximately 0.01 W which is around the same as the SM unit. The second exponential unit consumes a surprisingly high amount of power close to the total network of 0.1 W. The SD unit and softmax consume approximately 0.05 W each at 100 MHz. It is also worth noting that these power consumption's fall drastically with input frequency and at 10 KHz the entire FF network   consumes just 0.00001 W which is less than the ReLu unit alone at 1 MHz.
The total area breakdown of the implemented ASIC and individual components can be seen in the Fig. 22. Here we can note that the size of a single perceptron, from Section IV-C ooccupies a total area of approximately 0.036473 mm 2 , which when scaled arbitrarily to a 65nm technology: 65 nm 130 nm = 0, 0182365 mm 2 . The total network as seen in Fig. 2, which includes 4 × perceptron units and 2× Softmax outputs and the feed back training network produce a total area of 0.121 mm 2 or 0.0605 mm 2 , when scaled to the 65 nm technology and consumes 100 mW at 100 MHz input stimulus and consumes 0.01 mW at 10 KHz input stimulus. Table 7, shows a more in-depth comparison to other similar implementations of the main components discussed here. From this we can deduce that the ASIC implementation of the multiplier uses less power per bits when compared to the state of the art. Nonetheless operating frequency and supply voltages are important factors to take into account. In such a case our multiplier performs more than 10 times better than in [49] when running at the same operating frequency. Furthermore, the area consumption of the design is well below that of the other designs despite the higher number of active bits. To re iterate this massive area performance is due to the hardwired arithmetic shifts. Carrying on we can see a similar scenario in exponential units, SD units, soft-max implementations and the overall size of a single perceptron. Note that the Relu and GD unit were not included as no solo ASIC implementations were found in the literature.

VII. CONCLUSION
In this manuscript the authors have designed power and area alleviating building blocks for ML applications without sacrificing throughput of the system. The building blocks were individually tested in a FPGA platform and compared to unrestricted CPU implementations of which compared very well producing small MSEs and reducing overall FPGA resources when compared to the state of the art. The building blocks were also implemented into a 130 nm ASIC, showing that the average power consumption for a two layer perceptron with ReLu activation and softmax output with feedback network was just 100 mW at 10 8 Hz input stimulus frequency and occupied an area of just 0.065 mm 2 when normalised to a 65 nm technology. Furthermore, the design leveraged no DSP blocks or BRAM modules reducing latencies and saving area via no generic hardware implementation.

APPENDIX PD EQUATIONS
w 3,1 , can be calculated in a similar manner. Following the same procedure of back propagation w 2,1 and w 4,1 can be calculated.