Accelerating Deep Learning Inference in Constrained Embedded Devices Using Hardware Loops and a Dot Product Unit

Deep learning algorithms have seen success in a wide variety of applications, such as machine translation, image and speech recognition, and self-driving cars. However, these algorithms have only recently gained a foothold in the embedded systems domain. Most embedded systems are based on cheap microcontrollers with limited memory capacity, and, thus, are typically seen as not capable of running deep learning algorithms. Nevertheless, we consider that advancements in compression of neural networks and neural network architecture, coupled with an optimized instruction set architecture, could make microcontroller-grade processors suitable for specific low-intensity deep learning applications. We propose a simple instruction set extension with two main components—hardware loops and dot product instructions. To evaluate the effectiveness of the extension, we developed optimized assembly functions for the fully connected and convolutional neural network layers. When using the extensions and the optimized assembly functions, we achieve an average clock cycle count decrease of 73% for a small scale convolutional neural network. On a per layer base, our optimizations decrease the clock cycle count for fully connected layers and convolutional layers by 72% and 78%, respectively. The average energy consumption per inference decreases by 73%. We have shown that adding just hardware loops and dot product instructions has a significant positive effect on processor efficiency in computing neural network functions.


I. INTRODUCTION
Typically, deep learning algorithms are reserved for powerful general-purpose processors, because the convolutional neural networks routinely have millions of parameters. AlexNet [1], for example, has around 60 million parameters. Such complexity is far too much for memory-constrained microcontrollers that have memory sizes specified in kilobytes. There are, however, many cases where deep learning algorithms could improve the functionality of embedded systems [2]. For example, in [3], an early seizure detection system is proposed, based on a convolutional neural network running The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar . on a microcontroller implanted in the body. The system measures electroencephalography (EEG) data and feeds it to the neural network, which determines the seizure activity. They implemented the neural network on a low power microcontroller from Texas Instruments.
Some embedded system designers work around the dilemma of limited resources by processing neural networks in the cloud [4]. However, this solution is limited to areas with access to the Internet. Cloud processing also has other disadvantages, such as privacy concerns, security, high latency, communication power consumption, and reliability. Embedded systems are mostly built around microcontrollers, because they are inexpensive and easy to use. Recent advancements in compression of neural networks [5], [6] and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ advanced neural network architecture [7], [8] have opened new possibilities. We believe that combining these advances with a limited instruction set extension could provide the ability to run low-intensity deep learning applications on low-cost microcontrollers. The extensions must be a good compromise between performance and the hardware area increase of the microcontroller. Deep learning algorithms perform massive arithmetic computations. To speed up these algorithms at a reasonable price in hardware, we propose an instruction set extension comprised of two instruction types-hardware loops and dot product instructions. Hardware loops, also known as zero-overhead loops, lower the overhead of branch instructions in small body loops, and dot product instructions accelerate arithmetic computation.
The main contributions of this article are as follows: • we propose an approach for computing neural network functions that are optimized for the use of hardware loops and dot product instructions, • we evaluate the effectiveness of hardware loops and dot product instructions for performing deep learning functions, and • we achieved a reduction in the dynamic instruction count, an average clock cycle count, and an average energy consumption of 66%, 73%, and 73%, respectively.
Deep learning algorithms are used increasingly in smart applications. Some of them also run in Internet of Things (IoT) devices. IoT Analytics reports that, by 2025, the number of IoT devices will rise to 22 billion [9]. The motivation for our work stems from the fact that the rise of the IoT will increase the need for low-cost devices built around a single microcontroller capable of supporting deep learning algorithms. Accelerating deep learning inference in constrained embedded devices, presented in this article, is our attempt in this direction.
The rest of this article is organized as follows. Section II presents the related work in hardware and software improvements aimed at speeding up neural network computation. Section III introduces the RI5CY core briefly, and discusses hardware loops and the dot product extensions. Section IV shows our experimental setup. Section V first presents a simple neural network that we have developed and ported to our system. It then shows how we have optimized our software for the particular neural network layers. The empirically obtained results are presented and discussed in Section VI. Finally, Section VII contains the conclusion and plans for further work.

II. RELATED WORK
There have been various approaches to speed up deep learning functions. The approaches can be categorized into two groups. In the first group are approaches which try to optimize the size of the neural networks, or, in other words, optimize the software. Approaches in the second group try to optimize the hardware on which neural networks are running. As our approach deals mainly with hardware optimization, we will focus on the related approaches for hardware optimization, and only discuss briefly the advancements in software optimizations.
Because many neural networks, like AlexNet [1], VGG-16 [10], and GoogLeNet [11], have millions of parameters, they are out of the scope of constrained embedded devices, that have small memories and low clock speeds. However, there is much research aimed at developing new neural networks or optimizing existing ones, so that they still work with about the same accuracy, but will not take up as much memory and require too many clock cycles per inference. Significant contributions of this research include the use of pruning [12], quantization [13], and alternative number formats such as 8-bit floating-point numbers [14] or posit [15], [16]. Pruning of the neural network is based on the fact that many connections in a neural network have a very mild impact on the result, meaning that they can simply be omitted. On the other hand, the goal of using alternative number formats or quantization is to minimize the size of each weight. Therefore, if we do not store weights as 32-bit floating-point values, but instead as 16bit half-precision floating-point values or in an alternative format that uses only 16 or 8 bits (e.g., fixed-point or integer), we reduce the memory requirements by a factor of 2 or 4. In order to make deep learning even more resource-efficient, we can resort to ternary neural networks (TNNs) with neuron weights constrained to {−1, 0, 1} instead of full precision values. Furthermore, it is possible to produce binarized neural networks (BNNs) that work with binary values {−1, 1} [17]. The authors of [18] showed that neural networks using 8bit posit numbers have similar accuracy as neural networks using 32-bit floating-point numbers. In [5], it is reported that, by using pruning, quantization, and Huffman coding, it is possible to reduce the storage requirements of neural networks by a factor of 35 to 49.
Much research on neural network hardware focuses on a completely new design of instruction set architectures (ISAs) built specifically for neural networks. The accelerators introduced in [19]- [22], and [23] have the potential to offer the best performance, as they are wholly specialized. The Eyeriss [19] and EIE [20] projects, for example, focus heavily on exploiting the pruning of the neural network, and storing weights in compressed form for minimizing the cost of memory accesses. The authors of [21] also try to optimize memory accesses, but use a different strategy. They conclude that new neural networks are too large to be able to hold all the parameters in a single chip; that is why they use a distributed multi-chip solution, where they try to store the weights as close to the chip doing the computation as possible, in order to minimize the movement of weights. Similarly, in [23], they developed a completely specialized processor that has custom hardware units called Layer Processing Units (LPUs). These LPUs can be thought of as artificial neurons. Before using them, their weights and biases must be programmed, and the activation functions selected. A certain LPU can compute the output of a particular neuron. This architecture is excellent for minimizing data movement for weights, but limits the size of the neural network significantly. The authors of [22] realized that many neural network accelerator designs lack flexibility. It is why they developed an ISA that is flexible enough to run any neural network efficiently. The proposed ISA has a total of 43 64-bit instructions, which include instructions for data movement and arithmetic computation on vectors and matrices. A similar ISA was developed in [24]. However, because these ISAs are designed specifically for neural networks, they are likely unable to be deployed as a single-chip solution (e.g., a microcontroller is needed to drive the actuators). To lower the cost of the system and save the area on the PCB, we sometimes do not want to use a separate chip to process the neural network.
Other works focus on improving the performance of using CPUs to process neural networks. For example, the authors of [25] show that adding a mixture of vector arithmetic instructions and vector data movement instructions to the instruction set can decrease the dynamic instruction count by 87.5% on standard deep learning functions. A similar instruction set extension was developed by ARM-their new Armv8.1-M [26] ISA for the Cortex-M based devices is extended with vector instructions, instructions for low overhead loops, and instructions for half-precision floating-point numbers. Unfortunately, as this extension is new, there are currently no results available on performance improvements. ARM Ltd. published a software library CMSIS-NN [27] that is not tied to the new ARMv8.1-M ISA. When running neural networks, CMSIS-NN reduces the cycle count by 78.3%, and it reduces energy consumption by 79.6%. The software library CMSIS-NN achieves these results by using an SIMD unit and by quantizing the neural networks.
A mixed strategy is proposed in [28]. It presents an optimized software library for neural network inference called PULP-NN. This library runs in parallel on ultralow-power tightly coupled clusters of RISC-V processors. PULP-NN uses parallelism, as well as DSP extensions, to achieve high performance at a minimal power budget. By using a neural network realized with PULP-NN on an 8-core cluster, the number of clock cycles is reduced by 96.6% and 94.9% compared with the current state-ofthe-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively.
Many embedded systems are highly price-sensitive, so the addition of an extra chip for processing neural networks might not be affordable. That is why we optimized our software for a very minimal addition of hardware, which is likely to be part of many embedded processors.

III. USED HARDWARE AND INSTRUCTION SET EXTENSIONS
For studying the benefit of hardware loops, loop unrolling, and dot product instructions, we used an open-source RISC-V core RI5CY [29], also known as CV32E40P. It is a small 32-bit 4-stage in-order RISC-V core, which implements the RV32IMFC instruction set fully. RV32 stands for the 32-bit base RISC-V instruction set, I for integer instructions, M for multiplication and division instructions, F for floating-point instructions, and C for compressed instructions. Additionally, RI5CY supports some custom instructions, like hardware loops. Because we used a floating-point network, we extended the core with our floating-point dot product unit. We call this core the modified RI5CY core (Fig. 1). It also has an integer dot product unit, but we did not use it. Therefore, it is not shown for the sake of simplicity. RI5CY is part of an open-source microcontroller project called PULPino, parts of which we will also use. We call the PULPino microcontroller with the modified RI5CY core the modified PULPino. Both the original and the modified RI5CY core have 31 general-purpose registers, 32 floating-point registers, and a small 128-bit instruction prefetch cache. Fig. 1 details the RI5CY architecture. The non-highlighted boxes show the original RI5CY architecture, while the highlighted fDotp box shows our addition. The two orange-bordered boxes are the general-purpose registers (GPR) and the control-status registers (CSR). The violet-bordered boxes represent registers between the pipeline stages. The red-bordered boxes show the control logic, including the Hardware-Loop Controller ''hwloop control'', which controls the program counter whenever a hardware loop is encountered (details are explained in Subsection III-A). The gray-bordered boxes interface with the outside world. One of them is the load-store-unit (LSU). The boxes bordered with the light blue color are the processing elements. The ALU/DIV unit contains all the classic arithmetic-logic functions, including a signed integer division. The MULT/MAC unit allows for signed integer multiplication, as well as multiply-accumulate operations. Finally, the fDotp is the unit we added. It is described in Subsection III-B. For more information on the RI5CY core and PULPino, we refer the reader to [29], [30], and [31].

A. HARDWARE LOOPS
Hardware loops are powerful instructions that allow executing loops without the overhead of branches. Hardware loops VOLUME 8, 2020 involve zero stall clock cycles for jumping from the end to the start of a loop [30], which is why they are more often called zero-overhead loops. As our application contains many loops, we use this feature extensively. The core is also capable of nested hardware loops. However, due to hardware limitation, the nesting is only permitted up to two levels.
Additionally, the instruction fetch unit of the RI5CY core is aware of the hardware loops. It makes sure that the appropriate instructions are stored in the cache. This solution minimizes unnecessary instruction fetches from the main memory.
A hardware loop is defined by a start address, an end address, and a counter. The latter is decremented with each iteration of the loop body [30]. Listing 1 shows an assembly code that calculates the factorial of 5 and stores it in the register x5. Please note that, in RISC-V, x0 is a special register hardwired to the constant 0.

B. DOT PRODUCT UNIT
To speed up dense arithmetic computation, we added an instruction that calculates a dot product of two vectors with up to four elements, where each element is a single-precision floating-point number (32 bits). The output is a scalar single-precision floating-point number representing the dot product of the two vectors. The dot product unit is shown in Fig. 2. We did not implement any vector load instruction; instead, we used the standard load instruction for floating-point numbers. Consequently, this means that we reused the floating-point register file-saving the area increase of the processor.
The '×' and '+' marks in Fig. 2 represent a floating-point multiplier and a floating-point adder, respectively. The unit performs two instructions, which we added to the instruction set: • p.fdotp4.s -dot product of two 4-element vectors, • p.fdotp2.s -dot product of two 2-element vectors. When executing the instruction p.fdotp2.s, the dot product unit disconnects the terminals of switch S, automatically, and, similarly, it connects them when executing the instruction p.fdotp4.s. The RI5CY core runs at a relatively low frequency to reduce energy consumption. Therefore, we can afford that the dot product unit is not pipelined. The result is calculated in a single clock cycle.

IV. EXPERIMENTAL SETUP
To test the performance of various deep learning algorithms running on the modified RI5CY core, we developed a testing system. We decided to use a Zynq-7000 System-on-a-Chip (SoC) [32], which combines an ARM Cortex-A9 core and a field-programmable gate array (FPGA) on the same chip. The purpose of the ARM Cortex-A9 core is to program, control, and monitor the modified RI5CY core.   3 shows a block diagram of the system. The diagram is split into two parts-the processing subsystem (PS) and the programmable logic part (PL). On the PL side there is the emulated modified PULPino, and, on the PS side, the ARM core and the other hard intellectual property cores. In between, various interfaces enable data transfer between both sides, including a universal asynchronous receiver/transmitter interface (UART), a quad serial peripheral interface (QSPI), an advanced extensible interface accelerator coherency port (AXI ACP), and an interrupt line. Note that all blocks in Fig. 3, except the external DDR memory, are in the Zynq-7000 SoC chip.
On the PL side (FPGA) of the Zynq-7000 chip, we emulated not only the RI5CY core, but the entire PULPino microcontroller [33].
The AXI ACP bus enables high-speed data transfers to the microcontroller. In this configuration, we can get the data from the DDR memory, process them, send back the results, and again get new data from the DDR memory. We use the QSPI bus to do the initial programming of the PULPino memories and UART for some basic debugging. We designed the system in the Verilog hardware description language using the Vivado Design Suite 2018.2 integrated development environment provided by Xilinx Inc.

V. SOFTWARE
To test the efficiency of the designed instruction set optimization, we developed a simple optical character recognition (OCR) neural network to recognize handwritten decimal digits from the MNIST dataset [34] that contains 60,000 training data and 10,000 test data. The architecture of the neural network is given in Table 1 and Fig. 4. The network was trained in TensorFlow, an open-source software library for numerical computations. Using a mini-batch size of 100, the Cross-Entropy loss function, Adam optimization with a learning rate of 0.01 and 3 epochs of training, recognition accuracy 95% was achieved on the test data. For more information on the neural network, the reader may reference the supplemental material. State-of-the-art neural networks achieve accuracy higher than 99.5% [35]. Compared to them, our neural network performs worse, as it has just one feature map in its only convolutional layer. However, for us, the accuracy of this neural network is not essential, as we only need it to test our hardware.  The output of the network is a vector of ten floating-point numbers, which represent the probability that the corresponding index of the vector is the digit on the image. In total, the network contains 9,956 parameters, which consume roughly 39 kB of memory space if we use the single-precision floating-point data type. To compute one pass of the network, around 24 thousand multiply-accumulate (MAC) operations must be performed.

A. LOOP UNROLLING
Alongside hardware loops and the dot product unit, we also tested if loop unrolling could benefit the inference performance of our neural network. Loop unrolling is a compiler optimization that minimizes the overhead of loops by reducing or eliminating instructions that control the loop (e.g., branch instructions). This optimization has the side effect of increasing code size.
Algorithm 1 shows a simple for loop that adds up the elements of an array. Algorithm 2 shows the unrolled version of the for loop in Algorithm 1.

Algorithm 1 A Simple Standard Loop
In order to run our network on the modified RI5CY, we developed a software library of standard deep learning functions. We first implemented them by using naive algorithms and without using the dot product unit. We named this version of the library the reference version. Following that, we wrote the optimized assembly code that uses the dot product unit and hardware loops. We named this version of the library the optimized assembly version. The naive algorithms were written in C and could also use hardware loops, as the compiler is aware of them. Table 2 lists the functions we implemented in our library. The code can be seen by looking at the supplemental material.

C. FULLY CONNECTED LAYERS
A fully connected layer of a neural network is computed as a matrix-vector multiplication, followed by adding a bias vector and applying a nonlinearity on each element of the resulting vector. In our case, this nonlinearity is the ReLU function. Equation (1) details the ReLU function for scalar input. For a vector input, the function is applied to each element. Equation (2) shows the mathematical operation that is being performed to compute one fully connected layer.
where m is the number of neurons in the next layer, and k is the number of neurons in the previous layer or the number of inputs to the neural network.
In the reference version of the deep learning library, we simply used nested for loops to compute the matrix-vector product, and, following that, we applied the ReLU nonlinearity.
The optimized assembly version, however, tries to minimize the number of memory accesses. Because the dot product unit also operates on vectors, let us first clarify the terminology. The matrix and vector on which the matrix-vector multiplication is performed are named the input matrix M and the input vector v, to separate them from the vectors of the dot product unit. Also, the bias vector is denoted by b. We load in a chunk of the input vector v and calculate each product for that chunk. It means that we load the vector only once. The size of the chunk is determined by the input size of the dot product; in our case, it is 4. One problem with this approach is that the number of matrix columns must be a multiple of four. This problem can be solved by zero-padding the matrix and vector. Equation (3) This way, we only have to load vectors v 0 and v 1 once, and, at the same time, we have a good spatial locality of memory accesses.

D. CONVOLUTIONAL LAYER
Both the input to the convolutional layer and its output are two-dimensional. To compute a pass of a convolutional layer, one must compute what is known in signal processing as a 2D cross-correlation. Following that, a bias is added, and nonlinearity is applied. Equation (4) fil is a two-dimensional filter of size fil_size×fil_size, img is a two-dimensional input array of size img_size × img_size, b is a scalar bias term, res is the output of size out_size × out_size, and f is the nonlinear function. Note that more complicated convolutional neural networks typically have three-dimensional filters. We show a two-dimensional case for presentation, but to handle three dimensions, we simply repeat the procedure for the two-dimensional case.
The reference version of the function dlConv2nwReLU simply uses nested for loops to compute the convolution in the spatial domain. It then calls the ReLU function on the output matrix. Our optimized version tries to optimize the number of memory accesses. We do that by always keeping the filter in the register file. We first save the contents of the register file to the stack. Such an approach enables us to use the entire register file without breaking the calling convention used by the compiler. Next, we load the entire 5 × 5 filter and the bias term into the floating-point register file. RISC-V has 32 floating-point registers, so we have enough room in the register file to store the 5 × 5 filter, a chunk of the image, and still have two registers left. Note that we again use the word chunk to refer to four floating-point numbers. Fig. 6(a) and Fig. 6(b) show how we store the 5 × 5 filter and the bias term in memory, and load them into the floating-point register file. Registers f28-f31 are used to store chunks of the image and registers f9 and f10 to store the result.
Having the filter and bias term in the register file, we load in one chunk of the image at a time and compute the dot product with the appropriate part of the filter already stored in the register file. After traversing the entire image, we restore the previously saved register state. We make use of hardware loops to implement looping behavior. Computing the convolutional layer is shown in detail in Algorithm 4. The functions load_vec_f* load a total of 4 consecutive floating-point numbers from the memory location given in the argument to the registers f* to f* + 3. The function load_f* loads a single floating-point number to register f*. The functions dot_product_f*_f$ compute the dot product between two chunks in the register file. The first one starts at f* and ends at f* + 3, and the second one starts at f$ and ends at f$ + 3. Fig. 7 shows Algorithm 4 in action at the moment after the first iteration of the inner for loop. The leftmost matrix represents the input image, the middle matrix is the filter, and the rightmost matrix is the result matrix. Note that, in Fig. 7, the input image and filter contain only ones and zeros, so that anyone can calculate the dot product result quickly (8 in our case) in the upper-left corner of the result matrix by mental arithmetic. In fact, there are floating-point numbers in each cell.

VI. RESULTS
We provide a thorough set of results to give the reader a full picture of the costs and benefits of our proposed instruction set extension. All results of percentage decreases and increases according to the baseline values are rounded to whole numbers.

A. SYNTHESIS RESULTS
The synthesis was run using the Synopsys Design Compiler O-2018.06-SP5 and the 90 nm generic core cell library from the United Microelectronics Corporation. VOLUME Table 3 shows the results of the synthesis. We see that the area of the modified RI5CY is 72% larger than the original RI5CY. The area increase is only due to the addition of the dot product unit, not the hardware loops. The price in the area of adding hardware loops is minor, about 3 kGE [29]. Since the RI5CY core already has a floating-point unit, we could reduce the area increase by reusing one floating-point adder and one floating-point multiplier in the dot product unit.
Dynamic power consumption was reported by the Design Compiler using random equal probability inputs, so these results are only approximate. The leakage power consumption has more than doubled, but the dynamic power consumption has increased only slightly. It is important to note that the dynamic power consumption is three orders of magnitude higher, so the total power is still about the same. However, the rise in leakage power might be concerning for some low-power embedded systems, that stay most of the time in standby mode. This concern can be addressed by turning off the dot product unit while in standby mode.

B. METHODOLOGY
To gather data about the performance, we used the performance counters embedded in the RI5CY core. We compared and analyzed our implementation in the following metrics: • Cycles-number of clock cycles the core was active, • Instructions-number of instructions executed, • Loads-number of data memory loads, • Load Stalls-number of load data hazards, • Stores-number of data memory stores, • Jumps-number of unconditional jump instructions executed, • Branch-number of branches (taken and not taken), • Taken-number of taken branches. We compared five different implementations, listed in Table 4. All of them computed the same result. The implementations Fp, FpHw, and FpHwU, use the reference library, and implementations FpDotHw and FpDotHwU the optimized assembly library. We chose the Fp implementation as the baseline. Our goal was to find out how much the hardware loops, loop unrolling, and the dot product unit aided in speeding up the computation. The dot product unit is not used in all functions of our library, but only in dlDensen, dlDensenwReLU, dlConv2n, and dlConv2nwReLU. However, these functions represent most of the computing effort.
For compiling our neural network, we used the modified GNU toolchain 5.2.0 (riscv32-unknown-elf-). The modified toolchain (ri5cy_gnu_toolchain) is provided as part of the PULP platform. It supports custom extensions such as hardware loops, and applies them automatically when compiling the neural network with optimizations enabled. Hardware loops can be disabled explicitly using the compiler flag -mnohwloop. The neural networks were compiled with the compiler flags that are listed in Table 4. Even though the RI5CY core supports compressed instructions, we did not make use of them.

C. CODE SIZE COMPARISON
We compared the size of the code for the functions listed in Table 2. The results are shown in Table 5. The first three implementations (Fp, FpHw, and FpHwU) use the reference version of the library, while implementations FpDotHw and FpDotHwU use the optimized assembly version of the library. The assembly code for a particular (hardware) implementation is called a function implementation. Note that not all function implementations were written in inline assembly, but only the ones whose code sizes are highlighted blue in Table 5-these function implementations are not affected by compiler optimizations, and, because of that, they are identical at the assembly level. For example, the implementations FpDotHw and FpDotHwU use exactly the same code for the first four functions. On the other hand, the implementations FpHwU and FpDotHw have the same code size for the function dlDensen, but the codes are not identical. It is just a coincidence. Each of the last three functions (dlMaxPool, dlReLU, and dlSoftmax) has an identical function implementation and code size in FpHw and FpDotHw, as well as in FpHwU and FpDotHwU, because the same compiler optimizations apply for implementations with or without a dot product unit. Different functions may be of the same size, as are the sizes of dlConv2n and dlConv2nwReLU for the implementation Fp. Nevertheless, these function implementations are not identical. The reason for the same size is the fact that the ReLU functionality in dlConv2nwReLU is implemented by calling the dlReLU function, which does not affect the size of the function dlConv2nReLU. In the Fp implementation, where neither hardware loops nor loop unrolling are used, the code size is the smallest. As is predictable, the code with loop unrolling (including our optimized inline assembly code) is the largest. Nevertheless, the code size is still quite small, and in the realm of what most microcontrollers can handle.

D. FULLY CONNECTED LAYERS
Let us first look at the results for fully connected layers. These layers compute the matrix-vector product, add the bias and apply the ReLU function. Matrix-vector multiplication takes most of the time. The reader should keep in mind that computing a matrix-vector product is also very memory intensive.
What we are measuring is computing the F3 layer of our example neural network shown in Table 1. The matrix-vector product consists of a matrix with a dimension 64 × 144 and a column vector of 144 elements.
We ran the same code twenty times with different inputs, and computed the average of the runs. The averages are displayed in Fig. 8(a) and Fig. 8(b). Fig. 8(a) compares the number of clock cycles needed to compute the fully connected layer. Hardware loops alone contribute to a 29% reduction in clock cycles compared to the baseline. A decrease of 72% was achieved with the dot product unit included. The result makes sense, since we replaced seven instructions (four multiplications and three additions) with just one. It would be even better if we could feed the data to the RI5CY core faster. Let us conclude as follows. With our hardware, it takes only one clock cycle to calculate a single dot product of two vectors of size four, but at least eight clock cycles are needed to fetch the data for this dot product unit. Because the dot product is calculated over two vectors of size four and each access takes one clock cycle, our dot product unit is utilized only 11% of the time. This result is calculated in (5).

Cycles dot product unit is used
Total clock cycles = The actual utilization is slightly higher, because we reuse one vector (a chunk) several times, as seen in Algorithm 3. This fact means that the best possible utilization of our dot product unit is 1/(4 + 1) = 20%. VOLUME 8, 2020 Using loop unrolling does not provide any benefit in the implementation, neither with nor without the dot product unit. It slows down FpHwU, and FpDotHwU has about the same amount of clock cycles as FpDotHw. The implementation using hardware loops and loop unrolling (FpHwU) achieved a 21% reduction in clock cycle count compared to the baseline, but a 29% reduction was achieved using only hardware loops (FpHw). The reason is that loop unrolling makes the caching characteristics of the code much worse (the cache is no longer big enough). This effect can be seen by looking at the number of load stalls for the FpHwU implementation in Fig. 8(c).
The dynamic instruction count comparison can be seen in Fig. 8(b). Hardware loops contribute a 13% reduction in dynamic instruction count compared to the baseline. The optimized inline assembly code for the dot product unit (FpDotHw) contributed a 63% reduction of the baseline dynamic instruction count. It is a predictable consequence of having one single instruction that calculates a dot product of two 4-element vectors instead of seven scalar instructions. Loop unrolling reduces the dynamic instruction count. Hardware loops, combined with loop unrolling (FpHwU), contributed a 22% reduction in the dynamic instruction count compared to the baseline. Unrolling the loop reduces the number of loop iterations, but it does not make sense for hardware loops because there is no overhead. However, not all loops can be made into hardware loops due to the limitations of the RI5CY core.
The results of auxiliary performance counters are shown in Fig. 8(c). Our inline assembly code for the dot product unit (FpDotHw) reduces the number of branches substantially. Our algorithm, combined with the instruction extensions, reduced the number of loads and stores-as can be seen in Fig. 8(c). The number of loads and stores in the FpDotHw implementation was reduced by 36% and 74%, respectively compared to the baseline.

E. CONVOLUTIONAL LAYER
In this Subsection we look at the cost of computing the C1 layer from Table 1. It is a convolution of a 28 × 28 input picture with a 5 × 5 filter. We also include the cost of computing the ReLU function on the resulting output. We do this because we integrated the ReLU function into the convolution algorithm in order to optimize the code (dlConv2nwReLU). This layer differs from the fully connected layer, as it is not so constrained by memory bandwidth. It means that we can use our dot product unit more effectively. As in the case of the fully connected layer, we were not able to fetch data fast enough to utilize the unit entirely. It can be seen in Fig. 9(a) that using our dot product unit (FpDotHw) contributed to a 78% reduction of the clock cycle count compared to the baseline implementation (Fp). Adding just hardware loops to the instruction set (FpHw) contributed only a 23% reduction. As in the case of the fully connected layers, loop unrolling was not effective.
The dynamic instruction count was again substantially lower when using the dot product unit. This fact is a consequence of dot product instruction. By using it, seven instructions are replaced with only one. In Fig. 9(b) we see that the dynamic instruction count of the FpDotHw implementation is reduced by 72% compared to the baseline dynamic instruction count. Hardware loops alone (FpHw) have a modest impact. They contribute only a 9% reduction in the dynamic instruction count. Compared to the baseline implementation (Fp), loop unrolling increases the dynamic instruction count by 7% for computing a convolutional layer when not using a dot product unit (FpHwU). The dynamic instruction count is increased because of the increase in the number of branches, and it is a consequence of unrolling loops with branches inside them (e.g., function dlReLU).
The results of auxiliary performance counters are shown in Fig. 9(c). The FpDotHw and FpDotHwU implementations reduced the number of loads by 50% compared to the baseline. The number of stores was reduced by 95%. Both results mentioned above are a consequence of our strategy to keep the entire filter and bias term in the register file. Hardware loops again decreased the number of branch instructions. The number of branches in the FpHw implementation was reduced by 91% compared to the baseline. Since we used many small loops, hardware loops are a good idea. The FpDotHw implementation reduced the number of branches even further-by 97% compared to the baseline. Namely, the only branches in the FpDotHw implementation come from the ReLU function, while the FpHw function has additional branches for non-hardware loops. As is the case of fully connected layers, the number of branches increases significantly if loop unrolling is used.

F. ENTIRE NEURAL NETWORK
Finally, let us look at the results of running the entire example neural network. Fig. 10(a) shows that the FpDotHw implementation reduces the clock cycle count by 73% compared to the baseline. Hardware loops alone (FpHw) reduce the clock cycle count only by 24%. If we ran our microcontroller at only 10 MHz, we could run a single inference of the neural network in 7.5 ms using the FpDotHw implementation.
From Fig. 10(b) we can see that the dynamic instruction count for the FpDotHw implementation is reduced by 66% of the baseline version. The FpHw version also reduces the dynamic instruction count, but only by 10%. Fig. 10(c) shows the average results of the auxiliary performance counters for running a single-pass of the entire neural network. In general, we can say that, by using the dot product unit and the optimized inline assembly code (the FpDotHw implementation), we reduced the number of stores by 86%, and the number of loads by 43% compared to the baseline (Fp). Since neural networks are very data-intensive, this result is auspicious.

G. ENERGY CONSUMPTION
We derived the energy consumption results from the ASIC synthesis power results and the number of the executed clock cycles for the various implementations. Fig. 11 shows the energy results of running our entire neural network. The results are calculated by multiplying the sum of leakage and dynamic power consumption of the particular implementation and the time of computing one inference of the entire neural network with our microcontroller running at a 10 MHz clock frequency. Energy consumptions for the implementations Fp, FpHw, and FpHwU were calculated by using the power consumption results of the original RI5CY (see Table 3). For the implementations FpDotHw and FpDotHwU, energy consumptions were calculated by using the power consumption results of the modified RI5CY. Note that the results do not include the energy consumption of data transactions to and from the main memory. Adding the dot product unit (FpDotHw) reduced energy consumption by 73% compared to the baseline. Hardware loops alone reduced it by 24%, but if loop unrolling was included, the energy consumption dropped by only 10% compared to the baseline.

VII. CONCLUSION
The main aim of our research was to evaluate the effectiveness of hardware loop instructions and the dot product instructions for speeding up neural network computation. We showed that hardware loops alone contributed a 24% cycle count decrease, while the combination of hardware loops and dot product instructions reduced the clock cycle count by 73%.
Our reduction in cycle count is comparable with the 78.3% reduction achieved by CMSIS-NN [27]. Similarly, the combination of hardware loops and dot product instructions reduced the dynamic instruction count by 66%. Although our reduction in dynamic instruction count is less than the 87.5% reduction gained in [25], we achieved this reduction with a considerably smaller ISA extension. As embedded systems are highly price-sensitive, this is an important consideration. Unfortunately, in [25], the ISA extension hardware cost is not discussed. A point to emphasize is that our ISA improvements for embedded systems should be considered together with other research on compressing neural networks. Getting the sizes of neural networks down is an essential step in expanding the possibilities for neural networks in embedded systems. For example, in [5], it is shown that it is possible to quantize neural networks to achieve a size reduction of more than 90%. Another interesting topic for further research is Posit-an alternative floating-point number format, that may offer additional advantages, as it has an increased dynamic range at the same word size [15], [36]. Because of the improved dynamic range, weights could be stored in lower precision, thus, again, decreasing the memory requirements. Combining the reduced size requirements with low-cost ISA improvements could make neural networks more ubiquitous in the price-sensitive embedded systems market. His research interests include embedded software development tools, multicore processor architectures, hardware security, and system-level electronic design automation. He is also engaged as an Entrepreneur and in turning research results into innovations. He holds several patents and has been a co-founder of LISATek (now Synopsys), Silexica GmbH, and Secure Elements. As the coordinator of the TETRACOM and TETRAMAX projects, he contributes to EU-wide academia-to-industry technology transfer. He received various scientific awards, including Best Paper Awards at DAC and twice at DATE and several industrial awards. He has served on committees of the leading international EDA conferences.
ZMAGO BREZOČNIK (Member, IEEE) received the M.Sc. and Ph.D. degrees in electrical engineering from the Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia, in 1986 and 1992, respectively. He was the Vice-Dean of Education and the Head and Deputy Head of the Institute of Electronics and Telecommunications. In 1993, he founded the IEEE University of Maribor Student Branch, and served as its Counselor, until 2002. He is currently a Full Professor and the Head of the Laboratory for Microcomputer Systems. His main research interests include formal methods and tools for software and protocol verification, especially model checking, binary decision diagrams, and digital system design. He is the leading author of SpinRCP-a freely available integrated development environment for the Spin model checker. He was a member of the organizing and program committees of several international conferences and workshops.