An On-Chip Fully Connected Neural Network Training Hardware Accelerator Based on Brain Float Point and Sparsity Awareness

In recent years, deep neural networks (DNNs) have brought revolutionary progress in various fields with the advent of technology. It is widely used in image pre-processing, image enhancement technology, face recognition, voice recognition, and other applications, gradually replacing traditional algorithms. It shows that the rise of neural networks has led to the reform of artificial intelligence. Since neural network algorithms are computationally intensive, they require GPUs or accelerated hardware for real-time computation. However, the high cost and high power consumption of GPUs result in low energy efficiency. It recently led to much research on accelerated digital circuit hardware design for deep neural networks. In this paper, we propose an efficient and flexible neural network training processor for fully connected layers. Our proposed training processor features low power consumption, high throughput, and high energy efficiency. It uses the sparsity of neuron activations to reduce the number of memory accesses and memory space to achieve an efficient training accelerator. The proposed processor uses a novel reconfigurable computing architecture to maintain high performance when operating Forward Propagation and Backward Propagation. The processor is implemented in Xilinx Zynq UltraSacle+MPSoC ZCU104 FPGA, with an operating frequency of 200MHz and power consumption of 6.444W, and can achieve 102.43 GOPS.


I. INTRODUCTION
D EEP neural networks (DNNs) have been incorporated in the field of computer vision [1], [2], and a fully connected layer (FC) has been employed extensively for image classification tasks. In the early stages, in pursuit of high accuracy for specific applications, researchers developed deeper network layers such as VGG16 [3] and ResNet [4], which have massive parameters. To achieve faster computation, GPUs have become indispensable for the inference and training phases of neural networks. Although GPUs are highly flexible, they also suffer from high power consumption and high latency. Due to the hardware architecture and compiler of the GPU, low performance and low utilization problems occur during the inference stage. Therefore, many DNNs accelerators for inference have been developed to solve this problem.
As shown in Fig. 1, the DNN model is trained on the server, and the trained network model is deployed on a dedicated processor for inference purposes. During the data transfer from the personal information server, there can be a data privacy problem, which may affect the user's privacy. In addition, when a large amount of data is transferred back and forth between servers and personal devices, it causes delays in transmission due to network speed factors, which affects the overall stability. Therefore, training on the edge devices is essential to reduce the risk of data breaches and data transfer costs.
Several papers proposed DNN processors that can perform training to solve the privacy and latency problems encountered in inference-specific DNN accelerators [5], [6], [7], [8], [9], [10], [11], [12]. Due to the activation function module, many zero values are generated during the inference and training. These zero values are useless for computing operations, which occupy colossal memory space and affect the overall performance. Therefore, some papers proposed the use of the sparsity feature of DNNs for inference and training. For example, in an inference-only accelerator, Eyeriss [13] and EIE [14] use run-length coding (RLC) and compressed sparse column (CSC) compression methods to reduce the number of memory accesses and storage requirements of SRAM. Reference [7] introduced techniques for implementing sparse computation in the training phase to improve hardware performance. The edge device's computing and memory space are limited, so learning can only be done using small batches. According to [15], training with small batches (between 2 and 32) improves stability and convergence.
In DNN, Forward Propagation (FP) is the way to move from the input layer to the output layer in the neural network. It also used Backward Propagation (BP) for adjusting or correcting the weights to reach the minimized loss function. Due to the edge device's limited computing and memory bandwidth, multiple memory accesses are required to calculate the error/delta value of the backward propagation (BP) stage. This limitation leads to most of the on-chip training time being spent calculating error/delta values, creating a mismatch between the performance of FP and BP. Therefore, the processor used a memory-optimized access method proposed by Hussain and Tsai in [16] to speed up the error/delta calculation step. In the BP stage of the FC layers, the memory optimization method proposed in [16] can save 0.13x-13.93x memory accesses. It shows the advantage of FC layers for inference and training.
This paper proposes an energy-efficient sparsity-aware on-chip training processor. The proposed processor can perform FC layers-based inference and training on-chip. In this paper, we use Brain Floating Point as the data format of our architecture. Brain Floating Point format is developed by Google. It combines the advantages of IEEE 754 single-precision floating-point computing and half-precision floating-point computing. It retains the singleprecision 8-bit exponent and combines the half-precision with fewer mantissa bits. In the experience, using the Brain Floating Point format can slightly reduce the accuracy, but it can significantly save chip area and power consumption. Previous works [5], [8], [10], [17], [18] use single-precision floating-point or 16-bit fix-point to train or inference neural networks. It will either spend lots of hardware resources or the accuracy loss on training and inference may not be negligible. To the best of our knowledge, we are the first to design a hardware architecture that uses the Brain Floating Point format for all operations. It can maintain a certain level of accuracy during training while using fewer hardware resources.
Another contribution of our paper is that we exploit the data sparsity on input feature maps. We use the sparsity map index (SMI) matrix to efficiently compress both input data and weight information. Compared to other compression methods, it has a significant improvement in compression rate. Although some papers such as SparTen [19] and SIGMA [20] have proposed similar methods, both papers extract non-zero information from on-chip memory while ours store the data that will be computed in on-chip memories, which makes memory storage more efficient. Overall, both methods make our hardware efficiency higher and power consumption lower than the other works.
The main contributions of this work are summarized as follows; 1) We use a 16-bit brain floating-point format to represent a wide dynamic range of numeric values by using a floating radix point.
2) We use a novel reconfigurable processing element (PE) architecture to complete the training and inference stages of the FC layer.
3) We use the sparsity of neurons and combine the optimized memory access method to reduce the memory space and the number of memory accesses required for FP and BP computations. 4) We implement the whole accelerator on ZCU104. It can achieve 102.4 GOPS with 256 Multiply-Accumulate (MAC) units working at 200 MHz. The design is scalable to expand according to the PEs to achieve higher throughput quickly.
The organization of the rest of the paper is as follows. Section II reviews the literature related to training processors. In Section III, the implementation of the proposed hardware architecture is discussed in detail. Section IV includes the results and discussion of our design. In addition, this section also contains the results of comparing with other literature. Finally, the conclusions are provided in Section V.

II. BACKGROUND
This section introduces the DNN hardware accelerators, the fully connected layer networks, and well-known floatingpoint computing formats. The basis is that since the inference and training of the FC layer are used, the weights need to be updated during the training stage to allow the model to converge and obtain the final desired effect. It means the hardware precision format is important, and the trade-off between accuracy and hardware resources is significant.

A. RELATED WORKS ON HARDWARE ACCELERATORS
Deep neural network techniques are composed of complex data access and complex computation to achieve more efficient processing of neural network computations. Many researchers have designed ASICs that can perform neural network inference, including Eyeriss [13], EIE [14], Nullhop [21], etc. Researchers also target FPGAs for neural network inference [22], [23], [24]. A hardware accelerator for inference is mainly optimized for the forward propagation of numerical computations in convolutional networks and fully connected layers, mainly for scheduling optimization of data in memory, reducing the number of memory reads as much as possible and improving the overall hardware performance by parallelizing the processing.
Among the research on hardware accelerators for ASIC reasoning, the MIT team is best known for their invention of Eyeriss [13] AI chip. It proposed a row stationary and used RLC, a compression format, in its designed architecture. Reference [25] proposed a hardware accelerator design for Angel-Eye. It constructs a programmable and flexible architecture for accelerators through data quantization strategies and compilation tools and illustrates the entire hardware design process. It first obtains parameters from trained models and quantizes them. It supports multiple neural network architectures through a compliable hardware architecture to increase flexibility, and finally maps them to the hardware for execution. Reference [26] proposed CNN accelerator designed on Xilinx Vertex 7 FPGA. However, the above-mentioned hardware accelerators are designed for inference only and cannot perform the training of DNN models. To fine-tune the DNN model for such accelerators, the data is always transferred to the servers for the training, and the updated trained model is deployed on the edge device again. This leads to data privacy issues due to data transfer, which requires high bandwidth for high-speed data transfer.
Multiple hardware accelerators have been proposed to perform both training and inference of different DNN layers at the edge to deal with the data transfer issues during the training stage. Reference [5] proposed the F-CNN configurable training framework. It covers the training tasks of each layer of the CNN by reconfiguring the data flow path at runtime. Reference [6] uses a particular memory management unit to reduce the number of memory accesses during training. Experiments at different network sizes demonstrate the great flexibility of the proposed framework. Sticker [7] has been proposed to achieve high throughput for sparse FC layers. Liu et al. [8] designed an FPGA-based training accelerator using a unified computational engine and a scalable framework. An online learning processor has been proposed in [9] by Han et al. to support the training of the fully connected layers. It can perform object tracking and update the DNN model according to the changes in the data. EILE [10] was proposed to achieve incremental learning at the edge, where only the fully connected layer needs to be trained in its incremental learning processor. In [11], an FPGA-based accelerator with a compressed training scheme and effective compression using both quantization and pruning methods is proposed, exploiting sparsity in both forward and backward propagation. FPDeep [12] proposed a design framework for DNN training on multiple FPGAs in a cluster. In their research, the energy efficiency of training 16-bit AlexNet can be improved by a factor of 3.4 when the computations are distributed in a pipelined fashion across 15 FPGAs. Shao et al. [27] designed a reconfigurable processing element with a unified architecture that can flexibly support various computational modes during training and introduced scaling and rounding schemes to reduce memory usage. Fig. 2 shows a miniature model of the FC layer. It includes the input feature layer (IFL), hidden layer (HL), and output layer (OL). When the training process is performed in FC layers, there are three stages: FP, BP, and weight update (WU). FP starts from IFL, multiplies each node and weight, and accumulates the result to get the value of each node in the next layer. The BP is performed by partial linear differentiation based on the value of each node of the FP to obtain the gradient of the weights and the delta/error values. The WU stage uses the weight gradient obtained during BP to update the weights for training purposes. The following equations define these stages.

B. TRAINING ON FC LAYER
In (1), O n is the output node of the next layer, H i is the node of the input layer, and W in is the weight value associated with the output node that generates the next layer.

BP:
In (2) and (3), the weight gradient and delta values are calculated where On is the output value at the output node n, Yn is the target value of output node n, also known as a ground truth value. Hi is the node's value where the weight Win is connected in the previous hidden layer.
In (4), W old is the previous weight value. L represents the FC layer's loss, which is calculated at the end of the FP stage during the training process. η is the learning rate. Depending on the optimizer function used during training, it can be a fixed number or a variable number.

C. FLOATING POINT FORMAT
In many DNN hardware accelerators, the fixed-point format is mostly used as the computational format for the entire architecture. Many recent works have replaced fixed-point numbers with floating-point formats. This section describes several well-known floating-point formats and describes the advantages and disadvantages of these floating-point formats.
In the previous, the floating-point representation followed the IEEE 754 floating-point standard format [28], defined by the International Institute of Electrical and Electronics Engineers (IEEE). This standard describes two formats, single-precision floating-point, and double-precision floating-point numbers. They can be expressed by the following: where s is the sign bit, f is the mantissa bit, and e is the exponent bit. The single precision format bias is 127 and the double precision format bias is 1023. In recent years, many researchers have been working on accelerators for DNNs training, and as a result, many papers have proposed 16-bit or less than 16-bit floatingpoint computing formats. The IBM team has invented a new floating-point format and named it DLFloat [29]. DLFloat comprises 6 bits of exponential, 9 bits of fractional, and 1 bit of positive and negative numbers. Their definition is based on the actual range of values encountered in deep learning. Compared to IEEE 754 half-precision, the format has 1 bit more exponential and 1 bit fewer fractional bits. DLFloat also optimizes the floating-point format. Their proposed floating-point algorithm incorporates the NaN and infinity representations because when numerical operations are performed with NaN or infinity input, the result is always NaN or infinity. Therefore, they use e = 63 and m = 511 to represent the result of NaN and infinity, simplifying the logic unit of the FPU.
Flexpoint [30] is a floating-point computing format invented by the Intel team. Flexpoint combines the advantages of fixed-point and floating-point computing by using an index that automatically manages each tensor, using the same index for some operations to reduce computation and memory requirements. Flexpoint is tensor-based, using the N-bit mantissa to store the two's complement integer values and the M-bit exponent e shared among all tensor elements. This format is denoted as flexN+M. In general, the multiplication of two independent tensors can be computed as a fixed-point operation, which can turn most of the computations in deep neural networks into fixedpoint operations. Flexpoint reduces memory and bandwidth requirements in hardware compared to single precision floating point. However, Flexpoint also has some disadvantages. Flexpoint is more complex than single-precision floatingpoint in format conversion and has a small dynamic range, which makes it easy to generate gradient disappearance problems when training neural networks, thus making it difficult for the model to converge.
TensorFlow-32 (TF32) [31] is a floating-point format proposed by NVIDIA to replace the single-precision floatingpoint format (FP32). TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 to support the same dynamic range. The advantage of TF32 is that the format is the same as FP32. When computing inner products with TF32, the input operands have their mantissa rounded from 23 to 10 bits. The rounded operands are multiplied exactly and accumulated in normal FP32. TF32 requires a CUDA compiler to perform the format conversion effectively. Although the dynamic range is the same as FP32, the complexity of TF32 will be more complex than other 16-bit floating point formats when designing the hardware, resulting in higher power consumption and area.
Finally, we introduce a 16-bit floating-point format developed by Google, called "Brain Float Point" [32], which consists of a 1-bit sign, an 8-bit exponent, and a 7-bit mantissa. This floating-point format combines the advantages of IEEE 754 single-precision floating-point computing and half-precision floating-point computing, which retains the single-precision 8-bit exponent and combines the halfprecision with fewer mantissa bits. However, the accuracy is slightly less than that of single-precision floating-point computing. In hardware design, Brain Float Point uses fewer bits of mantissa bit, so it can significantly save chip area and power consumption. For example, using Brain Float Point will save eight times the power consumption for a multiplier than using single-precision floating-point computing. This is why Google and Intel use Brain Float Point as a floatingpoint format for their cloud servers. Brain Float Point has some advantages. It can directly intercept the first 16 bits of FP32, so it is straightforward to convert between FP32 and Brain Float Point. It also has a more extensive dynamic range than FP16, so it is less likely to overflow.
Although Google teams recommended that in general cases, representing activations in bfloat16 is generally safe, while weights and gradients should be kept in FP32 format. However, there is some potential to use bfloat 16 to represent more values. We trained AlexNet and ResNet-50 on ImageNet with all data formats where the computation precision is set to bfloat 16. Fig. 3 shows that both top-1 and top-5 validation accuracy drops less or equal to 0.6%. Therefore, using bfloat 16 in all data formats is suitable for training and inferencing models and can significantly save hardware resources and power consumption.
Based on the above introduction, each floating-point computing format has its advantages and disadvantages. We prioritize the selection based on the ability to maintain a certain level of accuracy during training while using fewer hardware resources. In this paper, we use 16-bit Brain Float Point as the computational precision format for our hardware architecture.

III. PROPOSED WORK
In this section, we describe the proposed hardware architecture. In Section III-A, we present our overall architecture and data processing. In Section III-B, we present the data processing data flow. In Section III-C, the sparsity method in our design is introduced. The PE array architecture and the other Computational Core Unit (CCU) of the proposed design are explained in Section III-D. The Memory Bank (MB) of the proposed design is described in Section III-E.

A. OVERALL ARCHITECTURE AND PROCESSING FLOW
We use the SoC architecture which includes the PL (programmable logic) and PS (processing system) to design the hardware accelerator. The proposed accelerator is implemented on the PL side, and we use the AXI4 bus protocol to communicate with the PS. When PS needs to accelerate a fully connected neural network, it can control the accelerator on the PL side for inference and training of the neural network. PS can perform format conversion and data pre-processing of the feature maps and weight data. All feature maps and weights are stored in DDR4 on the PS side, waiting for PL to fetch these data. We use the DMA IP provided by Xilinx to implement high-performance burst transfers between PS DRAM and PL. Fig. 4 shows the overall architecture of the proposed processor, and there are five main blocks in the architecture. These include a control unit (CU), memory bank (MB), computational core unit (CCU), data sparsity encoder and decoder unit, and a configuration register module to set the parameters used for training and inference stages, such as epoch, batch size, ground truth, etc. The data sparsity encoder and decoder unit are responsible for encoding and decoding the feature data. The MB stores the weights, input features, and output data generated by the CCU module. The CCU is responsible for processing the FP and BP calculations, including calculations such as output feature generation, activation function, softmax function, loss function, and weights update. The CU controls the data transfer from the processor, including data transfer from external memory to the MB for local storage, MB to the CCU for computation, and between different CCU modules during different calculation operations. 'Module Gating' is responsible for activating different modules to reduce the switching power of the processor. This is because during the processing of some modules, few of the modules remain inactive, e.g., PEs remain inactive when data is being transferred from off-chip memory to BRAM and BRAM to PEs, etc. Fig. 5 shows the overview of the processor with two different kinds of data processing, i.e., inference mode and training mode. First, we read some information from the external memory into the configuration register for specification customization, and the related information to perform inference or training mode. In the case of training mode, the configuration register contains the information used in the inference stage and epoch, batch size, the number of images used for training, and the ground truth value of the images in the training mode.
In the inference mode, the non-zero input feature values are first stored in the input BRAM of the memory module by the data sparsity encoder module. The weight parameter is stored when all the input feature values are stored or the memory space is full. The weight values of the nonzero input features are stored in weight BRAM by the data sparsity decoder module. When the weights are stored to a certain amount, we transfer the input feature values and weight values to the computing core unit (CCU) to start the accumulation operation. In this stage, the data generated by the computation will be transferred back and forth with the memory module to generate the node results of each layer. After the CCU completes the calculation of all FC layers, the data from the output layer is sent to the Output Classifications module for the final output classification, deciding which category it is finally classified.
In the training mode, the overall processing can be divided into the forward pass (FP) and backward pass (BP). The FP in the training mode is similar to the inference mode. The only difference is that the results in the output layer will be sent to the softmax module for calculation, and the BP stage will be started. In the next step, we will calculate the delta values and each weight gradient for each FC layer according to (2) and (3). The weight gradient for each layer is calculated cumulatively according to the size of the Batch Size. Therefore, the weight update will be calculated only when the specified Batch Size value is reached. When all the weights are updated, the newly obtained weights are used for the FP calculation again, and the above steps are repeated until the required number of updates is completed. The total number of weight updates (TWU) can be calculated as follows; where TWU is the abbreviation of Total Weight Updates, E is Epoch, TI is the total number of images used for training, and BS is the batch size. Fig. 6(a) shows the data stream used by the accelerator for forward propagation. The proposed processor uses output stationary data flow to reduce the number of output data movements. If we repeatedly transfer the result of each computation to BRAM and then read it out from BRAM for accumulation, this step will cause a delay in data transfer and generate additional power consumption. Therefore, we store the results of each cycle in the partial sum registers in the PEs first, and then transfer the results to the memory module after all the values of that cycle have been computed. When processing the forward propagation data, the input feature data is distributed to each PE Cluster synchronously. As a result, each PE in the PE Cluster uses the same input feature data for computation while the weight data is distributed to each PE with different weights by Unicast. Fig. 6(b) shows the data flow for the weight gradient calculation during the BP in the proposed processor. During BP, input stationary data flow reduces the number of input data moves. During BP, the input and weight data need to be reused for different computations. The input stationary method reduces data movement and provides data to PEs quickly to speed up BP. When calculating the weight gradient of BP, the result of each layer of FP and the error/delta values of each layer are used to generate the weight gradient value. This calculation also uses a mixture of unicast and broadcast modes to process the values.  Fig. 6(c) shows the data flow used by the proposed processor to calculate the error/delta values. We adopt output stationery to reduce the time of output data transfers, power consumption, and memory access time. When calculating the error/delta values, the previous layer of delta values and weights is used, and unicast and broadcast mode generates the result values.

Calculate Weight Gradient and (c) is Calculate Delta value.
In the case of the forward pass, delta computation, and weight update, all three cases are more likely to reach memory bound because they all involved unicast mode to deliver data to PEs. Unicast mode will send different data to different PE which will potentially cause high memory bandwidth. Adding the sparsity feature can reduce the overall latency since it eliminates unnecessary operations, but the passes will still have potential memory bound when it comes to unicast mode.
When the number of MACs increases, if the number of memory is also increased, there will be no congestion problem since the memory bandwidth is also higher. If the number of memory remains unchanged, the performance of the accelerator will be limited because the value cannot be transferred to each PE in time when performing unicast mode.

C. DATA SPARSITY ENCODER AND DECODER UNIT
Our design uses a simple sparse matrix compression algorithm as shown in Fig. 7. The algorithm mainly modifies the CSC compression method. The original CSC algorithm requires three matrix spaces to store the values. The first matrix records the number of non-zero in each row, the second matrix records the location of non-zero values in each column, and the third matrix records nonzero values. Although CSC allows flexibility in processing data, it takes ample memory space to record the row and column coordinates of the values. Instead of using two matrixes to record the row and column coordinates position, we only use one matrix to save sparsity information. The sparsity map index (SMI) matrix can be generated using (7).
The values are coded as 0 or 1 according to the conditions in (7) and recorded in SMI. The non-zero input feature values are stored in the input BRAM of the MB. The weight values of the non-zero input features are stored in weight BRAM to save processing time in the inference and training stages through the sparsity table.
According to the above-proposed compression method, we use SMI, CSC, and Coordinate list (COO) to perform the relationship between sparsity and the input feature nodes, as shown in Fig. 8. When the sparsity of the input feature map is lower than 10%, the data transfer time will be longer than the original format and will have less reduction in PE computing time. The higher the sparsity is, the better the compression performance will be since both data transfer time and PE computing time are significantly reduced. Since the input features need to be flattened before the output nodes can be computed in the fully connected layer, the input feature nodes become one-dimensional matrices. CSC and COO need to record the positions of non-zero values. When the number of input feature nodes increases, more bits are required to record the positions of non-zero coordinates, which will cause additional capacity overhead. Therefore, using the proposed SMI compression format will have better compression results than CSC and COO.
According to the compression method proposed above, we can analyze the number of weight memory accesses for the forward propagation and backward propagation operations of the fully connected layer as calculated in (8) to (10): Here, N HL2 is the total number of nodes in HL2, N OL is the total number of nodes in OL, and sparsity () is the sparsity of the network layer. From (8) to (10), we can conclude that using this bitmap compression method can reduce many memory accesses when performing FP and BP. Especially when the weight update is performed, the network sparsity is squared ratio with the number of memory accesses.
In our proposed architecture, we add the data sparsity encoder and decoder module before the memory bank to prevent the storage of zero values. These values are not worth further computation since the computation will always result in zero. These values include zero values in the input feature map and the corresponding weight values. The remaining values are stored in order. Hence, the frequency of accessing memory banks will be much lower, and the PE calculations will be reduced, which leads to lower computing and memory energy. Moreover, it will not affect the PE architecture since the data are already organized in the encoder and decoder modules.  Fig. 9 shows the architecture of a PE Cluster where each PE Cluster is composed of 16 PEs. There are 16 PE clusters in the proposed processor. In each PE Cluster, all PEs process data in parallel; and the switching power is reduced by signal gating modules. Figure 10 shows the PE structure that can support both FP and BP. There is a total of 256 PEs in the proposed processor. As long as there are enough hardware resources, a PE array can contain more PE clusters and a PE cluster can contain more PEs.

D. COMPUTATIONAL CORE UNIT (CCU)
The input and output of each PE are processed using the 16-bit brain floating-point precision format. Each PE comprises a multiplier, an adder, multiplexers, and a demultiplexer. A 4-stage pipeline is used to increase the operating frequency of the processor. Each pipeline stage consists of input data storage into registers, multiplication, addition, and output registers. Some registers are designed in the PE to reduce the memory accesses required for computing.
In image classification, the softmax function is used to map the value of the output layer between 0 and 1 to obtain the probability value of the category. The softmax function is defined as follows: Here, K is the total number of output classes in the output layer, x i represents the output value of node I, and C is the maximum value of the output layer node. From the above equation, it is known that the softmax is composed of complex mathematical operations. Compared to the previous hardware design using the RISC approach [33] and modified softmax function [34] to implement the softmax function, the proposed processor uses the fast exponential function proposed in [35] for exponential computation and a low resource divider module to complete the softmax module. There are three modules in this softmax architecture, which are the exponential function module, the accumulator module, and the divider module. To enhance the operation frequency of the chip, the fast exponential function proposed in [34] is used to perform exponent calculation in the proposed design. The fast exponential algorithm is shown in (12).
Here α = 1.4426950409, β = 127, and γ = 0.0579848147. The main purpose of α here is to convert the exponent to the second power of 2. β and γ are used to correct this exponent function. In the normal forward propagation process, output values produced by the output layer will be directly used to calculate the softmax function. However, the exponential function is too expensive for hardware implementation. Therefore, before performing the softmax function, the output layer will subtract its largest value to convert every value under zero. According to the characteristics of the exponential function, we can observe that when x is under zero, the function can be approximated as a linear equation. Since the softmax function will always redistribute all output values between 0 and 1, the approximation will not affect the prediction accuracy. The hardware design for the exponent unit is shown in Fig. 11. It includes a single 16-bit multiplier, two adders, and one sifter in this architecture. The result of the exponent module is stored in the output BRAM after the accumulation (denominator of (11)). After all exponent calculations have been performed for all output nodes, the exponential values are read from the MB. The division unit performs the final division to get the final softmax result.
The division operation in (13) is performed after calculating the sum of the exponent values of the output nodes. Suppose O represents the exponent value of any given output node, and P represents the sum of all exponents of the output layer nodes. Then, the final softmax value (Q) for node E can be calculated as follows; To reduce hardware resources, the division operation uses only shift and subtraction operations in pipelined mode.   In the exponent part, we mainly perform phase subtraction and then add bias. In the processing of mantissa, we use shift and subtractor to implement the divider. Our practice is similar to the concept of long division. Although the disadvantage of this is that it takes 20 cycles to get an output value, the advantage is to save hardware resources and achieves the effect of increasing the overall frequency.
Since the proposed processor is used for image classification tasks, the cross-entropy loss function used in the proposed processor can be defined as follows.
where t i is the ground truth value and S i is the output value after the softmax function of each class in the C classes of the DNN model. To obtain the loss values for the training stage, we calculate two values in this module: loss per round and loss per batch. Where loss per round is the loss value generated when training a DNNs model once. The loss per batch is the average loss value when training a specified number of times. For example, if batch size (BS) is equal to 4, then each round loss is generated four times, and per batch loss is generated using an average loss value based on each round loss. If the DNNs model is being trained for class C, then only one training image and one ground truth is provided per training round. It is known from (14) that for the class with zero ground truth, i.e., ti = 0, the loss value will be 0. Therefore, in the proposed processor, when t i = 1, only one loss value is calculated for the single class in class C, which can save C-1 calculations of loss values in each training round. Fig. 13 shows the overall block-level hardware architecture of the proposed cross-entropy loss processing. The input Algorithm 1 Fast Binary Logarithm Algorithm 1. x ← Input Data 2. Initialize b=16'h3f00 and log_2_result=16'h0000 3. while(x<(16'h3f80)) data from the softmax module results are in a 16-bit brain floating point format. In the first stage, the logarithmic function is computed in the "log2" module. In the second stage, the output of the log 2 unit is processed in the Divider module, which is a division of the log 2 e parameter. In the final stage, we convert the result of the division output value by a positive and negative sign.
The fast binary logarithm [36] has been implemented for the high-speed design of log functions for loss calculation. The fast binary logarithm [36] benefits from high-speed and simple hardware architecture as it can be implemented efficiently with the pipelined-based design. Algorithm 1 shows the pseudo-code of a fast binary logarithm [36]. In Algorithm 1, precision represents the precision required for the decimal points. In the proposed design, 16-bit brain floating point numbers are supported where 1-bit sign bit, 8 bits are exponent, and 7 bits are mantissa bit. So the precision will be 7 based on mantissa for log calculation using Algorithm 1. Figure 14 illustrates the weight update process in the WU module. The module is based on (4). First, the BS of the configuration register is converted to the reciprocal of the BS through the LUT register. For example, if BS = 4, the LUT conversion will result in a quarter (16'h3e80). Second, the result of the reciprocal is multiplied by the learning rate. Third, it multiples the accumulated weight gradient with the above result to get the updated weight error value. Finally, the old weight value is subtracted from the updated weight error value to obtain the new weight value. All computing units in this module use the sequential circuit to achieve high frequency.

E. MEMORY BANK (MB)
The memory bank (MB) is responsible for storing data transferred from external memory and different modules of the proposed processor. The MB is composed of 768KB of  BRAM, and the memory capacity is divided into three main blocks to store different data types. The weight BRAM is 256 KB and is responsible for storing the weight values required for FP and error/delta calculations. The input feature BRAM is 256KB. It is responsible for storing the input features from external memory and the node results of FC layers and is also used to store the weight gradient when performing weight updates. The output BRAM is 256KB. It is responsible for storing the results of the CCU calculation, including the partial sum of each node of the FP, the weight gradient, and the error/delta value of the BP.

IV. EXPERIMENTAL RESULTS
This section provides detailed experimental results and comparisons with other works.

A. COMPARISON OF FLOATING-POINT ARITHMETIC OPERATORS
Since we have investigated several floating-point computing formats, here we analyze these floating-point computing formats evaluated with their digital circuits. Our design environment is on the Design Compiler provided by Synopsys. Here we use TSMC 40nm process technology and 200MHz operating frequency for evaluation. Table 1 shows the synthesis results for floating-point adder/subtractor with different bit widths. The floating-point format for each bit width is represented by N(S, E, M), where N denotes the total number of bits, S is the sign bit, E is the exponent bit, and M is the mantissa bit. It shows that the 16-bit floating point format will get a smaller area and power consumption than singleand half-precision. Also, the brain floating point format can get the lowest area and power consumption. Table 2 shows the synthesis results for floating-point multipliers with different bit widths. As shown in Table 2, the fewer the number of mantissa bits used in floating-point multiplication, the smaller the area and energy consumption. Also, the brain floating-point format has the smallest area and power consumption under the same condition of 16-bit length.

B. MEMORY ACCESS ANALYSIS
We adopt the technique proposed by [16] for reducing memory accesses in the backpropagation stage and combining it with a sparsity compression scheme. These techniques also reduce the number of external memory accesses. Table 3, Table 4, and Table 5 show the analysis of memory accesses for different layers in AlexNet [1], VGG16 [3], and MobileNet-V1 [37] respectively. DRAM accesses include the accesses to read input feature nodes, output feature nodes, weights, and weight gradients. We use the ImageNet [1] dataset to analyze DRAM accesses. Here, we set the batch size to 1 and epoch to 1. Table 3 to Table 5 shows that the fully connected layer has many parameters that need to be accessed from DRAM and sent to the PL side for numerical computation. Since the activation function by ReLU turns negative values to 0, we can reduce the number of DRAM accesses by skipping some weights with input values of 0. In forward propagation, the weights are only used to calculate  the output of the hidden layer. While in backward propagation, the weights are used to calculate the delta value of each layer and get the new weight value so that more DRAM accesses are needed in backward propagation. Compared with [16], our proposed method can reduce DRAM accesses by 43.33% to 95.5% in forward and backward propagation. It can be seen that higher network sparsity leads to a higher compression rate, which results in better processor performance because fewer data needs to be accessed from DRAM. Since we reduce the DRAM accesses for some useless parameters, we can speed up the computation during the inference and training stage. Table 6 shows the different attributes of the proposed chip design. We use 256 PEs and brain floating point precision format in the proposed design. Our design can support the inference and training mode for FC layers (FC1 and FC2). The FC1 and FC2 layers are designed in a reconfigurable way. They can support 1∼4096 nodes, the output layer can support up to 1000 classes, batch size can support 1∼32, and epoch can support any number. We also support softmax and cross-entropy loss functions so that we can end-to-end train the images for image classification.

C. OVERALL HARDWARE SYNTHESIS RESULTS
The proposed design has been implemented in Verilog HDL and synthesized for Xilinx ZYNQ UltraSacle+MPSoC ZCU104 FPGA where we use Vivado 2020.1 as the development environment. The hardware utilization of the proposed design is shown in Table 7. Our design utilizes 73622 LUTs  and 26832 flip-flops. Moreover, the implementation design can achieve an operating frequency of 200MHz. We also use the Vivado Tool to analyze the power consumption of the proposed hardware architecture. The total power consumption was 6.444 W. The dynamic power consumption was 5.729 W, while the static power consumption was 0.715 W, and the power consumption of PS accounted for 45% of the dynamic power consumption, followed by 19% of the logical computing unit. According to the data flow, we use 70.51% of the BRAM resources on ZCU 104 to complete this SoC architecture design. Tables 8 to Table 9 show the execution times of the proposed hardware architecture on ZCU104 for AlexNet [1], VGG16 [3], and MobileNet-V1 [37] respectively. Here we provide the execution time with two designs, one with and the other without the sparse compression method. Comparing the sparse mode to the dense mode, we add the data sparsity encoder and decoder module before the memory bank to prevent the storage of zero values. These values are not worth further computation since the computation will always result in zero. The remaining values are reordered in the encoder and decoder module and stored inside the memory bank afterward. Thus, the PE array can handle sparse data in the same way as dense data.

D. PERFORMANCE OF HARDWARE ARCHITECTURE
We used the ImageNet [1] dataset to train the FC layer network. Here, the batch size is set to 1, and the epoch is set to 1. Due to the nature of the fully connected layer, when the hidden layer is large, many weight parameters need to be transferred from external memory to on-chip memory for computation. If a sparse approach is used, a significant amount of transfer time can be saved. From Table 8 to Table 9, we can see that it takes more time to perform backpropagation. It is mainly because, in backpropagation, we need to calculate the Delta/Error value for each layer, the weight gradient value for each weight, and update each weight using the weight gradient. Tables 8 to 10 shows that the execution time can be reduced by 32.45% to 95.43% using our proposed sparsity approach.
We use the PYNQ framework to complete the overall design as the SoC. In the overall system design, it takes much time to start the DMA where it transfers the data from PS to PL or to transfer the result from PL to PS. We found that it takes most of the execution time to start the DMA and send the data to the ZCU 104. If the sparsity in the network is exploited to reduce the number of weights transferred, the number of initiating DMAs can be reduced and the execution time required for each layer can be reduced. In Tables 8 and 10, the first layer has the largest number of weights so it takes more execution time to execute the forward and reverse propagation of the first layer. In   Table 9, since MobileNet-V1 uses only one layer of the fullyconnected network and only 1024 input features, the number of weight parameters is significantly less than AlexNet and VGG16. This eliminates the need to start the DMA on the PS side multiple times to pass the weights, resulting the less execution time. Table 11 shows the comparison of the proposed accelerator with the existing FPGA accelerators in terms of resources, power consumption, and operational performance. The power data is obtained after the FPGA synthesis results, while the throughput information is calculated by multiplying the clock frequency with the number of operations that will be computed in one clock cycle. Due to the different board models of FPGAs, different training accelerators present different performances. For comparison purposes, the evaluation metric is energy efficiency (GOPS/W). Our design can reach 102.4 GOPS at an operating frequency of 200 MHz. Compared with floating-point works in [5], [8], and [18], our design achieves higher energy efficiency. Although the proposed hardware architecture can only handle forward propagation and backward propagation of fully connected layers, our architecture supports softmax and cross-entropy loss functions for the complete training of image classification tasks.

E. COMPARISON WITH RELATED WORKS
The proposed hardware design achieves better performance compared to the works of [10] that supported training and inference of only FC layers. The proposed processor achieves 2.14x higher energy efficiency than [10]. The training accelerator in [17] uses a batch size of 10 to process more image data in parallel, thus reducing the energy consumption per image. However, in [17], they use a batch size of 1 and it leads to higher energy consumption by the additional latency imposed on DRAM with frequent weight updates. With the advantage of high energy-efficient performance, our proposed architecture is more suitable for deployment on mobile devices.

V. CONCLUSION
This paper presents a training processor that performs the training and inference phases of the FC layer. The processor uses a 16-bit brain floating-point computation format to achieve a high-performance hardware design while supporting sparse data. Our proposed hardware architecture can support forward and backward propagation of fully connected layers with a complete training mechanism. We can also reduce the number of DMA reads by expanding the number of PEs and the BRAM usage according to the hardware resources of the development board. The final design is implemented on the Xilinx ZCU104. The synthesis result of 256 MACs can reach 102.4 GOPS at an operating frequency of 200 MHz. We design the architecture in a reconfigurable way to support a different number of nodes, classes, batch size, and the epoch. As long as the parameters in the configuration registers are set at the beginning, end-to-end training is possible. Our architecture also can transfer fully connected layers to other types of layers since we can compute the convolution layers by transferring the convolution operation to matrix multiplication with the im2col method. In the future, the processor will be extended to support 2D convolution and Recurrent Neural Networks to support multiple types of training on a single chip.