Energy-Efficient DNN Training Processors on Micro-AI Systems

Many edge/mobile devices are now able to utilize deep neural networks (DNNs) thanks to the development of mobile DNN accelerators. Mobile DNN accelerators overcame the problems of limited computing resources and battery capacity by realizing energy-efficient inference. However, its passive behavior makes it difficult for DNN to provide active customization for individual users or its service environment. The importance of on-chip training is rising more and more to provide active interaction between DNN processors and ever-changing surroundings or conditions. Despite its advantages, the DNN training has more constraints than the inference such that it was considered impractical to be realized on mobile/edge devices. Recently, there are many trials to realize mobile DNN training, and a number of prior works will be summarized. First, it arranges the new challenges of the DNN accelerator induced by training functionality and discusses new hardware features related to the challenges. Second, it explains algorithm-hardware co-optimization methods and explains why it becomes mainstream in mobile DNN training research. Third, it compares the main differences between the conventional inference accelerators and recent training processors. Finally, the conclusion is made by proposing the future directions of the DNN training processor in micro-AI systems.

training-based distributed learning can distribute the workload of the server. If we adopt the federated learning [82], it not only protects individual privacy but also makes DNN training easy to scale.
In this article, DNN training will be discussed in terms of technical challenges. First, typical applications and requirements of the on-device training will be discussed in Section II. New design challenges caused by computational characteristics of the training will be summarized in Section III. After that, new features suggested by the recent research will be examined through Sections IV to VIII. Section IV surveys new hardware designs to support the transpose-read of weight required at the error-propagation (EP) stage. Sections V-VIII introduce: 1) sparsity-aware acceleration; 2) bit-precision optimization; 3) memory access optimization; and 4) backward unlocking (BU) methodologies. Section IX summarizes recently developed DNN training processors and introduces design examples, HNPU-V1 [29] and HNPU-V2 [80], with the design philosophies of the inference and training processor design. This article will be concluded with a discussion about future research direction and new challenges which should appear in the upcoming DNN training processors.

II. THREE MAJOR SCENARIOS OF ON-DEVICE TRAINING
Training efficiency and speed is the key enabler of on-device training but the detailed requirements can be varied according to the target application. As shown in Fig. 1, we summarize three major scenarios of on-device DNN training as follows: 1) evolution; 2) advancement; and 3) adaptation.

A. EVOLUTION: LONG-TERM DNN TRAINING
If we can collect the data with proper labels, a new DNN application can be created. However, there are too much data in the real world, and labeling all of them is challenging. In addition to labeling, it faces privacy-preserving issues during the training data gathering. Although these two problems are the main barriers to enlarge DNN application, they can be addressed using long-term DNN training such as federated learning [82]. After devicewise local DNN training, individual weights are sent to the cloud and the weights are aggregated for global DNN training. For this training process, the individual device should support on-device training for local training. It updates local DNN with enough user data collected for some period and receives global DNN after federated learning. This is an example of long-term DNN training because communication to share parameters occurs infrequently. The long-term training scenario, evolution, utilizes a large dataset such as ImageNet [81], unlike the other two training scenarios. The AI processors for evolution applications require reconfigurable logic to provide training functionality but have no strict conditions for training latency.

B. ADVANCEMENT: MEDIUM-TERM DNN TRAINING
On-device training is important when the application needs user-specific solutions. On-device training can make a small student network which shrinks the original large teacher network to focus only on user preference. It usually utilizes quantization or weight pruning for the fast and efficient acceleration of DNN on the micro-AI device. In addition to network shrinking, it can add new classes which were not considered in the original pretrained network training. It needs to fine-tune the network which utilizes the original big teacher network as a backbone but modifies its weight in the user-specific dataset. These applications can be categorized as advancement scenario which requires medium-term DNN training. In the advancement scenario, it has a relatively small dataset and the training should be finished at midnight which is generally considered as the device charging and not in use. The training time constraint is more strict than long-term DNN training cases but it still does not need a real-time or ultrafast performance like the inference applications. Instead, these training scenarios are important to offer user-friendly applications. Moreover, it can optimize networks using quantization or pruning for fast and efficient inference on edge/mobile devices.

C. ADAPTATION: SHORT-TERM DNN TRAINING
Both long-term and medium-term training are important to preserve users' privacy while expanding DNN applications to broad areas. However, their timing constraints are not that strict compared with the DNN inference scenario. However, real-time or ultrafast training is required when DNN needs environmental adaptation. There are two typical examples. The first example is object tracking. Han et al. [34], [43] utilized a light object detection network but supports object tracking functionality by adopting online DNN training. Since the shape of the target object can be deformed due to its individual movement, it learns new shapes through online DNN training. Furthermore, it adapts new illuminations or occlusions to be robust to environmental changes. Generally, object tracking targets real-time operation, the online training process should be accelerated within 30 ms not to degrade the performance of the conventional inference process. Adaptation after an unexpected situation, such as a camera malfunction or abrupt domain change is also important to prevent fatal operational errors. Han et al. [80] showed that online DNN tuning performed right after an unpredictable accident is one of the solutions to recovering its original performance As shown in both two examples, on-device adaptation seems promising but it must be accompanied by an energyefficient and low-latency DNN training processor. Long latency due to the training rather disturbs the DNN inference and can cause other problems due to slow response. Although it can collect only a small amount of dataset during the runtime, it should realize the fastest training speed compared with the other two training scenarios. Unlike both the evolution and advancement scenarios, the adaptation scenario requires ultrafast DNN training but it can lose the generality of the original network by using only a small amount of data collected every frame.
The design direction of the on-chip DNN training processor can be varied according to the target DNN training scenarios. In this article, we will introduce various DNN training processors including these three target application scenarios.

III. BACKGROUND: BACKPROPAGATION
Even though there are many DNN training methodologies, backpropagation (BP, [40]) is the mostly used DNN training method. Thus, distinct computing characteristics of the BP compared with the inference will be briefly explained. We will also introduce the main challenges for realizing fast and energy-efficient training on micro-AI systems.

A. BP OF FULLY CONNECTED LAYER AND CONVOLUTION LAYER
The BP is a loss minimization method based on gradient descent by using the chain rule. It stores all intermediate activations and errors generated in every layer and loads them to calculate the new gradients. Unlike numerical gradient descent, the BP-based training is one of the analytic gradient descent methods, which can calculate accurate gradients of the multiple parameters with just one inference. Therefore, the BP can significantly reduce the computational requirement, instead, it needs large storage to remember entire activations and errors to update the weights. As shown in Fig. 2, the BP generally consists of three training stages: 1) feedforward (FF); 2) EP; and 3) weightgradient-update computing (WG). In the FF stage, the DNN gets inference results with the predefined weights. After that, the loss function is calculated to generate the error. In the EP stage, this error is now propagated from the last layer to the prior layers step by step. Finally, the gradient of the weight is calculated by utilizing both activations and errors generated in FF and EP stages. This training process is general for any type of DNN, but the required computing methods can be varied according to the target layer type. In this section, we will explain the detailed computation of the three training stages in two commonly used layer types" 1) the fully connected layer (FCL) and 2) convolution layer (CL).

1) BP OF FCL
When input and output neurons are fully connected in a layer, we call it FCL and it is the most general layer type of the DNN. For example, multilayer perceptrons and recurrent neural networks use the stacks of FCLs to complete the network architecture. Convolutional neural network (CNN) also utilizes FCLs for classification or channel attention. The FF stage of FCL is based on matrix multiplication of input activation (IA) and weight. In the EP stage, the computation method is similar but the input operands are replaced with error and transposed weight. After both FF and EP stages, IA and output error (OE) are multiplied to create the gradient of the weight. The WG stage needs elementwise multiplication in the single-batch gradient descent.

2) BP OF CL
Although the FCL needs matrix multiplication and elementwise multiplication for training, CL only needs the convolution operation. Instead, the shapes of the weight kernel are varied according to the training stages. It needs a 180 • -rotated kernel to propagate the errors. The position of the input and output channel is also switched. Similar to FCL, CL also needs the transpose-read due to this channel switching. After the EP stage, the gradient of the weight is calculated by the convolution of IA and OE. Unlike the FF and EP stages, the WG stage needs large-kernel-size convolution because the size of OE is almost the same as IA. As a result, the distance of the spatial data-reusing at the WG stage is not as long as in the other training stages.

B. CHALLENGES OF BP AT THE MICRO-AI SYSTEMS
The DNN training requires not only FF stage but also EP and WG stages. Additionally, optimization methods of the inference processors can be useless because they sometimes degrade the training quality. This difference induces new challenges to design training processors and they are summarized as follows.

1) READING OF TRANSPOSED WEIGHT
The EP stage reads the transposed form of the forward weight (FW) before it computes the convolution. The CL requires a 180 • -rotated kernel as well as a transposed kernel. The reading of 180 • -rotated data can be realized simply by reading data in reversed order. However, the reading of transposed weight is not that simple because it reads discontinuous data from the external memory where the weight is stored. Discontinuous data access pattern caused by transposed weight disturbs burst-mode read of DRAM, resulting in slow and inefficient DNN computing.

2) MAXIMIZING THROUGHPUT EVEN WITH THE LIMITED RESOURCES
DNN training requires a much larger amount of computations compared with inference. It needs to perform additional two training stages and repeats this training loop until the loss curve is converged. It also utilizes batch gradient descent which uses multiple inference results to generate a single gradient.
The required batch size can be varied according to its target training applications. Generally, the batch size becomes relatively small if the target network requires a large memory size such as image-to-image translation networks. On the contrary, as shown in Table 1, deep reinforcement learning (DRL) networks utilize large batch sizes because their network size is relatively small. In the object classification application, its training performance can be degraded significantly with the batch size < 4, but beyond 4, there is no significant difference in training performance [ Fig. 3(a)]. For this reason, the conventional GP-GPU utilized large batch  parallelism to reduce the overall training time [ Fig. 3(b)]. However, this acceleration method is challenging in mobile training processors with limited on-chip memory size and external memory bandwidth. Instead, a novel throughput improvement method is necessary for the training processors to accelerate training even with the small-batch parallelism.

3) HIGH-BIT PRECISION REQUIREMENT
During the inference, the final decision of the DNN is mainly affected by the large values of the IA [29]. For this reason, it is easy to decrease the bit-width during the inference acceleration. Unlike the inference, even the small values of IA cannot be ignored in DNN training because they may include important information for loss minimization. The naïve quantization methods can destroy this information and slow down the training. Furthermore, the errors computed in the EP stage have a wider distribution compared with the FF stage; thus, it is insufficient to be represented with lowbit precision. As a result, training requires high-bit-precision representation and this requirement becomes the main obstacle to the realization of DNN training on mobile devices because it totally decreases the efficiency of the processor.

4) MEMORY-INTENSIVE OPERATIONS DURING WG
As described in Fig. 2, intermediate IA and OE should be stored until the weight gradient is calculated. In addition, although data-reusing can be partially done in the CL, its reusing amount is much smaller than in the other training stages. Computing in the WG stage becomes similar to  matrix multiplication that appears in the FCL. After gradient calculation, it additionally loads the weight and momentum to finalize the weight update process. It causes high arithmetic intensity (the number of operations per byte) so that its performance highly depends on its external and internal memory bandwidth. Consequently, the reduction of memory access during the WG is the most important optimization method to accelerate its computation.

5) BACKWARD LOCKING PROBLEM
The BP propagates errors sequentially from the last layer to the first layers as depicted in Fig. 4. Therefore, the weight of the first layer can be updated only when the error propagation of the entire layers is completed. In other words, FF, EP, and WG stages should be computed sequentially and cannot be processed in parallel. It is called the backward locking problem [34] and this sequential process has two disadvantages. First, three training stages require the same weights but the weights should be reloaded every training stage. Second, it cannot hide the latency of the EP and WG stages and significantly reduce the framerate during online DNN training, such as object tracking [41] or temporal knowledge distillation (TKD, [42]).

IV. SOLUTION OF TRANSPOSE-READ DURING EP
As summarized in Table 2, there are three categories of transpose-read solutions: 1) software-level; 2) architecturelevel; and 3) circuit-level solutions.

A. SOFTWARE-LEVEL SOLUTION
Han et al. [43] tried to avoid transposed weight by using a different EP method called feedback alignment [44]. This algorithm is the same as BP but the backward weight (BW) is substituted with the binary random matrix. However, since this algorithm modification may degrade the training accuracy significantly, it can only be applied to limited DNN applications.
Most DNN processors [22], [23], [29], instead, adopted software-level data prefetching. They rearranged the data stored in weight SRAM before it was used for the main convolution operation. This method is straightforward but needs additional time to rearrange the weight memory. Therefore, it can be efficient only when the reordered weights are reused repeatedly.

B. ARCHITECTURE-LEVEL SOLUTION
The transposable PE array suggested by Kim et al. [20] releases the burden of the software by simple architectural modification of the PE array. The transposable PE array utilized both broadcasting and unicasting data flow to perform matrix multiplication or convolution. It utilized the input feature reuse during the inference with a single image, but it adopted weight reuse by aggregating multiple images in the EP stage. This method can convert its computing type by using a simple instruction but it needs a local buffer for each MAC unit, resulting in an efficiency drop in the PE array.
Another paper [30] added a weight transpose-reading unit instead of modifying the main PE array. This additional unit generated the transposed weights before they were fetched to the PE array. Since the transposable PE array [20] assumed regular patterns of rectangular-shaped weight, there is no throughput improvement effect even with the pruned weight. The weight transposer suggested by Lee et al. [30] adopted hierarchical transpose-read and reduced memory access by excluding the fetch of pruned weights. Even though the weight transposer unit showed fast weight decoding speed, it can rather induce low area efficiency due to the additional large register file array.

C. CIRCUIT-LEVEL SOLUTION
Custom SRAM design [5], [25] is also a good solution that can support both normal-read and transpose-read. It makes both the main core architecture and software much simpler but the memory density of the SRAM is degraded because the SRAM cell architecture should be modified by adding more MOSFETs or bit/word-line.

V. SPARSITY EXPLOITATION DURING TRAINING
The DNN training processor needs further optimization to realize high-speed DNN training in the micro-AI systems. It needs additional hardware features to maximize its throughput even with the limited resources. Sparsity exploitation is the key feature that increases throughput during the on-chip training and it is summarized in Table 3.

A. CONVENTIONAL INPUT/WEIGHT ZERO SKIPPING
Most of DNNs utilize the ReLU activation function and it makes the negative values of IA become zeros. Weight can also have many zero values when the pruning is applied. Zero operands do not affect the final output and many inference engines [33], [45], [46] tried to speed up their operations by skipping zeros in the IA or weight. Input and weight skipping techniques are still effective for DNN training but it shows limited performance improvement during the EP and WG stages. At the EP stage, input sparsity disappears due to the normalization layer. On the other hand, weight sparsity disappears at the WG stage because the gradient is calculated by the convolution between IA and OE. DNN training processor needs a different sparsity exploitation method compared with the inference processor to utilize the advantages of the zeros that appear during the training.

B. RELU-AWARE OUTPUT ZERO SKIPPING DURING THE EP
Input sparsity cannot be utilized for the acceleration of the EP stage. Instead, the derivative of ReLU activation, dReLU(x)/dx, transfers only positive values to the prior layer and induces output sparsity. With this motivation, some DNN training processors [24], [27] exploit both input and output sparsity to improve their throughput both in FF and EP stages.

C. PRUNING-AWARE OUTPUT ZERO SKIPPING DURING THE WG
Weight sparsity exploitation core is still effective during the EP stage but it cannot be applied during the WG stage because it calculates matrix multiplication or convolution between IA and OE. Instead, the gradients of the pruned weight should be discarded and it gives the possibility of skipping that calculation. It results in output sparsity during the WG stages. Output sparsity exploitation during the WG stage has big benefits thanks to both useless computation avoidance and memory access removal. For this reason, recent energy-efficient training processors [26], [31], [51], [78] supported triple sparsity exploitation by combining iterative pruning.

D. IN-AND OUT-SLICE SKIPPING
Dual or triple sparsity exploitation is proposed recently for zero-skipping-based DNN training but they induce complex data-path leading to low efficiency at dense data computing. In addition, the activation functions used for a recent DNN architecture have been varied such as leaky-ReLU [52] and mish [53]. It wastes lots of energy if there is no sparsity during the DNN training. Han et al. [29], [80] focused on this drawback and utilized bit-slice (4 bit) level sparsity. Moreover, Han et al. [29] skipped partial accumulation slices which should be truncated before it is used for the next layer. Thanks to the in-and out-slice skipping it simplifies the sparsity exploitation data path and utilizes slice-level sparsity in the entire training stage.

VI. BIT-PRECISION OPTIMIZATION
Even if the training processor adopts sparsity-exploitation, its energy efficiency is still lower than the conventional mobile inference processors because it requires high-bit-precision representation. Previously, it was considered that the DNN training requires ≥ 32 bit but the required bit precision is now continuously decreasing thanks to the research on the low-bit-precision training. We will introduce recent research to reduce the required precision during the training.

A. NEW NUMBER REPRESENTATION
The first trial was the floating-point (FP) 16-bit IEEE 754 standard format. However, naïve FP16-based training sometimes degrades training accuracy because the exponent bit-width is too low to fully represent the weight gradient. There were many substitutions proposed for better DNN training.

1) NEW FLOATING-POINT REPRESENTATION
As shown in Table 4, there are three new FP representations that have a wider data representation range thanks to the high-bit-width exponent. The bit-width of the exponent in Bfloat16 [14] and TensorFloat [54] is the same as the original FP32 to give high compatibility with FP32. However, their energy efficiencies are still low because they need more than 16 bit for data accumulation. The DLFloat [17], [18], [19] was proposed to unify the data representation method of both input operand and accumulation. Flexpoint [55] tried to substitute FP with the fixed-point (FXP) representation using a shared exponent management algorithm together for the simplification of MAC design, but it failed to reduce the required bit precision to less than 16 bit.
Nowadays, FP-based training can be realized with less than 16 bit by utilizing the narrow distribution that appears in each of the tensors [28], chunks [56], or training stages [57], [58]. The required representation range can be reduced by dividing computation through the shared exponent [28], chunkwise accumulation [56], and hybrid representation among different training stages [57], [58]. Although the successful bit-precision reduction is observed during the multiplication, accumulation still needs high-bit precision and it requires additional hardware units to manage hybrid representation [57] or two-phase rounding [58].

FXP
representation needed much higher bit precision compared with the FP. The required bit precision can be reduced dramatically when dynamic FXP (DFXP) [3], [29], [34], [80] is adopted. It adjusts the required integer length to fit the layerwise narrow distribution instead of considering entire layers. Even though the DFXP shows a remarkable performance improvement, it still needs higher bit precision (> 8 bit) compared with the FP because the low-precision FXP has a limited representation resolution. Stochastic rounding (SR, [59]) was proposed to increase its virtual representation resolution. In addition to SR, stochastic thresholding (ST, [29]) was co-designed to further increase representation resolution by loosening its overflow judgment and removing outlier data. A combination of SR and ST successfully demonstrated a 43% bit-precision reduction in the CIFAR-100 dataset [29].

B. LOW BIT-PRECISION TRAINING ALGORITHM
Not only new number representation but also new algorithms were also proposed for low-bit-precision training. One of them is fine-grained-mixed-precision (FGMP, [21]) which divides IA into two different bit-width formats. The majority of data is placed near zero and it is enough to be represented by the low-bit-precision data. However, ignoring the precise representation of the outlier data degrades its training performance significantly. To solve this problem, the FGMP represents only outlier data with high-bit precision while maintaining low-bit precision for the majority of data.
Layerwise adaptive precision scaling (LAPS, [29]) further reduces the required bit precision by using automatic bit-precision tuning during the training. Required precision can be varied according to the training scenario but it is hard to predict the optimal bit precision at the beginning of training. The LAPS continuously monitors the similarity of high-precision and low-precision convolution results and increases the bit-width if the difference becomes larger. Bit precision found by LAPS can be varied according to the dataset and network complexity.

C. HARDWARE ARCHITECTURE FOR BIT-PRECISION OPTIMIZATION
Hardware architecture and PE circuit should be modified to support the new number representations and low-bitprecision optimization algorithms.

1) MULTIPLE-PRECISION CONFIGURABLE MULTIPLY-ADD UNIT
To accommodate the new FP representations, the design of the MAC unit should be modified. Since the required bit precision of the training is usually higher than the inference, training processors showed lower efficiency compared with the inference processors. Multiple-precision configurable multiply-add units can support both high-precision and low-precision computing and it becomes a common way to reduce the efficiency gap between inference and training processors.

2) ACTIVE TRAINING SUPPORTING UNIT
Both FGMP [21] and DFXP [3], [29], [34], [35], [80] need streaming data analysis units that can calculate the mean/variance or overflow ratio of the OA. Based on the analyzed tensorwise statistics, the FGMP converts more than 90% of FP16 accumulation results to FP8 operands during the ResNet-18 training. In DFXP-based computing, it increases the integer length of the corresponding layer if the overflow ratio exceeds the predetermined threshold [3]. The LAPS [29] performed it in a peripheral unit that calculates the difference and counts only large differences. The new bit precision is determined by an internal finite-state machine and then it is applied from the next training iteration. The bit-precision reduction algorithms usually require an additional unit to monitor output distributions and give useful information for runtime bit-precision optimization.

VII. WEIGHT GRADIENT STAGE ACCELERATION
Sparsity-aware WG acceleration assumes that it adopts weight pruning. Acceleration of the WG stage without weight pruning is still challenging because the bit precision of the gradient and primal weight should be higher than the bit precision used during the FF and EP stages.
Further optimization of the WG stage is essential for a high-performance training processor; thus, we summarize the existing methods for the WG stage.

A. GRADIENT BIT-PRECISION REDUCTION
Data compression is the mostly used WG acceleration method because the WG stage suffers from the memoryintensive computation. Seide et al. [60], Alistarh et al. [61], and Wen et al. [62] suggested ultralow-bit gradient quantization but their main purpose was to reduce the gradient communication among the multiple-edge devices during the distributed learning. Since the primal weight still needs full precision (16 bit), the quantized gradient should be converted to the original bit-width during the weight update operation. In addition, a large portion of memory access is still required due to the data movement of the primal weight and momentum from the memory.

B. SPARSE GRADIENT CALCULATION
Gradient compression can be realized by not only quantization but also gradient sparsification. Lin et al. [63] and Strom et al. [64] accumulated weight gradients every iteration and updated only when the accumulated gradient became larger than the predetermined threshold. These methodologies require a large internal buffer to store all weight gradients, which is not practical in training on micro-AI systems. Another sparsification [65] updated the weight only when it had a large gradient or the target confidence was low [66]. Large-gradient-only or lowconfidence-only weight update showed great compression ratio but they can cause a slow training curve. Furthermore, they are not easy to be applied to other complicated DNN applications, such as object detection and image-to-image generation.

C. SELECTIVE DNN TRAINING
Although bit-precision reduction or sparsification during the WG stage shows a higher efficiency than the naïve gradient calculation, it still requires a large buffer to store all intermediate IA and OE that appear during the FF and EP stage. Cai et al. [83] suggested bias-only fine-tuning which freezes weights and updates new bias to remove storage of intermediate IAs. Another method is selective layer training [32] which focused on a few layers but skips memorizing the IA and OE of the useless layers. Both methods can reduce the required memory footprint significantly but drops training quality due to aggressive gradient calculation skipping. Lin et al. [84] newly suggested channelwise gradient skipping and also utilized both bias-only tuning and selective layer tuning to further optimize the training. To minimize the accuracy degradation, it predicts the accuracy loss and the reduced hardware cost during the compile time. Finally, selective training is a key enabler of on-device training by minimizing required memory access due to IA and OE.

VIII. BACKWARD UNLOCKING
The consideration of backward locking is still challenging and there are few proposals to solve this problem. Li et al. [68] proposed an out-of-order scheduling mechanism that performs inference of CL before the training of FCL is finished. However, balancing the workload between CL and FCL is challenging and FCL still suffers from backward locking problems. BU without algorithm modification is difficult because the BP itself has serial computation characteristics.

A. BACKWARD UNLOCKING WITH DELAYED BACKPROPAGATION
There are several BU methods [69], [70] using the delayed BP. Decoupled delayed gradient [69] divides DNN into multiple groups and performs BP only within a group to loosen the backward locking problems. The gradient calculated by each group is transferred to the prior group at the next iteration. Since it adopts an intergroup pipeline during the BP, multiple gradients can be computed at the same time, but each group utilizes the FF results obtained at a different time step. Thus, it should copy the weights of the multiple different time steps for parallel gradient computation.
Another method, fully decoupled gradient [70], copies intermediate IAs instead of weights. Not only EP but also the FF stage is divided into multiple groups and makes a big pipeline structure among them. Each group updates its weight with the delayed IAs and performs FF operation as soon as it completes the weight update. It can further decrease processing time thanks to the pipeline structure of both FF and EP stages. In spite of their success, both methods are considered only in multi-GPU processing because they need a large buffer to store all weights or IAs of the delayed batches. Therefore, a memory-efficient BU method is necessary for the micro-AI systems.

B. DIRECT FEEDBACK ALIGNMENT-BASED APPROACHES
Direct feedback alignment (DFA, [71], [72], [73], [74]) described in Fig. 5 can propagate the errors of the last layer to the entire prior layers simultaneously. The problem of backward locking can be released in DFA-based training because the next inference can be started right after this  Table 5, it shows the minimum memory accesses compared with the other two BU methods. The only extra cost is that it needs additional BW which is independent of the FW. It now turns out that the overhead of the BW can be reduced by a maximum of 96.9% when using binarization [72] or sparsification [73]. Thanks to the lowest memory requirement, the DFA algorithm was selected by the recent on-chip training ASIC [34], [80] and it successfully demonstrated DFA-based online training in the object tracking or object detection application. It constructed a pipeline structure among the three training stages and also hid latency caused by EP and WG. Thanks to the pipelined DFA (PDFA), it showed at least 2× higher training speed compared with the BP-based acceleration.

IX. DNN TRAINING ACCELERATOR EXAMPLES
In this section, the real chip implementation results of the DNN training processor will be discussed. First, we will introduce the examples of training processors and clarify their target applications. The reason why the algorithm co-design is necessary for mobile-oriented DNN training processors will be explained with practical examples. Finally, a design example of DNN training will be introduced with the key differences in accelerator design between inference and training.
Super-fast training is required when online DNN training is necessary. One of the ASICs [34] showed an example that utilized online training to extend the role of object classification networks to object tracking. It learned the shape of the first bounding box and decided whether the surrounding boxes were similar to the target. Since it updates the network to learn the continuously moving target, it can achieve robust tracking under the shape deformation and illumination changes. Another example was TKD suggested by Farhadi and Yang [42] and Han et al. [80]. It realized energy-efficient but accurate object detection by adopting online knowledge distillation to the lightweight DNN. The lightweight network itself has been considered impractical due to its poor performance. This problem can be solved if the network is trained by data with labels generated by heavyweight teacher network inference. It needs the inference of a heavy teacher network, but this process is performed infrequently and consumes significantly less energy compared to the direct use of the teacher network. These examples show that DNN training can be used as not only for long-term user customization but also for short-term domain adaptation to further improve its functionality and performance.

B. ALGORITHM-HARDWARE CO-OPTIMIZATION IN TRAINING PROCESSOR
Most training processors adopted algorithm-hardware codesign and it becomes an inevitable trend for realizing DNN  training in micro-AI systems. Since the naïve BP-based training requires large hardware resources, the algorithmic cooptimization is essential for high-speed and energy-efficient training.

1) NEW ALGORITHM AND ITS DEDICATED HARDWARE
Some algorithms are not that efficient in conventional DNN accelerators and require completely different hardware designs as shown in Fig. 6 Nevertheless, when they are co-designed with their optimized hardware, it shows much higher efficiency compared with the naïve implementation of DNN training.
The GANPU [24] of Fig. 6(a) sought co-optimization by proposing a hardware-friendly output sparsity prediction algorithm for the FP number system to support fast training. Sign and exponent bits, which represent the scale of data, are used to speculate whether the ReLU result will be zero. Given that multiplications between exponents are mathematically identical to integer additions, a dedicated speculation unit was able to be integrated on-chip with minimal resource overhead and it helped the main convolution core to exploit both input and output sparsity simultaneously at the FF stage.
The PNPU [26] of Fig. 6(b) proposed a new iterative pruning algorithm that utilized both coarse-level and finelevel pruning for efficient training acceleration. It approximated similarity calculation among the output channels through random channel sampling. The PNPU designed its dedicated pruning core and construct a pipeline structure between the main core and the pruning core to support iterative pruning without PE utilization drop. In addition to the pruning unit, the main core is designed to support triple sparsity exploitation and its hardware complexity could be reduced by utilizing coarse-zero skipping.
Pruning-aware training processors [26], [31] were efficient only when the pruning ratio became high enough. The low pruning ratio that appeared at the beginning of training disturbed pruning-aware acceleration, resulting in significant performance degradation. OmniDRL [30] of Fig. 6(c) solved this problem by utilizing block-circulant-based weight grouping and the pruning of small weights together in the initial training phase. The new algorithm, group-sparse training (GST), induced both repeated patterns and sparsity in the weight. GST core (GSTC) suggested by Lee et al. [30] improved training performance by enabling both grouped weight reuse and zero-weight skipping through a weight router and prefetcher.

2) HARDWARE PERFORMANCE IMPROVEMENT THROUGH THE ALGORITHM COMBINATION
Algorithms introduced in Fig. 5 were newly proposed for energy-efficient DNN training acceleration but it is effective only when its dedicated hardware exists. Unlike these examples, there were ASICs that have greater synergy when the algorithm is combined.
The first example was LNPU [21] of Fig. 7(a) which combined the input-zero-skipping core and FGMP-based training. The required computation is linearly increasing when the naïve FGMP is applied. However, the IA of each bit precision shows high input sparsity because of the zero paddings. Computation including zeros wastes many computing resources because zero paddings do not affect final computing results. The LNPU [21] adopted the input sparsity exploitation core to skip zeros induced by ReLU or FGMP. It not only exploits data sparsity but also reduces the burden of high-bit-precision computing to obtain a great synergy effect.
The second examples were ASICs [27], [29] which adopted bit-scalable core architecture. Bit precision can be varied according to layer number and the bit-scalable core architecture could remove useless operations during lowbit-precision training and maximize both training speed and efficiency. In addition to these purposes, it was able to realize a ReLU prediction algorithm [27] by performing MSB-only computing. After MSB-only prediction, it skipped the LSB computations of which results would have the zero-expected value. Output-slice skipping [29] was another algorithm that shows synergy with the bit-scalable architecture. It could compensate for the throughput degradation problem during  high-precision computing by excluding LSB accumulations. Finally, it could be combined with the runtime bit-precision optimization algorithm, LAPS, as shown in Fig. 7(b). The HNPU-V1 [29] minimized the overhead of difference calculation between high-precision and low-precision convolution results by performing only LSB accumulations. The bitscalable architecture used in [29] finally minimized the extra time required for bit-precision searching, and at the same time, optimized training with the discovered precision numbers.
The last example is DF-LNPU [34] of Fig. 7(c) which adopted a heterogeneous core design. The heterogeneous core design can optimize each training stage and minimize hardware resources to support training but has the problem of core utilization drop. The DF-LNPU [34] adopted a heterogeneous core design but no longer suffered from low core utilization problems by utilizing PDFA-based BU. The PDFA-based optimization enabled parallel processing of three different training stages. Combination with the BU algorithm compensated for throughput degradation caused by backward locking while retaining the advantage of the heterogeneous core design.
The main challenges illustrated in Section III-B can be solved effectively when the algorithm is co-designed with the training hardware. These examples clarify that algorithmhardware co-optimization plays an important role in realizing  DNN training at the edge device, even with limited hardware resources.

C. RELATIONSHIP OF SPARSITY AND QUANTIZATION DURING DNN TRAINING
Sections V and VI introduce typical examples of sparsity exploitation and quantization method for on-device DNN training. In this section, we will discuss the relationship among these optimization techniques. We construct the experiment of three different ResNet training in the CIFAR-100 dataset. In this experiment, we pretrained the network with FP32 and retrain them after weight pruning and DFXPbased quantization. During the quantization, we modified bit precision of weight, IA and OE but maintains high accumulation bit width. To analyze the relationship between pruning and quantization, we introduce a new graphical display, Pruning-quantization (PQ) plot, which visualizes the test accuracy according to varying training conditions to help designers to select optimal pruning and quantization level during the on-device training. Fig. 8 shows three PQ plots obtained from our experiment. Green and blue boxes in Fig. 8 indicate acceptable conditions with two different accuracy thresholds. Within the accuracy boundary, the on-device training designer can choose the best condition based on the hardware characteristics of the training processor.
There are two key observations obtained from the PQ plot as summarized in Fig. 9. First, DNN shows a low pruning ratio and high bit precision if the network capacity is small. Second, there is a tradeoff relationship between pruning and quantization. If a designer selects a high pruning ratio, highbit precision must be used to maintain original high accuracy. The optimal design point can be varied whether the target processor has the functionality of zero-weight skipping or bit-reconfigurability.
Another minor observation is about IA sparsity. Fig. 10 summarizes the layerwise average IA sparsity after pruning or quantization. As shown in this figure, IA sparsity increases when it has a higher pruning ratio or lower bit precision. However, the IA sparsity within the accuracy boundary (indicated as a green or blue box) shows a small variance; thus, there is no expectation of additional throughput improvement through input sparsity exploitation when considering both pruning and quantization. Fig. 11 shows an example of a DNN training execution comparison. We compare the ideal performance of the two DNN accelerators which support both weight-skipping and bit-reconfigurability but have different bit-granularity. The optimal design point of the coarse-grained bit-reconfigurable accelerator shows a higher pruning ratio than the finegrained bit-reconfigurable accelerator (FGBRA). Instead, the FGBRA shows faster training in this experiment even with the lower pruning ratio. To sum up, the PQ plot can be useful for training processor designers to select optimal design hyperparameters. Utilizing both pruning and quantization is the best way to improve efficiency but needs to avoid lopsided optimization.

D. DESIGN EXAMPLES: HNPU-V1 AND HNPU-V2
As illustrated in Section II, the design of the on-device training processor can be varied according to their target training scenarios. There were two typical processors: 1) HNPU-V1 [29] and 2) HNPU-V2 [80], which utilized the same amount of hardware resources but were optimized differently due to their target application. Their key characteristics are  summarized and the differences are highlighted in Fig. 12. The HNPU-V1 mainly focused on long-term and mediumterm DNN training, thus, it introduced automatic precision search during the training for minimization of total training time and energy consumption. Moreover, it combined not only DFXP and SR but also ST to make a stable training curve even with the precision change caused by the LAPS. On the contrary, the HNPU-V2 targeted short-term DNN training, which performs online tuning for real-world environmental adaptation. It adopted the predetermined static bit precision during the online DNN tuning and removed STrelated circuits in its PE architecture. The higher bit-precision requirement of HNPU-V2 due to input data uncertainty could disturb the realization of low-latency training. To resolve this problem, it adopted pruning-aware training and DFA-based BU to reduce training latency. Since it does not require ST logic, it further optimized SR circuits by using the lowcost random number generator. In the same online tuning scenario, the HNPU-V2 can achieve 56.1% higher throughput and 66.5% higher energy efficiency than HNPU-V1. Although the training method of HNPU-V2 can be lossy to be used in other training tasks, it successfully realized low-latency DNN tuning for fast environmental adaptation. In contrast, the HNPU-V1 can support any long-term or medium-term training without accuracy loss by adopting changing precision during the training. Thanks to precision search, it can minimize required internal/external memory access compared with HNPU-V2 while training most of the network without loss. As shown in this example, the design methodology of the DNN training processor can be varied according to the target application, which may require different optimization schemes. Table 6 compares the main features of inference and training processors. Conventional inference processors focused only on the fast and efficient execution of well-trained networks while they do not consider whether the training of the network becomes slow or not. In contrast, the training processor should achieve high performance both in inference and training without compromising the training accuracy.

1) PE/CIRCUIT-LEVEL DIFFERENCES
The inference processor can take advantage of ultralow-bit quantization because the network can be retrained multiple times in advance to compensate for accuracy degradation. Ultralow-bit quantization enables look-up-table-based computing [3], [4], [10] for energy-efficient inference acceleration. Furthermore, an inference processor can utilize analogdomain computing [8] if it repeats retraining after adding noise components to a prewell-trained network. However, both ultralow-bit quantization and mixed-mode computing can cause accuracy degradation during the DNN training and slow down the training curve. For this reason, the majority of training processors adopt precision-configurable MAC units which can also support high-bit-precision FP. It sometimes uses BW binarization [34], [43] but only in limited applications.

2) ARCHITECTURE-LEVEL DIFFERENCES
The training processor shows differences also in the architecture-level features. In the inference processor, they sometimes adopt fused-layer-based acceleration [12], [75] to remove external memory access that appears during layer-by-layer computing. The fused layer is less effective in the training processor because all intermediate IAs and OEs should be stored and reloaded at the WG stage. Moreover, the training processor shows a more complex data-path architecture due to the transpose-read of weights or WG operation. It utilizes additional features, such as the transposable PE array [20] and output sparsity exploitation [24], [26], [27], [31] to compensate for efficiency drop caused by complex data path.

3) ALGORITHM-LEVEL DIFFERENCES
Training processors also show different design methods from inference processors during the algorithm-hardware co-optimization. Recent application-specific inference processors [75], [76] adopt DNN early-stopping. It skips computations of the posterior layers if it judges them useless at the prior computation steps. Branch-net-based inference prediction [75] uses layerwise binary classifiers to determine whether the remaining layers should be computed or not. Similarity maps generated in the Siamese network inform important spatial positions and it can also exclude the computation of useless tiles in the remaining layers [76]. Another feature is output ReLU prediction [24], [27] which predicts virtual output zeros by using only the minimum amount of precomputation. It helps the sparsity exploitation core to skip both input and output sparsity during the inference scenario. However, these two features cannot be used during the training because the wrong prediction can disturb the training of the original backbone network. Instead, the training processor utilizes features to compensate for the drop in energy efficiency during the inference. Runtime quantization [27], [29] and pruning automation [26] are unique features of training processors even though they are useless to inference processors.

X. CONCLUSION
The current AI system is smart only for the given applications but what we expect for AI is not to simply provide the predetermined routine passive processing. AI will cooperate together with human beings through active interaction. The purpose of the DNN training processor is to provide an AI system with dynamic adaptability. However, the design of the training processor is not that easy because it has more challenges compared with the inference processors. Training processors should support transpose-read of the weight during the EP stage. It needs to exploit three different types of sparsity: 1) input; 2) weight; and 3) output sparsity. The new bit-precision optimization methods are also required to realize training in the micro-AI systems. Unlike inference, it needs to reduce the required storage for IA and OE for efficient training on compact devices. Furthermore, the additional EP and WG stages should be finished as soon as possible to minimize the delay of next inference. As analyzed in Section IX, the design of the training processor should be accompanied by the algorithm-hardware co-optimization and it has unique design methodologies compared with the conventional inference processors.
In spite of the many realizations of DNN training, there are still challenges. First, current training processors showed few optimization methods during the WG stages. Primal weight needs high bit precision to maintain its training performance. Moreover, it shows memory-intensive computation because of the elementwise multiplication and large kernel convolution. To release the memory burden of the WG stage, both a new algorithmic approximation and hardware architecture are needed. The second issue is backward locking. Even if the acceleration of each layer becomes much faster, backward locking will eventually be the major obstacle to fast training. BU should be studied with the new algorithm to replace conventional BP-based training.
The majority of training processors adopted long-term personalization or customization for their target applications by training with personal datasets. However, in future AI research, online training will become mainstream. Han et al. [34], [80] already revealed that the DNN trained by general knowledge showed a poor performance in an unexpected situation. Moreover, in the 6G wireless communication network, the online training will be essential because the pretrained network cannot respond to the nonstationary nature of real-world situations to find optimal spectrum and base station [77]. Only online training can achieve such complicated real-time adaptation to realize the reliable 6G system. The application shift to online training will further highlight the necessity of an ultrahigh-speed and ultralow-power DNN training processor and system-level optimization.
In conclusion, the DNN training processors for micro-AI systems establish their own research area and show unique characteristics compared with the conventional inference processors. In order to improve the intelligence of the device, research should be extended to the training processor. The DNN training may not be the only answer for AI to reach human intelligence, but it will lead to the harmonious coexistence of AI and human beings.