A High-Accuracy Hardware-Efficient Multiply–Accumulate (MAC) Unit Based on Dual-Mode Truncation Error Compensation for CNNs

This paper presents a multiply–accumulate (MAC) unit that enables a dual-mode truncation error compensation (TEC) scheme based on a fixed-width Booth multiplier (FWBM) for convolutional neural network (CNN) inference operations. The proposed tailored TEC schemes of Modes 1 and 2 can achieve high MAC accuracy for a general or rectified linear unit-based CNN model with general (Mode 1) or positive/zero (Mode 2) input patterns. By pre-calculating the pre-known CNN model coefficients, the proposed dual-mode TEC scheme can be realized using minimal partial product operations with high hardware efficiency using a software–hardware codesign approach. Further, a reconfigurable architecture of the resultant MAC unit is presented to realize the proposed dual-mode TEC scheme. By evaluating the accuracy for 9-<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> and 25-<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> MAC operations (<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> denotes the number of times MAC is performed), a MAC operation using the proposed TEC scheme can achieve the highest accuracy for Modes 1 and 2, relative to contrast samples that directly employ the FWBM with a conventional TEC function. The hardware performances of 9-<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> and 25-<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> MAC units are also evaluated using the TSMC 40-nm standard cell library. Compared with the contrast TEC-enabled designs, the proposed MAC unit exhibits higher hardware efficiency in terms of area, delay, and power consumption and achieves a minimum reduction of more than 40% in both area-delay-error and power-delay-error products. Moreover, the resultant 9-<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> and 25-<inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> MAC units are verified using a system-on-chip field-programmable gate array platform to test a CNN model for handwritten digit classification.


I. INTRODUCTION
A convolutional neural network (CNN) is a popular group of deep learning models that has shown considerable performance in many applications, such as image, signal processing, pattern recognition, and computer vision. With the development of edge artificial intelligence applications [1], many CNN inference functions are expected to be executed in client-side devices. Therefore, localized CNN processing schemes that can perform the real-time inference function in the edge end are in demand [1]- [3]. Generally, CNN operations can be conducted on a local The associate editor coordinating the review of this manuscript and approving it for publication was Gian Domenico Licciardo . platform using central processing units (CPUs) or graphical processing units (GPUs) [1], [2]. However, these approaches generally cause computational speed or power consumption issues. Thus, the CPU-or GPU-based platform may be unsuitable for the timing-critical or computation-intensive edge CNN operations [3]. Consequently, considerable research has been conducted on hardware (HW) acceleration schemes using application-specific integrated circuits or field-programmable gate array (FPGA) units, which can perform efficient client-side CNN inference computations [3]- [8]. Two-dimensional (2D) convolution and inner product computations are the main operations of the convolution and fully connected layers in a CNN [1] and are performed using a series of multiply-accumulate (MAC) operations. Consequently, MAC units are essential components to construct the processing elements (PEs) for CNN acceleration [9]- [14]. By considering the fixed-point MAC operation, a suitable bit width can be determined based on the CNN classification accuracy [8], [15]. To avoid significant bit-width increases because of accumulation in a MAC unit, one approach is to enable the multiplication output with the same bit width as the input in an CNN accelerator [4], [13]. Booth multipliers [16] are frequently used as the MAC multiplication schemes for CNNs [9]- [11]. For MAC units using Booth multipliers, the aforementioned approach can be employed using a fixed-width Booth multiplier (FWBM), which allows an L-bit × L-bit Booth multiplier to truncate the L least significant bits (LSBs) and preserve the L most significant bits (MSBs) at its output for a full numerical range. To reduce HW costs for the operations associated with the truncated L-bit LSBs of a full-width product, the concept of truncation error compensation (TEC) has been proposed for fixed-width multiplier designs [17]- [32]. TEC uses an estimated bias function to compensate for the truncation error associated with reduced HW costs.
In this study, we aim to design a FWBM-based MAC unit for CNN inference operations and propose the corresponding TEC schemes. The main features and contributions of the proposed design are listed: • A TEC scheme of Mode 1 provides an ensemble high-accuracy TEC function to improve the overall MAC operation accuracy with general values input patterns. Because CNN model coefficients are generally preknown, the proposed Mode 1 TEC scheme can be realized with minimal HW overheads.
• A TEC scheme of Mode 2 provides a high-accuracy TEC function adaptable to positive or zero value input patterns for MAC operations in CNNs using the rectified linear unit (ReLU) activation function. Based on the preknown CNN coefficients together with the MSB processing of partial products, the proposed Mode 2 scheme can be realized using HW resources similar to those used for the Mode 1 scheme.
• The efficient HW architecture of the MAC unit realizes the aforementioned TEC scheme of Modes 1 and 2. The proposed design can perform dual-mode TEC with high HW efficiency using a reconfigurable configuration of the partial product array.
• The design extension based on the proposed Mode 2 TEC scheme supports the MAC unit design for general applications that are not targeted only for CNN operations in the edge end. The remainder of the paper is organized as follows. Section II introduces the background and literature review. Section III outlines the proposed TEC schemes of Modes 1 and 2 and their contributions. Section IV presents the architecture of the proposed MAC unit supporting the dual-mode TEC scheme. Section V evaluates the accuracy and HW performance of the proposed design. Finally, Section VI highlights the conclusions of this study.

II. BACKGROUND A. MAC ACCELERATION IN CNNs
A general CNN model mainly comprises three types of layers, namely, convolution, pooling, and fully connected layers. A convolution layer primarily performs 2D convolution for feature extraction. For a set of input feature map (F in ) and kernel map (K ), an output feature map (F out ) can be obtained using the 2D convolution operation, as described in (1), where F in , K , and F out are the forms of a 2D matrix. By splitting (1) into a one-dimensional form, (1) can be transformed into a MAC operation for the F in and K operands. Fully connected layers comprise several neurons that are linked to the layers. At each neuron node, input and coefficient vectors are set to perform the inner product calculation, which also corresponds to a MAC operation. Therefore, MAC operations are dominant in the CNN model.
A training process based on a cost function and backpropagation algorithm is necessarily performed to determine the parameters in the specified CNN model [1]. Generally, the CNN is trained on a high-end CPU/GPU-based platform in floating-point operations [10], [11]. After training, the kernel and weight coefficients are determined for CNN inference applications. When an inference operation is executed using HW acceleration at the edge end, a device that uses a software (SW)-HW codesign scheme [33]- [35] is often employed, which is developed by evaluating the work division. For example, computation-demanding 2D convolution is often performed using an HW accelerator, while other low-effort CNN operations and system controls are processed at the SW end [3]- [5], [33]. In addition to 2D convolution, several studies have further explored HW acceleration for inner product operations in the fully connected layer [8], [36]. A system configuration for CNN inference acceleration based on the SW-HW codesign scheme is shown in Fig. 1, which comprises a host CPU, external memory, and accelerator. At the HW end, numerous PEs are used to perform CNN acceleration in fixed-point operations. These PEs contain  (2), where W k and A k are the coefficients (for kernel maps or weights) and feature map pixels, respectively, in fixedpoint values (e.g., L bits) and N is the number of times MAC is performed. In acceleration operation, the predetermined W k values stored in the external memory can be initially sent to the coefficient buffer ( Fig. 1). Subsequently, the A k data are sent to the data buffer and operated with W k using the MAC units. The associated data (including the MAC results) are transmitted through direct memory access (DMA).
Booth multipliers are popular because they reduce the number of operated partial products (PPs), thus yielding high HW efficiency [37], [38]. The Booth encoding method can be described as follows. In (2), the 2's complement of A k and W k in L bits can be represented by (3). The superscript (k) in (3) annotates the k th multiplication stage mapping to (2). In this study, the symbol (k) has the same meaning for other formulas and is sometimes omitted in related descriptions for convenience. By applying the Booth encoding to W k , W k can be further expressed in d j as (4), where Q = (1/2) × L and the encoded d j values associated with (w 2j+1 , w 2j , w 2j−1 ) are listed in Table 1. Each d j term can also be represented using three bits of (b 2 , b 1 , b 0 ) corresponding to (negative, two, one) information. Thus, a 2L-bit full-width product (FWP) for W k ×A k can be obtained using (5). Based on (5), Q rows of PPs can be obtained, which are associated with each d j . Using binary arithmetic for the multiplication of d j in (5), the PPs (P i,j and n j ; i is from L to 0) for each d j can be derived in terms of a i , 0, or 1 as listed in Table 1, and n j is an LSB plus term for each row.
Based on (5) and the terms listed in Table 1, Fig. 2 shows the structure of the partial product (PP) array for the L-bit W k × A k FWBM. As shown in Fig. 2, the PP terms can be divided into two groups: main part (MP) and truncation part (TP). An L-bit FWBM preserves L-bit MSBs and truncates L-bit LSBs of FWP. Corresponding to the PP array structure (Fig. 2), MP is operated to obtain the required L-bit outcome, while TP is related to the computation for the truncated L LSBs of FWP in FWBM.

C. FWBM TEC SCHEMES (LITERATURE REVIEW)
A simple method for realizing FWBM is to calculate all PPs in MP and TP and then approximate the full-width result to L MSBs by rounding. Such an approach is called the post-truncation (PT) scheme, which can achieve high accuracy but leads to significant HW complexity. Another simple scheme is direct truncation (DT), which directly truncates the PP terms in TP; thus, only MP is calculated. The DT scheme reduces HW costs; however, its operation accuracy is very low. To address the PT and DT issues, numerous TEC-based methods have been proposed for FWBMs [21]- [32] or even fixed-width Baugh-Wooley array multipliers [17]- [20]. For a general FWBM design realizing the TEC scheme, Fig. 2 shows the general TEC operations. As shown in this figure, TP can be further divided into major (TP major ) and minor (TP minor ) sets. TP major has a dominant influence on accuracy than TP minor with respect to the carry term from TP to MP. A variable w (e.g., w = 1, 2, or 3 shown in Fig. 2) indicates the column range of TP major , which adjusts the effect on accuracy contributed by TP major . For the derivation, the w th -column TP major can be mapped to the 2 −w digit; thus, the FWP result in (5) can be rewritten as (6). In TEC operations (ref. Fig. 2), TP minor in (6) is truncated and a bias term is estimated to compensate for the truncation error associated with TP minor . Accordingly, an FWBM result (i.e., an L-bit quantized FWP q from FWP) can be obtained, as expressed in (7), where B (k) indicates the estimated bias value and R{.} denotes the rounding operation. When N FWBMs with TEC are directly used to perform MAC computation in (2), a quantized MAC result in (E acc + L) bits is obtained, as shown in (8). E acc is the number of extended bits owing to accumulation, which equals the roundup value of log 2 (N ).
Numerous FWBM designs with TEC schemes have been reported [21]- [32]. Generally, these studies obtained the bias value (i.e., B (k) in (7)) using computer simulation [21]- [25] and probability estimation [25]- [32] methods. For the simulation-based methods [21]- [25], in [21], the bias terms were derived using simulation findings, followed by the Karnaugh map simplification. Linear regression analysis and simulation were employed in [22] to generate the bias terms. Moreover, several studies ( [23], [24]) have derived formulas for the TEC bias based on simulation results. In [24] and [25], the Booth-encoded results were utilized to determine the bias terms using a simulation to further improve accuracy. The simulation-based methods presented in [21]- [25] are practical but usually consume exhaustive simulation time to determine the TEC bias. Instead of exhaustive simulation, the work in [26] used the expected PP value to derive the bias terms. The authors of [25] also presented a probabilistic analysis, together with their simulation-based works. Moreover, the probability estimation methods [27]- [32] derived the closed forms of the bias function based on the expected value or conditional probability for PP terms. In [27], the expected values for two groups of TP minor (with or without n 0 , . . . , n Q−1 terms) are individually estimated and combined to obtain the probabilistic estimation bias when w = 1. Moreover, a generalized probabilistic estimation bias (GPEB) method [28] further enhanced the work in [27] for the cases of w = 2 and 3. Based on the GPEB methods ( [27], [28]), a simple 1-or 2-bit constant bias function was derived. The author of [29] presented a bias estimation form using conditional probability based on nonzero Booth encoder outputs for each row of TP minor . A more complex method based on [29] was presented in [30] using a conditional probability model for the TP minor rows that slightly improved the accuracy but increased HW overhead. In [31], the authors combined two schemes based on conditional probability [29] and expected values [28] to develop an accuracy-area improved bias function using probability estimation and computer simulation (PACS). A Booth-encoded sign-digit-based conditional probability (BSCP) method was proposed in [32], which further introduced the sign of nonzero Booth encoder output to generate a relatively high-accuracy bias function; however, only the case of w = 1 was considered.

III. PROPOSED DUAL-MODE TEC SCHEME
Several studies have particularly considered the design of the MAC units for CNN acceleration [9]- [14]; however, TEC functions were not considered. A convenient method is the direct use of conventional TEC-enabled FWBMs [21]- [32] to devise MAC units for CNNs. However, such an approach may not achieve optimized accuracy or HW efficiency owing to a lack of considerations for the features of CNN inference operations. Therefore, this study proposes a tailored dualmode TEC scheme to improve the accuracy with minimal HW costs for the aimed MAC unit design.
A. MODE 1 TEC SCHEME (GENERAL PATTERN) Mode 1 operation concerns MAC computation for general values in a Gaussian or uniform distribution, which is common in many CNN models.

1) DERIVATION FOR MODE 1 TEC SCHEME
In Fig. 2, a closed form for TP major computation associated with w is derived in (9), wherein is a floor operator. Apart from TP major , there are rows of TP minor in the TP PP array. For convenience, we denote the top and bottom rows as the 0 th and γ th row ( ), respectively, for TP minor . The value of the j th row of TP minor in the k th MAC multiplication stage can be calculated using (10). Based on the probabilistic analysis, the expected value of [32]). However, for a specified d j , several values of P 0,j and n j can be exactly determined to be 0 or 1, as observed in Table 1. Therefore, we use the hybrid of probabilistic or deterministic values of P i,j and n j to calculate the expected value of TP minor,j using (10). By adopting a scheme using hybrid values, the PPs (P 0,j , n j ) and E[TP minor,j ] (the expected value of TP minor,j ) according to d j values are listed in Table 2.
where n (k) Table 2 can be expressed using (11) and employed as a bias to compensate the j th -row truncated PP terms of TP minor . By summing E[TP minor,j ] for all TP minor rows, an overall E[TP minor ] can be obtained in (12). For a MAC operation that uses TEC-enabled FWBMs, the formulation in (12) can generate a high-accuracy bias term for each FWBM stage. In terms of MP, TP major , and TP minor , the MAC operation in (2) can be rewritten as (13). The proposed TEC scheme (Mode 1) truncates the N TP minor terms in (13) and uses their estimated value (expressed in (12)) as a substitute. VOLUME 8, 2020 In (13), the original (E acc + L)-bit MAC result in (8) can be revised as (14), where a global bias (i.e., B M 1 ) is introduced in the TEC operation using (12). In practice, the B M 1 value in (14) requires only fractional precision for the w digit using (15), in which a F w {.} floor function is employed for truncating B M 1 to X integer part and w-bit fractional part.

2) OPERATIONS WITH MODE 1 TEC
Compared with (7) and (8), which are based on conventional TEC-enabled FWBM, B M 1 in (14) is a global bias accumulating the TEC effect in (12) for each FWBM, thus improving the MAC accuracy by considering overall operations [39]. Moreover, TP major of all N FWBMs is accumulated in (14); therefore, the rounding noise only existed in the final stage.
Although B M 1 in (14) can provide effective global biasing,   The ReLU activation function rectifies its negative input to zero. Thus, MAC operations (shown in (2)) for an ReLU-based CNN model only have input patterns (i.e., A k ) of either positive values or considerable-amount zeros [4], [40]. Such a condition of zero-valued input patterns particularly occurs in image processing with sufficient black-color pixels or CNN applications with no pooling functions [41]. The operation of Mode 2 considers such MAC execution in positive/zero values.

1) DERIVATION FOR MODE 2 TEC SCHEME
In Mode 2 condition, FWBM with TEC practically disables its bias function and outputs a zero product when its input pattern is zero to prevent unexpected accuracy degradation. This requirement makes the Mode 1 TEC scheme unsuitable for handling zero patterns in Mode 2 because B M 1 is a presummed global bias that assumes that all N E[TP minor ] values are employed in (14), which cannot be dynamically changed using A k by the MAC unit HW. Accordingly, Mode 2 TEC scheme is proposed to provide high-accuracy biasing for positive patterns, while disable biasing for zero patterns in each FWBM. Such a function can be directly offered using (12) Table 2 can be approximated, as shown in (16), based on the probability distribution for nonzero d j in Table 1. Consequently, the original E[TP minor ] value in (12) can be approximated using (17), where δ j indicates whether d j is a nonzero term (i.e., δ j = 1 for nonzero d j ), which can be mapped to the exclusive-OR operation of (b 1 , b 0 ) bits in Table 1. In (17), the TEC bias is only computed at the 2 −w−1 digit for the δ 0−γ terms.
A more efficient method of bias generation is using the deterministic information instead of δ γ for the bottom row of j = γ . As shown in Fig. 2 with w = 1, P 0,Q−1 and n Q−1 can be employed to replace δ Q−1 , where γ = Q− 1 at the 2 −2 digit. Moreover, for w = 2, δ Q−2 (γ = Q− 2) at the 2 −3 digit can be substituted by P 1,Q−2 together with a C out term that is carried from the 2 −4 -digit addition of P 0,Q−2 and n Q−2 . Moreover, we can use P 0,Q−2 and n Q−2 to substitute δ Q−2 at the 2 −4 digit for the case of w = 3. Consequently, the bias function in (17) can be modified to (18), (19), and (20) for FWBM with w = 1, 2, and 3, respectively, where an additional subscript ''M2'' emphasizes the updated E[TP minor ] value for Mode 2. Based on the derivations in (18)-(20), a general bias function for FWBM can be obtained as (21) for odd or even w. Thus, the MAC calculation in (14) for Mode 1 is updated to (22) for Mode 2, where an accumulated bias (i.e., B M 2 ) in (21) is applied to the TEC function. As shown in (22), the individual E[TP minor,M 2 ] values are accumulated across N FWBMs to enhance the overall TEC effect. Compared with Mode 1, Mode 2 TEC scheme uses the hybrid probabilistic δ 0−(γ −1) and deterministic (P 0,γ , P 1,γ , n γ , and C out ) values in another method to improve accuracy. Furthermore, B M 2 is a dynamic bias because the outcome in (21) can be adapted to the input data of each FWBM or optionally set to zero.

2) OPERATIONS WITH MODE 2 TEC
The MSB process for Mode 2 can be further improved using positive/zero-only input patterns. Because the sign bit of positive or zero patterns is always zero, all MSB PPs (P L,j ) in L-bit FWBM can be exactly determined using the sign of d j (i.e., ''1'' for positive d j ; ''0'' for negative d j ) and are denoted as S j (Table 1). Similar to the principle of B M 1 generation (Mode 1), these S j terms can be obtained in VOLUME 8, 2020 advance because the W k coefficients are preknown. Therefore, the signed presummation for MSB ''1'' (Fig. 3) can be extended to be merged with the predecided S j that generates upgraded PS M 2 . Based on (18) In addition to positive patterns, MAC operations in Mode 2 are also characterized using zero input data (i.e., A k = 0). In this situation, FWBM is expected to generate a zero product for the zero-valued input. For TEC-enabled FWBM, a practical approach is to set its MP, TP major , and bias terms to zero or multiplexing its product with zero. However, such an approach cannot be directly applied to the proposed design because the presummed MSB-one and S j of FWBM already affects the MAC result in our design, and all other portions (including residual MP, TP major , and bias) are set to zero. This issue can be resolved by optionally adding of ''1'' to each row of PPs according to the sign of d j in FWBM.
As an exemplification, we use an 8-bit W k × 0 operation with w = 1, in which the Booth decoder result of W k is (d 3 , d 2 , d 1 , d 0 ) and the signs of d 0−3 are (−, −, +, −). Fig. 6(a) shows the generated PP array based on Fig. 5(a) and Table 1 for the aforementioned example. As shown in Fig. 6(a), only an expected zero product can be obtained when ''1'' carried from the negative-d j row of TP minor is added to MP, TP major , and presummation portions if TP minor is truncated. Using binary arithmetic, the PP array in Fig. 6(a) can be modified to Fig. 6(b) by adding ''1'' at the digit next to the presummed MSBs and setting other PPs to zeros. For all possible d 0−3 conditions, a general-modified PP array using the same addition-of-1 scheme can be deduced using the analogy shown in Fig. 6(c), where variable z j is defined by z j = 1 for d j < 0 and z j = 0 for d j ≥ 0.

IV. PROPOSED MAC UNIT WITH DUAL-MODE TEC
As mentioned in the previous section (III/B/2), the required operations for Modes 1 and 2 (A k ! = 0) are nearly the same. Moreover, the operations for Mode 2 (A k = 0) can be performed in the original PP array for Mode 2 (A k ! = 0) using the inserted z j and sufficient ''0'' (Fig. 6(c)). Therefore, this study aims to develop a MAC unit that employs a reconfigurable structure to enable multiple operations for the proposed dual-mode TEC scheme.

A. RECONFIGURABLE ARCHITECTURE
In CNN convolution layers, the dimension of kernel map K in (1) is generally k d × k d , where k d is an odd integer. Among them, 3 × 3 or 5 × 5 kernel maps are usually employed for many CNN applications. By considering the parameters in (1) and (2), the N values in the MAC operation mapped to the 3 × 3 and 5 × 5 dimensions are 9 and 25, respectively. Taking the 9-N 8-bit MAC operation for 2D convolution as an example, Fig. 7(a) presents the overall architecture of the corresponding MAC unit using the proposed dual-mode TEC scheme with w = 1. As shown in this figure, the preknown W k is first sent to the coefficient buffer and the precalculated B M 1 + PS M 1 (for Mode 1) or PS M 2 (for Mode 2) is initially set for the calculation. A k data are prepared depending on Mode 1 or 2. For Mode 1, the 8-bit A k is the original (a 7 , a 6 , . . . , a 1 , a 0 ) digits, while for Mode 2, the form (a 6 , a 5 , . . . , a 0 , 0) are assigned to the 1-bit shift operation (i.e., the ''1 '' symbol), as shown in Figs. 5(a)-5(b). For each iteration of MAC operations, three-term A k data are sequentially sent to the data buffer and processed by the P.P. generator ( Fig. 7(a)), which generates the PP terms (including pp i,j , n 3 , δ j , and z j ) according to the Booth encoder result of W k and mode control. The generated PP terms are then sent to nine FWBMs to perform the 9-N MAC operation with selected TEC of Mode 1 or 2. The PP array for each FWBM is structured using a carry-save adder (CSA) tree, followed by a carry-propagation adder (CPA) summation. The final MAC result is obtained by accumulating the nine FWBM results and optionally adding B M 1 , PS M 1 , or PS M 2 based on Mode 1 or 2. Fig. 7(b) shows the actual operations performed in one of nine (k th ) FWBMs for Modes 1, 2 (A k ! = 0), and 2 (A k = 0), which correspond to the operations shown in Fig. 3, 5(a), and 6(c), respectively. As shown in Fig. 7(b), multimode operations are required for the proposed TEC scheme as follows. In Mode 1, all pp 8−1,0−3 terms from the P.P. generator ( Fig. 7(a)) are directly mapped to the PP terms associated with P 8−1,0−3 , as arrayed in Fig. 3. In Mode 2 (A k ! = 0) condition, because the 1-bit-shiffted A k has been assigned ( Fig. 7(a)), the pp 7−1,0−3 terms are alternatively mapped to P 6−0,0−3 at the corresponding digit in the PP array in Fig. 5(a). However, the pp 8,0−3 values should be adjusted in the P.P. generator to obtain the expected P 7,0−3 and P 8,0 terms in Fig. 5(a) using the mapped results in Table 1. Thus, we use a dashed line to frame pp 8,0−3 in Fig. 7(b) to highlight such a process. Moreover, the operation terms of the last column in the PP array are necessarily selectable (pp 7/5/3,0−2 or δ 0−2 /n 3 ) for Mode 1 or 2 (A k ! = 0) conditions, which are mapped to the contents in Figs. 3 and 5(a). In Mode 2 (A k = 0) condition, the P.P. generator is set to assign all pp i,j , n 3 , and δ j terms with ''0'' (the region included in the red-line frame in Fig. 7(b)) and generate z 0−3 for optional addition; z 0−3 is the previously defined parameter for Mode 2 (A k = 0), which equals 1 (for d j < 0) or 0 (for d j ≥ 0). Therefore, the contents of the PP array associated with z 0−3 operations must be optionally adjusted in Mode 2 (A k = 0) condition.
To enable the multimode operations using the aforementioned approach, Fig. 7(c) shows the resultant FWBM configuration, in which the CSA tree and CPA are implemented using full adders (FA) and half adders (HA). In Fig. 7(c), type-1 multiplexers (mux1) are used to select pp 7/5/3,0−2 or δ 0−2 /n 3 data for Mode 1 or 2 (A k ! = 0) conditions. Type-2 multiplexers (mux2) are responsible for the optional addition of z 0−3 , which can be merged with the existing PP array, allowing no extra z j addition HW costs because sufficient PPs are set by ''0'' in Mode 2 (A k = 0) condition. In our design, PPs can be practically set to zero using control logic to set (b 2 , b 1 , b 0 ) in Table 1 to (0, 0, 0) in the Booth encoder. Based on the 1-bit shift in the numerical range of Mode 1 or 2, the 9-bit FWBM output (including the w digit) can be mapped to o 15 -o 7 or o 14 -o 6 , as shown in Fig. 7(c). In Fig. 7(b), multimode operations cause variations in the number of PPs (N pp ) of the last column (i.e., w = 1) of a PP array, which is 4 or 5 for Mode 1 or 2, respectively. Thus, the associated PP array portion (Fig. 7(c)) is structured to address the N pp difference of 4 and 5 using mux1. For w = 2 or 3, the FWBM configuration varies with the contents of multimode operations and N pp of the last w column. Multimode operations and the PP array configuration for the k th FWBM in a 9-N 8-bit MAC unit with w = 2 and 3 are illustrated in Figs. 8 and 9, respectively, where the addition of n 2 or n 3 at the correct digit is also considered. As shown in Fig. 8, N pp of the last two columns (i.e., w = 2) (denoted as (N pp,2 & N pp,1 )) are (4 & 5) and (5 & 4) for Modes 1 and 2, respectively. In Fig. 9, the corresponding N pp count of the last three columns (i.e., w = 3) for Modes 1 and 2 are (4 & 5 & 3) and (5 & 3 & 4), respectively. For the operations involving multiple N pp , the proposed FWBM structure for w = 2 and 3 effectively use mux1 to enable data selection and  resource sharing of FAs, as shown in Figs. 8 and 9. A similar process with an extended w or bit width of L can also deduce the FWBM design. Because the w bits of each FWBM are included for accumulation in our design, the resultant MAC unit can be potentially devised beyond the FWBM limitations (i.e., employing (L + w) bits at the multiplier output) through the selection of w. Moreover, it supports the selectable bit width range in the (E acc + L + w)-bit MAC output when a dynamic fixed-point scheme is applied to the CNN accelerator design [42], [43]. Using the presented FWBM design (e.g., Figs. 7(c), 8, and 9), a reconfigurable 9-N 8-bit MAC unit can be devised, which can support the proposed dualmode TEC scheme with few HW overheads using control logic and multiplexers in the PP array, thus yielding high HW efficiency.

B. DESIGN EXTENSION FOR GENERAL-PURPOSE MAC UNIT
In addition to the targeted CNN applications, we can apply the proposed Mode 2 TEC scheme to the design of a general-purpose MAC unit. For such a MAC unit design, the biasing method based on the precalculation (e.g., B M 1 in Mode 1) is not feasible because both MAC input patterns (i.e., W k and A k ) are indeterminate. Moreover, the presummation method involving the zero MSB for positive/zero patterns (e.g., PS M 2 in Mode 2) is not applicable owing to the random values of W k and A k data. Consequently, we can use (21) Fig. 10 shows the operation and PP array for the k th FWBM. As shown in this figure, only 3-row MSB ''1'' of nine FWBMs are presummed to reduce the related HW costs of adding 1. In addition to the operation for TP major , additional calculation for δ 0−2 , P 0,3 , and n 3 at the 2 −2 digit is required for TEC biasing. Compared with Figs. 7(b) and 7(c), the operation in Fig. 10 is only in one mode and the PP array is a direct operation-mapped configuration with extra FAs for bias terms; however, the corresponding configuration can avoid extra multiplexing for resource sharing.

V. PERFORMANCE COMPARISONS AND EXPERIMENTS A. ACCURACY PERFORMANCE COMPARISONS
Considering 2D convolution with 3 × 3 or 5 × 5 kernel maps as two MAC operation samples (i.e., 9 or 25 N ), the accuracy performance based on the proposed dual-mode TEC scheme is addressed and evaluated for Modes 1 and 2, respectively, in the following subsections.

1) EVALUATION RELATED TO MODE 1
For accuracy performance, the signal-to-noise ratio (SNR) is the most important parameter. In the targeted MAC operation, SNR can be defined using (23), where MAC is the original full-width MAC result obtained by (2) given L-bit (W k , A k ) and MAC q is the quantized MAC outcome with TEC generated using (8), (14), or (22). Moreover, two other parameters, namely, mean error (ē) and mean absolute error (|ē|) (defined in (24)) are employed for accuracy evaluation. In (23) and (24), E [.] indicates the calculation of an averaged value. MAC q obtained using the DT and PT methods (specified in (25)) is also considered for a complete comparison involving lowest (DT) and highest (PT) operation accuracy.
To evaluate the accuracy performance of our design, we compared the SNR performance of 9-N and 25-N MAC operations, either using the proposed Mode 1 TEC scheme or directly using conventional TEC-enabled FWBMs. For comparison samples, we selected the FWBM designs whose TEC function is based on probability estimation methods ( [25]- [32]) as the aimed group, which are the GPEB [28], PACS [31], and BSCP [32] methods. The GPEB scheme [28] (an extension of [26] and [27]) generated the TEC bias in constant-valued 1-bit or 2-bit digits. The PACS work [31] accumulated the number of non-zero d j terms at the 2 −w−1 digit as TEC biasing, which is an enhanced design of [29] and [30]. PACS-similar bias functions were also derived in [25] or [24] (a simulation-based work), but only the case of w = 1 was included; and their accuracy is lower than or closer to that of PACS [31]. The BSCP scheme [32] derived a bias function ((19) in [32]) to improve the accuracy, considering the magnitude and sign of non-zero d j terms. However, [32] only included the case of w = 1, and the accuracy decreased in case the modified bias function ( (20) in [32]) was employed for practical bias generation. In this study, the evaluation data for the BSCP method (in Tables 3−9) were obtained using the high-accuracy bias formulation presented in [32] ( (19) in [32]). Tables 3 and 4 present the SNR results for the 9-N and 25-N MAC operations using the aforementioned TEC schemes, respectively. As illustrated in Tables 3 and 4, the proposed Mode 1 TEC scheme (i.e., the TEC-M1 item) can achieve the best SNR performance except the PT method. The comparison results demonstrate that our Mode 1 TEC scheme can further improve the operation accuracy by the use of hybrid probabilistic and deterministic PP values with the design consideration of the overall MAC operation.
Tables 5 and 6 list the mean error and mean absolute error values for the 9-N and 25-N MAC operations using various TEC schemes, respectively. Various TEC methods generally lead to different mean error effects on FWBM. However, in N MAC operation, the N sets of mean error are accordingly accumulated and produce significant overall impairments. As shown in Tables 3−5, the mean error issue usually results in SNR degradations with an increase in N for most situations of TEC methods, in which DT is the worst case. Alternatively, the proposed Mode 1 TEC scheme outperforms all listed TEC-based methods in terms of minimizing the mean error. Moreover, a higher SNR can be achieved with larger N using Mode 1 TEC scheme relative to other TEC works because the rounding noise at the final stage in our design generally has a stable magnitude, although the value of signal power increases with N . The leading performance of Mode 1 TEC scheme with respect to the mean absolute error is also exhibited in Table 6, which is consistent with the SNR results listed in Tables 3 and 4.

2) EVALUATION RELATED TO MODE 2
As discussed in Section III/B, when an FWBM has zero-valued input data in Mode 2, directly using Mode 1 TEC scheme potentially induces unwanted biasing in the FWBM. Table 7 presents SNR degradation when positive/zero patterns in Mode 2 are applied to the 9/25-N MAC operation using Mode 1 TEC scheme with w = 1. In Table 7, TEC-M1 (original) corresponds to the original results of Mode 1 TEC with w = 1 obtained in Tables 3 and 4 and TEC-M1 (Mode 2) indicates the condition that the original random input patterns of Mode 1 TEC design are replaced by the positive or zero data assumed in half (1/2) probability to meet the feature of Mode 2. As shown in Table 7, the decrease in SNR is exhibited in the TEC-M1 (Mode 2) condition and the associated SNR performances are inferior to many SNR performances of PACS or BSCP methods, as listed in Tables 3 and 4. Therefore, we alternatively used the proposed Mode 2 TEC scheme to reevaluate the accuracy performance based on the positive/zero input patterns in Mode 2. Tables 8 and 9 present the SNR values of the 9-N and 25-N MAC operations assigned by the positive/zero input data of (1/2) probability, respectively, for various TEC schemes. In Tables 8 and 9, the contrast GPEB, PACS, and BSCP schemes are assumed to possess an additional function that detects the zero input patterns and disables biasing for VOLUME 8, 2020    Tables 8 and 9, the proposed Mode 2 TEC scheme (i.e., TEC-M2) outperforms all listed methods, except PT in the SNR results. Using the upgraded bias functions ( (21) and (22)), which also consider hybrid probabilistic and deterministic digit values, SNR obtained using Mode 2 TEC scheme can achieve similar performance as Mode 1 TEC scheme, as illustrated in Tables 3 and 4.

B. HARDWARE PERFORMANCE COMPARISONS
A comparison of HW performances of the 9-N and 25-N MAC units, including area, power consumption, and critical-path delay for various TEC schemes, is presented in Tables 10 and 11, respectively, in which HW parameters were generated by logic synthesis using the Synopsys Design Compiler tool with the TSMC 40-nm typical standard cell library and the TEC-Dual items represent the proposed work using reconfigurable structures. In our evaluation, data and coefficient buffers are excluded (ref. Fig. 7(a)), and we directly redesigned MAC units for the contrast TEC schemes based on their presented works and re-performed the logic synthesis to obtain the HW performances' data for those  works. To ensure a fair comparison, we added multiplexers and control logic to enable bias-disable functions in the contrast designs wherein FWBM could produce zero-valued outputs when zero input patterns were detected. Thus, the HW behavior of all designs in Tables 10 and 11 can be synchronized to the accuracy results shown in Tables 8 and 9. Moreover, according to [24], [32], and [44], the bias function for [32] in Tables 10 and 11 was structured using a general sorting circuit based on [24] to avoid the addition of negative digit values. The comparison results show that the non-TEC-based PT and DT methods require the highest and lowest HW costs, respectively, which are rational to their accuracy performances. The main HW feature of the proposed design can perform dual-mode TEC operations using the minimal CSA elements of a PP array in cooperation with additional accumulation resources (i.e., addition for B M 1 , PS M 1 /PS M 2 , and cross-FWBM TP major ). Moreover, our MAC unit can exclude the individual output-rounding stage in each FWBM and merge the final rounding operation in B M 1 accumulation. However, to enable reconfigurable operations, extra data-selection multiplexers (for z j and δ j ) and control logic are required in our design. For TEC-enabled schemes in Tables 10 and 11, the proposed MAC unit has a smaller area, delay, and power consumption than most PACS [31] and BSCP [32] conditions because one more column of CSA addition or extra sorting circuit is required in each FWBM for the PACS or BSCP designs, respectively. Comparatively, the GPEB [28] scheme achieved better HW performance owing to its use of a simple constant 1-or 2-bit bias; however, the accuracy results of GPEB (Tables 3 and 4) are lower than those of the proposed scheme. Two design metrics, i.e., area-delay-error product (ADEP) and power-delayerror product (PDEP), defined in (26), can be employed to evaluate the overall design efficiency [32], where MSE is the mean square error (i.e., the E[(MAC − MAC q ) 2 ] term in (23)). Small ADEP or PDEP values exhibit good design efficiency considering both HW and accuracy performances. For the TEC-enabled designs (i.e., GPEB, PACS, and BSCP), Fig. 11 shows the decrease in the percentage values of ADEP and PDEP results of the 9-N MAC unit using the proposed dual-mode TEC scheme relative to other TEC methods for various L-bit operands. The same evaluation for the 25-N MAC unit is also shown in Fig. 12. In Figs. 11 and 12, the Ar_(G), Ar_(P), and Ar_(B) items (shown in dark colors) represent the ADEP reductions achieved using our design associated with the GPEB(G), PACS(P), and BSCP(B) methods,

C. DESIGN IMPLEMENTATION AND EXPERIMENTS
To verify the proposed design, we implemented 9-N and 25-N MAC units for 3 × 3 and 5 × 5 2D convolution acceleration, respectively, on an SW-HW codesign platform using the Xilinx Zynq-7000 system-on-chip (SoC) FPGA device. An experiment was performed to demonstrate the handwritten digit classification using a simplified LeNet-5 [45] CNN model. The experimental CNN comprises two convolution and maximum pooling layers as the model structure summarized in Table 12. For verification, two scenarios were set for mode selection. In Scenario 1, the input feature map was the original handwritten digit pattern, which only had positive/zero pixel values, and an ReLU function was employed after performing 2D convolution. Accordingly, the 9-N and 25-N MAC units were configured for Mode 2 operations in this scenario. In Scenario 2, we deliberately added noise to the input data and used the ''tanh'' function instead of ReLU to generate nonzero patterns in the convolution layer. Therefore, Scenario 2 allowed the 9-N and 25-N MAC units to operate in Mode 1 condition. The division of HW and SW responsibilities in our experiment was as follows. The HW side was responsible for 2D convolution, while the SW side performed residual operations, including the summation of the convolution results, addition of an offset, activation function, maximum pooling, FC execution, and system control. The development flow of our experiment includes the training and generation of CNN models, HW design of the MAC units, and SW-HW codesign and experiment on FPGA, as described in Fig. 13. In a typical CNN inference execution, the low operation precision (e.g., 8-bit width or even fewer) is sufficient to preserve the classification accuracy [12], [15]. However, to achieve diverse precision required by different applications, a sufficiently high bit width (e.g., 16 bits) is considered in several CNN accelerator designs [3]- [5]. In our design, the 9-N and 25-N MAC units were operated using 16-bit operands with w = 1, which can achieve a high recognition rate for handwritten digit classification. A Zynq-7000 SoC-FPGA device was integrated on an ARM CPU at the HW side, including FPGA logics and block RAMs. The ARM CPU communicates with the HW side through the AXI bus. Fig. 14 shows the schematic of the setup of our design implementation. The MAC unit and RAMs were implemented at the HW side for storing the coefficients (W k ) and data (A k ) to perform the 2D convolution operation.  Moreover, the SW commands were executed using the ARM CPU and external memory (i.e., DDR). When the 2D convolution of each layer was actuated, the coefficients and precalculated TEC-related terms (i.e., PS M 1 /B M 1 for Mode 1 and PS M 2 for Mode 2) stored in external DDR were accessed and moved to a register-based coefficient RAM at the HW side using the DMA controller. Subsequently, the layer-inputted feature maps were fetched from the DDR to the HW block RAM via a DMA transmission. The MAC unit then accessed the W k and A k values for MAC operations and then sent the calculated result to DDR through DMA for follow-up SW processing. Through two separate rounds of HW acceleration for the two layers of 2D convolution, the entire CNN operations (Table 12) were completed to obtain the final inference results based on the SW-HW codesign approach. As shown in Fig. 13, the FPGA-based inference outcomes were further compared with the results generated using the fixed-point CNN model for verification. Table 13 lists the main HW resource usage on a Xilinx/Zynq-7000 (XC7Z020-CLG484) device for the FPGA design of the proposed 9-N and 25-N MAC units (including block RAMs for storing data of feature maps), and the items include lookup-table slices (LUT), flip-flop slices (FF), and block RAMs (BRAM). Relative to the prior FPGA designs for CNN acceleration (e.g., [7], [8]), our work did not utilize existing FPGA DSP modules for the multiplier implementation because the TEC-based FWBMs in our MAC units were manually designed by structural modeling with FA and HA cells. The HW performance of giga operations per second (GOPs) and the inference (classification) result of the recognition rate for handwritten patterns are also listed in Table 13. The GOPs performances of our design were obtained using a 50-MHz clock rate with values converted from the GMACs multiplied by 2 because each MAC execution contains multiplication and addition operations [7]. In our evaluation (i.e., the experiments based on Table 11),  the CNN execution time using our FPGA-based design can be 5.7 times faster than the computation time using only the SW-based ARM CPU. Although the CNN acceleration and GOPs performances achieved in our experiment are not high, the proposed FPGA-based design focuses on MAC unit verification, which was developed with no emphasis on achieving high GOPs or significant acceleration effects using many parallel PEs or data reuse memory schemes, as presented in other FPGA-related studies [7], [8].

VI. CONCLUSION
In this paper, we presented a MAC unit that could support dual-mode TEC using FWBMs for CNN inference operations. The proposed dual-mode TEC scheme was derived using hybrid probabilistic/deterministic digit values to provide tailored and high-accuracy TEC functions for two operation modes. For the general CNN model in which the MAC units handle random input patterns, the TEC scheme of Mode 1 was proposed to optimize the overall MAC accuracy using a global bias value. For the MAC unit with positive/ zero input patterns in an ReLU-based CNN model, the TEC scheme of Mode 2 was proposed to achieve high accuracy based on a combination of the input-adaptive bias term for each FWBM. The proposed TEC method could yield high accuracy of the MAC unit, while preserving the HW efficiency. By exploiting the feature that the W k coefficients were preknown, the TEC-related terms (i.e., PS M 1 , B M 1 , or PS M 2 ) could be precalculated and most multimode operations in an FWBM could be performed using existing PP terms. Thus, the resultant MAC unit only required minimal CSA resources to enable the proposed TEC operations using a reconfigurable structure with high HW efficiency. Compared with the 9-N or 25-N MAC operations that directly employed state-of-the-art FWBMs with TEC, the 9-N or 25-N MAC operations using the proposed TEC scheme outperformed contrast samples in terms of operation accuracy in cases of both Modes 1 and 2.
Competitive HW performances, including area, criticalpath delay, and power consumption, of the proposed 9-N and 25-N MAC units were also shown and compared with those of the TEC-enabled contrast designs. Moreover, the proposed design could yield a significant overall design efficiency in terms of ADEP and PDEP values. Furthermore, the resultant 9-N and 25-N MAC units were verified by the SW-HW codesign approach using the Xilinx Zynq-7000 SoC-FPGA device. The FPGA-based verification demonstrated the Lenet-5 CNN model for handwritten digit classification. Although the MAC unit based on the proposed design could preserve the accuracy and reduce the HW costs, additional design contents for multiple modes were required and the contribution may be confined to CNN applications. However, an extended design based on Mode 2 TEC scheme can be applied to the design of the general-purpose MAC unit. Moreover, a systematic design method for determining the TEC contents and HW configurations can be developed in our future work.