A Two-Stage Operand Trimming Approximate Logarithmic Multiplier

We present an approximate logarithmic multiplier with two-stage operand trimming, which prioritises area and energy consumption while retains acceptable accuracy. The multiplier trims the least significant parts of input operands in the first stage and the mantissas of the obtained operands’ approximations in the second stage. We evaluated the multiplier’s efficiency in terms of error, energy, and area utilisation using NanGate 45nm. The experimental results show that the proposed multiplier exhibits smaller area utilisation and energy consumption than the state-of-the-art designs and that it behaves well in image processing and image classification with convolutional neural networks.


I. INTRODUCTION
P OWER consumption has become a considerable challenge and a serious obstacle when increasing computing performance. It is possible to obtain a significant performance gain through the shrinking of the CMOS transistor channel length. As the channel length is reduced, the power per switching event decreases, which allows for faster switching. At the same time, the number of transistors per chip increases, which in turn increases the total chip power. A possible way to overcome this obstacle is to design devices and algorithms that require fewer transistors, even at the price of accuracy.
Many applications, e.g., image processing, can produce an inaccurate output relying on the limited human perception. Some other applications, including adaptive filtering and machine learning, can refine their results iteratively, exhibiting inherent tolerance for a small computation error. Moreover, many signal processing and machine learning applications deal with input data distorted by the noise caused by sensors and quantization processes, setting a limit in the precision or accuracy. Pursuing meaningless precision in any of the above cases leads to excessive energy consumption. Therefore, we may introduce a small error in computation that would not impose noticeable degradation of the output results.
Hence, approximate computing has emerged as a new and promising paradigm for high-performance and energy-efficient systems. In approximate computing, a small and acceptable error may be induced in computing at all layers to achieve more energy efficient processing. For example, various approximate arithmetic circuits have been designed to save chip area and energy by generating inaccurate but acceptable results [1]- [6].
This paper focuses on approximate multipliers, as multiplication represents a ubiquitous arithmetic operation found in various applications. For example, approximate unsigned multipliers have appeared in some image processing applications, like sharpening and smoothing, and video compression [7]- [10], while approximate signed multipliers have performed well in deep-learning accelerators [11]- [16].
We propose an energy-efficient approximate logarithmic multiplier with two-stage operand trimming that exhibits smaller energy consumption and area utilisation than the stateof-the-art designs. We show that we can deploy the proposed approximate multiplier in image processing and image classification applications without noticeable degradation in performance.
In the remainder of this paper, we first review the related work in the field of approximate multipliers. In section III, we describe the architecture of the proposed multiplier, which we further analyse in terms of the error characteristics and the synthesis results in section IV and section V, respectively. Section VI shows the applicability of the proposed design in image smoothing and convolutional neural networks applications. Lastly, we conclude the paper with the main findings.

II. RELATED WORK
Multipliers are complex circuits. To approximate multiplication, we usually convert it into some simpler operations. By using algorithmic simplifications, we can significantly improve multipliers' energy performance. Approximate multipliers aim to achieve the best possible trade-off between accuracy and design efficiency [1], [2], [17]- [19]. The work [2] provides a comprehensive evaluation of recently proposed approximate arithmetic circuits, which are compared under different design constraints and applied to image processing and deep learning applications. Most of these multipliers follow one of the two major approaches: approximate logarithmic and approximate non-logarithmic multiplication. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ As the name suggests, approximate logarithmic multipliers use the logarithmic product approximation, which replaces the multiplication with the addition of operands' logarithms [12], [13], [15], [20]- [26]. Approximate non-logarithmic multipliers focus on simplifications of partial product addition [27]- [29], and partial product generation [7], [9], [30]- [33]. Approximate logarithmic multipliers deliver a more straightforward design but exhibit significantly higher computational error. At the same time, approximate non-logarithmic multipliers have a lower computational error with a price of higher design complexity [13].
In this chapter, we briefly review the approximate logarithmic and the approximate non-logarithmic multipliers.

A. Approximate Logarithmic Multipliers
Mitchell was the first to introduce unsigned logarithmic product approximation [20]. Mitchell's algorithm employs the approximation of a binary number logarithm. It replaces multiplication with addition in the logarithm domain and serves as the basis of all logarithmic multipliers. Mitchell's multiplier always underestimates the actual product, so its main drawback is a high computational error. Mahalingam et al. [21] improved Mitchell's multiplier accuracy through operand decomposition. The iterative logarithmic multiplier (ILM), developed by Babić et al. [22], achieves arbitrary accuracy through an iterative procedure but still underestimates the actual product.
Liu et al. [23] proposed the unsigned ALM-SOA multiplier, which uses a truncated binary-logarithm converter and a setone-adder (SOA) for the addition of logarithms. The setone-adder with k approximation bits (SOAk) sets the k least significant bits to '1'. Hence, the actual sum of the logarithms is overestimated. As Mitchell's multiplier always underestimates the actual product, SOA is employed to compensate the negative errors.
Kim et al. [12], [25] truncated the logarithm representation of operands to deliver more efficient logic and proposed an iterative error correction procedure. The authors used the one's complement as an approximation of the two's complement to handle negative numbers, as was previously proposed in [11]. Their signed Mitchell-trunck-C1 multiplier keeps only k upper bits of mantissa in the logarithmic representation of input operands. Due to mantissa truncation, Kim et al. achieved efficient logarithm and antilogarithm conversions, but at the same time, mantissa truncation causes a large error. Their two-stage design uses two truncated logarithmic multipliers for error correction.
Ansari et al. [15], [24] proposed two improved unsigned logarithmic multipliers (ILM), with the exact (ILM-EA) and approximate adder (ILM-AA) in the antilogarithm step. They introduced two novelties. Firstly, they use a near-one-detector (NOD) to round both operands to their nearest powers of two. As the output of the NOD uses a one-hot representation and some entries in the truth table of a conventional adder cannot occur, the authors proposed a compact adder for the reduced truth table. Secondly, they used the modified SOAk adder. In the modified SOAk, instead of setting all of the k least significant bits to '1', these bits are set alternately to '1' and '0', which leads to a double-sided error distribution -one of its major benefits. The proposed ILM-AA is more accurate and has the smallest error values compared to other unsigned logarithmic designs in the literature.
Yin et al. [26] proposed unsigned and signed designs of a dynamic range approximate logarithmic multiplier (DR-ALM). Mitchell's multiplier product is always smaller than its exact counterpart, the DR-ALM dynamically truncates input operands and sets the least significant bit of the truncated operand to '1' to compensate for negative errors. DR-ALM uses smaller bit-width logarithmic converters, adder, and antilogarithmic converter to generate the product due to the previous truncation of operands.
Pilipović et al. [13] proposed the LOBO approximate multiplier that uses the radix-4 Booth encoding to compute the higher part of the product and the logarithmic approximation to generate the least significant part of the product.

B. Approximate Non-Logarithmic Multipliers
The Booth algorithm is commonly used to design approximate multipliers where various approaches have been proposed to simplify the partial product generation stage.
Jiang et al. [30] proposed an approximate radix-8 Booth encoding multiplier (ABM) using the approximate recoding adder with and without the truncation of several less significant bits in the partial product. The approximate recoding adder is used to calculate the triples of multiplicands in a Wallace tree. It adds seven upper bits exactly, whereas the nine lower bits are obtained approximately with 3-input XOR gates. Among all ABM multipliers, the one with 15-bit truncation achieves the best overall performance in terms of hardware and accuracy.
Liu et al. [7] designed approximate Booth multipliers based on approximate radix-4 modified Booth encoding (R4ABM-k) algorithms and a regular partial product array that employs an approximate Wallace tree. The main idea is to generate lower bits of partial products with approximate radix-4 encoding, while the upper bits are generated exactly. Radix-4 Booth approximation is performed through a modified Karnaugh logic table for exact radix-4 Booth encoding. Two approximate radix-4 Booth encoders are proposed to speed-up the partial product generation to generate k least significant partial product bits. By changing the value of k, the R4ABM-k multiplier can achieve different tradeoffs between accuracy and hardware efficiency.
Approximate hybrid high radix encoding multipliers (RAD2 k ) are presented in [9]. The proposed approach aims to overcome the limitations of high radix Booth encoding through the omission of hard multiplies. The multiplier employs a hybrid encoding technique, where the n-bit input operand is divided into two groups: the upper part of n − k bits and the lower part of k bits. The configuration parameter, k ≥ 4, is an even number. The upper part is exactly encoded using the radix-4 encoding, while the lower part is approximately encoded with the radix-2 k encoding. The approximations are performed by rounding the radix-2 k values to their nearest power of two. Although this design is highly accurate, it exhibits a large error for small numbers. The authors showed that RAD1024 delivers the best compromise between energy reduction and accuracy.
A hybrid low radix encoding-based approximate Booth multiplier (HLR-BM), proposed in [31], addresses the issue of generating odd multiples (i.e., multiples of the ±3 multiplicand) in the radix-8 Booth encoding. HLR-BM approximates the ±3 multiplicands to their nearest power of two, such that the errors complement each other. Similar to RAD2 k [9], the authors employed hybrid encoding in which the least significant bits of multiplicand are encoded with the approximate radix-8 encoding. Due to smaller radix approximate encoding, HLR-BM achieves higher accuracy than RAD2 k , but offers smaller energy and area gains.
In [32], the authors proposed hybrid partial product-based building blocks by considering the probability distribution of the input operands. While [31] concentrated on simplification of Booth encoding, here, the authors focused on decomposing an 8-bit multiplier into 4-bit approximate multipliers. An efficient hardware implementation of approximate 4-bit multipliers uses the high-performance approximate NOR-based half adder and full adder cells. The authors used the proposed recursive partitioning to implement 8-bit multipliers (Ax8). Among the three different strategies (Ax8-1/2/3), Ax8-3 exhibits the smallest error and energy.

III. THE PROPOSED MULTIPLIER
In fixed-point computation, we must carefully select the numbers' bit-width, as it directly determines the range of values we can represent. We can obtain a very rough estimate of a number's value from its leading-one bit. The more bits we consider after the leading-one bit, the more accurate is the estimated number's value and the more complex becomes the arithmetic circuitry. Many applications may deliver an acceptable result even if we use only the leading-one bit and a few bits that follow.
The two-stage operand trimming logarithmic multiplier (TL) exploits this fact and splits the operands into two parts. The multiplier then uses only one part of an operand in the following way: if the upper part contains at least one non-zero bit, then the upper part enters the multiplier; otherwise, the lower part enters the multiplier. Thus, after splitting the operands, the multiplier operates on the reduced number of bits, leading to a smaller design. However, this approach requires additional logic at the multiplier's input to select the operands' important parts and at the multiplier's output to form the product correctly.

A. Background
The decimal value of a signed z-bit number Z in two's complement is where b Z ,i represents the i -th bit of Z . To convert the binary number into its logarithm, we separate the sign s Z = sign(Z ) and the absolute value where the exponent k Z = log 2 |Z | denotes the position of the leading one bit and represents the mantissa. According to [20], we approximate the logarithm of |Z | as

B. Logarithm Conversion of Operands
To get the final approximation P 0 of the exact product P 0 = X 0 · Y 0 , we start by computing the approximate logarithms of the n-bit signed operands X 0 and Y 0 . Conversion of both operands follows the same steps, so we limit the detailed description to operand X 0 only. Fig. 1 illustrates the circuitry used to obtain operand's sign from the most significant bit and approximate its absolute value |X 0 | by one's complement. In the schemes, a thick line with designated bit-width represents a bus, while a thin line represents a single wire. To get |X 0 |, we first take the most significant bit, i.e., the sign bit s X 0 = b X 0 ,n−1 , from the bus X 0 . Then we drive n instances of the sign bit along with X 0 into XOR. In the first operand trimming stage, we split the absolute value |X 0 | into two parts: the v-bit upper part and the (n − v)-bit lower part, where v ≥ n/2. We use the OR-reduction to detect if there is any non-zero bit in the upper part. If at least one non-zero bit exists, we approximate the absolute value with its v-bit upper part, otherwise, we approximate it with its (n − v)-bit lower part. In the latter case, all upper-part bits are zero, so we can drive the v lower bits to the multiplexer. As the most significant bit of the absolute value is always '0' due to the previous XOR with the same bit, we use wire routing to shift left the upper part by one place and set its least significant bit, thus getting the mean value approximation of the neglected bits. We illustrate this operation by the fork and the join patterns: from the v upper bits, we take out the most significant bit b |X 0 |,v−1 at the fork, and drive v − 1 remaining bits to the join, where we attach '1' at the least significant position. The multiplexer in Fig. 1 outputs the v-bit approximation We can interpret the expression 2|X 0 |/2 n−v + 1 as shifting the absolute value of an operand n − v − 1 places right and setting the least significant bit. Fig. 2 illustrates the binary-to-logarithm circuitry. We use the v-bit leading-one detector to extract the leading-one bit in X. The leading-one bit then enters the v-bit priority encoder to obtain the log 2 v-bit exponent k X . The right side of the circuitry extracts the mantissa m t X . In the second operand trimming stage, we propose a new mantissa extractor. To further reduce the overall complexity of the binary-to-logarithm conversion circuitry, we propose to keep only w < v leftmost bits in the mantissa. In the following text, we refer to the obtained mantissa as trimmed mantissa m t X . As w is a constant value, we can replace the costly barrel shifter with a simple AND-OR net. Fig. 3 illustrates the extraction of the i -th bit. In the first step, we shift the leading-one bit w − i places right. For the i -th bit, the value w − i is constant; therefore, we can implement the shift as a simple wire routing. In the second step, we perform bit-wise AND between the shifted leading-one bit and X, X * = (2 k X (w − i )) ∧ X. In the third step, we perform bit-by-next-bit OR-reduction on the result to obtain the i -th bit of the trimmed mantissa m t X,i = n−1 i=0 b X * ,i . We set the least significant bit of the trimmed mantissa m t X,0 , thus enabling the mean value approximation of the trimmed bits. Fig. 4 shows the circuitry for extraction of the i -th trimmed mantissa bit. Note that bits from X and 2 k , which we drive to the AND, are mutually shifted by w − i .
Employment of mantissa extractor reduces delay in the logarithm conversion circuitry. The critical path (i.e. the path with the maximal delay) passes either through mantissa extractor or priority encoder, while in previous designs (e.g. [12], [23]) the critical path includes the priority encoder and the barrel shifter.

C. Summation and Antilogarithm Conversion
The approximate product's logarithm is the sum of approximate logarithms of both input operands  where the 1 + log 2 v most significant bits represent the integer exponent k P , and the w least significant bits form the fractional mantissa m P .
To obtain the approximate intermediate product we use the circuitry depicted in Fig. 5. Due to the trimming, the adder and the barrel shifter have reduced bitwidth manifesting in small size and delay. We first form operands' approximate logarithms by concatenating '0' with its exponent and mantissa. The approximate logarithms enter the 1 + log 2 v + w-bit adder to get the approximate logarithm of the intermediate product with the 1+log 2 v-bit exponent k P and the w-bit mantissa m P . A number, formed by concatenating '1' and the mantissa m P , enters the barrel shifter that shifts it left by k P bits. The output of the barrel shifter is a (2v + w)-bit number. As the approximations X an Y are v-bit numbers, the output's upper 2v bits represent the approximate intermediate product P. Finally, we form the final approximation where we shift left the approximate intermediate product P to compensate for the shift of an operand n − v − 1 places right in the first trimming phase (6). Recall that u X 0 and u Y 0 (5) are bits that identify whether the upper part of input operands is used in further processing. The circuitry in Fig. 6 uses a multiplexer to select the appropriately shifted P based on the bits u X 0 and u Y 0 . If we take both lower parts of the input operands, then P is already the absolute value of the  final product approximation (mutiplexer input 00). If we take one lower part and one upper part of the input operands, we shift the approximate intermediate product (n − v − 1) bits to the left (mutiplexer input 01). And finally, if we take both upper parts of the input operands, we shift the approximate intermediate product 2(n − v − 1) bits to the left (mutiplexer input 10). Note that these shifts are just simple wire routings. As the final approximate product is 2n-bit wide, we also append the required number of leading zeros. We calculate the sign of the final product approximation as an XOR operation between the operands' sign bits. Again, we approximate the sign conversion using one's complement. In the rest of the paper, we denote the two-stage operand trimming logarithmic multiplier as TLn-v/w. With minor modifications of the proposed design, we could also implement an unsigned multiplier. To do so, we have to remove the sign conversion circuitry and drive the upper v-bits of |X 0 | directly to the multiplexer in the input stage (Fig. 1), and also remove the sign conversion circuitry from the output stage (Fig. 6).

IV. ERROR ANALYSIS
The error analysis of the TLn − v/w multiplier includes an error study and an empirical assessment of the error characteristics.

A. Error Study
The TLn − v/w multiplier introduces errors in the logarithm and antilogarithm conversion steps. Fig. 7 shows that the highest error in the logarithm conversion emerges when the input operand is 2 n−v , from where it exponentially decreases. Mantissa trimming introduces small errors from 2 w onwards.
Firstly, we limit our analysis to the approximate products of unsigned operands X 0 = 2 k X and Y 0 = 2 k Y , where k X , k Y ∈ [0, n − 1). These values are multiples of 2 n−v , hence having the largest relative approximation errors (Fig. 7). The operand approximation of X 0 = 2 k X , with comes with two exponential terms, originating from the mean value approximation of the trimmed bits. The antilogarithm conversion approximates the product P 0 (X 0 , The error distance ED(X 0 , Y 0 ) = |P 0 (X 0 ,Y 0 ) − P 0 (X 0 ,Y 0 )| increases with k X and k Y . Fig. 8 shows that the relative error distance RED(X 0 , Y 0 ) = ED(X 0 , Y 0 )/|P 0 (X 0 , Y 0 )| is large for operands with non-zero r k X or r k Y .
Assuming that operands with non-zero r k X or r k Y lead to products with a large relative error, we can estimate the pessimistic theoretical lower bound of the portion of the products with the relative error distance below 2 1−w ,  where the nominator represents products with non-zero r k X or r k Y , and the denominator represents the number of all possible products.

B. An Empirical Assessment of Error Characteristic
To empirically assess the error, we examine the influence of parameters v and w for all pairs of operands' values in the interval [−2 n−1 , 2 n−1 − 1] in terms of • NMED, the mean error distance, normalised to the maximal product [7], [23], • MRED, the mean relative error distance, • P RED<11% , the portion of products with relative error distance smaller than 11%, as it represents the maximal error of the Mitchell logarithm product approximation [20]. Table I shows error metrics for different instances of TL multipliers. In the 16-bit multipliers, the parameter v has a significant influence on MRED and P RED<11% but does not affect NMED, which is mainly affected by the products of large operands. The instances with wider mantissa provide more accurate product approximations. A large drop in NMED from w = 3 to w = 4 indicates that we should prefer only multipliers with w ≥ 4. The smaller 8-bit multipliers have significantly larger errors NMED and MRED and lower P RED<11% . Fig. 9 shows the distribution of relative error distance for the TL16-8/4 multiplier and the TL8-4/4 multiplier. The value  P RED<12.5% = 95% for the TL16-8/4 multiplier is above the calculated theoretical lower-bound of 89% (13), which confirms that a large number of operand values with non-zero r k X or r k Y still leads to a good product approximation. A much lower value of P RED<12.5% = 63% for the TL8-4/4 multiplier indicates its susceptibility to higher approximation errors. Fig. 10 presents NMED and MRED of the TL multiplier when varying the mantissa's width w and multiplier's bit-width n. With increasing w, both measures rapidly decrease at the beginning and finally stabilise for w ≥ 8, as wider mantissas cannot compensate for the error introduced by the operand trimming (parameter v). The TL multiplier's bit-width n influences the error measures in a similar way -after the initial rapid drop both measures stabilise for n ≥ 20.

V. SYNTHESIS RESULTS
We analyse the hardware performance of the proposed TL multipliers [34] in terms of power, area, delay, and power-delay-product (PDP) and compare them with several state-of-the-art approximate multipliers. In addition to hardware metrics, we compare NMED for all evaluated multipliers. The selected multipliers were implemented in Verilog and synthesised to 45nm Nangate Open Cell Library. The timing constraints, used for all evaluated designs, specify clock-related parameters, which affect synthesis and timing analysis. We set a clock signal with a period of 5ns, hence not violating a critical path. To evaluate the power, we used timing with a 10MHz virtual clock, a 5% signal toggle rate and output load capacitance equal to 10fF. Table II shows the synthesis results for 16-bit multipliers. The multipliers in [15], [20], [23] are all unsigned multipliers, while the proposed multiplier, and the multipliers in [7], [9], [12], [13], [26], [31] are signed. To compare all multipliers fairly, we have extended the unsigned multipliers to signed in the same way as in the proposed design by adding the logic for operand absolute value computation and the logic for assigning a sign to the approximate product.

A. 16-Bit Multipliers
The parameters v and w affect the size of the TL multiplier circuitry. The smaller v and w lead to a simpler logarithm and antilogarithm conversion, resulting in smaller delay, area utilisation, and PDP, but increased NMED. The proposed multipliers utilise 40% to 50% of the area and consume only 25% to 40% of the energy required in the exact radix-4 multiplier. The TL16-8/3 and TL16-8/4 multipliers outperform the stateof-the-art multipliers in every hardware metrics. Fig. 11 reveals a correlation between PDP and NMED for all 16-bit multipliers with PDP lower than 65fJ. On the one hand, the RAD1024 [9] and LOBO12-12/8 [13] multipliers have the smallest NMED but have a large energy consumption. On the other hand, the TL multipliers belong to the approximate logarithmic multipliers, prioritising efficient design over more accurate product approximation.
Comparing the multipliers TL16-8/4, and TL16-9/4 in Table II indicates that the best choice for operand trimming is v = n/2. From Table I and Fig. 10 we can see that 16-bit multipliers with w ≥ 4 have NMED and MRED fairly close to minimal achievable error values. From Table II we can see that the 16-bit multiplier with w = 5 has more than 25% larger PDP than the multiplier with w = 4, whereas NMED is 15% smaller. As our goal is energy efficiency, the TL16-8/4 multiplier becomes the design of our choice. Table III shows the synthesis results for 8-bit multipliers. Again, we have extended the unsigned multipliers [12], [20], [23] to support signed operands. Although the TL multipliers TL8-4/2 and TL8-4/3 are the best in terms of hardware measures, all other multipliers fairly outperform them in terms of the error measure NMED. Considering the error and hardware measures, we selected the TL8-4/3 multiplier for the application studies. Fig. 12 shows the dependency of the power-delay product and the circuit size on the mantissa's width w and the multipliers bit-width n. Rather steep dependency, for example, the 32-bit multiplier has about double PDP and size compared to the 16-bit version, suggests to keep w and especially n low.

C. Discussion
Approximating the logarithm from the trimmed operand instead of the whole operand simplifies the logarithmic conversion circuitry in Fig. 2. When extracting the most significant bit's position, the proposed multiplier uses a smaller v-bit leading-one detector and priority encoder circuits instead of  larger n-bit circuits. Similarly, it extracts the mantissa from a v-bit trimmed operand, resulting in a simpler mantissa extractor. Considerable hardware savings are also in logarithm summation and antilogarithm conversion in Fig. 5. The proposed multiplier employs a simpler adder and a simpler barrel shifter due to a shorter mantissa. All these improvements considerably reduce the overall area and delay. However, to benefit from them, we need the operand trimming stage in Fig. 1 and the output stage in Fig. 6 that increase the multiplier's overall complexity. As these stages mainly consist of simple multiplexers and wire routing, we can still profit from the design by adequately selecting the design parameters v and w.
The superiority of the proposed multiplier's hardware metrics emerges from its input stage. The operand's logarithm approximation is determined based on two fixed ranges of the operand's value, contrary to the dynamic range approach used in the DR-ALM multiplier [26]. Hence, the proposed method is more straightforward, but, with coarse-grain approximation, introduces larger errors.

A. Image Smoothing
Image smoothing reduces noise and details in the image [35] and is implemented by convolving the image with a smoothing kernel To evaluate the influence of product approximation, we use the mean structural similarity index (MSSIM) [36] and the peaksignal-to-noise ratio (PSNR) between the image smoothed with the exact multiplier and the image smoothed with an approximate multiplier. We perform the tests on five 16-bit grayscale images from TESTIMAGES database [37]: building, cards, flowers, snail, and wood game. The pixels in an input image are uniformly shifted from [0, 2 16 ) to [−2 15 , 2 15 ) to adapt for signed 16-bit multipliers. For signed 8-bit multipliers analysis, we additionally scale the images to 8-bit range. Table IV shows the results for image smoothing. The TL16-8/4 multiplier delivers MSSIM and PSNR similar to the Mitchell-trunc8-C1 [12] multiplier, and better than the Mitchell-trunc6-C1 [12] multiplier and the DR-ALM4 [26] multiplier. However, multipliers ALM-SOA11 [23] and RAD1024 [9] with very low NMED outperform the proposed TL16-8/4 design. Similarly, the TL8-4/3 multiplier outperforms the DR-ALM4 [26] multiplier and the Ax8_3 [32] multiplier in terms of PSNR, but lags behind the Mitchell-trunc5-C1 [12] multiplier and the ALM-SOA11 [23] multiplier. Nevertheless, high MSSIM and acceptable PSNR indicate that the TL multipliers can replace the exact multiplier without a significant image quality decrease. Fig. 13 visualizes Wood game image smoothing with 16-bit multipliers. The most obvious difference is in a continuous gradation of grey tone in the image background -more pronounced posterization (banding) can be observed in the images with lower MSSIM and PSNR.

B. Image Classification With Convolutional Neural Networks
Convolutional neural networks (CNNs) are a class of deep neural networks, mainly used in image classification and analysis [38]. The neural network processing involves a vast number of multiplications. We anticipate that the adaptable nature of CNNs makes them resilient to errors introduced by approximate multiplication.
To assess the influence of approximate multiplication on the inference phase, we deploy approximate multipliers in a CNN for image classification. For experiments, we utilise the Caffe framework [39] and replace the exact multiplication with the approximate one in the inference phase. Notably, we replace the calls to the cuBLAS multiplication routines with our C/C++ routines that implement various approximate multiplier designs.
As a test case, we select the ResNet-20 [40] network and the CIFAR10 dataset [41]. We repeat the training ten times using fixed-point arithmetic, random weight initialisation, and the predetermined split to training and test set [41]. To boost We initially train the network in floating-point arithmetic for 64, 000 iterations using stochastic gradient descent with the learning rate decay. To adapt the network to approximate multiplication, we perform additional 15, 000 iterations of training, employing the approximate fixed-point multiplication in the inference phase and the exact floating-point multiplication in the learning phase. We quantify the floating-point weights and inputs to the signed fixed-point representation with q fractional bits as Z ·2 q /2 q , where Z is a floating-point value. We set q = 12 for the 16-bit multipliers and q = 6 for the 8-bit multipliers, which gives the smallest accuracy degradation for the exact radix-4 multiplier. Table V presents the classification accuracy on a test set in terms of top-1 and top-3 scores. The top-t score represents the rate that the target label belongs to the t top predicted classes. For the top-1 score, the TL16-8/4 multiplier achieves lower classification accuracy than multipliers Mitchell-trunc6-C1 [12], ALM-SOA11 [23], and RAD1024 [9] but comparable accuracy to the DR-ALM4 multiplier [26]. In terms of a more relaxed top-3 score, the proposed TL16-8/4 multiplier is in line with the more accurate multipliers and outperforms the DR-ALM4 multiplier [26].
In general, the 8-bit multipliers perform worse than the 16-bit multipliers. The 8-bit multipliers are less accurate than the 16-bit multipliers regarding the top-1 score, while the difference drops in terms of the top-3 score. Performance of the TL8-4/3 multiplier follows the error analysis results -higher values of error measures correlate with lower classification accuracy. Hence, among the tested 8-bit multipliers, the DR-ALM4 multiplier is probably a better choice than the TL8-4/3 multiplier, as it achieves better accuracy with slightly worse hardware measures.

VII. CONCLUSION
This paper presents an energy-efficient approximate logarithmic multiplier with two-stage operand trimming, which aims to deliver improvements in the area and energy consumption of approximate logarithmic multipliers.
The multiplier splits the input operands into the upper and lower parts and uses only the parts that give better approximations to the exact operands' values. This way, the bit width is reduced, which results in simpler circuitry at the expense of a large error for some products. Additionally, the proposed multiplier trims the mantissas of the shortened operands. For this operation, we propose a new mantissa extractor based solely on a simple AND-OR net to lower delay and energy consumption. Moreover, a shorter operand and mantissa lead to a smaller adder and barrel shifter in the antilogarithm conversion circuitry.
Using the error analysis, we derive a lower bound of a portion of acceptable errors and empirically show that the TL16-8/4 multiplier computes more than 90% of products with a relative error smaller than 12.5%. The TL16-8/4 multiplier exhibits smaller energy consumption and area utilisation than the state-of-the-art designs.
Although the proposed multiplier offers significant improvements in energy consumption and area utilisation, the relatively large normalised mean error distance may be a significant drawback for its deployment. Nevertheless, the proposed multiplier behaves well in image processing and image classification with convolutional neural networks. From the error analyses, we can see that the 8-bit design has a significantly larger error than the 16-bit design. While the 16-bit design behaves well in both application case studies, the 8-bit design is more appropriate for less demanding applications, like image smoothing in our case. The final choice of the appropriate multiplier merely depends on the application requirements prioritising energy over the accuracy or vice versa. The experimental results suggest that the proposed multiplier can be useful in error-resilient applications, in intensive computing, or mobile devices, as it delivers more computational power per chip area and per power-delay product than the state-of-the-art logarithmic multipliers.
We have successfully deployed the proposed multiplier in the inference phase on CNNs. The remaining challenge is the training of the neural networks. In our previous work [11], we showed that it is possible to train the fully connected layers with approximate multipliers efficiently. In future research into the topic, we should investigate the approximate multipliers influence in the training phase of much larger convolutional neural networks. We anticipate that with an improved training strategy, we can lower the multipliers' accuracy requirements and provide even more area and energy-efficient designs.