Low-Power and Low-Latency Hardware Implementation of Approximate Hyperbolic and Exponential Functions for Embedded System Applications

The hyperbolic and exponential functions are widely used in various applications in engineering fields such as machine learning, Internet of Things (IOT), signal processing, etc. To fulfill the needs of future applications effectively, this paper proposes a low-latency, low-power, acceptable accuracy, and low-cost architecture for computing the approximate exponential function $\text{e}^{\mathrm {\pm z}}$ and the hyperbolic functions sinh(z) and cosh(z) using a table-driven algorithm named Approximate Composited-Stair Function (ApproxCSF). By adopting a FPGA, the proposed design is realized and demonstrates significant improvements in terms of latency, hardware cost, power consumption, and MSE by 91%, 96%, 74%, and 99%, respectively, compared to the state-of-the-art. Xilinx Virtex-5/7 FPGAs have been employed throughout the functional verification and prototype processes. Compared to related works, it shows that the proposed architectures are much better for low-cost and low-latency computations of exponential and hyperbolic functions than CORDIC, stochastic computation, and the Look-up Table approaches. The source code is publicly available online https://github.com/AyadMDalloo/ApproxCSF.


I. INTRODUCTION
The current trend of research in the development of highperformance very large-scale integration (VLSI) designs is increasingly focused on real-time digital signal processing (DSP) and machine learning algorithms.This focus is essential for applications such as surveillance and wearable electronics, which require the analysis and evaluation of sensed data to recognize patterns [1], [2], [3].IoT and edge processing require immediate action based on sensed data.
The associate editor coordinating the review of this manuscript and approving it for publication was Nuno M. Garcia .Some prioritize local processing over cloud computing due to latency and connection limits.Unfortunately, local processing critically demands low-power, high-accuracy, lowlatency, and low-cost solutions.A significant number of these algorithms utilize elementary functions such as trigonometric, hyperbolic, exponential, logarithmic, division functions, etc.The calculation of these transcendental functions via computer software always leads to significant delays.Hardware implementations have gained considerable prominence due to the performance improvements that they provide over software implementations.There is a substantial amount of published material that describes the hardware implementation of these functions (exponential and hyperbolic).In general, there are five common types of computing methods for implementing these functions, including the look-up table (LUT) approach [4], [5], [6], [7], the polynomial approximation methodology [7], [8], and the coordinate rotation digital computer (CORDIC) algorithm [9], [10], [11], [12], piecewise polynomial approximations [13], [14], [15], and hybrid (table-driven) approaches [16], [17].The approximate and stochastic computing approaches [18], [19], [20], [21] have also garnered considerable interest in recent years.The benefits and drawbacks of each implementation approach will be discussed in the next section.
There are no known Field-Programmable Gate Arrays (FPGAs) designs in the literature that precisely combine the qualities of low-cost, wide-range, low-latency, acceptable high accuracy all at the same time.In this paper, we propose simple architectures of exponential and hyperbolic functions based on the table-driven approach in [22].The proposed architectures outperform existing state-of-the-art designs in terms of latency, cost, operational range, and power consumption.The results of the experiments that are given in this paper provide more credence to this fact.Thus, the key contributions of the paper are as follows: • Development of the Exponential Function: The paper introduces an innovative table-driven algorithm for computing the exponential function.This approach significantly improves performance and reduces cost while extending the input range.The use of Xilinx Virtex-5/7 FPGAs for verification and prototyping underscores the practical applicability of the design.
• Design of Hyperbolic Functions: The paper presents novel architectures for the hyperbolic functions sinh(x) and cosh(x), based on the table-driven approach for the exponential function.This design effectively balances hardware complexity, cost, performance, and accuracy, eliminating the need for data input scaling.
• Comprehensive Review of Elementary Function Implementations: An extensive review of key studies and various methods for implementing elementary functions (exponential and hyperbolic) is provided.This review offers valuable insights into the state of the art and highlights the advantages of the proposed approach.
• Comprehensive Architectural Advancements: The paper presents architectures that excel in accuracy and range, are energy-efficient, scalable, and flexible, and contribute to the open-source community, supporting diverse applications and technological evolution.The rest of the paper is organized as follows: Section II surveys and outlines the most pivotal studies of different methods for implementing elementary functions.Section III elucidates the background of the computation of expansion and hyperbolic functions.Section IV describes the architecture of expansion and hyperbolic functions.Afterward, in Section V, the experimental results are used to quantify the benefits of our proposed architecture.Finally, Section VI concludes the paper.

II. LITERATURE REVIEW
Exponential and hyperbolic functions are essential for many computational tasks, requiring accurate and efficient implementation.Efficient and accurate approximations of these functions are important for FPGA-based computing, where hardware resources are limited and speed is critical.To this end, we have surveyed the most pivotal studies in the field published between 2011 and 2022, with a focus on those appearing in the last seven years between 2017 and 2023.Furthermore, this review includes some of the earliest studies documenting the concept's inception.In particular, articles published by highly regarded publishers like IEEE, Elsevier, MDPI, Nature, ACM, and Springer were prioritized.The ArXiv repository has provided the source for a few of the chosen papers.In this section, we will highlight the benefits and drawbacks of each approach and review some of the recent published studies on approximating and implementing hyperbolic and exponential functions on FPGAs and ASICs.

A. LOOK-UP TABLE APPROACH
The Look-Up Table (LUT) method [4], [5], [6], [7], is considered the easiest, most effective, and quickest method for computing exponential and hyperbolic functions by interpolating information stored in memory blocks.In Figure 1, the LUT addresses a defined range of input values with 8-bit addressing.Values beyond this range might need extrapolation or error management.Therefore, the interpolation techniques (e.g., linear interpolation) can be used to improve accuracy and generate input values between LUT entries.This lowcomplexity method requires ample silicon space since the accuracy is affected by the size of the memory.LUTs are particularly useful for functions that have a small input domain and a fixed output precision.Therefore, it is not suitable for highly oscillatory or rapidly changing functions.The core challenge with such functions is their need for an extremely dense set of points in the lookup table to accurately capture their behavior.This requirement for a high density of points leads to increased memory usage.This makes the lookup table approach inefficient or even impractical for these types of functions, particularly in scenarios demanding real-time processing systems with restricted computing capabilities.Saint-Genies et al. [4] proposed a method for using errorfree values by tabulating two or more terms per table row using Pythagorean triples.This technique reduces memory use by up to 29% and floating-point operations by up to 42%.Deng et al. [6] introduced a framework for automatically developing of a look-up table (LUT) to generate and evaluate the functions for optimization of hardware resources in FPGAs.The evaluation of the function is conducted through a numerical approximation methodology employing Taylor polynomials.This approach is meticulously tailored to meet specific demands regarding precision and computational speed.The result of the exponential function shows the maximum error is 1.69e-7.Magalhães et al. [5] provided an optimization tool for creating accurate and efficient LUTs, which is composed of layers with thicknesses that specify the distance between pre-calculated points.

B. POLYNOMIAL APPROACH
Another approach is polynomial approaches, which are techniques used to approximate mathematical functions by a polynomial function of a given degree.The Taylor series and polynomial approximation approaches are both techniques used to approximate mathematical functions, but they differ in their underlying mathematical principles and implementation.The implementation of exponential and hyperbolic functions frequently employs Taylor series, a renowned mathematical method for approximating functions through a sequence of derivatives at a specific point.Although these series may converge slowly for larger values, truncating them to a finite number of terms often yields a satisfactory level of accuracy.Nonetheless, this method may demand significant computational resources and is susceptible to numerical instability, especially with larger argument values.
To address these issues, various alternative methods have been proposed in recent years.Costa et al. [23] provided a Taylor Series exponential function with a variable input range and an eight-byte LUT.The proposed architecture used a Newton-Raphson division and a radix-4 Squarer unit for designing a Taylor series exponential function.To minimize error and get a reliable approximation, the polynomial approximation would like to be used [7], [8].However, this method requires many multipliers, adders, and tables for storing coefficients.It is both slow and inefficient, as depicted in Figure 2(a).Wu et al. [8] presented an approximate exponential function unit (EFU) based on Taylor expansion and optimized using discrete gradient descent with a power consumption of 3.73 pJ/exp.Polynomial approximation is a common approach for approximating exponential and logarithm functions on FPGAs.The idea is to represent the function as a polynomial that approximates the function over a specific input range.The polynomial coefficients can be determined by fitting a polynomial to the function using least-squares regression or another curve-fitting algorithm.In their respective studies, Chen et al. [24], Nandagopal [25], and Ze [26] FIGURE 2. Architecture of the exponential function using (a) a 3rd-order polynomial Approach, (b) stochastic approach of a 5th-order Maclaurin polynomial [19].have each introduced a hardware accelerator specifically designed for elementary transcendental functions, demonstrating significant enhancements in computational throughput.Notably, the work of Chen et al. [24] exemplifies this advancement, achieving an approximate throughput of 2.5 GFLOPS utilizing 65nm CMOS technology.This development is further characterized by its precision, maintaining an average error of 0.5 units in the last place (ulp) and a maximum error of 3 ulp.

C. CORDIC APPROACH
The third method is the CORDIC algorithm, which is a lowcost iterative algorithm designed by Volder [12] in 1959.It uses adders, wire shift operations, and a few registers, as depicted in Figure 3. Unfortunately, it has delayed performance like a serial multiplier and a limited input range, making it not the ideal method for exponential and hyperbolic function computation.
Recent enhancements to the CORDIC algorithm offer potential for affordable, high-performance real-time computing hardware solutions [8], [9], [10], [11].Osta et al. [18] have conducted research with the objective of diminishing the energy usage of specialized circuits in real-time execution of CORDIC algorithms within the realm of machine learning.This is achieved through the application of approximate computing methodologies.The findings from their study indicate that the integration of Lower-Part Or adder (LOA) leads to a significant reduction in power consumption, quantified at 21%.Based on adapting the current [27] architecture with an approximation, Chen et al. [9], [28] presented a new approximate approach for coordinate rotation digital computer (CORDIC) construction.A completely parallel approximation CORDIC (FPAX-CORDIC) technique is developed, which eliminates the memory register of Para-CORDIC and makes rotation direction generation totally parallel.Although approximation CORDIC and parallel CORDIC functions have their benefits, they nevertheless have limits in terms of input range and latency.When dealing with difficult mathematical processes or big input quantities, the performance of these functions may be hampered.Hence, researchers and developers continue to explore new solutions and enhancements to meet these issues and produce more effective and adaptable algorithms for a wide range of areas.

D. PIECEWISE APPROXIMATIONS APPROACH
The fourth method is Piecewise Linear/Nonlinear/ polynomial Approximations.Piecewise Linear Approximation (PLA) is a computationally efficient method for numerical approximation, particularly suitable for real-time systems with resource constraints.PLA is characterized by its simplicity, which makes it suitable only for implementing simple functions in constrained computational environments.PLA generates non-uniform segments based on the maximum error threshold.The number of segments affects the length of the input interval.It also impacts the steepness of the function.However, PLA suffers from limitations where it has limited accuracy for complex, non-linear functions.It also introduces discontinuities at the boundaries between segments.
Moving beyond PLA's simplicity, Piecewise Nonlinear Approximation (PNA) offers a more advanced solution.PNA utilizes nonlinear functions such as exponentials, logarithms, and trigonometric functions.Its aim is to accurately represent complex functions.However, it comes at the cost of increased computational complexity and the challenge of selecting the most appropriate nonlinear function for each segment.
Lastly, Piecewise Polynomial Approximation (PPA) balances computational complexity and accuracy by using polynomials of different degrees over segmented intervals.PPA is widely used in signal processing and scientific computing.It can face boundary issues and needs domain expertise for best results.It's chosen for applications needing a good speed-accuracy trade-off.Figure 4 illustrates piecewise linear and quadratic approximations.For example, Chiluveru et al. [13] developed a novel iterative algorithm designed for the piecewise linear approximation of the sigmoid function, characterized by its controlled accuracy.Dong et al. [14] introduced an advanced, universally applicable, and error-minimized piecewise linear (PLA) approximation methodology.This approach is elaborated through a comprehensive piecewise linear approximation computation (PLAC) technique, which is effectively applicable across a broad spectrum of nonlinear unary functions.The PLAC method is distinguished by its two primary components: an optimized segmentation mechanism and a refined quantization process.Then Lyu et al. [15] built PLAC without using a multiplier.This architecture is optimized by Yu et al. [29] to find the minimum number of segments and reduce the maximum absolute error (MAE).All the authors concentrated on the [0,1) interval for their circuit designs.However, these designs necessitate employing the scaling property of the exponential function for input and output processing.

E. HYBRID (TABLE-DRIVEN) APPROACH
Hybrid (Table-driven) methods involve combining multiple approximation techniques to improve accuracy or reduce computational cost [16], [22], [30].For example, a lookup table can be combined with a linear or polynomial approximation to improve accuracy over a wider input range.To address the need for an efficient implementation of the exponential function with variable precision fixed point negative input, Chandra [16] proposed a hybrid method combining LUTs and polynomial approximation to reduce the number of multipliers and adders.The space requirements and energy consumption decreased by over 30% and 50%, respectively.
Overall, hybrid methods are a powerful tool for approximating complex functions and are widely used in many areas of science and engineering.Exponential and hyperbolic functions are important in many areas of science and engineering, but they can be computationally expensive to evaluate directly, especially for large inputs or high precision.Hybrid methods can be used to overcome these challenges by combining different approximation techniques that are wellsuited to different parts of the function's domain.

F. APPROXIMATE AND STOCHASTIC COMPUTING APPROACH
In recent years, there has also been a great deal of interest in the techniques of approximation computing and stochastic computing [18], [19], [20], [21], [31].High clock speeds and fault tolerance are two distinguishing features of stochastic computing, resulting in exceptionally low hardware costs and power consumption.Frameworks for stochastic computation based on the fundamental building blocks of arithmetic and logic are shown in Figure 2(b).However, it has drawbacks such as decreased precision and extended latency.Luong et al. [21] explored the implementation of stochastic logic in executing complex arithmetic functions, notably exponential, sigmoid, and hyperbolic tangent functions.This study utilized piecewise linear and polynomial approximations, specifically employing Lagrange interpolation, to achieve these computations.The findings show power consumption and hardware complexity reduced by 40% and increased the critical path by 2.5% compared to earlier designs.
In addition, approximate computing is a technique for reducing the gap between CMOS scaling and future application needs by utilizing the trade-off between hardware cost and accuracy, which offers significant promise for enhancing the performance of integrated systems [31].The parallel CORDIC proposed by [27] was approximated by [9] and [18], but these approximated CORDIC versions may still not meet many applications' needs due to their delay.
In this literature review, we have examined a wide range of methodologies for implementing low-power and low-latency hardware designs for approximate hyperbolic and exponential functions, crucial in embedded system applications.Our analysis covered various approaches, including Look-Up Table (LUT), polynomial approximation, CORDIC algorithms, Hybrid (table-driven) methods, etc.Every method offers unique advantages and drawbacks.The selection relies on the needs of the particular application.
The LUT method is notable for its low latency and ability to replace complex calculations.However, it demands extensive memory for high precision, which can be a drawback, as seen in Hugues' approach [4] for exact hyperbolic functions, where high memory access and floating-point operations become a disadvantage due to delays and high-power consumption.Conversely, Magalhães [5] and Deng [6] have contributed tools for automating LUT development, optimizing hardware resources.From my experiments, LUTs are suitable for generating functions with lower accuracy or for functions like the exponential with negative inputs, which have outputs in the [0, 1] range.While LUTs offer lower latency and accuracy compared to our proposed method, our goal is to balance these aspects.Similarly, the polynomial approximation needs to high order of polynomial for satisfying high precision needs.One of methods to reduce the interpolation points is to use polynomial approach.
Piecewise approximations (PA) present a middle ground between accuracy and computational load but may need significant hardware for coefficient storage and calculations.The main time consumption in PA lies in coefficient addressing, and compared to our method, PAs have higher delays, despite sharing similar characteristics.
The CORDIC algorithm and stochastic methods are less costly in terms of hardware and power but come with their own set of challenges.The CORDIC algorithm faces higher latency and lower accuracy, whereas the stochastic method trades accuracy for lower latency.Essentially, each method compromises one design aspect.However, hybrid approaches aim to merge the advantages of different methods to boost performance, albeit at the cost of increased design and implementation complexity.For instance, Chandra [16] combined LUT and polynomial approximation in designing an exponential function for negative inputs, reducing the need for multiple multipliers and adders.
Our review highlighted that despite the advancements in these methodologies, challenges remain in achieving an optimal balance between power consumption, latency, accuracy, and hardware cost.Notably, the current implementations exhibit limitations in scalability, flexibility, and efficiency when subjected to the demanding requirements of real-time DSP and machine learning applications.These gaps underscore the necessity for innovative approaches that can adeptly navigate the trade-offs inherent in hardware design for exponential and hyperbolic function computations.Our proposed architecture, leveraging the Approximate Composited-Stair Function (ApproxCSF) and table-driven algorithms, aims to address these shortcomings by offering a design that significantly improves upon latency, power consumption, and hardware efficiency, while maintaining acceptable accuracy levels, thus presenting a viable solution to the identified gaps in the literature.

III. BACKGROUND OF TABLE-DRIVEN ALGORITHM
A table-driven (hybrid) implementation algorithm is proposed by Tang et al. [22] to provide a software implementation of the exponential function in IEEE Floating-point arithmetic.Fixed-point numbers are used to speed up the processing of exponential functions, and the table-driven technique is used to improve both speed and accuracy.In this article, we present a simplified approach to the table-driven technique for the exponential function, building upon the framework established in [22].The algorithm, originally detailed in [22], is described in the following steps: In this design, the values of 2 j / 32 , corresponding to the fractional part of N/32, are pre-computed and stored in 32 memory locations, utilizing a lookup table.This setup is pivotal for the algorithm's efficiency.This approach allows for rapid retrieval of these values during computation, thereby reducing the computational complexity and underscoring the table-driven nature of our design.
As part of the VLSI design of support vector machines (SVM), Patankar et al. [32] used a table-driven algorithm to implement the exponential function e −z to realize the Gaussian function.The drawback of this implementation is utilizing the divider unit, which has a high computation cost.The divider unit is employed for normal division operation between two numbers and for critical tasks such as input normalization and output scaling.Division operations, Step 1: The input argument X is reduced to the range [−log(2)/64, log(2)/64].Obtain integers m, j, and r, where |r| ≤ log2 64 , Step 2: The function exp(r)-1 is approximated by a polynomial p(r), where Step 3: Reconnect exp(x) via exp (X ) = 2 m (2 while necessary for these tasks, are inherently more computationally intensive than simpler arithmetic operations.Division operations are notably time-consuming, often requiring a considerable amount of clock cycles, varying from tens to hundreds.Furthermore, division operations necessitate significant area due to their complexity.Even a minimal improvement in the divider circuit, such as 1%, can significantly boost the overall system performance by up to 20% [32].In our design, we decide to eliminate the divider circuit due to several critical considerations.The following formula for calculating the exponential function was adopted in [33]: Our development is divided into two phases to acquire a comprehensive description and analysis of our architectures.We provide a broad description of the design and a breakdown of how exponential and hyperbolic functions are implemented in the circuit.

IV. THE PROPOSED ARCHITECTURES A. GENERAL DESCRIPTION OF ARCHITECTURE
This paper presents a systematic framework for exponential functions which segments the functions into two parts: the Stair-Step Function (SSF) and the Composited-Error Function (CEF).The SSF is essentially a piecewise function that approximates the exponential curve by breaking it down into a series of discrete 'steps' or segments.This stepwise approximation simplifies the complex nature of the exponential function, making it more manageable for computational purposes.On the other hand, the CEF is designed to capture and represent the error introduced by the stepwise approximation of the SSF.It models the deviation from the actual exponential curve as a sawtooth waveform, which helps in analyzing and compensating for the approximation error in subsequent processing stages.We have named our proposed architecture the Approximate Composited-Stair Function (ApproxCSF).It employs a table-driven approach to approximate the exponential function, as detailed in Algorithm 1.
This design converts the input parameter z into an integer number N by multiplying it by a certain constant C1 and feeding N into two segments, each with its own function.The SSF segment can generate the stepped-exponential function in stair form directly by looking up 2 N / 32 , but the cost of this process will be high because dividing N by 32 yields a fractional amount.Consequently, it requires a large number of memory places to achieve the highest precision.To reduce the cost of this design, the integer number N is divided by 32 into quotient m and remainder j. Figure 5 illustrates the design's block diagram.The remainder j is generated by extracting 5 bits from N and the remaining bits are used as quotient m (e.g., m=6 bits if N is 11 bits).The fractional component of N/32 is represented by storing 2 ∓j / 32 values in 32 memory locations.Regarding the integer quotient m, it is shifted by one position to generate 2 ∓m through the use of a decoder, instead of utilizing a shift register.The last step of the SSF segment is to multiply 2 ∓m by 2 ∓j / 32 to yield 2 ∓N / 32 as described in algorithm I and detailed more below.
In the CE segment, the error value ε is determined by subtracting the estimated input argument z n from the input argument z, where z n is calculated from the integer number N after multiplying by a certain constant C2 (=1/C1).The output of the segment (Y) is computed by adjusting the error (ε) through addition or subtraction of 1 to calculate e x or e −x , respectively.The approximate exponential function is then computed by multiplying the segment's output, Y, by the segment's output, Feng et al. [33] entered the error ε into the second-order polynomial p (ε) and then divided it by the output of SSF segment X.This is a drawback of Feng's implementation.
The implementation of hyperbolic functions sinh(x) and cosh(x) can be easily developed by adding and subtracting two exponentials' functions, e −x and e x , and then wire shifting them, as follows:

B. THE CIRCUIT DESIGN
The exponential function is a fundamental part of activation functions for neural networks and various algorithms.For instance, the exponential function is used by the Gaussian function in the construction of a support vector machine (SVM). in conscious of the fact that the exponential function for a negative domain is s more widely used than the positive domain [16].As a result, we proposed and constructed a variety of range structures for the exponential and hyperbolic Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.functions to reduce hardware costs and properly compare them to existing architectures.The first architecture of the exponential function is limited to the negative input, as shown in Figure The design is pipelined to achieve a four-clock latency.We use the signed fixed-point s4.11 format (16 bits).The integer number s10.0 is extracted after multiplying the input argument z (s.4.11) by the constant 32/ln(2) (s6.13).Then it is fed to the stair-step function (SSF) and composited-error function (CEF) segments.In the CEF section, the error ε is computed by calculating the difference between estimated and exact input parameters, which is always less than one.The final output of the CEF segment (represented by X) may be less than or more than one used to calculate the exponential function e −z or e z , respectively, as explained below.
where ε = z-z n , z represents the exact input argument, and z n is the estimated input argument.Since outcome X is always around one, the outcome is always represented by s1.16.
In the stair-step function segment, the integer number N is split into two numbers (denoted j and m), and the reminder number j is used to access the lookup table in order to acquire 2 −j/32 for e −z or 2 j/32 for e z .Because the output of the decoder 2 −m for e −z or 2 m for e z is 16 bits, the quotient m is represented by just 4 bits here.The output of the stair-step function (SSF) segment is explained below.
For the exponential of negative input argument: For the exponential of positive input argument: Multiplying the outputs of two segments (X and Y) yields the final circuit output, which reflects the input argument's exponential function.The final output is represented by 32 bits, with s1.30 for e −z and s17.14 for e z .Figures 5a and 5b depict the exponential function circuits, and the section on the results and assessment describes the aspects of these circuits.
As demonstrated in Figure 6, the aforementioned architectures may be integrated, modified further, and simplified to encompass the exponential function of the positive and negative input ranges.The design selects between the exponential functions of negative and positive input arguments based on the sign of the input argument.In this design, we utilize a 64-memory location lookup table to find 2 −j/32 for e −z and 2 j/32 for e z , which are addressed by combining the remainder of j with the sign of input argument z.In the SSF segment, the quotient number m must be inverted, and the decoder's output must shift only when the parameter z is negative.The sign of the input parameter determines how the operation is performed in the CEF segment.The error ε is directly inserted into the adder or subtractor to obtain 1±ε and is controlled by the sign of the input argument.The output of the CEF segment has a distinct numerical format when the exponential functions e −z and e z are calculated.This distinction necessitates an increase in the number of output bits from 32 bits to 41 bits to avoid accuracy loss.
When hyperbolic functions are significant, it is necessary to build both exponential functions in parallel.Thus, the circuits depicted in Figure 5 operate in parallel, with their outputs fed into the adder/subtractor unit as per Equation 6.Subsequently, a wire shift is applied to generate the hyperbolic functions.The control signal is used to choose between cosh(z) and sinh(z).Figure 7 shows the architecture of sinh(z) and cosh(z) functions.

V. EXPERIMENTAL RESULTS
The hyperbolic and exponential functions have been developed in VHDL to evaluate and assess the features of the proposed architectures presented in the preceding section.Then, these designs are synthesized in the Xilinx ISE Design Suite and evaluated on the Xilinx Virtex-5 (XC5VLX110T) and Virtex-7 (XC7VX485) FPGAs, which support 16-bit operands with the fixed-point format s4.11.
In this work, we analyze the architectures based on a variety of important metrics, such as delay, area (number of LUTs, FFs, and slices), maximum frequency, latency, throughput, and error.Each proposed architecture has been functionally verified with 10 6 different uniformly distributed random input patterns.The error metrics are then calculated by comparing the hardware simulation results of the proposed designs with the floating-point results of the Matlab program.Following this, we discuss the timing analysis and hardware utilization to conclude the FPGA implementation study.In this work, we conduct three primary tests on the proposed exponential and hyperbolic function designs.

A. PERFORMANCE OF APPROXIMATE EXPONENTIAL COMPOSITED STAIR FUNCTION
To show the superiority and highlight the advantages of our architectures (ApproxCSF) for generating the exponential function compared to other best FPGA-implementing architectures.The first architecture of the exponential function is designed only for negative input.It is compared with the architecture developed by [33] and [34], and the standard CORDIC algorithm known as Xilinx CORDIC IP Core V4.All hardware was simulated with a clock of 5 ns, but the nonpipelined proposed architecture was simulated with 10 ns.
According to Table 1, our proposed method decreased the latency from 44 to 4 clocks with the same maximum frequency compared with the method in [33].In other words, we reduced the latency by 91% and 83% compared with the methods in [33] and the Xilinx CORDIC IP core, which is illustrated in Figure 8.The employment of the costly and latency-intensive divider unit is the cause of the high latency in [33].The proportion of slices used by the proposed architecture is just 8% and 88% in [33] and [34], compared to the CORDIC IP core, as shown in Figure 8.In addition, the proposed architecture consumes 74% less power than the  approach described in [33] and [34] due to its usage of fewer hardware resources.However, it has the same delay as the other approaches.Figure 8 depicts a visual comparison of the advantages of the proposed architecture over alternative architectures.The CORDIC IP Core has higher error metrics and a smaller input range than the other approaches, according to Table 1.Therefore, the error metrics of IP Core are excluded from our comparison.Figure 9 illustrates the distribution of errors across the input domain, spanning from −10.397 to 0, for both the architecture proposed in and our proposed architecture.Our findings indicate that excluding the quadratic term results in a significant reduction of the maximum error value by 85%.However, this modification leads to a marginal increase in the overall error rate, specifically by 2%.Moreover, excluding this term facilitates a reduction in hardware complexity and cost by obviating the need for an additional multiplier and adder.Thereby, this simplification enhanced the system's accuracy and efficiency in terms of power consumption, hardware complexity, and resource allocation.Based on the results presented in Table 1, our design outperforms the methods described in [33] by achieving 94%, 99%, and 93% enhancements in MAE, MSE, and STD, respectively.However, it also results in a 2% higher error probability.
For handling negative inputs in exponential functions, the proposed architecture outperforms the current state-of-theart approaches.This architecture is also employed to develop hardware implementations of exponential functions for both full and positive inputs, leveraging its distinctive features.The architectural outcomes for handling both negative and positive inputs exhibit similar performance characteristics facilitating straightforward comparison.Figure 10 illustrates the error distribution across the range [0, 10.397].
In light of the distinct qualities of the two aforementioned designs, we were inspired to create a single architecture capable of handling both the positive and negative broad ranges of exponential computation (as shown in Figure 6).Table 2 showcases a comparison of our design against various approaches in the literature.Our architecture achieves better latency and hardware efficiency.However, the method in [35] surpasses our design in terms of error metrics and range coverage due to its 55-bit floating-point precision, 16Kb BRAM lookup table, and high-end hardware.Our proposed architecture employs a 16-bit fixed-point format for precision, whereas the approach in [35] uses a 55-bit floating-point format.Consequently, the design presented in [35] is capable of processing a significantly broader input range compared to our proposed design.Despite identical maximum frequencies, our design significantly lowers latency, LUT, and FF usage by 82%, 91%, and 92%, respectively, compared to [35].Our design prioritizes power efficiency with a trade-off in accuracy.Enhancing precision would require more bits and potentially separate storage for integer functions in the lookup table.
This study aims to balance different metrics, finding an equilibrium that optimizes overall performance without disproportionately favoring any single aspect.

B. PERFORMANCE OF APPROXIMATE HYPERBOLIC COMPOSITED STAIR FUNCTION
In this section, we design and implement hyperbolic functions on the FPGA Virtex-7 XC7VX485T in order to perform a fair comparison with the approach described in [11].Our proposed circuit demonstrates significant improvements in cost, power, and performance, accompanied by a minimal 0.074% decrease in accuracy, which remains satisfactory.Table 3 depicts an evaluation and comparison of the proposed architecture with the modified CORDIC algorithms reported in [11] in terms of timing analysis, hardware utilization, and error metrics.The proposed architecture adopts a 16bit fixed-point precision with accepting a minor decrease in accuracy to avoid the complexities and delays inherent in floating-point computations.Fu et al. [11] introduced a 128-bit floating-point version of the exponential function, emphasizing precision over other considerations.As a result, this design suffers from reduced performance and high implementation costs.The proposed architecture significantly outperforms the method described in [11] reducing latency and delay times by 81.25% and 98.45%, respectively.This leads to a remarkable 99% improvement in throughput at an operational frequency of 200 MHz.Furthermore, the architecture is characterized by minimal hardware utilization, which contributes to lower power consumption (128mW at 200 MHz), utilizing only 61 slices compared to the 9430 slices required by [11].Table 3 demonstrates that the input ranges of the two approaches for hyperbolic functions are comparable.The findings indicate that the highest output error recorded is 7.042 at an output value of 9517.2, with this maximum error being normalized to 7.4 × 10 −4 , equivalent to 0.074% of the output value.In addition, the maximum output error observed is 1.5225, representing 0.0138% of the highest output value, which is 11.013 × 10 3 .As previously mentioned, the work of Fu et al. [11] emphasizes precision, achieving near-accuracy with an error probability of only 0.4%.In contrast, our architecture for hyperbolic functions prioritizes minimizing power consumption, reducing footprint and latency, and broadening the input range, albeit at the cost of some precision.
Table 4 demonstrates the comparisons of the LUT, stochastic computing, and modified CORDIC methods with Approx-CSF.Ref. [4] demonstrated that it is possible to compute trigonometric and hyperbolic functions with an accuracy of 4 bits using 77-bit look-up tables.This method requires an exponentially increasing number of look-up tables to achieve higher precision, while a larger look-up table decreases performance.Stochastic computing is a kind of computing that makes use of stochastic bitstreams; it is characterized by being both energy-efficient and cost-effective [21].In stochastic computing, accuracy is directly proportional to the number of random numbers used [20], [21].According to the findings in Table 4, the stochastic computing architecture [21] provides the highest latency, area, and error rate.In contrast to the others, it has a significant latency and a small input range [0, 1].Consequently, the latency of the methods in [11] and [20], and this paper is 10240 ns, 297.472 ns, and 19.78 ns, respectively.Significant delays, area requirements, and energy consumption are inevitable consequences of highprecision computations because of the massive amount of stochastically produced data.
To sum up, the use of a certain architecture depends on the applications and costs.We promote adopting stochastic computing for applications that need low power, cost, accuracy, and range and where latency is unimportant.We advocate using the method described in [4] for applications that aren't as concerned about power consumption or cost.We encourage utilizing our recommended solution for cases where all metrics are crucial.

VI. CONCLUSION AND FUTURE WORK
In this study, we introduced efficient frameworks for computing the exponential and hyperbolic functions through using a table-driven algorithm approach.The proposed designs are characterized by their acceptable accuracy, low cost, low power consumption, and low latency.The experimental findings of the proposed method show considerable improvements over the previously reported best designs in terms of performance, error, and hardware cost metrics.Moreover, the proposed architecture offers versatility and scalability, facilitating enhanced accuracy at larger scales.
In future developments, we aim to enhance both the delay and accuracy of our work through the implementation of the following four strategies.1.The output range of the exponential and hyperbolic function can be extended by splitting the input argument into two or more parts: the integer and fraction parts to be represented in the combination of the look-up table and this table-driven method as follows: when x = n 1 + n 2 +f→ y = exp (x) = exp (n 1 ) * exp (n 2 ) * exp (f ) (10) where n 1 , n 2 are integers, and f is fraction of input argument x.Furthermore, by splitting the integer lookup table into multiple sections, we can significantly reduce the total storage requirement.2. In future work, we aim to enhance computational efficiency by minimizing multiplier use, especially for constant-factor operations.We plan to replace the first two multipliers with shift operations and an adder tree, optimizing input scaling and resource usage.Furthermore, we intend to substitute the third multiplier with a barrel shifter, driven by specific ''m'' values, to easily adjust the outputs of the LUT in our design.This modification makes the computation less complex.It also makes the process faster.Such improvements are valuable for realtime signal processing tasks.3.In the proposed method, we may employ approximation computation, such approximate adders and multipliers, with a tolerable loss of precision to reduce the latency.In this increasing the precision (bits) of the input arguments does not result in a corresponding increase in delay or reduction in maximum frequency.4. The exponential function exhibits a unique scaling property where input values can be scaled, allowing for calculations within a limited range to be extended through appropriate scaling.This feature allows for efficient computation, especially when dealing with a wide range of input values.This property is based on the mathematical principle that the exponential function can be scaled and shifted in a way that preserves its essential characteristics.This is particularly useful when working with hardware implementations, where computing the exponential function directly over a large input range might be computationally expensive or impractical.

FIGURE 1 .
FIGURE 1.(a) Hardware implementation of exponential function using one LUT, (b) The illustration of linear interpolation method [7].

FIGURE 3 .
FIGURE 3. Hardware implementation of the conventional Cordic algorithm.

FIGURE 5 .
FIGURE 5.The circuit diagram of the implementation of the exponential function (a) e −z (b)e +z .

FIGURE 6 .
FIGURE 6.The circuit diagram of the implementation of the exponential function e ±z .

FIGURE 7 .
FIGURE 7. The circuit diagram of the implementation of the hyperbolic functions.

FIGURE 8 .
FIGURE 8. Comparative performance analysis of the proposed design against the[33] design and the IP Core in terms of Latency, Slices, Dynamic Power, and Mean Squared Error (MSE).The MSE is excluded from the IP Core comparison due to its significantly higher value.

FIGURE 9 .
FIGURE 9. Distribution of errors in the implementation of the exponential function e −z for (a) our proposed architecture and (b) the architecture referenced in [32].
The normalized error distribution of implementing of the exponential function e +z .

TABLE 1 .
Comparison of different architectures of 16-bit approximate exponential function e −z .

TABLE 2 .
Comparison of the different architectures of approximate exponential function e ±z .

TABLE 3 .
Comparison of the different architectures of approximate hyperbolic functions.

TABLE 4 .
Comparison the architectures of hyperbolic functions for the proposed, LUT, stochastic computing, and cordic.