Introduction
The instruction set architecture of most modern processors provides Single Instruction Multiple Data (SIMD) instructions that process multiple instances of data concurrently [1]. The programming model that utilizes these instructions is a key technique for many computing systems to reach their peak performance. Most software SIMD optimizations are introduced manually by programmers. However, this approach introduces a portability problem because the code needs to be re-written when targeting a new vector extension. In order to improve portability of codes with SIMD optimizations, recent compilers have introduced auto-vectorizing capability [2]. To fully exploit the SIMD capabilities of a system, the transformation for auto-vectorization of a compiler must be able to invoke a version of functions that operates on concurrent iterations, or on a vector function. This applies particularly to C mathematical functions defined in
In this paper, we describe our implementation of a vectorized library of C standard math functions, called SLEEF library. SLEEF stands for SIMD Library for Evaluating Elementary Functions, and implements a vectorized version of all C99 real floating-point math functions. Our library provides 1-ULP accuracy version and 3.5-ULP accuracy version for most of the functions. We confirmed that our library satisfies such accuracy requirements on an empirical basis. Our library achieves both good performance and portability by abstracting intrinsic functions. This abstraction enables sub-features of vector extensions such as mask registers to be utilized while the source code of our library is shared among different vector extensions. We also implemented a version of functions that returns bit-wise consistent results across all platforms. Our library is designed to be used in conjunction with vectorizing compilers. In order to help development of vectorizing compilers, we collaborated with compiler developers in designing a Vector Function Application Binary Interface (ABI). The main difficulty in vectorizing math functions is that conditional branches are expensive. We implemented many of the functions in our library without conditional branches. We devised reduction methods and adjusted domains of polynomials so that a single polynomial covers the entire input domain. For an increased vector size, a value requiring a slow path is more likely to be contained in a vector. Therefore, we vectorized all the code paths in order to speed up the computation in such cases. We devised a variation of the Payne-Hanek range reduction and a remainder calculation method that are both suitable for vectorized implementation.
We compare the implementation of several selected functions in our library to those in other open-source libraries. We also compare the reciprocal throughput of functions in our library, Intel SVML [3], FDLIBM [4], and Vector-libm [5]. We show that the performance of our library is comparable to that of Intel SVML.
The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 discusses how portability is improved by abstracting vector extensions. Section 4 explains the development of a Vector ABI and a vectorized mathematical library. Section 5 shows an overview of the implementation of SLEEF, while comparing our library with FDLIBM and Vector-libm. Section 6 explains how our library is tested. Section 7 compares our work with prior art. In Section 8, the conclusions are presented.
Related Work
2.1 C Standard Math Library
The C standard library (libc) includes the standard mathematical library (libm) [6]. There have been many implementations of libm. Among them, FDLIBM [4] and the libm included in the GNU C Library [7] are the most widely used libraries. FDLIBM is a freely distributable libm developed by Sun Microsystems, Inc., and there are many derivations of this library. Gal et al. described the algorithms used in the elementary mathematical library of the IBM Israel Scientific Center [8]. Their algorithms are based on the accurate tables method developed by Gal. It achieves high performance and produces very accurate results. Crlibm is a project to build a correctly rounded mathematical library [9].
There are several existing vectorized implementations of libm. Intel Short Vector Math Library (SVML) is a highly regarded commercial library [3]. This library provides highly optimized subroutines for evaluating elementary functions which can use several kinds of vector extensions available in Intel's processors. However, this library is proprietary and only optimized for Intel's processors. There are also a few commercial and open-source implementations of vectorized libm. AMD is providing a vectorized libm called AMD Core Math Library (ACML) [10].
Some of the code from SVML is published under a free software license, and it is now published as Libmvec [11], which is a part of Glibc. This library provides functions with 4-ULP error bound. It is coded in assembly language, and therefore it does not have good portability. C. K. Anand et al. reported their C implementation of 32 single precision libm functions tuned for the Cell BE SPU compute engine [12]. They used an environment called Coconut that enables rapid prototyping of patterns, rapid unit testing of assembly language fragments and patterns to develop their library. M. Dukhan published an open-source and portable SIMD vector libm library named Yeppp! [13], [14]. Most of vectorized implementations of libm utilizes assembly coding or intrinsic functions to specify which vector instruction is used for each operator. On the other hand, there are also other implementations of vector versions of libm which are written in a scalar fashion but rely on a vectorizing compiler to generate vector instructions and generate a vectorized binary code. Christoph Lauter published an open-source Vector-libm library implemented with plain C [5]. VDT Mathematical Library [15], is a math library written for the compiler's auto-vectorization feature.
2.2 Translation of SIMD Instructions
Manilov et al. propose a C source code translator for substituting calls to platform-specific intrinsic functions in a source code with those available on the target machine [16]. This technique utilizes graph-based pattern matching to substitute intrinsics. It can translate SIMD intrinsics between extensions with different vector lengths. This rewriting is carried out through loop-unrolling.
N. Gross proposes specialized C++ templates for making the source code easily portable among different vector extensions without sacrificing performance [17]. With these templates, some part of the source code can be written in a way that resembles scalar code. In order to vectorize algorithms that have a lot of control flow, this scheme requires the bucketing technique is applied, to compute all the paths and choose the relevant results at the end.
Clark et al. proposes a method for combining static analysis at compile time and binary translation with a JIT compiler in order to translate SIMD instructions into those that are available on the target machine [18]. In this method, SIMD instructions in the code are first converted into an equivalent scalar representation. Then, a dynamic translation phase turns the scalar representation back into architecture-specific SIMD equivalents.
Leißa et al. propose a C-like language for portable and efficient SIMD programming [19]. With their extension, writing vectorized code is almost as easy as writing traditional scalar code. There is no strict separation in host code and kernels, and scalar and vector programming can be mixed. Switching between them is triggered by the type system. The authors present a formal semantics of their extension and prove the soundness of the type system.
Most of the existing methods are aiming at translating SIMD intrinsics or instructions to those provided by a different vector extension in order to port a code. Intrinsics that are unique in a specific extension are not easy to handle, and translation works only if the source and the target architectures have equivalent SIMD instructions. Automatic vectorizers in compilers have a similar weakness. Whenever possible, we have specialized the implementation of the math functions to exploit the SIMD instructions that are specific to a target vector extension. We also want to make special handling of FMA, rounding and a few other kinds of instructions, because these are critical for both execution speed and accuracy. We want to implement a library that is statically optimized and usable with Link Time Optimization (LTO). The users of our library do not appreciate usage of a JIT compiler. In order to minimize dependency on external libraries, we want to write our library in C. In order to fulfill these requirements, we take a cross-layer approach. We have been developing our abstraction layer of intrinsics, the library implementation, and the algorithms in order to make our library run fast with any vector extensions.
Abstraction of Vector Extensions
Modern processors supporting SIMD instructions have SIMD registers that can contain multiple data [1]. For example, a 128-bit wide SIMD register may contain four 32-bit single-precision FP numbers. A SIMD add instruction might take two of these registers as operands, add the four pairs of numbers, and overwrite one of these registers with the resulting four numbers. We call an array of FP numbers contained in a SIMD register a vector.
SIMD registers and instruction can be exposed in a C program with intrinsic functions and types [20]. An intrinsic function is a kind of inline function that exposes the architectural features of an instruction set at C level. By calling an intrinsic function, a programmer can make a compiler generate a specific instruction without hand-coded assembly. Nevertheless, the compiler can reorder instructions and allocate registers, and therefore optimize the code. When intrinsic functions corresponding to SIMD instructions are defined inside a compiler, C data types for representing vectors are also defined.
In SLEEF, we use intrinsic functions to specify which assembly instruction to use for each operator. We abstract intrinsic functions for each vector extension by a set of inline functions or preprocessor macros. We implement the functions exported from the library to call abstract intrinsic functions instead of directly calling intrinsic functions. In this way, it is easy to swap the vector extension to use. We call our set of inline functions for abstracting architecture-specific intrinsics Vector Extension Abstraction Layer (VEAL).
In some of the existing vector math libraries, functions are implemented with hand-coded assembly [11]. This approach improves the absolute performance because it is possible to provide the optimal implementation for each microarchitecture. However, processors with a new microarchitecture are released every few years, and the library needs revision accordingly in order to maintain the optimal performance.
In other vector math libraries, the source code is written in a scalar fashion that is easy for compilers to auto-vectorize [5], [15]. Although such libraries have good portability, it is not easy for compilers to generate a well-optimized code. In order for each transformation rule in an optimizer to kick in, the source code must satisfy many conditions to guarantee that the optimized code runs correctly and faster. In order to control the level of optimization, a programmer must specify special attributes and compiler options.
3.1 Using Sub-Features of the Vector Extensions
There are differences in the features provided by different vector extensions, and we must change the function implementation according to the available features. Thanks to the level of abstraction provided by the VEALs, we implemented the functions so that all the different versions of functions can be built from the same source files with different macros enabled. For example, the availability of FMA instructions is important when implementing double-double (DD) operators [21]. We implemented DD operators both with and without FMA by manually specifying if the compiler can convert each combination of multiplication and addition instructions to an FMA instruction, utilizing VEALs.
Generally, bit masks are used in a vectorized code in order to conditionally choose elements from two vectors. In some vector extensions, a vector register with a width that matches a vector register for storing FP values, is used to store a bit mask. Some vector extensions provide narrower vector registers that are dedicated to this purpose, which is SLEEF makes use of these opmask registers by providing a dedicated data type in VEALs. If a vector extension does not support an opmask, the usual bit mask is used instead of an opmask. It is also better to have an opmask as an argument of a whole math function and make that function only compute the elements specified by the opmask. By utilizing a VEAL, it is also easy to implement such a functionality.
3.2 Details of VEALs
Fig. 1 shows some definitions in the VEAL for SVE [22]. We abstract vector data types and intrinsic functions with typedef statements and inline functions, respectively.
The
The function
3.3 Making Results Bit-Wise Consistent Across All Platforms
The method of implementing math functions described so far, can deliver computation results that slightly differ depending on architectures and other conditions, although they all satisfy the accuracy requirements, and other specifications. However, in some applications, bit-wise consistent results are required.
To this extent, the SLEEF project has been working closely with Unity Technologies,1 which specializes in developing frameworks for video gaming, and we discovered that they have unique requirements for the functionalities of math libraries. Networked video games run on many gaming consoles with different architectures and they share the same virtual environment. Consistent results of simulation at each terminal and server are required to ensure fairness among all players. For this purpose, fast computation is more important than accurate computation, while the results of computation have to perfectly agree between many computing nodes, which are not guaranteed to rely on the same architecture. Usually, fixed-point arithmetic is used for a purpose like this, however there is a demand for modifying existing codes with FP computation to support networking.
There are also other kinds of simulation in which bit-wise identical reproducibility is important. In [23], the authors show that modeled mean climate states, variability and trends at different scales may be significantly changed or even lead to opposing results due to the round-off errors in climate system simulations. Since reproducibility is a fundamental principle of scientific research, they propose to promote bit-wise identical reproducibility as a worldwide standard.
One way to obtain bit-wise consistent values from math functions is to compute correctly rounded values. However, for applications like networked video games, this might be too expensive. SLEEF provides vectorized math functions that return bit-wise consistent results across all platforms and other settings, and this is also achieved by utilizing VEALs. The basic idea is to always apply the same sequence of operations to the arguments. The IEEE 754 standard guarantees that the basic arithmetic operators give correctly rounded results [24], and therefore the results from these operators are bit-wise consistent. Because most of the functions except trigonometric functions do not have a conditional branch in our library, producing bit-wise consistent results is fairly straightforward with VEALs. Availability of FMA instructions is another key for making results bit-wise consistent. Since FMA instructions are critical for performance, we cannot just give up using FMA instructions. In SLEEF, the bit-wise consistent versions of functions have two versions both with and without FMA instructions. We provide a non-FMA version of the functions to guarantee bit-wise consistency among extensions such as Intel SSE2 that do not have FMA instructions. Another issue is that the compiler might introduce inconsistency by FP contraction, which is the result of combining a pair of multiplication and addition operations into an FMA. By disabling FP contraction, the compiler strictly preserves the order and the type of FP operations during optimization. It is also important to make the returned values from scalar functions bit-wise consistent with the vector functions. In order to achieve this, we also made a VEAL that only uses scalar operators and data types. The bit-wise consistent and non-consistent versions of vector and scalar functions are all built from the same source files, with different VEALs and macros enabled. As described in Section 5, trigonometric functions in SLEEF chooses a reduction method according to the maximum argument of all elements in the argument vector. In order to make the returned value bit-wise consistent, the bit-wise consistent version of the functions first applies the reduction method for small arguments to the elements covered by this method. Then it applies the second method only to the elements with larger arguments which the first method does not cover.
The Development of a Vector Function ABI and SLEEF
Recent compilers are developing new optimization techniques to automatically vectorize a code written in standard programming languages that do not support parallelization [25], [26]. Although the first SIMD and vector computing systems [27] appeared a few decades ago, compilers with auto-vectorization capability have not been widely used until recently, because of several difficulties in implementing such functionality for modern SIMD architectures. Such difficulties include verifying whether the compiler can vectorize a loop or not, by determining data access patterns of the operations in the loop [2], [28]. For languages like C and C++, it is also difficult to determine the data dependencies through the iteration space of the loop, because it is hard to determine aliasing conditions of the arrays processed in the loop.
4.1 Vector Function Application Binary Interface
Vectorizing compilers convert calls to scalar versions of math functions such as sine and exponential to the SIMD version of the math functions. The most recent versions of Intel Compiler [29], GNU Compiler [30], and Arm Compiler for HPC [31], which is based on Clang/LLVM [32], [33], are capable of this transformation, and rely on the availability of vector math libraries such as SVML [3], Libmvec [11] and SLEEF respectively to provide an implementation of the vector function calls that they generate. In order to develop this kind of transformations, a target-dependent Application Binary Interface for calling vectorized functions had to be designed.
The Vector Function ABI for AArch64 architecture [34] was designed in close relationship with the development of SLEEF. This type of ABI must standardize the mapping between scalar functions and vector functions. The existence of a standard enables interoperability across different compilers, linkers and libraries, thanks to the use of standard names defined by the specification.
The ABI includes a name mangling function, a map that converts the scalar signature to the vector one, and the calling conventions that the vector functions must obey. In particular, the name mangling function that takes the name of the scalar function to the vector function must encode all the information that is necessary to reverse the transformation back to the original scalar function. A linker can use this reverse mapping to enable more optimizations (Link Time Optimizations) that operate on object files, and does not have access to the scalar and vector function prototypes. There is a demand by users for using a different vector math library according to the usage. Reverse mapping is also handy for this purpose. A vector math library implements a function for each combination of a vector extension, a vector length and a math function to evaluate. As a result, the library exports a large number of functions. Some vector math libraries can only implement part of all the combinations. By using the reverse mapping mechanism, the compiler can check the availability of the functions by scanning the symbols exported by a library.
The Vector Function ABI is also used with OpenMP [35]. From version 4.0 onwards, OpenMP provides the directive declare simd. A user can decorate a function with this directive to inform the compiler that the function can be safely invoked concurrently on multiple instances of its arguments [36]. This means that the compiler can vectorize the function safely. This is particularly useful when the function is provided via a separate module, or an external library, for example in situations where the compiler is not able to examine the behavior of the function in the call site. The scalar-to-vector function mapping rules stipulated in the Vector Function ABI are based on the classification of vector functions associated with the
The Vector Function ABI specifications are provided for the Intel x86 and the Armv8 (AArch64) families of vector extensions [34], [37]. The compiler generates SIMD function calls according to the compiler flags. For example, when targeting AArch64 SVE auto-vectorization, the compiler will transform a call to the standard
4.2 SLEEF and the Vector Function ABI
SLEEF is provided as two separate libraries. The first library exposes the functions of SLEEF to programmers for inclusion in their C/C++ code. The second library exposes the functions with names mangled according to the Vector Function ABI. This makes SLEEF a viable alternative to libm and its SIMD counterpart
Overview of Library Implementation
One of the objectives of the SLEEF project is to provide a library of vectorized math functions that can be used in conjunction with vectorizing compilers. When a non-vectorized code is automatically vectorized, the compiler converts calls to scalar math functions to calls to a SIMD version of the math functions. In order to make this conversion safe and applicable to wide variety of codes, we need functions with 1-ULP error bound that conforms to ANSI C standard. On the other hand, there are users who need better performance. Our library provides 1-ULP accuracy version and 3.5-ULP accuracy version for most of the functions. We confirmed that our library satisfies the accuracy requirements on an empirical basis. For non-finite inputs and outputs, we implemented the functions to return the same results as libm, as specified in the ANSI C standard. They do not set errno nor raise an exception.
In order to optimize a program with SIMD instructions, it is important to eliminate conditional branches as much as possible, and execute the same sequence of instructions regardless of the argument. If the algorithm requires conditional branches according to the argument, it must prepare for the case where the elements in the input vector contain both values that would make a branch happen and not happen. Recent processors have a long pipeline and therefore branch misprediction penalty can reach more than 10 cycles [39]. Making a decision for a conditional branch also requires non-negligible computation, within the scope of our tests. A conditional move is an operator for choosing one value from two given values according to a condition. This is equivalent to a ternary operator and can be used in a vectorized code to replace a conditional branch. Some other operations are also expensive in vectorized implementation. A table-lookup is expensive. Although in-register table lookup is reported fast on Cell BE SPU [12], it is substantially slower than polynomial evaluation without any table lookup, within the scope of our tests. Most vector extensions do not provide 64-bit integer multiplication or a vector shift operator with which each element of a vector can be specified a different number of bits to shift. On the other hand, FMA and round-to-integer instructions are supported by most vector extensions. Due to the nature of the evaluation methods, dependency between operations cannot be completely eliminated. Latencies of operations become an issue when a series of dependent operations are executed. FP division and square root are not too expensive from this aspect.2
The actual structure of the pipeline in a processor is complex, and such level of details are not well-documented for most CPUs. Therefore, it is not easy to optimize the code according to such hardware implementation. In this paper, we define the latency and throughput of an instruction or a subroutine as follows [41]. The latency of an instruction or a subroutine is the delay that it generates in a dependency chain. The throughput is the maximum number of instructions or subroutines of the same kind that can be executed per unit time when the inputs are independent of the preceding instructions or subroutines. Several tools and methods are proposed for automatically constructing models of latency, throughput, and port usage of instructions [42], [43]. Within the scope of our tests, most of the instruction latency in the critical path of evaluating a vector math function tends to be dominated by FMA operations. In many processors, FMA units are implemented in a pipeline manner. Some powerful processors have multiple FMA units with out-of-order execution, and thus the throughput of FMA instruction is large, while the latency is long. In SLEEF, we try to maximize the throughput of computation in a versatile way by only taking account of dependencies among FMA operations. We regard each FMA operation as a job that can be executed in parallel and try to reduce the length of the critical path.
In order to evaluate a double-precision (DP) function to 1-ULP accuracy, the internal computation with accuracy better than 1 ULP is sometimes required. Double-double arithmetic, in which a single value is expressed by a sum of two double-precision FP values [44], [45], is used for this purpose. All the basic operators for DD arithmetic can be implemented without a conditional branch, and therefore it is suitable for vectorized implementation. Because we only need 1-ULP overall accuracy for DP functions, we use simplified DD operators with less than the full DD accuracy. In SLEEF, we omit re-normalization of DD values by default, allowing overlap between the two numbers. We carry out re-normalization only when necessary.
Evaluation of an elementary function often consists of three steps: range reduction, approximation, and reconstruction [21]. An approximation step computes the elementary function using a polynomial. Since this approximation is only valid for a small domain, a number within that range is computed from the argument in a range reduction step. The reconstruction step combines the results of the first two steps to obtain the resulting number.
An argument reduction method that finds an FP remainder of dividing the argument
There are tools available for generating the coefficients of the polynomials, such as Maple [49] and Sollya [50]. In order to fine-tune the generated coefficients, we created a tool for generating coefficients that minimizes the maximum relative error. When a SLEEF function evaluates a polynomial, it evaluates a few lowest degree terms in DD precision while other terms are computed in double-precision, in order to achieve 1-ULP overall accuracy. Accordingly, coefficients in DD precision or coefficients that can be represented by FP numbers with a few most significant bits in mantissa are used in the last few terms. We designed our tool to generate such coefficients. We use Estrin's scheme [51] to evaluate a polynomial to reduce dependency between FMA operations. This scheme reduces bubbles in the pipeline, and allows more FMA operations to be executed in parallel. Reducing latency can improve the throughput of evaluating a function because the latency and the reciprocal throughput of the entire function are close to each other.
Below, we describe and compare the implementations of selected functions in SLEEF, FDLIBM [4] and Christoph Lauter's Vector-libm [5]. We describe 1-ULP accuracy version of functions in SLEEF. The error bound specification of FDLIBM is 1 ULP.
5.1 Implementation of sin and cos
FDLIBM uses Cody-Waite range reduction if the argument is under
SLEEF switches among two Cody-Waite range reduction methods with approximation with different sets of constants, and the Payne-Hanek reduction. The first version of the algorithm operates for arguments within
5.2 Implementation of tan
After Cody-Waite or Payne-Hanek reduction, FDLIBM reduces the argument to [0,0.67434], and uses a polynomial approximation with 13 non-zero terms. It has 10
In SLEEF, the argument is reduced in 3 levels. It first reduces the argument to
5.3 Implementation of asin and acos
FDLIBM and SLEEF first reduces the argument to
Then, SLEEF uses a polynomial approximation of arcsine on
FDLIBM uses a rational approximation with 11 terms (plus one division). For computing arcsine, FDLIBM switches the approximation method if the original argument is over 0.975. For computing arccosine, it has three paths that are taken when
5.4 Implementation of atan
FDLIBM reduces the argument to
SLEEF reduces argument
5.5 Implementation of log
FDLIBM reduces the argument to
SLEEF multiplies the argument
5.6 Implementation of exp
All libraries reduce the argument range to
SLEEF then uses a polynomial approximation with 13 non-zero terms to directly approximate the exponential function of this domain. It achieves 1-ULP error bound without using a DD operation.
FDLIBM uses a polynomial with 5 non-zero terms to approximate
The reconstruction step is to add integer
5.7 Implementation of pow
FDLIBM computes
Vector-libm does not implement
SLEEF computes
5.8 The Payne-Hanek Range Reduction
Our method computes
\begin{align*}
\mathrm{rfrac}(2x/\pi) &= \mathrm{rfrac}(M \cdot 2^E \cdot 2/\pi)\\
&= \mathrm{rfrac}(M \cdot (I(E) + F(E)))\\
&= \mathrm{rfrac}(M \cdot F(E)).
\end{align*}
The value
\begin{align*}
& \mathrm{rfrac}(M \cdot F(E))\\
= & \mathrm{rfrac}(M \cdot F_0(E) + M \cdot F_1(E)\\
& + M \cdot F_2(E) + M \cdot F_3(E))\\
= & \mathrm{rfrac}(\mathrm{rfrac}(\mathrm{rfrac}(\mathrm{rfrac}(M \cdot F_0(E)) + M \cdot F_1(E))\\
& + M \cdot F_2(E)) + M \cdot F_3(E)), \tag{1}
\end{align*}
FDLIBM seems to implement the original Payne-Hanek algorithm with more than 100 lines of C code, which includes 13
Vector-libm implements a non-vectorized variation of the Payne-Hanek algorithm which has some similarity with our method. In order to reduce argument
5.9 FP Remainder
We devised an exact remainder calculation method suitable for vectorized implementation. The method is based on the long division method, where an FP number is regarded as a digit. Fig. 3 shows an example process for calculating the FP remainder of 1e+40 / 0.75. Like a typical long division, we first find integer quotient 1.333e+40 so that 1.333e+40
Our basic algorithm is shown in Algorithm 1. If
FDLIBM uses a method of shift and subtract. It first converts the mantissa of two given arguments into 64-bit integers, and calculates a remainder in a bit-by-bit basis. The main loop iterates
Vector-libm does not implement FP remainder.
Algorithm 1. Exact Remainder Calculation
Input: Finite positive numbers
Output: Returns
while
end while
return
5.10 Handling of Special Numbers, Exception and Flags
Our implementation gives a value within the specified error bound without special handling of denormal numbers, unless otherwise noted.
When a function has to return a specific value for a specific value of an argument (such as a NaN or a negative zero) is given, such a condition is checked at the end of each function. The return value is substituted with the special value if the condition is met. This process is complicated in functions like
SLEEF functions do not give correct results if the computation mode is different from round-to-nearest. They do not set errno nor raise an exception. This is a common behavior among vectorized math libraries including Libmvec [11] and SVML [3]. Because of SIMD processing, functions can raise spurious exceptions if they try to raise an exception.
5.11 Summary
FDLIBM extensively uses conditional branches in order to switch the polynomial according to the argument(
Vector-libm switches between a few polynomials in most of the functions. It does not provide functions with 1-ULP error bound, nevertheless, the numbers of non-zero terms in the polynomials are larger than other two libraries in some of the functions. A vectorized path is used only if the argument is smaller than 2.574 in
SLEEF uses the fastest paths if all the arguments are under 15 for trigonometric functions, and the same vectorized path is used regardless of the argument in most of the non-trigonometric functions. SLEEF always uses the same polynomial regardless of the argument in all functions.
Although reducing the number of conditional branches has a few advantages in implementing vector math libraries, it seems to be not given a high priority in other libraries.
Testing
SLEEF includes three kinds of testers. The first two kinds of testers test the accuracy of all functions against high-precision evaluation using the MPFR library. In these tests, the computation error in ULP is calculated by comparing the values output by each SLEEF function and the values output by the corresponding function in the MPFR library, and it is checked if the error is within the specified bounds.
6.1 Perfunctory Test
The first kind of tester carries out a perfunctory set of tests to check if the build is correct. These tests include standards compliance tests, accuracy tests and regression tests.
In the standards compliance tests, we test if the functions return the correct values when values that require special handling are given as the argument. These argument values include
In the accuracy test, we test if the error of the returned values from the functions is within the specified range, when a predefined set of argument values are given. These argument values are basically chosen between a few combinations of two values at regular intervals. The trigonometric functions are also tested against argument values close to integral multiples of
In the regression test, the functions are tested with argument values that triggered bugs in the previous library release, in order to prevent re-emergence of the same bug.
The executables are separated into a tester and IUTs (Implementation Under Test). The tests are carried out by making these two executables communicate via an input/output pipeline, in order to enable testing of libraries for architectures which the MPFR library does not support.
6.2 Randomized Test
The second kind of tester is designed to run continuously. This tester generates random arguments and compare the output from each function to the output calculated with the corresponding function in the MPFR library. This tester is expected to find bugs if it is run for a sufficiently long time.
In order to randomly generate an argument, the tester generates random bits of the size of an FP value, and reinterprets the bits as an FP value. The tester executes the randomized test for all the functions in the library at several thousand arguments per second for each function on a computer with a Core i7-6700 CPU.
In the SLEEF project, we use randomized testing in order to check the correctness of functions, rather than formal verification. It is indeed true that proving correctness of implementation contributes to the reliability of implementation. However, there is a performance overhead because the way of implementation is limited in a form that is easy to prove the correctness. There would be an increased cost of maintaining the library because of the need for updating the proof each time the implementation is modified.
6.3 Bit-Identity Test
The third kind of tester is for testing if bit-identical results are returned from the functions that are supposed to return such results. This test is designed to compare the results among the binaries compiled with different vector extensions. For each predetermined list of arguments, we calculate an MD5 hash value of all the outputs from each function. Then, we check if the hash values match among functions for different architectures.
Performance Comparison
In this section, we present results of a performance comparison between FDLIBM Version 5.3 [4], Vector-libm [5], SLEEF 3.4, and Intel SVML [3] included in Intel C Compiler 19.
We measured the reciprocal throughput of each function by measuring the execution time of a tight loop that repeatedly calls the function in a single-threaded process. In order to obtain useful results, we turned off optimization flags when compiling the source code of this tight loop,3 while the libraries are compiled with their default optimization options. We did not use LTO. We confirmed that the calls to the function are not compiled out or inlined by checking the assembly output from the compiler. The number of function calls by each loop is
We compiled SLEEF and FDLIBM using gcc-7.3.0 with “
We carried out all the measurements on a physical PC with Intel Core i7-6700 CPU @ 3.40 GHz without any virtual machine. In order to make sure that the CPU is always running at the same 3.4 GHz clock speed during the measurements, we turned off Turbo Boost. With this setting, 10 nano sec. corresponds to 34 clock cycles.
The following results compare the the reciprocal throughput of each function. If the implementation is vectorized and each vector has
7.1 Execution Time of Floating Point Remainder
We compared the reciprocal throughput of double-precision
The graph of reciprocal throughput looks like a step function, because the number of iterations increases in this way.
7.2 Comparison of Overall Execution Time
We compared the reciprocal throughput of 256-bit wide vectorized double-precision functions in Vector-libm, SLEEF and SVML, and scalar functions in FDLIBM. We generated random arguments that were uniformly distributed within the indicated intervals for each function. In order to check execution speed of fast paths in trigonometric functions, we measured the reciprocal throughput with arguments within
The reciprocal throughput of functions in SLEEF is comparable to that of SVML in all cases. This is because the latency of FP operations is generally dominant in the execution time of math functions. Because there are two levels of scheduling mechanisms, which includes the optimizer in a compiler and the out-of-order execution hardware, there is small room for making a difference to the throughput or latency.
Execution speed of FDLIBM is not very slow despite many conditional branches. This seems to be because of a smaller number of FP operations, and faster execution speed of scalar instructions compared to equivalent SIMD instructions.
Vector-libm is slow even if only the vectorized path is used. This seems to be because Vector-libm evaluates polynomials with a large number of terms. Auto-vectorizers are still developing, and the compiled binary code might not be well optimized. When a slow path has to be used, Vector-libm is even slower since a scalar evaluation has to be carried out for each of the elements in the vector.
Vector-libm uses Horner's method to evaluate polynomials, which involves long latency of chained FP operations. In FDLIBM, this latency is reduced by splitting polynomials into even and odd terms, which can be evaluated in parallel. SLEEF uses Estrin's scheme. In our experiments, there was only a small difference between Estrin's scheme and splitting polynomials into even and odd terms with respect to execution speed.
Conclusion
In this paper, we showed that our SLEEF library shows performance comparable to commercial libraries while maintaining good portability. We have been continuously developing SLEEF since 2010.5 [52] We distribute SLEEF under the Boost Software License [53], which is a permissive open source license. We actively communicate with developers of compilers and members of other projects in order to understand the needs of real-world users. The Vector Function ABI is important in developing vectorizing compilers. The functions that return bit-identical results are added to our library to reflect requests from our multiple partners. We thoroughly tested these functionalities, and SLEEF is already adopted in multiple commercial products.
ACKNOWLEDGMENTS
The authors would like to acknowledge Will Lovett and Srinath Vadlamani for their valuable input and suggestions. They are particularly grateful to Robert Werner, who reviewed the final manuscript. The authors would like to thank Prof. Leigh McDowell for his suggestions. The authors would like to thank all the developers that contributed to the SLEEF project, in particular Diana Bite and Alexandre Mutel.