Loading web-font TeX/Math/Italic
SLEEF: A Portable Vectorized Library of C Standard Mathematical Functions | IEEE Journals & Magazine | IEEE Xplore

SLEEF: A Portable Vectorized Library of C Standard Mathematical Functions

CodeAvailable

Abstract:

In this article, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to ...Show More

Abstract:

In this article, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to make the library portable while maintaining good performance, intrinsic functions of vector extensions are abstracted by inline functions or preprocessor macros. We implemented the functions so that they can use sub-features of vector extensions such as fused multiply-add, mask registers, and extraction of mantissa. In order to make computation with SIMD instructions efficient, the library only uses a small number of conditional branches, and all the computation paths are vectorized. We devised a variation of the Payne-Hanek argument reduction for trigonometric functions and a floating point remainder, both of which are suitable for vector computation. We compare the performance with our library to Intel SVML.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 31, Issue: 6, 01 June 2020)
Page(s): 1316 - 1327
Date of Publication: 18 December 2019

ISSN Information:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION 1

Introduction

The instruction set architecture of most modern processors provides Single Instruction Multiple Data (SIMD) instructions that process multiple instances of data concurrently [1]. The programming model that utilizes these instructions is a key technique for many computing systems to reach their peak performance. Most software SIMD optimizations are introduced manually by programmers. However, this approach introduces a portability problem because the code needs to be re-written when targeting a new vector extension. In order to improve portability of codes with SIMD optimizations, recent compilers have introduced auto-vectorizing capability [2]. To fully exploit the SIMD capabilities of a system, the transformation for auto-vectorization of a compiler must be able to invoke a version of functions that operates on concurrent iterations, or on a vector function. This applies particularly to C mathematical functions defined in math.h that are frequently called in hot-loops.

In this paper, we describe our implementation of a vectorized library of C standard math functions, called SLEEF library. SLEEF stands for SIMD Library for Evaluating Elementary Functions, and implements a vectorized version of all C99 real floating-point math functions. Our library provides 1-ULP accuracy version and 3.5-ULP accuracy version for most of the functions. We confirmed that our library satisfies such accuracy requirements on an empirical basis. Our library achieves both good performance and portability by abstracting intrinsic functions. This abstraction enables sub-features of vector extensions such as mask registers to be utilized while the source code of our library is shared among different vector extensions. We also implemented a version of functions that returns bit-wise consistent results across all platforms. Our library is designed to be used in conjunction with vectorizing compilers. In order to help development of vectorizing compilers, we collaborated with compiler developers in designing a Vector Function Application Binary Interface (ABI). The main difficulty in vectorizing math functions is that conditional branches are expensive. We implemented many of the functions in our library without conditional branches. We devised reduction methods and adjusted domains of polynomials so that a single polynomial covers the entire input domain. For an increased vector size, a value requiring a slow path is more likely to be contained in a vector. Therefore, we vectorized all the code paths in order to speed up the computation in such cases. We devised a variation of the Payne-Hanek range reduction and a remainder calculation method that are both suitable for vectorized implementation.

We compare the implementation of several selected functions in our library to those in other open-source libraries. We also compare the reciprocal throughput of functions in our library, Intel SVML [3], FDLIBM [4], and Vector-libm [5]. We show that the performance of our library is comparable to that of Intel SVML.

The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 discusses how portability is improved by abstracting vector extensions. Section 4 explains the development of a Vector ABI and a vectorized mathematical library. Section 5 shows an overview of the implementation of SLEEF, while comparing our library with FDLIBM and Vector-libm. Section 6 explains how our library is tested. Section 7 compares our work with prior art. In Section 8, the conclusions are presented.

SECTION 2

Related Work

2.1 C Standard Math Library

The C standard library (libc) includes the standard mathematical library (libm) [6]. There have been many implementations of libm. Among them, FDLIBM [4] and the libm included in the GNU C Library [7] are the most widely used libraries. FDLIBM is a freely distributable libm developed by Sun Microsystems, Inc., and there are many derivations of this library. Gal et al. described the algorithms used in the elementary mathematical library of the IBM Israel Scientific Center [8]. Their algorithms are based on the accurate tables method developed by Gal. It achieves high performance and produces very accurate results. Crlibm is a project to build a correctly rounded mathematical library [9].

There are several existing vectorized implementations of libm. Intel Short Vector Math Library (SVML) is a highly regarded commercial library [3]. This library provides highly optimized subroutines for evaluating elementary functions which can use several kinds of vector extensions available in Intel's processors. However, this library is proprietary and only optimized for Intel's processors. There are also a few commercial and open-source implementations of vectorized libm. AMD is providing a vectorized libm called AMD Core Math Library (ACML) [10].

Some of the code from SVML is published under a free software license, and it is now published as Libmvec [11], which is a part of Glibc. This library provides functions with 4-ULP error bound. It is coded in assembly language, and therefore it does not have good portability. C. K. Anand et al. reported their C implementation of 32 single precision libm functions tuned for the Cell BE SPU compute engine [12]. They used an environment called Coconut that enables rapid prototyping of patterns, rapid unit testing of assembly language fragments and patterns to develop their library. M. Dukhan published an open-source and portable SIMD vector libm library named Yeppp! [13], [14]. Most of vectorized implementations of libm utilizes assembly coding or intrinsic functions to specify which vector instruction is used for each operator. On the other hand, there are also other implementations of vector versions of libm which are written in a scalar fashion but rely on a vectorizing compiler to generate vector instructions and generate a vectorized binary code. Christoph Lauter published an open-source Vector-libm library implemented with plain C [5]. VDT Mathematical Library [15], is a math library written for the compiler's auto-vectorization feature.

2.2 Translation of SIMD Instructions

Manilov et al. propose a C source code translator for substituting calls to platform-specific intrinsic functions in a source code with those available on the target machine [16]. This technique utilizes graph-based pattern matching to substitute intrinsics. It can translate SIMD intrinsics between extensions with different vector lengths. This rewriting is carried out through loop-unrolling.

N. Gross proposes specialized C++ templates for making the source code easily portable among different vector extensions without sacrificing performance [17]. With these templates, some part of the source code can be written in a way that resembles scalar code. In order to vectorize algorithms that have a lot of control flow, this scheme requires the bucketing technique is applied, to compute all the paths and choose the relevant results at the end.

Clark et al. proposes a method for combining static analysis at compile time and binary translation with a JIT compiler in order to translate SIMD instructions into those that are available on the target machine [18]. In this method, SIMD instructions in the code are first converted into an equivalent scalar representation. Then, a dynamic translation phase turns the scalar representation back into architecture-specific SIMD equivalents.

Leißa et al. propose a C-like language for portable and efficient SIMD programming [19]. With their extension, writing vectorized code is almost as easy as writing traditional scalar code. There is no strict separation in host code and kernels, and scalar and vector programming can be mixed. Switching between them is triggered by the type system. The authors present a formal semantics of their extension and prove the soundness of the type system.

Most of the existing methods are aiming at translating SIMD intrinsics or instructions to those provided by a different vector extension in order to port a code. Intrinsics that are unique in a specific extension are not easy to handle, and translation works only if the source and the target architectures have equivalent SIMD instructions. Automatic vectorizers in compilers have a similar weakness. Whenever possible, we have specialized the implementation of the math functions to exploit the SIMD instructions that are specific to a target vector extension. We also want to make special handling of FMA, rounding and a few other kinds of instructions, because these are critical for both execution speed and accuracy. We want to implement a library that is statically optimized and usable with Link Time Optimization (LTO). The users of our library do not appreciate usage of a JIT compiler. In order to minimize dependency on external libraries, we want to write our library in C. In order to fulfill these requirements, we take a cross-layer approach. We have been developing our abstraction layer of intrinsics, the library implementation, and the algorithms in order to make our library run fast with any vector extensions.

SECTION 3

Abstraction of Vector Extensions

Modern processors supporting SIMD instructions have SIMD registers that can contain multiple data [1]. For example, a 128-bit wide SIMD register may contain four 32-bit single-precision FP numbers. A SIMD add instruction might take two of these registers as operands, add the four pairs of numbers, and overwrite one of these registers with the resulting four numbers. We call an array of FP numbers contained in a SIMD register a vector.

SIMD registers and instruction can be exposed in a C program with intrinsic functions and types [20]. An intrinsic function is a kind of inline function that exposes the architectural features of an instruction set at C level. By calling an intrinsic function, a programmer can make a compiler generate a specific instruction without hand-coded assembly. Nevertheless, the compiler can reorder instructions and allocate registers, and therefore optimize the code. When intrinsic functions corresponding to SIMD instructions are defined inside a compiler, C data types for representing vectors are also defined.

In SLEEF, we use intrinsic functions to specify which assembly instruction to use for each operator. We abstract intrinsic functions for each vector extension by a set of inline functions or preprocessor macros. We implement the functions exported from the library to call abstract intrinsic functions instead of directly calling intrinsic functions. In this way, it is easy to swap the vector extension to use. We call our set of inline functions for abstracting architecture-specific intrinsics Vector Extension Abstraction Layer (VEAL).

In some of the existing vector math libraries, functions are implemented with hand-coded assembly [11]. This approach improves the absolute performance because it is possible to provide the optimal implementation for each microarchitecture. However, processors with a new microarchitecture are released every few years, and the library needs revision accordingly in order to maintain the optimal performance.

In other vector math libraries, the source code is written in a scalar fashion that is easy for compilers to auto-vectorize [5], [15]. Although such libraries have good portability, it is not easy for compilers to generate a well-optimized code. In order for each transformation rule in an optimizer to kick in, the source code must satisfy many conditions to guarantee that the optimized code runs correctly and faster. In order to control the level of optimization, a programmer must specify special attributes and compiler options.

3.1 Using Sub-Features of the Vector Extensions

There are differences in the features provided by different vector extensions, and we must change the function implementation according to the available features. Thanks to the level of abstraction provided by the VEALs, we implemented the functions so that all the different versions of functions can be built from the same source files with different macros enabled. For example, the availability of FMA instructions is important when implementing double-double (DD) operators [21]. We implemented DD operators both with and without FMA by manually specifying if the compiler can convert each combination of multiplication and addition instructions to an FMA instruction, utilizing VEALs.

Generally, bit masks are used in a vectorized code in order to conditionally choose elements from two vectors. In some vector extensions, a vector register with a width that matches a vector register for storing FP values, is used to store a bit mask. Some vector extensions provide narrower vector registers that are dedicated to this purpose, which is SLEEF makes use of these opmask registers by providing a dedicated data type in VEALs. If a vector extension does not support an opmask, the usual bit mask is used instead of an opmask. It is also better to have an opmask as an argument of a whole math function and make that function only compute the elements specified by the opmask. By utilizing a VEAL, it is also easy to implement such a functionality.

3.2 Details of VEALs

Fig. 1 shows some definitions in the VEAL for SVE [22]. We abstract vector data types and intrinsic functions with typedef statements and inline functions, respectively.

Fig. 1. - 
A part of definitions in VEAL for SVE.
Fig. 1.

A part of definitions in VEAL for SVE.

The vdouble data type is for storing vectors of double precision FP numbers. vopmask is the data type for the opmask described in Section 3.1.

The function vcast_vd_d is a function that returns a vector in which the given scalar value is copied to all elements in the vector. vsub_vd_vd_vd is a function for vector subtraction between two vdouble data. veq_vo_vd_vd compares elements of two vectors of vdouble type. The results of the comparison can be used, for example, by vsel_vd_vo_vd_vd to choose a value for each element between two vector registers. Fig. 2 shows an implementation of a vectorized positive difference function using a VEAL. This function is a vectorized implementation of the fdim function in the C standard math library.

Fig. 2. - 
Implementation of vectorized fdim (positive difference) function with VEAL.
Fig. 2.

Implementation of vectorized fdim (positive difference) function with VEAL.

3.3 Making Results Bit-Wise Consistent Across All Platforms

The method of implementing math functions described so far, can deliver computation results that slightly differ depending on architectures and other conditions, although they all satisfy the accuracy requirements, and other specifications. However, in some applications, bit-wise consistent results are required.

To this extent, the SLEEF project has been working closely with Unity Technologies,1 which specializes in developing frameworks for video gaming, and we discovered that they have unique requirements for the functionalities of math libraries. Networked video games run on many gaming consoles with different architectures and they share the same virtual environment. Consistent results of simulation at each terminal and server are required to ensure fairness among all players. For this purpose, fast computation is more important than accurate computation, while the results of computation have to perfectly agree between many computing nodes, which are not guaranteed to rely on the same architecture. Usually, fixed-point arithmetic is used for a purpose like this, however there is a demand for modifying existing codes with FP computation to support networking.

There are also other kinds of simulation in which bit-wise identical reproducibility is important. In [23], the authors show that modeled mean climate states, variability and trends at different scales may be significantly changed or even lead to opposing results due to the round-off errors in climate system simulations. Since reproducibility is a fundamental principle of scientific research, they propose to promote bit-wise identical reproducibility as a worldwide standard.

One way to obtain bit-wise consistent values from math functions is to compute correctly rounded values. However, for applications like networked video games, this might be too expensive. SLEEF provides vectorized math functions that return bit-wise consistent results across all platforms and other settings, and this is also achieved by utilizing VEALs. The basic idea is to always apply the same sequence of operations to the arguments. The IEEE 754 standard guarantees that the basic arithmetic operators give correctly rounded results [24], and therefore the results from these operators are bit-wise consistent. Because most of the functions except trigonometric functions do not have a conditional branch in our library, producing bit-wise consistent results is fairly straightforward with VEALs. Availability of FMA instructions is another key for making results bit-wise consistent. Since FMA instructions are critical for performance, we cannot just give up using FMA instructions. In SLEEF, the bit-wise consistent versions of functions have two versions both with and without FMA instructions. We provide a non-FMA version of the functions to guarantee bit-wise consistency among extensions such as Intel SSE2 that do not have FMA instructions. Another issue is that the compiler might introduce inconsistency by FP contraction, which is the result of combining a pair of multiplication and addition operations into an FMA. By disabling FP contraction, the compiler strictly preserves the order and the type of FP operations during optimization. It is also important to make the returned values from scalar functions bit-wise consistent with the vector functions. In order to achieve this, we also made a VEAL that only uses scalar operators and data types. The bit-wise consistent and non-consistent versions of vector and scalar functions are all built from the same source files, with different VEALs and macros enabled. As described in Section 5, trigonometric functions in SLEEF chooses a reduction method according to the maximum argument of all elements in the argument vector. In order to make the returned value bit-wise consistent, the bit-wise consistent version of the functions first applies the reduction method for small arguments to the elements covered by this method. Then it applies the second method only to the elements with larger arguments which the first method does not cover.

SECTION 4

The Development of a Vector Function ABI and SLEEF

Recent compilers are developing new optimization techniques to automatically vectorize a code written in standard programming languages that do not support parallelization [25], [26]. Although the first SIMD and vector computing systems [27] appeared a few decades ago, compilers with auto-vectorization capability have not been widely used until recently, because of several difficulties in implementing such functionality for modern SIMD architectures. Such difficulties include verifying whether the compiler can vectorize a loop or not, by determining data access patterns of the operations in the loop [2], [28]. For languages like C and C++, it is also difficult to determine the data dependencies through the iteration space of the loop, because it is hard to determine aliasing conditions of the arrays processed in the loop.

4.1 Vector Function Application Binary Interface

Vectorizing compilers convert calls to scalar versions of math functions such as sine and exponential to the SIMD version of the math functions. The most recent versions of Intel Compiler [29], GNU Compiler [30], and Arm Compiler for HPC [31], which is based on Clang/LLVM [32], [33], are capable of this transformation, and rely on the availability of vector math libraries such as SVML [3], Libmvec [11] and SLEEF respectively to provide an implementation of the vector function calls that they generate. In order to develop this kind of transformations, a target-dependent Application Binary Interface for calling vectorized functions had to be designed.

The Vector Function ABI for AArch64 architecture [34] was designed in close relationship with the development of SLEEF. This type of ABI must standardize the mapping between scalar functions and vector functions. The existence of a standard enables interoperability across different compilers, linkers and libraries, thanks to the use of standard names defined by the specification.

The ABI includes a name mangling function, a map that converts the scalar signature to the vector one, and the calling conventions that the vector functions must obey. In particular, the name mangling function that takes the name of the scalar function to the vector function must encode all the information that is necessary to reverse the transformation back to the original scalar function. A linker can use this reverse mapping to enable more optimizations (Link Time Optimizations) that operate on object files, and does not have access to the scalar and vector function prototypes. There is a demand by users for using a different vector math library according to the usage. Reverse mapping is also handy for this purpose. A vector math library implements a function for each combination of a vector extension, a vector length and a math function to evaluate. As a result, the library exports a large number of functions. Some vector math libraries can only implement part of all the combinations. By using the reverse mapping mechanism, the compiler can check the availability of the functions by scanning the symbols exported by a library.

The Vector Function ABI is also used with OpenMP [35]. From version 4.0 onwards, OpenMP provides the directive declare simd. A user can decorate a function with this directive to inform the compiler that the function can be safely invoked concurrently on multiple instances of its arguments [36]. This means that the compiler can vectorize the function safely. This is particularly useful when the function is provided via a separate module, or an external library, for example in situations where the compiler is not able to examine the behavior of the function in the call site. The scalar-to-vector function mapping rules stipulated in the Vector Function ABI are based on the classification of vector functions associated with the declare simd directive of OpenMP. Currently, work for implementing these OpenMP directives on LLVM is ongoing.

The Vector Function ABI specifications are provided for the Intel x86 and the Armv8 (AArch64) families of vector extensions [34], [37]. The compiler generates SIMD function calls according to the compiler flags. For example, when targeting AArch64 SVE auto-vectorization, the compiler will transform a call to the standard sin function to a call to the symbol _ZGVsMxv_sin. When targeting Intel AVX-512 [38] auto-vectorization, the compiler would generate a call to the symbol _ZGVeNe8v_sin.

4.2 SLEEF and the Vector Function ABI

SLEEF is provided as two separate libraries. The first library exposes the functions of SLEEF to programmers for inclusion in their C/C++ code. The second library exposes the functions with names mangled according to the Vector Function ABI. This makes SLEEF a viable alternative to libm and its SIMD counterpart libmvec, in glibc. This also enables a user work-flow that relies on the auto-vectorization capabilities of a compiler. The compatibility with libmvec enables users to swap from libmvec to libsleef by simply changing compiler options, without changing the code that generated the vector call. The two SLEEF libraries are built from the same source code, which are configured to target the different versions via auto-generative programs that transparently rename the functions according to the rules of the target library.

SECTION 5

Overview of Library Implementation

One of the objectives of the SLEEF project is to provide a library of vectorized math functions that can be used in conjunction with vectorizing compilers. When a non-vectorized code is automatically vectorized, the compiler converts calls to scalar math functions to calls to a SIMD version of the math functions. In order to make this conversion safe and applicable to wide variety of codes, we need functions with 1-ULP error bound that conforms to ANSI C standard. On the other hand, there are users who need better performance. Our library provides 1-ULP accuracy version and 3.5-ULP accuracy version for most of the functions. We confirmed that our library satisfies the accuracy requirements on an empirical basis. For non-finite inputs and outputs, we implemented the functions to return the same results as libm, as specified in the ANSI C standard. They do not set errno nor raise an exception.

In order to optimize a program with SIMD instructions, it is important to eliminate conditional branches as much as possible, and execute the same sequence of instructions regardless of the argument. If the algorithm requires conditional branches according to the argument, it must prepare for the case where the elements in the input vector contain both values that would make a branch happen and not happen. Recent processors have a long pipeline and therefore branch misprediction penalty can reach more than 10 cycles [39]. Making a decision for a conditional branch also requires non-negligible computation, within the scope of our tests. A conditional move is an operator for choosing one value from two given values according to a condition. This is equivalent to a ternary operator and can be used in a vectorized code to replace a conditional branch. Some other operations are also expensive in vectorized implementation. A table-lookup is expensive. Although in-register table lookup is reported fast on Cell BE SPU [12], it is substantially slower than polynomial evaluation without any table lookup, within the scope of our tests. Most vector extensions do not provide 64-bit integer multiplication or a vector shift operator with which each element of a vector can be specified a different number of bits to shift. On the other hand, FMA and round-to-integer instructions are supported by most vector extensions. Due to the nature of the evaluation methods, dependency between operations cannot be completely eliminated. Latencies of operations become an issue when a series of dependent operations are executed. FP division and square root are not too expensive from this aspect.2

The actual structure of the pipeline in a processor is complex, and such level of details are not well-documented for most CPUs. Therefore, it is not easy to optimize the code according to such hardware implementation. In this paper, we define the latency and throughput of an instruction or a subroutine as follows [41]. The latency of an instruction or a subroutine is the delay that it generates in a dependency chain. The throughput is the maximum number of instructions or subroutines of the same kind that can be executed per unit time when the inputs are independent of the preceding instructions or subroutines. Several tools and methods are proposed for automatically constructing models of latency, throughput, and port usage of instructions [42], [43]. Within the scope of our tests, most of the instruction latency in the critical path of evaluating a vector math function tends to be dominated by FMA operations. In many processors, FMA units are implemented in a pipeline manner. Some powerful processors have multiple FMA units with out-of-order execution, and thus the throughput of FMA instruction is large, while the latency is long. In SLEEF, we try to maximize the throughput of computation in a versatile way by only taking account of dependencies among FMA operations. We regard each FMA operation as a job that can be executed in parallel and try to reduce the length of the critical path.

In order to evaluate a double-precision (DP) function to 1-ULP accuracy, the internal computation with accuracy better than 1 ULP is sometimes required. Double-double arithmetic, in which a single value is expressed by a sum of two double-precision FP values [44], [45], is used for this purpose. All the basic operators for DD arithmetic can be implemented without a conditional branch, and therefore it is suitable for vectorized implementation. Because we only need 1-ULP overall accuracy for DP functions, we use simplified DD operators with less than the full DD accuracy. In SLEEF, we omit re-normalization of DD values by default, allowing overlap between the two numbers. We carry out re-normalization only when necessary.

Evaluation of an elementary function often consists of three steps: range reduction, approximation, and reconstruction [21]. An approximation step computes the elementary function using a polynomial. Since this approximation is only valid for a small domain, a number within that range is computed from the argument in a range reduction step. The reconstruction step combines the results of the first two steps to obtain the resulting number.

An argument reduction method that finds an FP remainder of dividing the argument x by \pi is used in evaluation of trigonometric functions. The range reduction method suggested by Cody and Waite [46], [47] is used for small arguments. The Payne and Hanek's method [48] provides an accurate range-reduction for a large argument of trigonometric function, but it is expensive in terms of operations.

There are tools available for generating the coefficients of the polynomials, such as Maple [49] and Sollya [50]. In order to fine-tune the generated coefficients, we created a tool for generating coefficients that minimizes the maximum relative error. When a SLEEF function evaluates a polynomial, it evaluates a few lowest degree terms in DD precision while other terms are computed in double-precision, in order to achieve 1-ULP overall accuracy. Accordingly, coefficients in DD precision or coefficients that can be represented by FP numbers with a few most significant bits in mantissa are used in the last few terms. We designed our tool to generate such coefficients. We use Estrin's scheme [51] to evaluate a polynomial to reduce dependency between FMA operations. This scheme reduces bubbles in the pipeline, and allows more FMA operations to be executed in parallel. Reducing latency can improve the throughput of evaluating a function because the latency and the reciprocal throughput of the entire function are close to each other.

Below, we describe and compare the implementations of selected functions in SLEEF, FDLIBM [4] and Christoph Lauter's Vector-libm [5]. We describe 1-ULP accuracy version of functions in SLEEF. The error bound specification of FDLIBM is 1 ULP.

5.1 Implementation of sin and cos

FDLIBM uses Cody-Waite range reduction if the argument is under 2^{18}\pi. Otherwise, it uses the Payne-Hanek range reduction. Then, it switches between polynomial approximations of the sine and cosine functions on [-\pi /4, \pi /4]. Each polynomial has 6 non-zero terms.

sin and cos in Vector-libm have 4-ULP error bound. They use a vectorized path if all arguments are greater than 3.05e-151, and less than 5.147 for sine and 2.574 for cosine. In the vectorized paths, a polynomial with 8 and 9 non-zero terms is used to approximate the sine function on [-\pi /2, \pi /2], following Cody-Waite range reduction. In the scalar paths, Vector-libm uses a polynomial with 10 non-zero terms.

SLEEF switches among two Cody-Waite range reduction methods with approximation with different sets of constants, and the Payne-Hanek reduction. The first version of the algorithm operates for arguments within [-15, 15], and the second version for arguments that are within [-10^{14}, 10^{14}]. Otherwise, SLEEF uses a vectorized Payne-Hanek reduction, which is described in 5.8. SLEEF only uses conditional branches for choosing a reduction method from Cody-Waite and Payne-Hanek. SLEEF uses a polynomial approximation of the sine function on [-\pi /2, \pi /2], which has 9 non-zero terms. The sign is set in the reconstruction step.

5.2 Implementation of tan

After Cody-Waite or Payne-Hanek reduction, FDLIBM reduces the argument to [0,0.67434], and uses a polynomial approximation with 13 non-zero terms. It has 10 if statements after Cody-Waite reduction.

tan in Vector-libm has 8-ULP error bound. A vectorized path is used if all arguments are less than 2.574 and greater than 3.05e-151. After Cody-Waite range reduction, a polynomial with 9 non-zero terms for approximating sine function on [-\pi /2, \pi /2] is used twice to approximate sine and cosine of the reduced argument. The result is obtained by dividing these values. In the scalar path, Vector-libm evaluates a polynomial with 10 non-zero terms twice.

In SLEEF, the argument is reduced in 3 levels. It first reduces the argument to [-\pi /2, \pi /2] with Cody-Waite or Payne-Hanek range reduction. Then, it reduces the argument to [-\pi /4, \pi /4] with \tan a_1 = 1 / \tan (\pi /2 - a_0). At the third level, it reduces the argument to [-\pi /8, \pi /8] with the double-angle formula. Let a_0 be the reduced argument with Cody-Waite or Payne-Hanek. a_1 = \pi /2 - a_0 if |a_0| > \pi /4. Otherwise, a_1 = a_0. Then, SLEEF uses a polynomial approximation of the tangent function on [-\pi /8, \pi /8], which has 9 non-zero terms, to approximate \tan (a_1/2). Let t be the obtained value with this approximation. Then, \tan a_0 \approx 2t / (1 - t^2) if |a_0| \leq \pi /4. Otherwise, \tan a_0 \approx (1 - t^2) / (2t). SLEEF only uses conditional branches for choosing a reduction method from Cody-Waite and Payne-Hanek. Annotated source code of tan is shown in Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2019.2960333.

5.3 Implementation of asin and acos

FDLIBM and SLEEF first reduces the argument to [0, 0.5] using \arcsin x = \pi /2-2\arcsin \sqrt{(1-x)/2} and \arccos x = 2\arcsin \sqrt{(1-x)/2}.

Then, SLEEF uses a polynomial approximation of arcsine on [0, 0.5] with 12 non-zero terms.

FDLIBM uses a rational approximation with 11 terms (plus one division). For computing arcsine, FDLIBM switches the approximation method if the original argument is over 0.975. For computing arccosine, it has three paths that are taken when |x|<0.5, x \leq -0.5 and x \geq 0.5, respectively. It has 7 and 6 if statements in asin and acos, respectively.

asin and acos in Vector-libm have 6-ULP error bound. asin and acos in Vector-libm use vectorized paths if arguments are all greater than 3.05e-151 and 2.77e-17, respectively. Vector-libm evaluates polynomials with 3, 8, 8, and 5 terms to compute arcsine. It evaluates a polynomial with 21 terms for arccosine.

5.4 Implementation of atan

FDLIBM reduces the argument to [0, 7/16]. It uses a polynomial approximation of the arctangent function with 11 non-zero terms. It has 9 if statements.

atan in Vector-libm have 6-ULP error bound. Vector-libm uses vectorized paths if arguments are all greater than 1.86e-151 and less than 2853. It evaluates four polynomials with 7, 9, 9 and 4 terms in the vectorized path.

SLEEF reduces argument a to [0, 1] using \arctan x = \pi /2 - \arctan (1/x). Let a^{\prime } = 1/a if |a| \geq 1. Otherwise, a^{\prime } = a. It then uses a polynomial approximation of arctangent function with 20 non-zero terms to approximate r \approx \arctan a^{\prime }. As a reconstruction, it computes \arctan a \approx \pi /2 - r if |a| \geq 1. Otherwise, \arctan a \approx r.

5.5 Implementation of log

FDLIBM reduces the argument to [\sqrt{2}/2, \sqrt{2}]. It then approximates the reduced argument with a polynomial that contains 7 non-zero terms in a similar way to SLEEF. It has 9 if statements.

log in Vector-libm has 4-ULP error bound. It uses a vectorized path if the input is a normalized number. It uses a polynomial with 20 non-zero terms to approximate the logarithm function on [0.75, 1.5]. It does not use division.

SLEEF multiplies the argument a by 2^{64}, if the argument is a denormal number. Let a^{\prime } be the resulting argument, e = \lfloor \log\! _2 (4a^{\prime }/3) \rfloor and m = a^{\prime } \cdot 2^{-e}. If a is a denormal number, e is subtracted 64. SLEEF uses a polynomial with 7 non-zero terms to evaluate \log m \approx \sum _{n=0}^{6} C_n \left(\frac{m-1}{m+1} \right)^{2n+1}, where C_0 \ldots C_6 are constants. As a reconstruction, it computes \log a = e\;\log 2 + \log m.

5.6 Implementation of exp

All libraries reduce the argument range to [-(\log 2) / 2, (\log 2) / 2] by finding r and integer k such that x = k\; \log 2 + r, |r| \leq (\log 2) /2.

SLEEF then uses a polynomial approximation with 13 non-zero terms to directly approximate the exponential function of this domain. It achieves 1-ULP error bound without using a DD operation.

FDLIBM uses a polynomial with 5 non-zero terms to approximate f(r) = r(e^r+1)/(e^r-1). It then computes exp(r) = 1 + 2r/(f(r)-r). It has 11 if statements.

The reconstruction step is to add integer k to the exponent of the resulting FP number of the above computation.

exp in Vector-libm has 4-ULP error bound. A vectorized path covers almost all input domains. It uses a polynomial with 11 terms to approximate the exponential function.

5.7 Implementation of pow

FDLIBM computes y\; \log\! _2\;x in DD precision. Then, it computes pow(x, y) = e^{\log 2 \cdot y \;\log\! _2\,x}. It has 44 if statements.

Vector-libm does not implement pow.

SLEEF computes e^{y \log x}. The internal computation is carried out in DD precision. In order to compute logarithm internally, it uses a polynomial with 11 non-zero terms. The accuracy of the internal logarithm function is around 0.008 ULP. The internal exponential function in pow uses a polynomial with 13 non-zero terms.

5.8 The Payne-Hanek Range Reduction

Our method computes \mathrm{rfrac}(2x/\pi) \cdot \pi /2, where \mathrm{rfrac}(a) := a - \mathrm{round}(a). The argument x is an FP number, and therefore it can be represented as M \cdot 2^E, where M is an integer mantissa and E is an integer exponent E. We now denote the integral part and the fractional part of 2^E \cdot 2/\pi as I(E) and F(E), respectively. Then, \begin{align*} \mathrm{rfrac}(2x/\pi) &= \mathrm{rfrac}(M \cdot 2^E \cdot 2/\pi)\\ &= \mathrm{rfrac}(M \cdot (I(E) + F(E)))\\ &= \mathrm{rfrac}(M \cdot F(E)). \end{align*}

View SourceRight-click on figure for MathML and additional features.

The value F(E) only depends on the exponent of the argument, and therefore, can be calculated and stored in a table, in advance. In order to compute \mathrm{rfrac}(M \cdot F(E)) in DD precision, F(E) must be in quad-double-precision. We now denote F(E) = F_0(E) + F_1(E) + F_2(E) + F_3(E), where F_0(E) \ldots F_3(E) are DP numbers and |F_0(E)| \geq |F_1(E)| \geq |F_2(E)| \geq |F_3(E)|. Then, \begin{align*} & \mathrm{rfrac}(M \cdot F(E))\\ = & \mathrm{rfrac}(M \cdot F_0(E) + M \cdot F_1(E)\\ & + M \cdot F_2(E) + M \cdot F_3(E))\\ = & \mathrm{rfrac}(\mathrm{rfrac}(\mathrm{rfrac}(\mathrm{rfrac}(M \cdot F_0(E)) + M \cdot F_1(E))\\ & + M \cdot F_2(E)) + M \cdot F_3(E)), \tag{1} \end{align*}

View SourceRight-click on figure for MathML and additional features. because \mathrm{rfrac}(a + b) = \mathrm{rfrac}(\mathrm{rfrac}(a) + b). In the method, we compute (1) in DD precision in order to avoid overflow. The size of the table retaining F_0(E) \ldots F_3(E) is 32K bytes. Our method is included in the source code of tan shown in Appendix A, available in the online supplemental material.

FDLIBM seems to implement the original Payne-Hanek algorithm with more than 100 lines of C code, which includes 13 if statements, 18 for loops, 1 switch statement and 1 goto statement. The numbers of iterations of most of the for loops depend on the argument.

Vector-libm implements a non-vectorized variation of the Payne-Hanek algorithm which has some similarity with our method. In order to reduce argument x, it first decomposes |x| into E and n such that 2^E \cdot n = |x|. A triple-double (TD) approximation to t(E) = 2^E / \pi - 2 \cdot \lfloor 2^{E - 1} / \pi \rfloor is looked-up from a table. It then calculates m = n \cdot t(E) in TD. The reduced argument is obtained as a product of \pi and the fractional part of m. In Table 1, we compare the numbers of FP operators in the implementations. Note that the method in Vector-libm is used for trigonometric functions with 4-ULP error bound, while our method is used for functions with 1-ULP error bound.

TABLE 1 Number of FP Operators in the Payne-Hanek Implementations
Table 1- 
Number of FP Operators in the Payne-Hanek Implementations

5.9 FP Remainder

We devised an exact remainder calculation method suitable for vectorized implementation. The method is based on the long division method, where an FP number is regarded as a digit. Fig. 3 shows an example process for calculating the FP remainder of 1e+40 / 0.75. Like a typical long division, we first find integer quotient 1.333e+40 so that 1.333e+40 \cdot 0.75 does not exceed 1e+40. We multiply the found quotient with 0.75, and then subtract it from 1e+40 to find the dividend 4.836e+24 for the second iteration.

Fig. 3. - 
Example computation of FP remainder.
Fig. 3.

Example computation of FP remainder.

Our basic algorithm is shown in Algorithm 1. If n, d and q_k are FP numbers of the same precision p, then r_k is representable with an FP number of precision 2p. In this case, the number of iterations can be minimized by substituting q_k with the largest FP number of precision p within the range specified at line 3. However, the algorithm still works if q_k is any FP number of precision p within the range. By utilizing this property, an implementer can use a division operator that does not return a correctly rounded result. The source code of an implementation of this algorithm is shown in Fig. 6 in Appendix B, available in the online supplemental material. A part of the proof of correctness is shown in Appendix C, available in the online supplemental material.

FDLIBM uses a method of shift and subtract. It first converts the mantissa of two given arguments into 64-bit integers, and calculates a remainder in a bit-by-bit basis. The main loop iterates ix-iy times, where ix and iy are the exponents of the arguments of fmod. This loop includes 10 integer additions and 3 if statements. The number of iterations of the main loop can reach more than 1,000.

Vector-libm does not implement FP remainder.

Algorithm 1. Exact Remainder Calculation

Input: Finite positive numbers n and d

Output: Returns n - d\lfloor n/d \rfloor

1:

r_0 := n, k := 0

2:

while d \leq r_k do

3:

q_k is an arbitrary integer satisfying (r_k/d)/2 \leq q_k \leq r_k/d

4:

r_{k+1} := r_k - q_kd

5:

k := k + 1

6:

end while

7:

return r_k

5.10 Handling of Special Numbers, Exception and Flags

Our implementation gives a value within the specified error bound without special handling of denormal numbers, unless otherwise noted.

When a function has to return a specific value for a specific value of an argument (such as a NaN or a negative zero) is given, such a condition is checked at the end of each function. The return value is substituted with the special value if the condition is met. This process is complicated in functions like pow, because they have many conditions for returning special values.

SLEEF functions do not give correct results if the computation mode is different from round-to-nearest. They do not set errno nor raise an exception. This is a common behavior among vectorized math libraries including Libmvec [11] and SVML [3]. Because of SIMD processing, functions can raise spurious exceptions if they try to raise an exception.

5.11 Summary

FDLIBM extensively uses conditional branches in order to switch the polynomial according to the argument(sin, cos, tan, log, etc), to return a special value if the arguments are special values(pow, etc.), and to control the number of iterations (the Payne-Hanek reduction).

Vector-libm switches between a few polynomials in most of the functions. It does not provide functions with 1-ULP error bound, nevertheless, the numbers of non-zero terms in the polynomials are larger than other two libraries in some of the functions. A vectorized path is used only if the argument is smaller than 2.574 in cos and tan, although these functions are frequently evaluated with an argument up to 2\pi. In most of the functions, Vector-libm uses a non-vectorized path if the argument is very small or a non-finite number. For example, it processes 0 with non-vectorized paths in many functions, although 0 is a frequently evaluated argument in normal situations. If non-finite numbers are once contained in data being processed, the whole processing can become significantly slower afterward. Variation in execution time can be exploited for a side-channel attack in cryptographic applications.

SLEEF uses the fastest paths if all the arguments are under 15 for trigonometric functions, and the same vectorized path is used regardless of the argument in most of the non-trigonometric functions. SLEEF always uses the same polynomial regardless of the argument in all functions.

Although reducing the number of conditional branches has a few advantages in implementing vector math libraries, it seems to be not given a high priority in other libraries.

SECTION 6

Testing

SLEEF includes three kinds of testers. The first two kinds of testers test the accuracy of all functions against high-precision evaluation using the MPFR library. In these tests, the computation error in ULP is calculated by comparing the values output by each SLEEF function and the values output by the corresponding function in the MPFR library, and it is checked if the error is within the specified bounds.

6.1 Perfunctory Test

The first kind of tester carries out a perfunctory set of tests to check if the build is correct. These tests include standards compliance tests, accuracy tests and regression tests.

In the standards compliance tests, we test if the functions return the correct values when values that require special handling are given as the argument. These argument values include \pm Inf, NaN and \pm 0. Atan2 and pow are binary functions and have many combinations of these special argument values. These are also all tested.

In the accuracy test, we test if the error of the returned values from the functions is within the specified range, when a predefined set of argument values are given. These argument values are basically chosen between a few combinations of two values at regular intervals. The trigonometric functions are also tested against argument values close to integral multiples of \pi /2. Each function is tested against tens of thousands of argument values in total.

In the regression test, the functions are tested with argument values that triggered bugs in the previous library release, in order to prevent re-emergence of the same bug.

The executables are separated into a tester and IUTs (Implementation Under Test). The tests are carried out by making these two executables communicate via an input/output pipeline, in order to enable testing of libraries for architectures which the MPFR library does not support.

6.2 Randomized Test

The second kind of tester is designed to run continuously. This tester generates random arguments and compare the output from each function to the output calculated with the corresponding function in the MPFR library. This tester is expected to find bugs if it is run for a sufficiently long time.

In order to randomly generate an argument, the tester generates random bits of the size of an FP value, and reinterprets the bits as an FP value. The tester executes the randomized test for all the functions in the library at several thousand arguments per second for each function on a computer with a Core i7-6700 CPU.

In the SLEEF project, we use randomized testing in order to check the correctness of functions, rather than formal verification. It is indeed true that proving correctness of implementation contributes to the reliability of implementation. However, there is a performance overhead because the way of implementation is limited in a form that is easy to prove the correctness. There would be an increased cost of maintaining the library because of the need for updating the proof each time the implementation is modified.

6.3 Bit-Identity Test

The third kind of tester is for testing if bit-identical results are returned from the functions that are supposed to return such results. This test is designed to compare the results among the binaries compiled with different vector extensions. For each predetermined list of arguments, we calculate an MD5 hash value of all the outputs from each function. Then, we check if the hash values match among functions for different architectures.

SECTION 7

Performance Comparison

In this section, we present results of a performance comparison between FDLIBM Version 5.3 [4], Vector-libm [5], SLEEF 3.4, and Intel SVML [3] included in Intel C Compiler 19.

We measured the reciprocal throughput of each function by measuring the execution time of a tight loop that repeatedly calls the function in a single-threaded process. In order to obtain useful results, we turned off optimization flags when compiling the source code of this tight loop,3 while the libraries are compiled with their default optimization options. We did not use LTO. We confirmed that the calls to the function are not compiled out or inlined by checking the assembly output from the compiler. The number of function calls by each loop is 10^{10}, and the execution time of this loop is measured with the clock_gettime function.

We compiled SLEEF and FDLIBM using gcc-7.3.0 with “-O3 -mavx2 -mfma” optimization options. We compiled Vector-libm using gcc-7.3.0 with the default “-O3 -march=native -ftree-vectorize -ftree-vectorizer-verbose=1 -fno-math-errno” options. We changed VECTOR_LENGTH in vector.h to 4 and compiled the source code on a computer with an Intel Core i7-6700 CPU.4 The accuracy of functions in SVML can be chosen by compiler options. We specified an “-fimf-max-error=1.0” and an “-fimf-max-error=4.0” options for icc to obtain the 1-ULP and 4-ULP accuracy results, respectively.

We carried out all the measurements on a physical PC with Intel Core i7-6700 CPU @ 3.40 GHz without any virtual machine. In order to make sure that the CPU is always running at the same 3.4 GHz clock speed during the measurements, we turned off Turbo Boost. With this setting, 10 nano sec. corresponds to 34 clock cycles.

The following results compare the the reciprocal throughput of each function. If the implementation is vectorized and each vector has N elements of FP numbers, then a single execution evaluates the corresponding mathematical function N times. We generated arguments in advance and stored in arrays. Each time a function is executed, we set a randomly generated argument to each element of the argument vector (each element is set with a different value). The measurement results do not include the delay for generating random numbers.

7.1 Execution Time of Floating Point Remainder

We compared the reciprocal throughput of double-precision fmod functions in the libm included in Intel C Compiler 19, FDLIBM and SLEEF. All the FP remainder functions always return a correctly-rounded result. We generated a random denominator d and a numerator uniformly distributed within [1, 100] and [0.95r\cdot d, 1.05r\cdot d], respectively, where r is varied from 1 to 10^{25}. Fig. 4 shows the reciprocal throughput of the fmod function in each library. Please note that SVML does not contain a vectorized fmod function.

Fig. 4. - 
Reciprocal throughput of double-precision fmod functions.
Fig. 4.

Reciprocal throughput of double-precision fmod functions.

The graph of reciprocal throughput looks like a step function, because the number of iterations increases in this way.

7.2 Comparison of Overall Execution Time

We compared the reciprocal throughput of 256-bit wide vectorized double-precision functions in Vector-libm, SLEEF and SVML, and scalar functions in FDLIBM. We generated random arguments that were uniformly distributed within the indicated intervals for each function. In order to check execution speed of fast paths in trigonometric functions, we measured the reciprocal throughput with arguments within [0.4, 0.5]. The result is shown in Table 2.

TABLE 2 Reciprocal Throughput in Nano Sec
Table 2- 
Reciprocal Throughput in Nano Sec

The reciprocal throughput of functions in SLEEF is comparable to that of SVML in all cases. This is because the latency of FP operations is generally dominant in the execution time of math functions. Because there are two levels of scheduling mechanisms, which includes the optimizer in a compiler and the out-of-order execution hardware, there is small room for making a difference to the throughput or latency.

Execution speed of FDLIBM is not very slow despite many conditional branches. This seems to be because of a smaller number of FP operations, and faster execution speed of scalar instructions compared to equivalent SIMD instructions.

Vector-libm is slow even if only the vectorized path is used. This seems to be because Vector-libm evaluates polynomials with a large number of terms. Auto-vectorizers are still developing, and the compiled binary code might not be well optimized. When a slow path has to be used, Vector-libm is even slower since a scalar evaluation has to be carried out for each of the elements in the vector.

Vector-libm uses Horner's method to evaluate polynomials, which involves long latency of chained FP operations. In FDLIBM, this latency is reduced by splitting polynomials into even and odd terms, which can be evaluated in parallel. SLEEF uses Estrin's scheme. In our experiments, there was only a small difference between Estrin's scheme and splitting polynomials into even and odd terms with respect to execution speed.

SECTION 8

Conclusion

In this paper, we showed that our SLEEF library shows performance comparable to commercial libraries while maintaining good portability. We have been continuously developing SLEEF since 2010.5 [52] We distribute SLEEF under the Boost Software License [53], which is a permissive open source license. We actively communicate with developers of compilers and members of other projects in order to understand the needs of real-world users. The Vector Function ABI is important in developing vectorizing compilers. The functions that return bit-identical results are added to our library to reflect requests from our multiple partners. We thoroughly tested these functionalities, and SLEEF is already adopted in multiple commercial products.

ACKNOWLEDGMENTS

The authors would like to acknowledge Will Lovett and Srinath Vadlamani for their valuable input and suggestions. They are particularly grateful to Robert Werner, who reviewed the final manuscript. The authors would like to thank Prof. Leigh McDowell for his suggestions. The authors would like to thank all the developers that contributed to the SLEEF project, in particular Diana Bite and Alexandre Mutel.

References

References is not available for this document.