This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Gene®/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.