Skip to Main Content
An implementation of the nonlinear iterative partial least squares algorithm (NIPALS) was used as a test case for use of OpenCL for computation on a general purpose graphics processing unit (GPGPU) cluster using MPI. Timing results are shown along with results of a model of time required per iteration for defined problem sizes. Various steps in optimization of the code are discussed, moving from use of a single GPU, to multiple GPUs on a single node, to multiple GPUs on multiple nodes. Comparison of performance between OpenCL and BLAS implementations, modern CPU architectures and NVidia Tesla and Fermi class GPU systems are given.