Skip to Main Content
Using GPUs as computational accelerators has been a growing area of research in the past several years. One particular area amenable to exploiting video card hardware is dense linear algebra. We continue this trend by generalizing the MAGMA xGEMM kernels, porting them to OpenCL and tuning them to run on the AMD 7970. Achieving up to 1.7 TFlops in SGEMM and 650 GFlops in DGEMM, we extend this performance to multiple GPUs using a parallel-for algorithm designed to run on multiple heterogeneous devices. Using 3 Radeon 7970s, our large GEMM algorithm obtains 4.37TFlops in single precision and 1.64 TFlops/s in double.