Skip to Main Content
Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with hyper-threading (HT) technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores with HT is less than 2 and 1 GFLOPS on Level-2 and level-1 BLAS, respectively, even though eight threads are executed in parallel.
Date of Conference: 25-27 Nov. 2008