Skip to Main Content
The ratio between processor and main memory performance has been increasing since quite some time, and can safely be expected to do so throughout the oncoming years. In the era of single-core processors, this was mainly observable by increased latency, for example when measured in number of (possibly stalled) CPU clock cycles. Nowadays, with multicore chips, multiple cores share the same connection to off-chip main memory, which effectively reduces available bandwidth, as well. Caches help in both cases: they provide both a much lower latency and a much higher bandwidth by being located on-chip. By holding copies of least recently used memory blocks, caches exploit the fact that programs on the average access memory in ways that often access the same memory cell (temporal locality), or nearby memory cells (spatial locality). However, this natural locality is not enough for scientific computing in HPC. Further improving any existing access locality of given algorithms is very much wanted. In this talk, we present strategies to improve the locality of memory accesses for linear algebra problems occurring in different kinds of applications: (1) an algorithmic approach based on Peano spacefilling curves that leads to inherently cache efficient (cache oblivious) matrix algorithms, such as matrix multiplication or LU decomposition for dense and sparse matrices - on single-core CPUs, as well as in the context of shared-memory multicore platforms. (2) cache optimization strategies for matrix-vector multiplications with very large, sparse matrices, as they occur in the iterative MLEM algorithm, which is used for image reconstruction in nuclear medicine. Here, different cache-aware optimization strategies are combined in order to better exploit large caches, small caches, and single cache lines.