Level-based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication

The multiplication of a sparse matrix with a dense vector (SpMV) is a key component in many numerical schemes and its performance is known to be severely limited by main memory access. Several numerical schemes require the multiplication of a sparse matrix polynomial with a dense vector, which is typically implemented as a sequence of SpMVs. This results in low performance and ignores the potential to increase the arithmetic intensity by reusing the matrix data from cache. In this work we use the recursive algebraic coloring engine (RACE) to enable blocking of sparse matrix data across the polynomial computations. In the graph representing the sparse matrix we form levels using a breadth-first search. Locality relations of these levels are then used to improve spatial and temporal locality when accessing the matrix data and to implement an efficient multithreaded parallelization. Our approach is independent of the matrix structure and avoids shortcomings of existing"blocking"strategies in terms of hardware efficiency and parallelization overhead. We quantify the quality of our implementation using performance modelling and demonstrate speedups of up to 3$\times$ and 5$\times$ compared to an optimal SpMV-based baseline on a single multicore chip of recent Intel and AMD architectures. As a potential application, we demonstrate the benefit of our implementation for a Chebyshev time propagation scheme, representing the class of polynomial approximations to exponential integrators. Further numerical schemes which may benefit from our developments include $s$-step Krylov solvers and power clustering algorithms.


INTRODUCTION AND RELATED WORK
Sparse matrix-vector multiplication (SpMV) is a critical building block for a wide variety of computational algorithms used in science, engineering, and data analytics.The SpMV kernel is known to perform poorly on modern compute devices due to its low arithmetic intensity and often irregular memory access pattern.Most performance optimization efforts target a single SpMV invocation.To minimize the data access costs to the matrix entries, a plethora of data layout choices have been proposed for GPGPUs [1] and CPUs [2], [3], [4], including hardwareagnostic formats [5].These formats typically ensure linear access to matrix data, but the input vector is always accessed indirectly and therefore potentially in an irregular way.Optimization strategies like matrix reordering or partitioning techniques [6] aim to reduce the reuse distances in the vector accesses and thus improve the performance.Finally, at the kernel implementation level, automatic performance optimization for SpMV has been a subject of research for decades.These approaches mainly account for the complexity of cache-based microprocessors, where SpMV performance maybe extremely sensitive to the spatial/temporal data access locality, out-of-order instruction capability, register scheduling, and SIMD vectorization.Choosing parameters for these code optimizations and choosing among alternative implementations is critical for efficient hardware utilization.It has been demonstrated [3], [7], [8], [9] that it is possible • C. Alappat, G. Hager and G. Wellein are with Erlangen National High Performance Computing Center at Friedrich-Alexander-Universität Erlangen-N ürnberg.E-mail: {christie.alappat,georg.hager,gerhard.wellein}@fau.de• G. Wellein is also with the Department of Computer Science, Friedrich-Alexander-Universität Erlangen-N ürnberg.• O. Schenk is with the Institute of Computing at Faculty of Informatics, Università della Svizzera italiana.E-mail: olaf.schenk@usi.chto build an automatic tuning system capable of generating implementations that are on par with or even outperform the best manually tuned code.
In this work, we extend SpMV performance tuning research towards automatic data reuse optimization across several SpMV invocations in the sparse matrix-power-vector kernel (MPK), which computes Ax, A 2 x, A 3 x, • • • , A k x for matrix A, vector x, and a small constant k.Our focus is on thread-level parallel and efficient CPU implementation of MPK using the popular compressed row storage (CRS) sparse matrix format.To this end we extend the recursive algebraic coloring engine (RACE) framework [10] to tackle the dependencies between several SpMV invocations in the MPK.The algebraic formulation used in RACE is general in the sense that it does not assume any special structure in the underlying matrix.
The need for software implementations and structures for MPK is exemplified by communication-avoiding algorithms [11], [12], [13], [14], which have been proposed to improve performance by trading redundant computation for memory traffic.In these algorithms, independent SpMV invocations are replaced by the MPK to compute A k x.Once the computation has been performed, the next k steps of the solver can proceed without further memory accesses to A by combining vectors from this set.
There has been some research in exploiting data locality in MPK, mostly motivated by classic blocking strategies well established in stencil computations.In particular, in [15] blocking schemes for MPK have been developed that first partition the graph of a matrix A into p blocks of almost equal size, where p is the number of cores.For cache reuse, the blocks assigned to each core are further partitioned.Within each block, an orthotrope-style [16] temporal blocking is used to perform MPK computation locally for the block.This requires to find neighbors of each block that are involved in an MPK computation with power k.However, these neighbors end up in nonconsecutive spots, resulting in a performance bottleneck.In [17], MPK kernels were studied on modern multicore architectures for banded sparse matrices that arise from stencil discretization.Following classic stencil blocking approaches, a geometrical blocking method was proposed.For matrices arising from two-dimensional discretization the method achieved decent speedup.However, for matrices from three-dimensional discretization it yielded very limited performance gains due to high matrix bandwidth.Most of the other works [11], [18], [19] on MPK schemes focused on reducing the MPI communication overhead.A recent work [20] in this direction presents a theoretical study on the benefit of diamond tiling for reducing communication.

Contribution and Outline
Our work bridges the gap between temporal blocking of stencil algorithms [21], [22], [23], which can be considered as an MPK on structured grids, and recursive spatial blocking strategies for SpMV [24].In addition we reduce the need to manually set up the blocks.We cover full thread-level parallelization and focus on a single multicore processor.Our contributions are as follows: • We generalize temporal tiling strategies known from stencil computations on structured grids to MPK computations on structured and unstructured sparse matrices using the levels of the graph of the matrix.

•
We present an efficient, multi-threaded implementation of our level-based blocking method for sparse MPK on modern multicore processors.Our solution aims to reduce the main memory traffic and to avoid scalability bottlenecks such as synchronization overhead or load imbalance.

•
We conduct a detailed performance analysis of our approach as implemented in RACE on various CPU architectures.

•
For a broad set of sparse matrices we demonstrate full threading functionality and excellent multicore performance achieving speedups of 3× to 5× compared to a standard baseline implementation.
• We validate the performance improvements using the roofline model and the phenomenological Execution-Cache-Memory (ECM) model.These models corroborate the optimality of both our level-blocking approach and the baseline implementation to which we compare.

•
We finally discuss potential applications of MPK to problems from physics and chemistry and demonstrate how such applications can be optimized using the MPK model.
The remainder of the paper is structured as follows.Section 2 reviews our experimental setup, in particular hardware and software characteristics of the next generation of scalable processor, namely the Intel Cascade Lake and Intel Ice Lake, and the AMD EPYC architectures, and, additionally, the set of benchmark matrices.In Section 3 we review the computational workload of matrix-vector multiplications for sparse matrices.Section 4 is dedicated to the main contribution of the paper and describes in detail the algorithmic components of level-based blocking of MPK.Section 5 includes an assessment of performance parameters within our recursive level-based blocking engine (RACE MPK) method.In Section 6 we conduct a detailed performance analysis of our cache-aware implementation for matrix-power kernels and compare it to a state-of-theart implementation.Section 7 presents the application of the matrix-power-vector multiplication in Chebyshev time propagation of quantum wave functions and finally Section 8 concludes the paper.

Hardware
The measurements in this paper were conducted on a single socket of Intel Cascade Lake (CLX), Intel Ice Lake (ICL), and AMD Epyc Zen2 (ROME), respectively.Key specifications of the three systems are summarized in Table 1.
These state-of-the-art processors power more than 50% of the top 100 ranking supercomputers [25].The Intel CPUs support the AVX-512 instruction set, while the AMD CPU supports only AVX-2.Turbo mode was active for all the runs, and the systems were configured with one ccNUMA domain per socket, i.e., on Intel systems the Sub-NUMA Clustering (SNC) was disabled and on AMD the NPS1 mode was used.
All CPUs have three levels of cache: private, inclusive L1 and L2, and a victim-type L3.The L3 cache on the Intel systems is shared by all cores of a socket, while on ROME it is shared only within a core complex (CCX), which comprises four cores.The aggregate L3 cache on ROME is 2.5× larger than on ICL and 5× larger than on CLX.This can be observed in the full-socket load-only bandwidth measurements in Fig. 1, where the combined L2 and L3 cache sizes are marked with dashed lines.This data also shows the L3 and main memory bandwidths of the three CPUs.CLX and ICL have a moderate L3 bandwidth of 300 Gbyte/s and 400 Gbyte/s, respectively, while ROME has a very high L3 bandwidth of more than 2500 Gbyte/s.It is worth noting that the transition from L3 to main memory is very sharp on ROME and occurs exactly where the data-set size exceeds the total cache size, while on the Intel systems the drop is gradual and there is a noticeable cache effect even when the working set exceeds the cache size by 2× or more, due to its dynamic cache replacement policy [26].The main memory bandwidth (b Mem ) of CLX, ICL and ROME is about 116 Gbyte/s, 170 Gbyte/s, and 146 Gbyte/s, respectively.

Software
For compilation, Intel compilers (see Table 1 for version info) were used on Ubuntu 18.04.5 (CLX and ROME) and Red Hat Enterprise Linux 8.2 (ICL), respectively, with compiler flags -O3 -xHOST.All floating-point computations were done in double precision, while integers were 32 bits wide.Threads were bound to cores in a closed (fill-type pinning) manner.To reduce fluctuations, each kernel was executed multiple times such that the overall runtime is greater than one second.The average performance of these runs was then reported.As the variation among multiple measurements was less than 5%, we do not show error bars.For pinning, bandwidth benchmarks (see Fig. 1), and for counting hardware events we use the likwid-pin, likwid-bench, and likwid-perfctr tools from the LIKWID tool suite version 5.1.

Benchmark matrices
Table 2 shows the sparse matrices used for the benchmarks and some of their properties: N r is the total number of rows, N nz is the total number of nonzero entries, and N nzr is the average number of nonzero entries per row (i.e.,N nz /N r ).The matrices are ordered (top to bottom) according to increasing N nz , and all are square since this is a requirement for the matrix power kernel (MPK).All matrices except one were taken from the SuiteSparse Matrix Collection [27].HPCG-128-128-128 is the matrix found in the HPCG benchmark [28], with a problem size of 128 3 .

MATRIX POWER KERNEL
The basic algorithmic workload addressed in this paper is the computation of powers of a sparse matrix applied to a dense vector.The matrix power kernel (MPK) is defined as follows: For a given square, sparse matrix A and a dense vector x calculate all matrix powers A p x up to a maximum p m (p = 1, . . ., p m ) and store all p m resulting vectors (y p = A p x) for subsequent calculations.We further define y 0 := x.

Baseline MPK implementation
The standard approach to implement the MPK is to perform a sequence of p m SpMV operations, i.e., y i = Ay i−1 with i = 1, . . ., p m , using standard SpMV implementations or library calls.We refer to this strategy as baseline MPK.
Figure 2 shows a high-level representation of our baseline MPK together with an SpMV implementation that is known to provide good performance on CPUs for a wide variety of sparse matrix structures.The sparse matrix A is stored in the well-known CRS format, using the three arrays rowP tr, val, and col, which hold the row pointer information, values, and column indices of nonzero entries, respectively (see [29] for details).This information is passed (as global data) to the SpMV function along with the function parameters (line 9) representing the right-hand side (RHS) vector and the range of row indices for which the SpMV is to be computed. 1The function then performs the SpMV operation (lines 12-20) and returns the resulting left-hand side (LHS) vector.Note that most SpMV implementations in libraries are unsuitable for the optimized MPK discussed later as they do not support SpMV on a subset of rows.Therefore we use our own version of SpMV, which serves as the main kernel for both the baseline and the optimized version.We have ensured that our SpMV performs at least as good as Intel MKL with the standard CRS format.
The baseline MPK stores the p m + 1 vectors {y p } in the matrix y[:, 0 : p m ] (column-major order) and performs p m back-to-back calls to the SpMV function (see lines 5-7 of Fig. 2).If the caches are too small to hold the entire matrix, it must be read p m times from main memory.Consequently, the optimum (minimum) main memory balance for the CRSbased baseline MPK is B C = 6 byte/flop [10], [30], which is equivalent to 12 bytes of memory traffic per nonzero matrix entry.The baseline MPK thus reflects the strongly memorybound performance characteristic of the underlying SpMV operation.
In order to evaluate the quality of optimized MPK implementations, we will measure the actual code balance B C,m and compare it with the theoretical baseline minimum (6 byte/flop) discussed above.The B C,m is obtained by measuring the actual data traffic (using likwid-perfctr) and dividing it by the minimum amount of floating-point operations to be performed, i.e., 2 × N nz × p max .Where appropriate, measured code balance from within the cache hierarchy will also be reported.
1.For the baseline implementation, the entire row range is specified.#pragma omp parallel for schedule(static) 13: for row = row s : row e do

Blocking strategy for the MPK implementation
As the same sparse matrix is repeatedly applied, there is substantial performance optimization potential via data transfer reduction by reusing matrix entries from the cache for the successive computation of multiple powers.The basic idea is to compute the SpMV partially for a block of A that fits into cache and reuse these matrix entries for the next SpMV, i.e., calculate another power on a smaller subset of the data.This approach is equivalent to temporal blocking for iterative stencil update schemes, where multiple updates on the same stencil data are computed in cache.Here the spatial stencil structure determines the dependencies between successive updates and geometric schemes for handling the spatialtemporal dependencies such as trapezoidal [31] or diamond blocking [32] are well established.To demonstrate the equivalent challenge in MPK, we show in Fig. 3 a simple banded sparse matrix, which arises from a discretization of a toy stencil in one spatial dimension.In the first step (Fig. 3a), an SpMV operation is performed applying a block of the matrix (yellow rows), which fits into cache, to the input (RHS) vector x to calculate a part of Ax (yellow elements of LHS vector).In the next step (Fig. 3b), the updated vector elements serve as input and are used to calculate A 2 x (blue elements of the LHS vector) by applying SpMV with a subset of the matrix block (blue rows).To fulfill the dependencies between these successive SpMV steps, the column indices of the subset of the matrix block (blue rows in Fig. 3b) must be in the range (indicated with red line) of the row indices of the original matrix block (yellow rows).It is obvious that the overhead of this approach, which is quantified by the ratio of yellow to blue rows in Fig. 3b, increases with the bandwidth of the matrix (i.e., with longer-range stencils).
The outlined MPK blocking approach can be generalized for sparse matrices with irregular structures.We define I to be a set of row indices of the matrix A. The corresponding set C(I) contains the column indices of all nonzero entries in the rows of I, i.e., if i ∈ I then j ∈ C(I) ⇐⇒ A i,j = 0. Based on this notation, the SpMV operation (y = Ax) for a given row index i ∈ I can be written as: If we apply the SpMV for all rows in I to a RHS y p−1 , then all corresponding row entries of the LHS vector are updated to power p.We can then apply to this vector another SpMV on a set of rows K for which C(K) ⊆ I.The choice of the set of row indices I for a given sparse matrix A is decisive to the performance of such a method: (i) The matrix elements associated with I and C(I) have to fit into cache and (ii) should be stored to enable high spatial and temporal locality without indirect access.Furthermore, (iii) the bandwidth of the matrix involved in the MPK should be as small as possible, i.e., the indices of C(I) have to be close to the set I. A potential approach to address these challenges is to consider the SpMV operation as a graph traversal problem as done in the RACE coloring scheme [10].Here, breadthfirst search (BFS) [33] is applied to the graph underlying A; the BFS levels of A are stored consecutively.These levels allow us to identify appropriate parts of the matrix (I and K) for blocking and how to traverse the full matrix (graph) systematically to update vector elements corresponding to all matrix powers while maintaining locality in accessing matrix and vector data.As an added benefit, the BFS reordering of the matrix reduces its bandwidth.The next update is performed on the blue block of A to compute A 2 x on the blue elements of the LHS vector.
Yellow matrix elements can be reused when computing A 2 x on blue blocks.

LEVEL-BLOCKED MPK
The RACE coloring scheme has been developed to generate hardware efficient distance-k colorings of graphs [10].It has been successfully applied to the shared-memory parallelization of symmetric SpMV providing unprecedented performance levels.Further it has been shown that the levelbased approach allows to control dependencies in SpMV operations and at the same time provides flexibility to ensure data locality and to adjust to the degree of parallelism required by modern multicore processors.The level-based blocking strategy introduced in the following adapts these RACE properties to the MPK.We thus first recapitulate the basic terminology and the level-based approach of RACE.
Next we demonstrate how it is basically applied to the MPK problem and then show how data locality and efficient shared-memory parallelization can be achieved.
In this section, we restrict ourselves to symmetric matrices, i.e., undirected graphs.However, the proposed MPK blocking method is also applicable to non-symmetric square matrices.The following definitions from graph theory are used throughout the paper: Graph: G = (V, E) represents a graph, with V(G) denoting a set of vertices and E(G) denoting its edges.For sparse matrices, V(G) consists of all row indices of the matrix and E(G) consists of edges between two vertices corresponding to the row (u) and the column indices (v) of the nonzero entries, i.e., {u, In the graph terminology, an SpMV operation (y = Ax) can be formulated as follows: If G = (V, E) is the graph representation of the sparse matrix A then for every vertex u ∈ V(G) calculate Comparing ( 2) with ( 1), we can observe the equivalence between index-based (row index i and its related column indices C(i)) and graph-based (vertex u and its neighborhood N (u)) notations.
To illustrate our method, a simple graph generated by applying a two-dimensional seven-point (2d-7pt) stencil to a square grid of size 8×8 will serve as an example.Figure 4a shows the graph with each vertex numbered in lexicographic ordering.The associated stencil at a single grid point (vertex 54 and its neighborhood) is highlighted.The sparsity pattern of the corresponding matrix is shown in Fig. 4b.

Levels
The level formation in RACE is based on a BFS which assigns each vertex (row) of the graph (matrix) to a level.First, a root vertex v root is chosen and assigned to the first level, L(0).The rest of the levels, L(i) ∀ i > 0, are defined to contain vertices that are in the combined neighborhood of the vertices in the previous level L(i − 1) but have no level numbers assigned yet, i.e., Figure 4c shows the 15 levels (indicated by different colors) generated by this procedure for the stencil graph if v root = 0 is chosen.After level formation, the vertices are renumbered (compare vertex indices in Fig. 4a and Fig. 4c) such that those in the same level are numbered consecutively and the vertices in level L(i − 1) appear before those in L(i).This permutation 2 increases data locality between neighboring vertices and results in a lens-shaped matrix with typically reduced bandwidth (see Fig. 4d).Since this improves the data locality of sparse matrix computations, such permutations are widely employed as preprocessing steps for SpMV-based algorithms [34].
As a consequence of the definition of levels, the neighborhood of all vertices in a given level L(i) is clearly confined to the vertices within the previous, current, and next levels, i.e.: 2. Note that a symmetric permutation is employed on the matrix, i.e., both rows and columns are permuted    This property is crucial for the design of our level-based MPK blocking scheme as it defines the dependency between the computation of SpMVs for different levels at different matrix powers: To advance all vertices of L(i) to A p x, the calculation of A p−1 x has to be completed on the levels L(i − 1), L(i), and L(i + 1).

Level-based blocking of MPK
In Sec. 3, we discussed the baseline MPK and the potential matrix data reuse by blocking across the SpMV operations involved in the MPK.Further it has been shown that the graph formulation of the SpMV (2) together with the neighborhood relation ( 4) of the levels (3) provide a natural framework for the structured computation of the MPK.This includes the dependency between a level and its neighborhood; e.g., in Fig. 4c one can calculate the next matrix power for level L(6) (with vertices 21, . . ., 27) only after the computation of the previous matrix power is complete on levels L(5), L(6), and L(7) (containing vertices 15, . . ., 35).
We next introduce the Lp diagram to visualize the dependencies between levels in MPK calculations.In the Lp diagram, the indices of the levels L(i) are on the x-axis and the matrix power stages (1 ≤ p ≤ p max ) are on the y-axis.Hence, each node (i, p) in the diagram represents an SpMV on the vertices in level i to compute part of the power p. Figure 5 shows the Lp diagram for 15 levels and p max = 5.To satisfy the dependencies in the level-based MPK blocking scheme, the nodes (i−1, p−1), (i, p−1), and (i+1, p−1) need to be computed before SpMV can be applied to compute the node (i, p).The red arrows in Fig. 5 denote the dependency for the computation of L(6) at p = 4, i.e., for the node (6,4).The order of traversal in the Lp diagram is as follows: • Each diagonal, defined by i + p = const, is traversed from bottom to top (starting at p = 1).
• Diagonals are traversed from left to right, i.e., starting with p = 1 for L(0).This execution order, which is independent of the actual graph structure, ensures that the levels L(i − 1), L(i), and L(i + 1) are updated to power stage p − 1 before level L(i) is advanced to power stage p.In Fig. 5, the order of all execution steps of this scheme is shown via the node numbers in the Lp diagram with p m = 5.  14)) and a maximum power stage of pmax = 5.Level colors are the same as in Fig. 4c.Each node in the Lp diagram is numbered according to the execution order.For p = 4 and level L(6), the explicit dependencies with levels at p = 3 are indicated with red arrows.The nodes highlighted in orange fulfill i + p = 13 ("diagonal").
Visualizations similar to Fig. 5 are often shown for one-dimensional (1D) radius-one stencils, where the x-axis represents the grid points and the y-axis shows iterations or time steps [16], [31], [32].As we have shown above, our level-based MPK algorithm shows the same dependencies, with levels substituting grid points on the x-axis.This opens up a host of options, since we could draw from the large variety of temporal blocking optimizations developed for 1D stencils.Our approach is analogous to parallelogram-style temporal blocking; see [16] for a classification.
The reuse distance of a given level is a central quantity to characterize the cache locality of the level-blocked (LB) MPK.Within the Lp diagram, this quantity can be determined by the number of execution steps between two computations on the same level, i.e., one step in vertical direction.As the scheme traverses the Lp space in consecutive diagonals, a level computed at power p will be reused after d+1 execution stages for the computation of the next power p + 1, where d is the number of execution steps in the current diagonal.After the wind-up and before the wind-down phases at the left and right ends of the Lp diagram, we have d = p m ; hence, levels are reused after p m + 1 execution steps.This can be observed from Fig. 5 if we concentrate on a single level, e.g., the vertices of L(10) used in the 40th execution step to compute p = 1 are reused in the 46th step to compute p = 2.As the number of levels is typically much larger than the maximum power stage, we can assume a maximum reuse end for 10: end for Fig. 6.Basic implementation of the level-blocked (LB) MPK algorithm.Lm is the total number of levels and pm is the maximum matrix power.The SpMV function implementation from Fig. 2 is used.
distance of p m + 1 execution stages.This means if all the matrix entries associated with the p m + 1 successive levels touched between two computations of a given L(i) can be held in a cache, all accesses to this L(i) can be served from the cache with the exception of the first one (p = 1), which requires main memory access.Assuming that cache accesses are much faster than memory accesses, the performance of the LB MPK implementation can improve by a factor of at most p m as compared to the baseline MPK.

Implementation
Two basic implementation decisions for our LB MPK are guided by RACE.First, the complete algorithm operates on the permuted graph.Second, only two lean data structures are required to store the information on the permutation and the levels: The permutation vector (N r entries) is required to recover the original ordering.The storage location of the first vertex (row) of each level are stored in the level_ptr array (one entry per level).Figure 4e shows the level_ptr of our stencil example matrix (see Fig. 4d).
A straightforward implementation of our LB MPK is presented in Figure 6.The algorithm first iterates over all diagonals of the Lp diagram in ascending order (line 2).Within a diagonal d = i + p = const, the computations are processed in increasing order of power p (line 6).Note that to account for the wind-up and the wind-down phase of the parallelogram, the starting and ending power stages are adjusted in lines 3 and 4 of the algorithm.Depending on the power p and the diagonal counter d, the actual level index i to use in the current iteration is calculated in line 7. Finally, in line 8 the vector (y[p − 1]) containing the required information at power level p − 1 and the indices of the first and last row of L(i) are passed to the SpMV function (shown in Fig. 2) to compute A p x on level L(i).Note that OpenMP parallelization is done within the SpMV function using static scheduling (line 12 in Fig. 2).As there is an implicit barrier after the parallel workshare construct, all threads finish the computations on a given execution stage before proceeding to the next one.In order to reduce the start-up overhead at the parallel region encountered in each SpMV call, the parallel region is opened outside the SpMV routine in our implementation.
Note that the storage of each level is consecutive and the levels are stored in ascending index order.Therefore, the proposed method neither has irregular accesses to matrix entries nor does it have to store extra copies of matrix elements and perform redundant computations, which were required in previous work [15].Moreover,the parallelization within the levels avoids load imbalance and redundant thread-local copies, which may add significant overhead for irregular matrices and high thread (or core) counts.

Performance analysis of naive version
The naive implementation of the LB MPK already results in a decent performance improvement for some of the matrices presented in Table 2.However, it often falls short of the predicted maximum p m -fold speedup.For example, with p m = 4 on one socket of CLX, 50% of the matrices in the table showed speedup of less than 10% and almost 10 matrices had a performance degradation compared to the baseline MPK.We choose two representative matrices, pwtk and Flan_1565, which are exemplary for the major performance shortcomings of the basic LB MPK and we will identify those in the following.
Figure 7 shows the multithreaded performance and main memory code balance of the LB MPK (Fig. 6) with p m = 1 and p m = 4 along with the baseline MPK (Fig. 2) with p m = 4 on one socket of CLX (20 cores) for both matrices.One may expect that LB MPK with p m = 1 and the baseline MPK should deliver the same performance, independent of p m .They both perform the memory-bound SpMV operations successively but with different execution order within each SpMV function, and their minimum code balance from main memory is B C = 6 byte/flop (see Sec. 3.1).Hence, a data traffic (i.e.B C ) reduction and performance speedup of at most 4× may be achieved when using LB MPK for p m = 4.
For pwtk, the typical memory bandwidth saturation pattern is observed for LB MPK (p m = 1, triangles) and baseline MPK (circles) in Fig. 7a.The level-based implementation saturates at a lower level, although both variants attain the same minimum code balance of B C = 6 byte/flop (Fig. 7b).The characteristic behavior is the same for the LB MPK with p m = 4 (squares): In line with the expectation, our method reduces the data traffic by a factor of approximately four (B C,m ≈ 1.5 byte/flop) but it fails to improve performance at the full socket level.It even falls behind the baseline MPK for larger core counts.Further analysis reveals a 1.6× increase in retired instructions 3 for LB MPK (p m = 4) compared to the baseline approach.These instructions are executed in the spin-waiting loop of OpenMP barriers [36], indicating that the synchronization between threads (performed after each computation of a level) is a potential bottleneck.An analysis of the level structure of the pwtk matrix confirms the relevance of synchronization cost as the average level size is approximately 850 rows only.At an average of 53 nonzeros per row, the workload of a level is just too low to ignore the synchronization cost, which increases with thread count and may reach a few thousand cycles at a full socket. 4 The Flan_1565 matrix shows an opposite characteristic.The performance of LB MPK with p m = 1 is in line with the baseline approach, and the level blocking with p m = 4 achieves a performance improvement of 1.2× (see Fig. 7c).The moderate speedup of LB MPK is reflected in Fig. 7d by its rather high (measured) code balance of approximately   4 byte/flop, indicating that level-blocking is not very cache efficient in this case.The matrix level structure plays a decisive role here as there is a rather small number of levels, some of them being large.Already one of these large levels, which may contain up to 20, 000 rows (with about 75 nonzeros per row) has a size of roughly 18 MB, which is more than half of the L3 cache size of the CPU.Moreover, the small number of levels in combination with imbalanced level sizes may cause the irregular performance scaling of LB MPK (p m = 4) in Fig. 7c.
In the following three sections we describe three optimizations of the LB MPK, which are motivated by the performance shortcomings identified above.The first two are targeted at reducing the synchronization cost by forming larger levels ("level groups") and substituting the expensive barrier by point-to-point synchronization.The third optimization improves performance on matrices with dominant, bulky levels by recursively splitting these up ("recursion") to improve cache efficiency.

Level groups (LG)
The formation of larger levels follows the idea presented in [10]: Successive levels are aggregated into so-called level groups.This allows our LB MPK to operate on these level groups instead of the original levels.Figure 8a shows the fifteen levels of Fig. 4c being clustered into five level groups T (0)-T (4) (T (i) denotes i-th level group).The Lp diagram can easily be adapted by replacing the levels by the level groups on the x-axis (see Fig. 8b). 5 Still, the same parallelogram-style blocking can be applied by traversing the level groups using the same rules as for the levels.Parallel execution is performed within a level group, and all threads synchronize after the computation of each group.This strategy satisfies the neighborhood dependencies between levels as required by the LB MPK.
The cache reuse requirements of the LB MPK impose strict limits on the size of the level groups.As discussed in Sec.4.2, p m + 1 neighboring level groups have to be kept in cache.Therefore, if we assume neighboring level groups to be of similar size, the following criterion has to be satisfied by the i-th level group T (i): where N nz (T (i)) is the number of nonzeros in T (i), C is a parameter representing the available cache size (in bytes), and f is a safety factor.The cache size parameter is typically chosen to be less than or equal to the physical size of the cache(s) targeted for level blocking.The safety factor (f = 0.5 in this work) accounts for extra traffic from other data structures and inefficiencies of the cache replacement policies.
The left part of inequality ( 5) is the total memory traffic generated by accessing p m +1 level groups (assuming 12 bytes per nonzero entry of the matrix, see Sec. 3.1), and the right part is the effective cache size.If ( 5) is satisfied then level group T (i) can be reused from cache for p m > 1; otherwise, at least parts of it must be loaded from main memory.Inequality ( 5) is crucial to the construction process of the level groups.We form the first level group T (0) by accumulating levels L(0) . . .L(j) up to the largest j for which The same procedure is repeated starting from level L(j + 1) to find T (1), and successively forming the other level groups.It can be seen from Fig. 8a that this procedure creates level groups with almost equal numbers of nonzero elements.In regions where levels contain fewer nonzeros per level, more levels are aggregated (see T (0) in Fig. 8a) while in regions with bulkier levels, even a single level can form a level group (see T (2) in Fig. 8a).As a result, the number of level groups is typically much smaller than the number of levels.
In the level-group-based scheme, synchronization only happens after the computation of each level group, which greatly diminishes the impact of barriers in case of LB MPK for the pwtk matrix: The performance of the LB MPK for p m = 4 (LB+LG; triangles in Fig. 9a) improves on a full socket by 1.6× compared to the baseline MPK approach.At the same time, we only encounter a minor increase in the measured code balance (see Fig. 9b) since the condition ( 5) (a) Graph of the matrix.-

Point-to-point (p2p) synchronization
The concept of level groups allows us to relax the locksteplike synchronization by eliminating the OpenMP barrier after computation of a level group.The parallel LB MPK must ensure that the computations on the following levels and level groups are completed before the computation of power p for a given level group T (i): Note that the most stringent condition (A) can be enforced without a global barrier synchronization since it is only relevant when a level group T (i) is visited again for computing the next power on it, which happens after a full diagonal traversal.We thus implemented a customized locking mechanism (see below for details), which allows threads to spread out over a full diagonal of the Lp diagram.They only need to check if all threads have finished computing the previous power of the current level group.Due to the diagonal traversal scheme of the Lp diagram, condition (A) implies condition (B) as the southwest neighbor of a level group is always visited before its bottom neighbor (see numbering of execution order in 8b).Finally, a similar mechanism is required to ensure condition (C).Here, only the completion of the relevant boundary level of the southeast neighbor has to be ensured (see Fig. 8c).As vertices are statically assigned to threads, this boundary level is typically calculated by a single or only a few threads, further relaxing the synchronization demands between all threads.This can be observed in Fig. 8d, where only the first thread (tid=0) is involved in the computation of the relevant boundary level of T (4).
The locking mechanism is implemented as follows: An array of locks is defined in the initialization phase, which allows us to control the above dependencies.The threads use omp atomic and spin-waiting loops to set and test the locks.Figure 9a shows the performance scaling of this implementation (LB+LG+p2p; diamonds) in comparison to the other variants; it yields a performance boost of 1.2× over the version with level groups and barrier synchronization (LB+LG).A part of this speedup comes from the reduced synchronization cost.The rest is due to the relaxation of lock step synchronization that allows for overlap between memory and cache transfers, i.e., some threads can work on the memory-bound phase (p = 1) while the rest work on a cache-bound phase (p > 1).The optimization thus brings us close to our phenomenological ECM model (stars in Fig. 9a) and results in a 2× speedup over the baseline approach.Note that as the sizes of level groups change, traffic within inner cache levels will also change.Since the ECM model uses this data traffic as input, it results in slightly different models when sizes of level groups change.This can be observed for example by comparing Fig. 7a and Fig. 9a.

Recursion
The negative impact of bulky levels (which do not satisfy ( 5)) on main memory traffic for the LB MPK approach (see Fig. 7d) has been identified and discussed for the Flan_1565 matrix in Sec.4.2.In the RACE coloring scheme [10], a recursive approach has been presented to generate higher levels of parallelism within bulky levels.
The same method can be used in our context to successively generate new levels or level groups of reduced size until they fit into cache.The idea is to apply the LB MPK presented so far to the subgraph defined by a single level or a set of consecutive levels.As a result, a new set of smaller levels is generated for this subgraph.If some of the new levels still violate ( 5), the procedure is applied again to the new subgraph defined by these levels.This procedure can be continued until all levels fit into a cache.We start by locating (consecutive) levels that do not fit in a cache and isolate the subgraph formed by these levels.BFS is applied first to this subgraph, and then a set of level groups is formed from these BFS levels.The resulting level groups are typically smaller than the previous ones as neighboring vertices outside the subgraph do not need to be considered.Figure 10 illustrates this procedure for our stencil example and a hypothetical cache which satisfies (5) for level groups T (i) containing no more than six vertices.We find that the three bulkier level groups (containing one level each) T (4) -T (6) do not satisfy the condition.The subgraph induced by these three levels is formed (shaded with gray background in Fig. 10a), and we identify the eight BFS levels of this subgraph (Fig. 10b).Following the discussion in Sec.4.3, the level groups of the subgraph are constructed (Fig. 10c).They are now small enough to satisfy (5) and the process stops.
In general, the procedure can be applied recursively until the level groups satisfy (5) or a user-specified maximum recursion stage s m is reached, where s m = 0 is the case without any recursion.In the following, s (≤ s m ) denotes the current recursion stage.The maximum recursion stage should, however, be limited as applying the recursion step leads to loss of data locality at the boundaries of the subgraph.This happens because the subgraphs are permuted (BFS) without taking into account the neighbors outside the subgraph.Figure 11 demonstrates this effect by comparing the matrix structure of our stencil example without recursion (s m = 0) and with one recursion step (s m = 1) applied to the inner levels.The matrix bandwidth increases for the boundary elements of the subgraph because of the mismatch of the vertex numberings outside and inside the subgraph.While access to the matrix elements remains linear, the more irregular accesses to the right-hand side vector may impact the overall MPK performance.Note that the graphical representation in Fig. 11b exaggerates this effect, since in our toy problem the subgraph represents a substantial fraction of the full problem.The performance influence of the maximum recursion stage s m is discussed later in Sec.5.3.
As each subgraph (formed from consecutive levels) of a recursion stage creates its own level groups, we construct Lp diagrams for each subgraph, i.e., Lp s represents the Lp diagrams of recursion stage s. Figure 12 shows the two Lp diagrams of the stencil example for p m = 2: Lp 0 representing s = 0 on the full graph (Fig. 10a), and Lp 1 after the first recursion stage of the subgraph corresponding to level groups in Fig. 10c.Note that the numbering of the execution order is local to each Lp s diagram.All level groups of a subgraph of Lp s to which recursion is applied have the same execution order in Lp s (e.g., the subgraph related to T (4)-T ( 6) in Lp 0 is executed in step 8 of Lp 0 in Fig. 12).The actual execution order of the vertices in this subgraph is determined by Lp s+1 (see Lp 1 in Fig. 12).In general, the actual execution of a given vertex is determined by the Lp diagram associated with the highest recursion stage of the vertex.Of course the actual execution order in the Lp s diagrams still needs to maintain the data dependencies of the LB MPK.With p m = 2 as used in Fig. 12 we can still maintain our diagonal-type execution order within the graphs: T (7) of Lp 0 is updated to p = 1 at step 7. Lp 1 is calculated as step 8 of Lp 0 .In step 9 of Lp 0 , T (3) is updated to p = 2.
For p m > 2, the dependency relations between execution order of Lp s and Lp s+1 are more complicated.This is depicted in Figure 13, where Lp 0 with p m = 3, 5 is shown for 15 level groups and T (6)-T (8) form the subgraph on which Lp 1 is built.Actually, all nodes in the parallelogram formed by the diagonals in Lp s (Lp 0 in our example) that cross the subgraph to be refined have dependency relations to the vertices in this subgraph.Within the parallelogram, there are three different types of dependencies for the nodes to be computed at Lp s (Lp 0 in Figure 13) and which are not in the subgraph to be refined: (i) Nodes which provide input only to Lp s+1 and which need to be calculated before Lp s+1 (orange color in Figure 13), (ii) nodes which have only an output dependency on Lp s+1 and need to be calculated after Lp s+1 (blue color in Figure 13), (iii) nodes within the "diamond" embedded in the parallelogram, which have input and output dependencies related to the computations in Lp s+1 and need to be calculated in coordination with Lp s+1 .All nodes within the "diamond" thus have the same execution order in Lp s , and the calculation of Lp s+1 also involves computation of level groups outside the subgraph to be refined.This diamond-type execution structure is well known from diamond tiling [32] applied to stencils.
Note that this recursive refinement approach is not limited to a single subgraph of a given Lp s .However, if multiple subgraphs need to be refined, the parallelograms formed by these subgraphs must not overlap.
The impact of the presented recursion scheme on the performance of the LB MPK method for the Flan_1565 matrix with p m = 4 is shown in Fig. 14a.We used a cache size parameter C = 45 MB for LB MPK methods and set s m = 4 for the case with recursion (squares).In this setting, the Lp 0 diagram has three subgraphs to which recursive treatment is applied.Via improved cache reuse, the recursion improves the full-socket performance by a factor of almost 1.4× compared to the version without recursion.This comes with a corresponding reduction of almost 2× in main memory data traffic (Fig. 14b).Compared to the baseline MPK approach, we achieve an overall reduction of main memory traffic by 3.2× and an increase in performance by 1.8× on a full socket of CLX.These numbers and the (close to) linear scaling of our method indicate that main memory access is no longer the performance bottleneck.

RACE
The LB MPK algorithm including all optimizations discussed above has been implemented in the RACE library (code available at [37]).In the following we therefore refer to our LB MPK implementation as "RACE MPK."The library supports both preprocessing and execution phases of the LB MPK.
For preprocessing, RACE requires the matrix, highest power p m , cache size C, and maximum recursion stage s m as input and returns the permutation vector as output.The user then has to pass the permuted matrix and a call-back function to RACE for execution.RACE will execute the call-back function in parallel (using OpenMP threading) according to the internally created level_ptr and Lp diagrams.

PARAMETER STUDY
Our RACE MPK as introduced in the previous section has three input parameters: the maximum power p m , the cache size C, and the maximum recursion stage s m .In this section we discuss the qualitative impact of these parameters on the performance of RACE MPK.

Influence of p m
Ideally, RACE MPK requires to access main memory for each level group exactly once at p = 1.The remaining p m − 1 accesses can potentially be served from the cache(s) (see Figs. 9b and 14b).As a consequence, cache utilization and performance should increase with p m .However, as p m gets larger, the number of level groups grows and their size must reduce as condition (5) has to be fulfilled, which results in higher synchronization cost.These opposing effects result in a typical performance pattern as shown in Fig. 15a for the pwtk matrix on CLX.Initially the performance increases almost linearly with p m but starts to drop gradually at larger p m (≈ 6-8 in our example).For matrices that require recursion, the performance drop is more prominent and occurs at a lower p m as shown in Fig. 15b for the Flan_1565        matrix on CLX.The additional overhead at the boundaries of the recursively refined level groups (see discussion in Sec.4.5) add another performance penalty.Of course, the p m value at which performance starts to decrease depends on the matrix and the cache size.This can be observed by comparing the performance of Flan_1565 on the three architectures (Figs.15b-15d).On ROME (Fig. 15d) with its large last-level cache, the matrix does not require recursion at all and the performance increases up to p m = 10, where the RACE MPK achieves a speedup of 4× compared to the baseline MPK.
The ICL (Fig. 15c) and CLX (Fig. 15b) CPUs need recursion to achieve best performance.The maximum performance is attained at p m values of 5 and 4, resulting in speedups of 2.3× and 1.8× with respect to the MPK baseline on these two architectures.Note that performance improvements decrease with decreasing cache sizes.
For applications computing A k x using RACE MPK, the best strategy is to identify the optimal p m value p opt m and perform the A p opt m x computations multiple times (if k>p opt m ) until the power k is reached.If k is not a multiple of p opt m , the remainder computations can be done using MPK kernels with p m < p opt m .

Influence of C
The interaction of cache size C and highest power p m is shown as a heatmap in Fig. 16 for the pwtk and Flan_1565 matrices on CLX.The optimal C value is between 25 and  45 MB irrespective of p m and the matrix.This is in good qualitative agreement with the aggregate size of the L3 (27.5 MiB,victim) and L2 cache (20 MiB) of CLX.Of course, the RACE MPK method works best when blocking for the biggest available cache.Smaller C values lead to smaller level groups (see (5)) and therefore higher synchronization and recursion overheads.On the other hand, C values bigger than the total cache size will obviously provoke cache misses.

Influence of s m
For matrices that require recursion to fulfill (5), the maximum recursion depth s m may stop the recursion procedure even if the condition is still violated for some level groups.Figure 17a depicts the performance behavior of the Flan_1565 matrix with p m = 4 on CLX as a function of s m .Initially, the performance increases with s m as the level groups become smaller.When ( 5) is fulfilled at s m = 4 for all level groups, performance saturates.Note that increasing s m does not always have the positive performance effect as observed for Flan_1565.The overhead at the boundaries of the refined subgraphs may overcompensate the gains of increased cache efficiency.For example, in case of the RM07R matrix on ICL (not shown in plots) with p m = 3 (= p opt m ) it was found that s m = 0 (no recursion) achieves 1.2× better performance than s m = 13, where all the level groups fit in cache.Of course, the optimal value of s m is determined by an intricate interplay of cache properties and matrix properties and thus cannot be found analytically.Typically, recursion should only be applied if condition (5) cannot be fulfilled with s m = 0.In this scenario, recursion depths up to s m = 15, . . ., 20 should be scanned for best performance.
The preprocessing cost increases with s m as levels have to be found for recursive subgraphs.This can be seen in Fig. 17b for the Flan_1565 matrix, where the preprocessing cost (shown in equivalent SpMVs) increases with s m up to s m = 4.The construction of levels (BFS) dominates the preprocessing time.The other parameters p m and C do not have a considerable impact on preprocessing time as changing them does not require to generate new levels.

PERFORMANCE EVALUATION
In this section we investigate the performance of RACE MPK and compare it against the baseline MPK for 36 different sparse matrices commonly seen in literature.The details of these matrices can be found in Table 2.

Experimental setup
All matrices were stored in the CRS data storage format (see Sec. 3.1).Unless specified otherwise we used all the cores on one CPU socket and one thread per core.To ensure vectorization of the kernels we used #pragma simd vectorlength(VECLEN) reduction(+:tmp) on the innermost loop of the SpMV (see Fig. 2).The vector length (VECLEN) was specified explicitly and was chosen to be the maximum SIMD width of the hardware, i.e., four on ROME and eight on ICL and CLX.
For both baseline and RACE, the matrices were preprocessed with RCM reordering using the Intel SpMP [38] library if it improved the performance.The baseline method was parallelized using the #pragma omp parallel for schedule(static) workshare construct along the outermost loop (over matrix rows). 6RACE is parallelized using OpenMP pragmas by manually assigning the vertices in each level group to the threads and implementing the pointto-point synchronization mechanisms discussed in Sec.4.4.The parameter space of RACE (see Sec.

Results
Figures 18a, 18d, and 18g show the performance of baseline and RACE MPK on CLX, ICL, and ROME, respectively.The matrices are ordered (left to right) according to increasing data-set size (number of nonzeros).The vertical lines represent the total cache size of the respective hardware and thus categorize matrices into memory-resident (right of line) and cache-resident (left of line) scenarios.
6.Note that static scheduling was chosen as the benchmark matrices (see Table 2) did not have highly imbalanced row lengths.
7. in the format [start value : increment : end value] For the smallest matrices, RACE does not usually show significant speedup over the baseline method as these matrices comfortably fit in cache.However, as the working set approaches the cache size, RACE starts to develop clear performance advantages.On CLX and ICL, this effect is pronounced already for larger "in-cache" matrices, while for ROME the benefit of RACE MPK starts exactly at the boundary between cache-and memory-resident matrices.There are two main reasons for this: (i) The transition between L3 and main memory bandwidth on Intel architectures is gradual compared to AMD ROME (see Fig. 1), and (ii) the L3 and L2 caches have almost similar sizes on both Intel architectures, and the blocking in RACE targets the combined L3 and L2 caches.Therefore, for smaller matrices that fit into the L3 cache, RACE can reduce the L2 traffic compared to the baseline method.On the other hand, for ROME the L3 cache is considerably bigger than the L2 and hence the blocking is performed only in the L3 cache, thereby bearing no benefit for matrices fitting in the L3 cache.
For all memory-resident matrices RACE has a clear performance advantage on all architectures, achieving typical speedups of 2× to 5× compared to the baseline MPK.This is correlated with the measurements of the average L2, L3, and main memory traffic shown in Figs.18b, 18e, and 18h.Here the baseline MPK approach is close to the SpMV's minimum traffic limit of 6 byte/flop 8 , indicating the absence of caching of matrix elements.In most cases the baseline approach is also strongly memory bandwidth bound and thus performs close to the optimistic (memory-bound) roofline limit (i.e., b Mem /B C ) of 19, 28, and 24 Gflop/s on CLX, ICL, and ROME, respectively.For RACE we find a memory traffic less than the minimum SpMV limit on all the three architectures due to caching of the matrix elements.On CLX and ICL, even the L3 traffic reduces substantially as the large (aggregate) L2 cache contributes substantially to the blocking.The reduced data traffic of RACE results in a performance higher than the SpMV in-memory roofline limit and the baseline approach.Correlated with the reduction of main memory traffic, RACE achieves the highest speedups on ROME where we observe an average (maximum) performance gain of 3.5× (5.4×).On ICL and CLX, we observe an average speedup of almost 2× and 1.6×, respectively, and a maximum speedup of 3× and 2.3×.
The significantly higher performance (as well as speedup) of RACE on ROME compared to the Intel systems can be attributed to its larger L3 cache and higher L3 bandwidth (see Fig. 1).A larger L3 allows to cache level groups for higher p m values (see (5)).This can be observed in the tuned p m values annotated with numbers on top of the RACE performance bars.We see that for the same matrices the p m values on ROME are higher than that of ICL and CLX.This allows for matrix elements to be cached longer on ROME and results in an average memory traffic reduction of 4.5× (see Fig. 18h) compared to the baseline, while on ICL and CLX the reduction is 2.7× and 2.2×, respectively.
8. The L3 traffic measurements using likwid-perfctr is double on CLX and ICL as the current version of likwid-perfctr cannot distinguish traffic between main memory and L2 cache with L3 and L2 caches; see [39] for details.

Preprocessing cost
Now that the performance behavior of RACE is understood, we need to investigate its preprocessing overhead.The box plots in Figs.18c, 18f, and 18i show statistics of RACE's preprocessing cost for memory-resident matrices.These cost is shown in equivalent number of SpMVs that can be executed during the time required for preprocessing.In general, the cost reduces as the cache size of the architecture increases, i.e., on ROME the preprocessing time is well under the time of 30 SpMVs for most matrices while on ICL and CLX the equivalent SpMV invocations are 40 and 60, respectively.This is due to larger cache sizes requiring fewer recursion stages (s m ), since the preprocessing cost increases with s m (see Fig. 17b).
Most of the preprocessing time (>95%) is spent on determining the levels using BFS.In RACE we use a parallel BFS implementation similar to the top-down approach from [40], where the parallelization is accomplished by distributing the vertices in a level (frontier) to different threads.However, this method lacks sufficient parallelism if the number of vertices in a level is too small.This is the case with the RM07R matrix, which is the outlier in the preprocessing cost on all three architectures.Here, a lot of levels contain only one vertex and preprocessing is sequential.

APPLICATION: CHEBYSHEV TIME PROPAGATION
There is a wide range of numerical methods which basically allow to map a sequence of SpMV steps to an MPK, such as s-step Krylov methods, exponential time differencing, polynomial preconditioning, or eigenvalue computations.Here we focus on the solution of time-dependent (dynamic) partial differential equations (PDEs) using an exponential time evolution operator U (∆t) [41] .We choose Chebyshev polynomials to approximate the exponential, i.e., U (∆t) = M k=0 c k (∆t)T k (A), where c k are the coefficients (which depend on the time step ∆t), M is the number of Chebyshev moments, A is a sparse matrix derived from the underlying PDE, and T k (A) are Chebyshev polynomials of order k.In our applications, Chebyshev polynomials of the first kind are used; therefore, T k (A) is defined using the following recurrence relation: The time evolution can then be computed by applying the operator U (∆t) to the current state vector x t to obtain the next state vector x t+∆t , i.e., where v k = T k (A)x t .The polynomial matrices T k (A) need not be stored explicitly as the v k+1 can be determined from previous v k exploiting the recurrence relation (6), i.e., Thus, the computation of U (∆t)x t can be implemented as a sequence of M SpMVs (Av k ) with appropriate scaling factors.This step, which propagates the system by one time step, is the compute time hotspot of Cheb-TP applications.Current implementations of Cheb-TP (see, e.g., [42], [41], [43], [44]) perform successive back-to-back SpMVs similar to our baseline MPK.However, the relation (8) and the Schr ödinger equation ih ∂ψ(x, t) ∂t = Hψ(x, t) .
choosing three application scenarios (i.e., representative sparse matrices) for both.These scenarios have been selected such that the system matrices A are sparse and real.For the heat equation ( 9), the coefficients c k depend on the modified Bessel functions and are similar to the ones found in [45].This means the code balance of the entire algorithm implemented using the baseline MPK approach would be similar to that of SpMV as discussed in Sec.3.1.For the Schr ödinger equation (10), the coefficients are complex numbers and depend on the Bessel functions [42].The complex coefficients result in complex vectors v k , and thus SpMVs between a real matrix and complex vectors have to be performed.This results in nearly twice as many flops due to complex arithmetic, but the data traffic remains almost the same as the matrix has real entries.Therefore, the code balance is reduced by a factor of roughly two in case of the Schr ödinger equation.
The value of M in Cheb-TP method has been chosen such that the Bessel coefficients have a value lower than 10 −4 (see the cut-off strategy in [42]).For RACE MPK, the tuning space is similar to the one used in Sec.6.1 above.Figure 19 shows the performance in Gflop/s of Cheb-TP using the baseline and RACE MPK approaches.Note that the baseline performance of Cheb-TP on the Schr ödinger equation is almost twice that of the heat equation due to 2× lower code balance.In general it can be seen that levelblocked approach of RACE outperforms the baseline method of back-to-back SpMVs.In line with our results in Sec.6, the highest speedup is attained on ROME followed by ICL and CLX.For the heat equation we use a three-dimensional grid of size 160 3 and discretize the spatial derivatives (i.e., ∆ψ in ( 9)) using finite differences of second, fourth, and sixth order (order-2 . . .order-6 in Fig. 19).Of course, higher-order discretization leads to an increase in the number of neighbors, i.e., N nzr of the sparse matrix (see Table 3 for details).This in turn results in bulkier levels and therefore should lead to a lower optimal p m value due to (5) and the recursion overhead.This results in lower performance of RACE MPK with increasing discretization order, as can be seen in Fig. 19a and Fig. 19b for CLX and ICL.However, the large cache size of ROME allows to maintain high p m values even for large orders and thus performance increases with discretization order as seen in Fig. 19c.This is due to the increase in N nzr causing a relative reduction in vector traffic contributions and allowing for efficient SIMD vectorization along the inner loop over N nzr .Overall, for the heat equation the RACE MPK attains an average speedup of 2.8×, 2.1×, and 1.6× over the baseline approach on ROME, ICL, and CLX, respectively.
For the Schr ödinger equation, we test our approach using three Hamiltonian matrices (H in (10)) from quantum physics applications.The Fermion and Anderson matrices model noninteracting many-particle quantum systems and the motion of a quantum-mechanical particle in a disordered solid [46], respectively.Both matrices were generated using the ScaMaC [46] library.The Graphene matrix arises from modeling graphene tubes and ribbons [47] and can be generated using the ESSEX-Physics library [48].The details of these matrices can be found in Table 3.They have varying sparsity patterns resulting in widely different level structures and indirect access patterns, which results in a wide performance range (compared to the heat equation) of the baseline method for the three matrices (see Figs. 19d,19e and 19f).Similar to the heat equation, when compared to the baseline approach the RACE MPK achieves an average speedup of 2.3×, 1.9×, and 1.6× on ROME, ICL, and CLX, respectively.

CONCLUSION AND OUTLOOK
In this paper we have developed a level-based blocking algorithm (RACE MPK) to increase the performance of sparse matrix-power-vector kernels (MPK).The RACE algorithm uses levels, generated by breadth-first search, to increase temporal access locality for the matrix entries by reusing them for successive power computations.Various hardware-oriented algorithmic optimization strategies such as level grouping, point-to-point synchronization, and recursive application of the level-blocking scheme are introduced to further improve the performance of RACE MPK.A thorough performance analysis on a representative set of 36 matrices shows that RACE MPK outperforms a standard MPK implementation by an average factor of 2× and 3.5× on modern Intel and AMD CPUs.Finally, applying RACE MPK to Chebyshev time propagation we demonstrate that similar speedups are achievable for real-world applications.
The MPK finds its use in a large variety of applications, especially in the field of communication-avoiding algorithms [12], polynomial preconditioning [49], and exponential time integration [41].Future work includes integrating RACE MPK into Krylov solvers and preconditioners from the Trilinos [50] framework.

Fig. 1 .
Fig. 1.Single socket LLC and memory bandwidth (load-only) of the three architectures under consideration.The dashed line represents the total available cache size (CS).Note the different scaling on the y-axis.

Fig. 2 .
Fig. 2. CRS-based MPK computing A pm x.The arrays val, col, and rowP tr hold the CRS data structure of A. The input and output vectors are stored in the y matrix.

Fig. 3 .
Fig.3.Blocking successive matrix applications for a simple banded sparse matrix: (a) The RHS vector is the input vector x.Yellow elements of the LHS vector are updated to Ax.(b) The next update is performed on the blue block of A to compute A 2 x on the blue elements of the LHS vector.Yellow matrix elements can be reused when computing A 2 x on blue blocks.

Fig. 4 .
Fig.4.Graph (a) and sparsity pattern (b) of the matrix associated with a 2d-7pt stencil on an 8×8 grid.In (a), the associated stencil is highlighted in red for an arbitrary vertex (54).(c) shows the permuted graph and (d) the sparsity pattern of the matrix after applying BFS reordering.The vertices (rows) of the graph (matrix) that belong to a level are represented with the same color.The level_ptr associated with the permuted graph/matrix is shown in (e)

Fig. 5 .
Fig.5.Lp diagram with 15 levels (L(0), . . ., L(14)) and a maximum power stage of pmax = 5.Level colors are the same as in Fig.4c.Each node in the Lp diagram is numbered according to the execution order.For p = 4 and level L(6), the explicit dependencies with levels at p = 3 are indicated with red arrows.The nodes highlighted in orange fulfill i + p = 13 ("diagonal").

3 .
using the event INSTR_RETIRED_ANY in likwid-perfctr 4. For the full CLX socket (ignoring hyper-threading) and the software environment used, a minimum barrier cost of 2,900 cycles was measured by direct barrier benchmarking.

Fig. 7 .
Fig.7.Scaling performance and main memory traffic of our LB MPK implementation for Ax (pm = 1) and A 4 x (pm = 4) in comparison to the baseline MPK on one socket of CLX for the pwtk and Flan_1565 matrices.The stars show the phenomenological ECM performance model[35] (in gray) for the pm = 4 case.The model assumes that the computation of first power of a level (p = 1) does not overlap for subsequent powers (p > 1).
Lp diagram zoomed view.

Fig. 8 .
Fig. 8. (a) Levels as in Fig.4cbeing consolidated to five level groups (T (0) -T (4)).(b) Lp diagram corresponding to the level groups and the execution order of each level group at different power stages.The bold red arrow (vertical) corresponds to the dependency with all the levels of the same level group T (i) at the previous power stage p − 1, and the slanted red arrow corresponds to the dependency with the lowest-indexed level of next level group T (i + 1) at the previous power stage.The blue arrow corresponds to a dependency that is automatically fulfilled by the execution order.(c) Zoomed-in view of the T (3) and T (4) level groups in the Lp diagram.The levels within the level group are seen as square nodes and the dependency between levels in T (i) and T (i + 1) are clearly visible.The subgraph corresponding to the zoomed region is shown in (d).The vertices drawn with red circles correspond to the two boundary levels between which synchronization in southeast direction has to be established.The numbers on the vertices represent the id of the thread (tid) working on that vertex.

Fig. 9 .
Fig. 9. (a) Performance improvement of LB MPK using level group (LG) optimizations and point-to-point synchronization (p2p) for the pwtk matrix with pm = 4 on CLX.(b) Memory traffic of the four variants shown in (a).
(A) the same level group T (i) with previous power p − 1 (bottom neighbor in Lp diagram), (B) the highest-indexed (rightmost) level of T (i − 1) with power p − 1 (southwest neighbor in Lp diagram), and (C) the lowest-indexed (leftmost) level of T (i + 1) with power p − 1 (southeast neighbor in Lp diagram).

Fig. 10 .
Fig. 10.(a) Level groups in the graph.The shaded subgraph shows the level groups with more than six rows, where recursive treatment is applied.(b) BFS levels within the subgraph.(c) Level groups formed from the levels within the subgraph.

Fig. 11 .
Fig. 11.Sparsity pattern of the stencil example matrix without (a) and with (b) recursion.The entries of submatrix where recursion is applied is shown with orange color in (b).

Fig. 12 .Fig. 13 .
Fig. 12.The Lp diagram for pm = 2. Left: Lp diagram of the s = 0 recursion stage (Lp 0 ), which contains level groups of the entire graph seen in Fig. 10a.The level groups selected for recursion are highlighted.Right: Lp diagram at s = 1 (Lp 1 ), which consists of the level groups shown in Fig. 10c.The execution order of the Lp graph is shown with numbers.

Fig. 14 .
Fig. 14.(a) Performance improvement of LB MPK using recursion (squares) compared to the one without recursion (diamonds) for the Flan_1565 matrix with pm = 4 on one socket of CLX.Both versions use level groups and p2p optimizations.The performance of the baseline approach as well as the ECM model is also shown for reference.(b) Measured memory traffic of the three variants on the left.

Fig. 15 .
Fig.15.Performance as a function of maximum power pm for RACE and the baseline implementation of MPK.For cases where recursion yields a speedup, we also plot the performance of RACE without recursion (in green) for comparison.

Fig. 16 .
Fig. 16.Influence of cache size C and power pm on performance (in Gflop/s) of the RACE MPK using all cores of CLX.
s m ) Pre-processing time (b) Preprocessing cost.

Fig. 17 .
Fig. 17.(a) Performance influence of maximum recursive stage sm on the performance of the Flan_1565 matrix with pm = 4 and C = 35 MB on one socket of CLX.(b) Corresponding preprocessing cost of RACE in equivalent number of SpMVs.

Fig. 18 .
Fig. 18. (a), (d), (g): Performance comparison between baseline and RACE MPK on CLX, ICL, and ROME, respectively.The dashed line represents the total available cache size and the numbers show the tuned pm values corresponding to the RACE performance.(b),(e), (h): L2, L3, and memory code balance of RACE MPK and baseline approach on the three architectures.The memory and cache data traffic shown is the average across all the in-memory matrices (i.e., to right of dashed line in the respective performance plot).(c), (f), (i): Statistics of the preprocessing cost of RACE MPK for all in-memory matrices.The cost is shown as the number of SpMVs that can be executed in the given time.

Fig. 19 .
Fig. 19.Cheb-TP performance using the baseline and the RACE MPK on the three architectures.(a-c) Three cases of 3D heat equation with various spatial discretization order.(d-f) Schr ödinger equation with three different Hamiltonian matrices H (see (10)) derived from various physics applications.In all cases the tuned pm values of RACE are annotated in the plots.Note the difference in y-axis scaling for CLX data compared to ICL and ROME.

TABLE 1
Key specification of test bed machines.

TABLE 2
Details of the benchmark matrices.Nr is the number of rows, Nnz is the number of nonzeros, and Nnzr is the average number of nonzeros per row.

1 :
double :: val[N nz ] //store values of nonzeros in A 2: int :: col[N nz ], rowP tr[N r +1] //column index and row pointer of A 3: double :: y[N r , 0:p m ] //to store input and output vectors 4: //Perform p m SpMVs 5: for p = 1 : p m do Perform SpMV between A and in vector and store result in out.//row s and row e arguments are used to specify the start and end row //to which SpMV is applied.9: function SpMV(double :: in rhs[N r ], int :: row s, int :: row e)

TABLE 3
Details of the matrices used for the Cheb-TP application.
allows to apply the RACE MPK in the computation of U (∆t)x t and to perform level-based cache blocking across p m successive SpMVs.As Cheb-TP typically uses high values of M (few 100s-1000s), choosing p m = M is not advisable (see Sec. 5.1).Hence, we split the M SpMVs into M/p m batches and perform p m successive SpMVs via the RACE MPK within each batch.To demonstrate the performance potential of RACE MPK, we perform time propagation for PDEs underlying the parabolic heat equation