Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for SpGEMM's performance. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal performance. In this paper, we thoroughly analyze the memory access patterns of SpGEMM and their influences on the memory subsystem. Based on the analysis, we propose a novel and more efficient accumulation method named BRMerge for the multi-core CPU architectures. The BRMerge accumulation method follows the row-wise dataflow. It first accesses the $B$ matrix, generates the intermediate lists for one output row, and stores these intermediate lists in a consecutive memory space, which is implemented by a ping-pong buffer. It then immediately merges these intermediate lists generated in the previous phase two by two in a tree-like hierarchy between two ping-pong buffers. The architectural benefits of BRMerge are 1) streaming access patterns, 2) minimized TLB cache miss rate, and 3) reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM. Based on the BRMerge accumulation method, we propose two SpGEMM libraries named BRMerge-Upper and BRMerge-Precise, which use different allocation methods. Performance evaluations with 26 commonly used benchmarks on two CPU servers show that the proposed SpGEMM libraries significantly outperform the state-of-the-art SpGEMM libraries.

The extensive use of SpGEMM in real-world applications has led to the development of several SpGEMM libraries on CPUs [7]- [10], GPUs [11]- [16], and accelerators [17]- [20], targeting at high-performance computing. In this paper, we focus on optimizing SpGEMM libraries on multi-core CPU architectures.
As a memory-bounded application, the memory access efficiency of both the main memory and multi-level caches is critical for SpGEMM's performance [10], [23]. There are two critical memory access patterns in performing SpGEMM. The first pattern is the vast and irregular memory accesses to the B matrix for the row-wise SpGEMM [9], or theĈ matrix for the outer-product SpGEMM [10]. These memory accesses usually span large memory spaces, which is not friendly for the memory system (especially the translation lookaside buffer (TLB) [24]) of modern multi-core CPU architectures. The second pattern is the memory accesses when accumulating multiple intermediate lists with varying lengths into one result row. These intermediate lists should be accessed multiple times with irregular access patterns due to the inherent sorting property of the accumulating operation. Hence, the second memory access pattern is also not friendly for the memory system (especially the L1/L2 caches) of modern multi-core CPU architectures.
The existing Heap-SpGEMM [9] and Hash-SpGEMM [9] only consider the computation complexity while putting no consideration to the memory access efficiency mentioned above. Specifically, the Heap-SpGEMM may have high TLB cache miss rate when accessing the B matrix, whereas the Hash-SpGEMM may waste memory bandwidth utilization when accumulating the intermediate lists due to the random access pattern caused by hashing operations. As a result, the two libraries suffers suboptimal performance on the CPU architectures (Section IV-B).
Although the PB-SpGEMM [10] considers the memory access efficiency, they fail to fully assess the gains and overheads in their method. They focus on solving the vast and irregular memory accesses to the B matrix in the rowwise dataflow by adopting the outer-product dataflow [9], [10]. However, their memory accesses to theĈ matrix are still irregular with an even larger amount [10]. Therefore, the PB-SpGEMM may suffer worse TLB cache hit rate than the Heap-SpGEMM. As a result, PB-SpGEMM suffers exceptionally lower performance than its precedent Heap-SpGEMM and Hash-SpGEMM (Section IV-B).
To avoid or minimize the inefficient memory access issues in the existing SpGEMM libraries, we propose a novel and more efficient accumulation method, named BRMerge, for SpGEMM on the CPU architectures. To better understand the proposed work and its architectural benefits, we first describe several backgrounds and the BRMerge accumulation method. And then, we discuss the architectural benefits of the BRMerge accumulation method and the inefficient memory access issues of the existing methods in Section III-C. We denote the proposed accumulation method as BRMerge since it is based on the binary-row-merging algorithm described in Section III-A.
The BRMerge accumulation method adopts the row-wise dataflow to perform SpGEMM, which multiplies each row of A with the whole B matrix for the corresponding row of C [7], [9], [15], [16]. It consists of a multiplying phase and an accumulating phase to compute each output row. In the first multiplying phase, the multiple required rows of B are accessed, multiplied with the nonzeros in the corresponding row of A, and the generated intermediate lists are stored in a consecutive memory space, which is implemented with a ping-ping buffer. In the second accumulating phase, the multiple intermediate lists generated in the previous phase are merged two by two in a tree-like hierarchy between two ping-pong buffers. The architectural benefits of BRMerge are the streaming access patterns, minimized TLB misses, and reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM on the multi-core CPU architectures (Section III-C).
Based on the BRMerge accumulation method, we further propose two SpGEMM libraries named BRMerge-Upper and BRMerge-Precise. The BRMerge-Upper and BRMerge-Precise use the upper-bound [12] and precise allocation methods [15], respectively. Comprehensive evaluations with 26 commonly used benchmarks on two Intel Xeon CPUs show that the proposed SpGEMM libraries outperform the state-of-the-art SpGEMM libraries significantly. Specifically, BRMerge-Precise achieves on average 1.42× and 1.39× performance speedups compared to the state-of-the-art bestperforming SpGEMM library (i.e., Hash-SpGEMM [9]) on the Intel Xeon Platinum 8163 CPU and the Intel Xeon Gold 6254 CPU, respectively.
The key technical contributions of this work are as follows: The rest of this paper is organized as follows. Section II provides backgrounds and related work on the SpGEMM algorithm. Section III describes the binary-row-merging algorithms, the proposed BRMerge accumulation method, the performance analysis of the BRMerge accumulation method, and two proposed SpGEMM libraries based on the BRMerge accumulation method and other methods. Section IV first describes the performance evaluation environments, then provides the results and discussions. Section V concludes this paper.
The source code of this paper is provided in https://github. com/lorentzbf/BRMerge.git

II. BACKGROUNDS AND RELATED WORK
In this section, we provide the backgrounds and related work of SpGEMM algorithms mainly on CPUs.

1) Notations
We define several notations for the rest of this paper. Capital letters A, B, and C denote matrices. The two input matrices are A and B with sizes of M × K and K × N , respectively. The result matrix is denoted as C with a size of M × N .
A ij represents an element located at the i th row and j th column of A. A i * represents all the nonzero elements in the i th row of A. Similarly, B * j represents all the nonzero elements in the j th column of B.

2) Dataflows to Perform SpGEMM
Given two input matrices A and B, the text-book definition of matrix multiplication is computed as: Equation (1) describes the inner-product dataflow to perform SpGEMM. One variation of Equation (1) is: where k belongs to the set of column indices of the nonzero elements in each row of A. Equation (2) describes the rowwise dataflow to perform SpGEMM. A similar variation of the row-wise dataflow is the column-wise dataflow described in Equation (3): where k is the set of row indices in each column of B.
The SpGEMM can also be computed as: where the operation computes the sum of multiple sparse partial matrices. Equation (4) describes the outer-product dataflow to perform SpGEMM.
All the four dataflows mentioned above can generate the correct result matrix C but with different performance issues on modern hardware platforms. Generally, the rowwise dataflow is reported to have the best performance than other dataflows by recent SpGEMM work on CPUs [9], GPUs [13]- [16] and accelerators [19], [20]. Note that, based on Equation (2) and Equation (3), the implementation of the row-wise and column-wise dataflows are dual of each other. Therefore, we only discuss the row-wise dataflow for simplicity.
There are three key benefits of the row-wise dataflow compared to the inner-product dataflow: 1) zero elements are entirely skipped in both the computation and memory access, 2) the computation of each output row is independent of each other; therefore, the row-wise SpGEMM can be easily parallelized, and 3) accumulating the intermediate lists of each output row has a good temporal locality. As a result, we adopt the row-wise dataflow in our proposed SpGEMM algorithm.

3) CSR Storage Format
The compressed sparse row (CSR) storage format is one of the most commonly used sparse storage formats for SpGEMM, which is commonly used in the row-wise SpGEMM [9], [10], [12]- [16]. Fig. 1 illustrates the CSR storage format. The CSR consists of three arrays to record the nonzero elements and their corresponding indices. The val and col arrays record the values of the nonzero elements and their corresponding column indices in a sorted row-major and column-major order. The lengths of val and col arrays are both the number of nonzero elements (n nz ) of the sparse matrix. The rpt array records the start and end offsets for each row's values and column indices in the val and col arrays. Since the i th row's end offset can be encoded the same as the (i + 1) th row's start offset in the rpt array, the rpt array is further compressed to M + 1 entries, where M is the number of rows. One of the key performance benefits of using CSR is that it is easy to access the elements of an entire row. In this paper, we also use CSR as the storage format for the input and output matrices in the proposed SpGEMM libraries. In performing SpGEMM, usually, multiple intermediate products are combined to generate one result element due to the duplicate indices. The compression ratio is defined by dividing the total number of intermediate products (n prod ) in performing SpGEMM by the result matrix's total number of nonzero elements (n nz ) (Equation (5)). In other words, the compression ratio shows the average number of intermediate products that generate one nonzero element in the result matrix.
Compression ratio = T otal n prod to compute C T otal n nz of C . (5)

B. RELATED SpGEMM DESIGNS
In this section we describe the existing methods to address two important design issues: the accumulation method and the allocation method. The accumulation method generates multiple intermediate lists and merges these lists to one result row. Since the structure of the result matrix cannot be known in advance, where to store the result row in the accumulation method remains a challenging task. The allocation method tackles the unknown memory allocation challenge mentioned above.

1) Accumulation Methods
The heap-based accumulation method [9]  lists' front elements to sort the result row. In each iteration, the heap pops the smallest element to the result row. The popped element will be added to the current position in the result row or be stored in a new position in the result row, based on the indices of the popped element and the current element in the result row. After this operation, the heap reads the next element from the intermediate list which provides the smallest element last time. The read element should be pushed to the heap for the next iteration. The most two time-consuming operations in the heapbased accumulation method are the pop and push operations, which are both with O(log(k)) complexity, where k is the heap size. In this paper, we use the binary-row-mergingbased accumulation method, which can be seen as a heapbased method when the heap size is two. As a result, our accumulation method's pop and push operations are simply one comparison operation and one pointer addition operation for the two input intermediate lists.
The hash-based accumulation method [9] merges the intermediate lists by inserting all the intermediate elements into a hash table. After the insertion, the valid nonzero elements are extracted and sorted to generate one result row [9], [15]. The hashvec-based accumulation method [9] follows the same computation flow as the hash-based accumulation method. The difference is that the hashvec-based accumulation method uses the Intel's SIMD instructions [26] to implement the hashing operation. The hashvec-based accumulation method is reported to have better performance than the hashbased counterpart when the hash collision rate is high [9].
The ESC-based accumulation method is first proposed in the GPU work [11] and is recently used by the CPU work [10] with the outer-product dataflow. The ESC represents Expand, Sort, and Compress. It first expands the intermediate lists by storing them consecutively without order. Then it sorts the unordered intermediate lists with duplicate column indices. At last, it compresses the sorted list by merging the duplicate elements which are consecutive to each other.
The row-merge-based accumulation method is proposed on the GPU architectures [13], [14]. It merges a pre-defined power-of-two number (e.g., 2, 4, 8, 16 et al.) of intermediate lists simultaneously using subwarp [27]. It is similar to the heap-based accumulation method in that it merges multiple intermediate lists simultaneously. Since it uses one thread to maintain one or more intermediate lists and accesses each of these intermediate lists consecutively within the thread, the warp-level accesses to these intermediate lists are completely non-coalesced, which is detrimental to its performance on GPU architectures.

2) Allocation Methods
The structure of the result matrix (i.e., the number of nonzero elements per output row) cannot be determined before performing SpGEMM. Hence, allocating the memory space to store the result matrix, which requires information of the output structure, is a challenging task. We describe two commonly used allocation methods to tackle this challenge.
The two allocation methods are also implemented in the two proposed SpGEMM libraries.
The upper-bound allocation method [9], [12] first computes the number of intermediate products (n prod ) per output row as the upper-bound output structure, which is used for the memory allocation for the result matrix in an temporary memory space. The computation of the n prod is a very lowcost operation compared to SpGEMM. And then, the result matrix is computed and stored in the temporary memory space. Since the result matrix should be stored in a standard storage format (e.g., CSR format), an additional memory copy operation is required to move the result matrix from the temporary memory space to the memory space conforming to the standard storage format.
The precise allocation method [9], [15], [16] first computes the number of nonzero elements (n nz ) of each result row. This process only computes the indices information of the input matrices without floating-point multiplication, which is commonly named as the symbolic computation [15]. Then the precise memory space can be allocated based on a lowcost prefix-sum operation on the computed n nz information. At last, the result matrix is computed and directly stored in the standard memory space without the need for the additional copy operation. However, one additional cost when using the precise allocation method is the nontrivial symbolic computation, which has a similar complexity to the SpGEMM itself.

III. METHOD
In this section, we first describe the binary-row-merging algorithm and the BRMerge accumulation method. And then we analyze the architectural benefits of BRMerge and the inefficient memory access issues of the existing methods. At last, we describe two proposed SpGEMM libraries, named BRMerge-Upper and BRMerge-Precise, based on the BRMerge accumulation method.

A. BINARY-ROW-MERGING ALGORITHM
The binary-row-merging algorithm merges multiple sorted lists into one list. Fig. 2 illustrates how the binary-rowmerging algorithm merges six sorted lists into one list.

B. BRMerge ACCUMULATION METHOD
Algorithm 1 illustrates the proposed BRMerge accumulation method in a single thread version. BRMerge mainly consists of two phases: 1) generating the intermediate lists, named multiplying phase, and 2) merging the generated lists into one result row, named accumulating phase.
Since the accumulating phase in BRMerge consists of multiple rounds of merging, we use ping-pong buffers to store the input and output lists in each round of merging. For simplicity in Algorithm 1, the ping_buffer and the pong_buffer represent both the col's and val's pair of the ping-pong buffers. We use an additional pair of ping-pong buffers named ping_list_offset and pong_list_offset to record Line 1 in Algorithm 1 computes the sizes of the two pairs of ping-pong buffers, which are the maximum n prod of the output rows and the maximum n nz of the rows in A. Line 2-5 allocates the ping-pong buffers. The multiplying phase (Line 10-15) follows the row-wise dataflow (Equation (2)). It multiplies the nonzero elements in each row of A with the corresponding rows in B to generate the intermediate lists. The multiple generated intermediate lists are stored consecutively in one of the ping-pong buffers (pointed by the dst_buffer). The start offset of each list is recorded in one of the list_offset ping-pong buffers (pointed by the dst_list_offset). Note that each generated list is naturally sorted since each row of B is sorted due to the CSR format.
The accumulating phase (Line 21-35) implements the binary-row-merging algorithm, which is further illustrated in Fig. 3. The accumulating phase is implemented with two while-loops. The outer while-loop determines how many rounds of merging are required. It monitors the num_list information, which is reduced by half after each round of merging. The exit condition of the outer while-loop is when num_list equals 1 (Line 21). The inner while-loop consumes the num_list intermediate lists from the src_buffer two by two and stores the merged result list to the dst_buffer (Line 25). The consume operation of two input lists is similar to the merge sort algorithm [21], except that the elements with duplicate indices are added together. When only one intermediate list is left in the inner while-loop, it is directly copied to the dst_buffer (Line 28). After the completion of each round of merging (i.e., after the completion of one inner while-loop), the pointers of the ping-pong buffers are swapped (Line 33-34) without data movement. As a result, Store the i th result row in the src_buffer into the appropriate memory space 37: Release the allocated ping-pong buffers 38: end for in each inner while-loop, we always read input lists from the src_buffer and write the result lists to the dst_buffer. After the two while-loops, the result row is stored in the src_buffer due to the last swap operation. Line 36 writes the result row in the src_buffer to the appropriate memory space determined by the allocation method.

C. PERFORMANCE ANALYSIS OF THE BRMerge ACCUMULATION METHOD AND OTHER METHODS
Two of the most critical memory access patterns in performing SpGEMM are 1) the vast and irregular memory accesses to the B matrix for the row-wise SpGEMM [9], or to theĈ matrix for the outer-product SpGEMM [10], and VOLUME 4, 2016 1 4 5 2 5 1 2 4 2 3 4 3 5 2 3 7   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  The illustrated implementation uses the same input data as used in Fig. 2.
2) the memory accesses in the accumulating phase, which accumulates multiple intermediate lists with varying lengths.
In this section, we analyze the architectural benefits of the BRMerge accumulation method for the two critical memory access patterns on modern multi-core CPU architectures. We also analyze the inefficient memory access issues of the existing methods compared to BRMerge. In the following, we present the analysis mentioned above when we show the three architectural benefits of the proposed BRMerge, which are the streaming access patterns, minimized TLB cache miss rate, and reasonably high L1/L2 cache hit rates.

1) Streaming Access Patterns
The streaming access patterns is mainly in the accumulating phase in BRMerge. The accumulating phase uses one CPU thread to merge all the intermediate lists two by two. Therefore, at any given time, it only merges two lists into one result list. The memory read and write patterns are always consecutive when accessing the input and output lists. Furthermore, the duration of the read and write operation is short since it only performs one simple comparison with one potential addition operation before the next data read and write operations. Therefore, the memory access pattern of the input and output lists is a typical streaming access pattern.
There are mainly two benefits of the streaming access patterns in modern CPU architectures. The first benefit is to maximize the memory bandwidth utilization between any two high-level memory systems (e.g., between L1 and L2 caches, or between L2 and L3 caches). In modern CPUs, the memory access granularity between the CPU and the L1 cache can be as small as 4 or 8 bytes. However, the minimum memory access granularity between high-level memory systems is usually at least a cache line (e.g., 64 bytes). When a program randomly accesses a 4-bytes memory space, which loads a cache line from the L2 to the L1, the other 60 bytes of the loaded cache line may not be used after a long duration. And after the long duration, the above-loaded cache line may have been evicted from the L1 cache, which means the bandwidth utilization between the L1 and L2 caches is only 4/64 = 6.25%. In contrast, when a program has a streaming access pattern and accesses a 4-bytes memory space, which loads a cache line from the L2 to the L1, the other 60 bytes of the loaded cache line will be used immediately. Therefore, a program with the streaming access patterns can achieve up to 64/64 = 100% bandwidth utilization between the high-level memory systems (e.g., between L1 and L2 caches).
The second benefit of the streaming access pattern is to take advantage of the hardware pre-fetching mechanism, which can increase the L1 cache hit rate. Modern CPUs usually have a hardware pre-fetching mechanism, which automatically pre-fetches the following cache line when a few consecutive memory accesses are detected [24]. Consider a typical case where a CPU program consecutively accesses and processes data in a memory space with a size between the L1 and L2 cache sizes. Due to the pre-fetching mechanism, most to-be-accessed data may have been pre-fetched from L2 to L1, improving the L1 cache hit rate. This benefit is provided that the CPU also takes time to process the accessed data; therefore, part or all of the pre-fetching latency from the L2 cache is hidden, guaranteeing an improved L1 cache hit rate.
Since the accumulating phase of the BRMerge has a typically streaming access pattern when merging the intermediate lists, BRMerge achieves the maximized memory bandwidth utilization and takes advantage of the hardware pre-fetching mechanism. In contrast, the hash-based and hashvec-based accumulation methods do not have the streaming access pattern since the hashing operation can direct the insertion to potentially any place in a hash table. Therefore, when the random memory accesses miss the L1 cache and load a cache line from the L2 cache, most memory bandwidth between L1 and L2 may be wasted. Moreover, the hash-based and hashvec-based accumulation methods cannot take advantage of the pre-fetching mechanism.

2) Minimized Translation Lookaside Buffer (TLB) Cache Misses
We discuss the TLB hit rate for two reasons: 1) both the row-wise and outer-product SpGEMM algorithms have to randomly access a large matrix (either the B matrix for the row-wise SpGEMM or theĈ matrix for the outer-product SpGEMM [10]), which may cause high pressure for the TLB cache, and 2) the TLB cache hit rate is also critical for the per-formance of CPU programs. Some may argue that the outerproduct SpGEMM can merge the partial matrices on the fly while generating them. However, this will require too many merging operations and data movement operations, which quickly cause more overheads than the gains, especially for the medium-to-large-size matrices which cannot fit in the cache [10]. Therefore, we do not consider that situation.
We first explain the importance of the hit rate of L1-d TLB (i.e., the first level data TLB). Each time the CPU accesses a memory space, it generates a virtual address. The virtual address should be translated to a physical address (with page number) by consulting the L1-d TLB before finishing the querying to the L1 cache. If accessing the L1-d TLB misses, the querying to the L1 cache is hanged, and the CPU may consult the shared L2 TLB, which has a similar cost to that of accessing the L2 cache [24], [25]. Therefore, the L1-d TLB miss also causes a similar cost to L2 cache access. Also, note that if accessing the L2 TLB misses again, the CPU should start a page table walk process, which involves much more expensive main memory accesses [25].
In most cases, one TLB miss penalty can be alleviated by multiple following accesses to the same 4KiB page, which always hits the L1-d TLB. However, if each TLB miss is not followed with many accesses to the same 4KiB page, the penalty will of course not be alleviated, which is the case for the Heap-SpGEMM. We then shall discuss the possible TLB misses for the interested SpGEMM algorithm, how the proposed BRMerge accumulation method minimizes the TLB misses, and the inefficient TLB access issue in the Heap-SpGEMM.
We provide an upper-bound estimation of the possible TLB misses for the row-wise SpGEMM algorithm on Intel's recent Skylake server-class CPU. The Skylake CPU has a 64entry private L1-d TLB per physical core backed by a 1536entry shared L2 TLB per CPU [24]. The hyper-threading technology is often enabled to improve the performance of the multi-threaded programs [24]. When the hyper-threading technology is enabled, there will be two logical cores in each physical core, which share the L1-d TLB resources [24]. As a result, each logical core only has 32 entries of the L1-d TLB.
The upper-bound estimation assumes that the memory distance of different accessed rows in B to compute one result row is more than the 4KiB page size. In other words, accessing each different row in B occupies two TLB entries, where the "two" entries are occupied by the col and val arrays of a row. Therefore, when a logical core computes a result row that needs to access 32/2 = 16 different rows in B, there may be capacity-caused L1-d TLB misses. Note that the number of the accessed rows of B is the same as the n nz of the corresponding row in A. In the real situation, the n nz of rows in A can be much larger than 16, which can cause many TLB misses in the SpGEMM algorithm.
Yet, the capacity-caused TLB misses are hard to be avoided in performing SpGEMM. However, minimizing the TLB misses can be easily achieved by ensuring that the TLB is only updated once when loading each row from the B matrix. To achieve this, in many cases, we only need to make sure each row of the B matrix is loaded consecutively in a short duration. The proposed BRMerge accumulation method meets this requirement since it accesses all the required rows in B consecutively and stores the multiplied intermediate lists in a local ping-pong buffer in the multiplying phase.
In contrast, the Heap-SpGEMM does not meet the above requirement. It maintains the accesses to all the required rows of B during the its accumulating phase. Therefore, the corresponding TLB entries for accessing each row of B may be inserted into and evicted out of the TLB multiple times, suffering not-alleviated TLB misses penalties when there are capacity-caused TLB misses.

3) Reasonably High L1/L2 Cache Hit Rates
When accumulating the multiple intermediate lists in the accumulation method, the data may be processed multiple times in the L1/L2 cache. Therefore, the L1/L2 cache hit rates are important for the performance of the accumulation method.
Since BRMerge is a row-wise accumulation method, it only processes the intermediate lists for one result row in each iteration, which is usually small and comparable to the L1/L2 cache size. Moreover, after each round of merging, the required memory size of the intermediate lists is reduced due to the merging of duplicate indices. Therefore, the expected cache hit rates of the L1/L2 caches increases after each round of merging. Furthermore, since BRMerge has a typical streaming access pattern during the accumulating phase, the hardware pre-fetching mechanism further increases the L1/L2 cache hit rates. As a result, the L1/L2 cache hit rates of BRMerge during the accumulating phase is reasonably high.
In summary, the architectural benefits of the proposed BRMerge accumulation method are the streaming access patterns, minimized TLB misses, and reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM on the multi-core CPU architectures.

D. BRMerge-UPPER AND BRMerge-PRECISE
To implement a high-performance SpGEMM algorithm, the allocation and the load balance methods are also important [9], [15]. Since the upper-bound and precise allocation methods are commonly used and yields high performance [9], [12], we use the two allocation methods and implements two SpGEGMM libraries named BRMerge-Upper and BRMerge-Precise. The two SpGEMM libraries are based on the BRMerge accumulation method.
The load balance of the two SpGEMM algorithms is the same as the previous work [9], which is reported to have high performance. First, the n prod of computing each output row is counted. Then the work is statically divided into p groups in a per-row granularity, where p is the number of CPU threads. The division policy is to keep the same total n prod in each group of work. Each group of work is computed by one CPU thread. VOLUME 4, 2016 Fig. 4 (a) shows the computation steps of the BRMerge-Upper SpGEMM algorithm.
Step 1 computes the row_nprod of each output row and performs prefix sum on the row_nprod for two purposes: 1) load balance (step 2) and 2) allocating the temporary memory space for the C matrix (step 3). The temporary memory space for the C matrix is denoted as C_bar.
Step 4 allocates the ping-pong buffers and computes the row_size, col, and val arrays of the C matrix by the BRMerge accumulation method. The row_size represents each row's number of nonzero elements.
Step 5 performs prefix sum on the row_size to obtain the rpt array and total n nz of the C matrix at the same time. The col and val arrays of C are allocated according to the total n nz of C.
Step 6 copies the C_bar matrix to the standard C matrix which conforms to the CSR storage format.   Step 3 allocates the C_bar matrix parallelly, which means each CPU thread only allocates the part of C_bar for its private use. The load balance of step 4 and step 6 is based on the n prod information, which means each CPU thread computes approximately the same number of intermediate products. Whereas, the load balance of step 1 and step 5 is based on the number of rows of C, which means each CPU thread computes approximately the same number of rows.
Step 3 is computed in a single CPU thread since its computation complexity is very low (i.e., O(p), where p is the number of CPU threads). Fig. 4 (b) shows the computation steps of the BRMerge-Precise SpGEMM algorithm.

2) BRMerge-Precise
Step 1 computes the row_nprod of each output row and performs prefix sum on the row_nprod for the load balance in step 2.
Step 3 computes the row_size of the C matrix by the hashing method described in the previous work [9]. Step 4 performs the prefix sum on the computed row_size to obtain the rpt array and the total n nz of C. Then the col and val arrays of C are allocated according to the total n nz of C.
Step 5 allocates the ping-pong buffers and computes the col and val arrays by the BRMerge accumulation method. Each computed row is directly written into the col and val arrays, which conform to the CSR storage format.
The load balance of step 3 and step 5 is based on the n prod information. Whereas the load balance of step 1 and step 4 is based on the number of rows of the C matrix.

IV. PERFORMANCE EVALUATION A. EVALUATION SETUP
We compare the proposed SpGEMM libraries BRMerge-Upper and BRMerge-Precise to Heap-SpGEMM [9], Hash-SpGEMM [9], Hashvec-SpGEMM [9], PB-SpGEMM [10], and MKL-SpGEMM [28]. The evaluation is based on the FLOPS performance of the matrix square benchmarks [12], [15], [16], which is twice the number of the intermediate products divided by the execution time. The execution time is obtained by averaging the execution time of ten runs after one warm-up run. In all benchmarks, we measure the execution time of the approaches in double precision. We evaluate all the approaches on two CPU servers with different CPUs. All the evaluated approaches are compiled and executed with the same hardware and software environ- ment on each CPU server. The detailed information about the two CPU servers and the software environment is summarized in Table 1. We select 26 square matrices from the SuiteSparse Matrix Collection [22] for the evaluation. The selected matrices are commonly used for the performance evaluation of SpGEMM on CPUs [9], [10] and GPUs [12], [15], [16]. The detailed information of these matrices is summarized in Table 2. Note that the 26 matrices in Table 2 are sorted in an ascending order by their compression ratio.
The MKL-SpGEMM is implemented based on the C version of Intel's oneAPI Math Kernel Library (oneMKL) [28]. The key routine in the MKL-SpGEMM is the mkl_sparse_spmm routine. The parallelism in all the approaches including MKL-SpGEMM is based on the OpenMP framework [30]. Therefore, we use omp_set_num_threads to set the used number of threads (#threads). All the evaluations are set with the same #threads on each CPU server (Table 1).
During the performance evaluation, we observe that when the used #threads is set as the maximum #threads on a CPU server, which is 96 on the Platinum server and 72 on the Gold server, even the average execution time can vary widely in different runs. The possible reasons for this might be that the occasional operating system services may seize several CPU threads for execution and evict the executing SpGEMM tasks. Based on our experiments, finally, we set the #threads as 80 on the Platinum server and 58 on the Gold server for all the evaluations and observed a relatively stable average execution time among different runs.
Another performance note is that the scalable_malloc and scalable_free in Intel's oneAPI Thread Building Block (oneTBB) [29] are used in the state-of-theart SpGEMM libraries including Heap-SpGEMM, Hash-SpGEMM, Hashvec-SpGEMM, and PB-SpGEMM. And the scalable_malloc and scalable_free yield better performance for the parallel memory allocation and deallocation [9]. Therefore, we also use the scalable_malloc and scalable_free in our SpGEMM implementations. The PB-SpGEMM suffers exceptionally lower performance on all the tested benchmarks, including benchmarks with low compression ratios. We attribute the exceptionally low performance of PB-SpGEMM to the outer-product dataflow, which suffers worse memory access efficiency to theĈ matrix compared to the memory access efficiency to the B matrix for the row-wise dataflow. The previous work  also reported that the outer-product dataflow has significant drawbacks compared to the row-wise dataflow [19], [20].

1) Performance of SpGEMM on the Platinum Server
We compare the performance of the BRMerge-Upper with the Heap-SpGEMM since the two SpGEMM libraries implement the same allocation and load balance methods. For the benchmarks with low compression ratios and low n nz per row of the A matrix (i.e., the matrix with ids ranges from 1 to 8, except for the webbase-1M matrix), the performance of the Heap-SpGEMM is comparable to the proposed BRMerge-Upper SpGEMM library. The reason is that the two main drawbacks of the Heap-SpGEMM are avoided in these benchmarks, which are 1) the high computation complexity on benchmarks with high compression ratios [9], and 2) the high TLB cache miss rate due to accessing the rows of the B matrix in a longer duration (Section III-C). For other evaluated benchmarks, BRMerge-Upper significantly outperforms Heap-SpGEMM despite the two SpGEMM li-braries implementing the same allocation and load balance methods. For the evaluated benchmarks with the matrix ids above 20, BRMerge-Upper even achieves more than 3× better peroformance than the Heap-SpGEMM. We attribute this large performance gap to the architectural benefits of the proposed accumulation method and the high TLB cache miss rate of the Heap-SpGEMM (Section III-C).
We compare the performance of the BRMerge-Precise with the Hash-SpGEMM since the two SpGEMM libraries implement the same allocation and load balance methods. Moreover, the symbolic phase of the BRMerge-Precise also uses the hash-based method to count the n nz per row [9] which is the same as the Hash-SpGEMM. The Hash-SpGEMM is reported to have better performance than other SpGEMM libraries on benchmarks with high compression ratios [9]. The reason is that, when combined with the precise allocation method, the n nz per output row is known before the numeric phase; therefore, the accessed hash table size can be as small as the n nz per output row.
Nevertheless, for benchmarks with relatively high compression ratios (i.e., the matrices with ids ranging from 15 to 26), the BRMerge-Precise outperforms the Hash-SpGEMM. We attribute this to three reasons. First, the BRMerge accumulation method has streaming memory access patterns, which can achieve maximized bandwidth utilization and increase the L1 cache hit rate by taking advantage of the hardware pre-fetching mechanism. Whereas, the hash-based accumulation method does not have the architectural benefits mentioned above. Second, the BRMerge accumulation method has reasonably high L1/L2 cache hit rates as a row-wise accumulation method (section III-C). Third, the result list in the BRMerge accumulation method is natively sorted. However, the hash-based accumulation method has to perform the additional extracting and sorting operations after the hashing operation.
As for the benchmarks with low compression ratios, the BRMerge-Precise significantly outperforms the Hash-SpGEMM. Note that in our evaluation, the Hashvec-SpGEMM seldom outperforms the Hash-SpGEMM in the evaluated benchmarks.
2) Performance of SpGEMM on the Gold Server The performance trends of all the evaluated SpGEMM libraries except for the MKL-SpGEMM on the Gold server are similar to those on the Platinum Server. The MKL-SpGEMM outperforms the proposed BRMerge-Precise on benchmarks with high compression ratios (i.e., the matrices with ids ranging from 18 to 26). In contrast, the proposed BRMerge-Precise outperforms the MKL-SpGEMM on benchmarks with compression ratios lower than 4. Note that on the highly irregular webbase-1M benchmark, the BRMerge-Precise significantly outperforms the MKL-SpGEMM on both the Platinum server and the Gold server. On average, the BRMerge-Precise outperforms the MKL-SpGEMM by 1.73×. However, we could not explain the reasons for this performance difference since MKL-SpGEMM's detailed implementation is unavailable.

V. CONCLUSION
In this paper, we present a novel binary-row-merging-based accumulation method, named BRMerge, to optimize the performance of SpGEMM on multi-core CPU architectures. We also analyze the architectural benefits of the proposed accumulation method, which are the streaming access patterns, minimized TLB cache miss rate, and reasonably high L1/L2 cache hit rates. As a result, the proposed accumulation method can achieve both low access latency and high bandwidth utilization when performing SpGEMM on multicore CPU architectures. Based on the proposed accumulation method, we further propose two parallel SpGEMM libraries named BRMerge-Precise and BRMerge-Upper based on different allocation methods. The proposed SpGEMM libraries significantly outperform the state-of-the-art SpGEMM libraries on 26 commonly used matrices with two CPU servers. Specifically, our best-performing SpGEMM library (i.e., BRMerge-Precise) achieves on average 1.42× and 1.39× performance speedups compared to the state-of-the-art bestperforming SpGEMM library (i.e., Hash-SpGEMM [9]) on the Intel Xeon Platinum 8163 CPU and the Intel Xeon Gold 6254 CPU, respectively.