Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling

Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of $75.8\times $ , $1.2\times $ , and $4.7\times $ compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.


I. INTRODUCTION
23 Emerging data-intensive applications such as natural lan- 24 guage processing deploy Recurrent Neural Networks (RNNs) 25 such as Long Short-Term Memory (LSTM) [1] and 26 Transformer-based models [2]. Their primary characteris-27 tic is to process a large amount of data with a very 28 The associate editor coordinating the review of this manuscript and approving it for publication was Baker Mohammad . low locality [3], [4], [5], [6]. For example, Bidirectional 29 Encoder Representations from Transformers (BERT) [7] 30 and Generative Pre-trained Transformer (GPT) [8], [9], the 31 transformer-based models for understanding and generating 32 human-like texts, process 110∼335 million and up to billions 33 of parameters, being used only within one layer. Therefore, 34 the von Neumann architecture suffers from a severe data 35 transfer bottleneck from main memory in these workloads, 36 limiting the system performance [10], [11]. [6], [12], [21]. 71 The recent in-bank PIM can be categorized by whether 72 it targets per-bank execution [4], [21] or all-bank execu-73 tion [3], [5], [6], [12] depending on how many banks one 74 DRAM command activates for the PIM execution. The 75 per-bank execution triggers only one bank at a time, while 76 the all-bank execution invokes all or multiple banks. The 77 number of the commands required for a PIM kernel directly 78 determines the PIM performance. The memory requests for 79 the per-bank PIM execution are much higher than the all-80 bank one, thus delivering much lower performance [4], [21]. 81 The all-bank PIM achieves high computation throughput 82 by exploiting the bank-level computation parallelism and 83 using the full internal bandwidth. However, it accompanies 84 the following design difficulties and potential performance 85 degradation: 1) data alignment issue from its computation 86 granularity, 2) synchronization overhead for syncing the bank 87 states before starting the all-bank execution, 3) modifica-88 tion of memory interface/controller, 4) implementation over-89 head such as power consumption and thermal dissipation, 90 and 5) PIM mode switching which prevents the non-PIM 91 request service during PIM operations [3], [5], [6], [12], 92 [13]. Even though the per-bank PIM suffers the lower per-93 formance, it preserves the standard DRAM interface and 94 supports non-PIM request service during the PIM execution 95 as a standard DRAM. Both per-bank and all-bank PIMs suffer 96 from replicating bank-shared operands to all banks, resulting 97 in multiple copies of the same data stored in different memory 98 addresses. Several PIMs provide a global/local buffer to avoid 99 the replication but incur significant overhead in size, much 100 worse, especially in the DRAM fabrication [3], [5], [6], [12]. 101 For taking only advantages from both PIMs, this paper 102 introduces the memory-computation decoupling architecture 103 to achieve the ideal all-bank performance of in-DRAM PIM 104 with standard memory interface, i.e., the per-bank execu-105 tion. Also, to maximize the decoupled execution performance, 106 we introduce a compiler analysis and code generation tech-107 nique. 108 The decoupled architecture uses two PIM execution 109 phases: a memory phase for the per-bank memory opera-110 tion and a computation phase for the all-bank computation. 111 At the memory phase, we read bank-private operands from 112 a memory array and store them to the bank's PIM registers 113 bank-by-bank. At the computation phase, we decouple each 114 PIM engine datapath from its bank and broadcast bank-shared 115 operands read from one bank or written from a host to all 116 banks' engines without operating all banks' memory arrays. 117 It allows us to compute the PIM engine in all banks in 118 parallel with operating only one bank of the memory array 119 and achieve the ideal performance of the all-bank execution 120 while preserving the standard DRAM interface and satisfy-121 ing the standard power budget. In reality, only half of all 122 banks could perform concurrently due to power and ther-123 mal issues [5], [6]; thus, our performance would be more 124 attractive to users than the all-bank execution in the real 125 world. Also, since we preserve the standard interface, we can 126 serve non-PIM requests during our computation phase, i.e., 127 while performing the all-bank computation. As we operate 128 only one bank array at both phases and conform to the stan-129 dard interface, we implement our architecture based on the 130 per-bank PIM [4]. 131 VOLUME 10, 2022 FIGURE 2. Ratio of the execution time for the matrix multiplication and element-wise operations to the total execution in the DNN applications, LSTM [1] and BERT [7].
Xp GPU with cuBLAS [28]. Compared to CPU and GPU, 171 we achieved speedups of 75. 8×  The rest of the paper is organized as follows: Section II 197 introduces existing in-bank PIMs and our baseline per-198 bank PIM [4] for this work. Section III proposes the 199 memory-computation architecture and its compiler tech-200 niques by applying our work to Level-3 BLAS as an exam-201 ple. Section IV evaluates the performance, and Section V 202 concludes the paper. The PIM architecture places the computing unit inside mem-206 ory devices to use the full internal memory bandwidth. The 207 architecture can be classified by where the computation is 208 performed; in-cell, i.e., computing on memory cells, and in-209 bank, i.e., on bank peripherals.

210
The in-cell PIM architecture utilizes the analog proper-211 ties of Non-Volatile Memory (NVM) cells such as ReRAM 212 and MRAM to use them as both storage and computing 213 devices [16], [17], [18], [20], [31]. However, the analog-based 214 computation drops accuracy due to the limited precision 215 and the error vulnerability in analog-to-digital conversion. 216 Analog-to-Digital (ADC) and Digital-to-Analog Converter 217 (DAC) also result in a large area overhead [3], [32]. For 218 alleviating such problems, some in-cell PIM architectures 219 proposed arithmetic computation in NVM and DRAM, based 220 on the logic operations implemented upon their analog char-221 acteristics such as resistance of ReRAM and charge sharing 222 of DRAM [14], [15]  operations concurrently. In [5], the PIM unit is shared among 263 two banks, thus allowing only half of the banks to perform 264 at once. For supporting the lockstep-style behavior of all 265 banks, [3], [6], [12], [13] requires a customized memory con-266 troller violating the JEDEC standard, and [5], [12] requires 267 the mode switching before and after the PIM operations. 268 Besides, memory requests of non-PIM applications cannot be 269 serviced during PIM operations [3], [5], [6], [12], [30].

275
Each bank has its memory cell array and a PIM engine. The 276 memory banks receive Command, Address, and Data sig-277 nals from standard memory requests as conventional DRAM 278 devices do and accept the PIM signals generated from the 279 PIM Interface Unit (IU). Before executing a PIM kernel, 280 a programmer stores the PIM operands' start addresses of 281 the uncacheable physical pages and the engine configuration 282 information in the control registers in PIM IU. The PIM 283 Request Identification Unit (RIU) compares the input address 284 with the operand addresses stored in the control registers 285 to determine if the incoming memory request is the PIM 286 command. If the addresses are matched, the PIM valid 287 signal is generated and delivered to a target bank with the 288 other signals for providing data from the bank to its PIM 289 engine.

290
Silent-PIM has the 4-stage pipelined datapath for bfloat16, 291 as shown in Figure 4, performing all the computations with-292 out violating the DRAM timing of the data burst, i.e., in 293 4 cycles. The first stage fetches the source operands into two 294 vector registers, vecA and vecB. A switch (PimS) lies in 295 between the global data bus and the register file bus interface 296 of the engine. The PIM valid signal enables the switch to 297 connect the global data bus to the source or the destination 298 register for the PIM command or disconnected for typical 299 VOLUME 10, 2022   The operation is performed whenever data is stored to vecA.   We perform the PIM execution using two phases: a mem- decoupled PIM preserves the standard DRAM interface, i.e., 336 the standard DRAM power budget, commands, timings, and 337 so on. 338 The decoupled execution outperforms the per-bank PIM in 339 speedup and energy consumption while slightly increasing 340 the power consumption since the PIM engines of all banks 341 operate simultaneously in the computation phase. On the 342 other hand, the proposed decoupled PIM shows less perfor-343 mance than the all-bank PIM in the memory phase. How-344 ever, the total power of the decoupled PIM remains within 345 the standard power budget, unlike the all-bank PIM. Conse-346 quently, the proposed decoupled PIM architecture becomes 347 more attractive and acceptable than the prior all-bank 348 PIMs when there is a higher opportunity for broadcast in 349 applications. The memory phase execution is the same as the per-bank 352 execution of Silent-PIM, whose memory request turns on/off 353 PimS and DataS switches of only a target bank. How-354 ever, for supporting the computation phase, i.e., broadcasting 355 bank-shared data to all banks' engines, we added one attribute 356 to source operands and modified the decoder for the PimS 357 switch from Silent-PIM, marked in grey in Figs. 3 and 4.

358
For the PIM engine to identify bank-shared data during 359 the PIM execution, we added the broadcast attribute (BC) 360 to source operands in the control registers of the PIM IU. 361 A programmer provides the address of the broadcast target 362 (i.e., the bank-shared operand) with setting the associated 363 BC. The matching of the incoming request's address with 364 the broadcast target while ignoring their bank addresses gen-365 erates the BC match signal to notify the broadcast to all 366 banks' engines. Therefore, we modified the decoder for the 367 PimS switch of each bank by ORing PIM valid and BC 368 match signals. We turn on all the PimS switches so that 369 all banks' engines receive the broadcast (bank-shared) data 370 from the global data bus and store them in registers. The 371 broadcast data can either be read from a bank or provided 372 outside DRAM, such as a host's write request. A standard 373 memory request performs the data broadcast.

374
Suppose the BC attribute is unset and the address of the 375 incoming request is matched with the source operand address 376 while considering their bank addresses. In that case, the PIM 377 valid signal is generated and delivered to only the target 378 bank, i.e., performing as per-bank. All the memory requests 379 turn on their target banks' DataS switch for accessing their 380 data array as usual, and the PIM memory request among them 381 also controls the PimS switches. 382 Figure 5 shows an example of how to control the switches 383 at the decoupled execution phases. We use vecB for 384 bank-private operands and vecA for bank-shared operands. 385 At the memory phase, we turn on the DataS and PimS 386 switches of the target bank to read the bank-private operand 387 from its data array and store the operand to vecB bank-by-388 bank. On the other hand, we turn on the DataS of only one 389 bank (Bank 0) and all the PimS switches at the computation 390 independence and dependence on each dimension of the 427 matrices, respectively. For example, MatA and MatB are 428 independent of the j and the i dimensions, respectively. Each 429 matrix is independent of one of the three dimensions and 430 reused at its lower iteration space. That is, MatA is reused 431 (i.e., shared) within the j dimension, and MatB is shared 432 within the i dimension. We select MatB, which is reused in 433 a larger iteration space, as a bank-private operand since we 434 can reuse the matrix maximally within the i dimension, i.e., 435 repeatedly using K × J elements. Also, we choose MatA, 436 which is reused in a smaller iteration space, as a bank-shared 437 operand by considering the operand reuse across the banks 438 for the j dimension. If the i and j dimensions are interchanged, 439 MatB becomes the bank-shared since the i dimension will be 440 the lower dimension, and MatA turns into the bank-private 441 operand. 442 We can consider a convolution algorithm in Figure 6(b) in 443 the same way and determine inp as bank-shared and wgt as 444 bank-private operands. We fetch the bank-private operands and store them into the 448 PIM engine registers at the memory phase in a per-bank 449 manner. The higher the reuse of bank-private operands in the 450 registers, the longer the computation phase, and the higher 451 the performance due to the decoupled all-bank execution. 452 For maximizing the reusability, we apply the loop tiling and 453 develop a cost model to derive a tiling factor for the code 454 generation. Our compiler technique is different from the con-455 ventional compiler's approach to employ the tiling for cache 456 hierarchy. Our PIM does not include a cache, so we match 457 the tiling to two PIM resources: registers for maximizing the 458 bank-private operand reuse and an ALU width for maximiz-459 ing the computation utilization.    (1).
From the decision that MatA is to be bank-shared and 473 MatB is to be the bank-private, the load/store cost of the tiled  Cost MatA = # of LD # of banks When the i and j dimensions are interchanged, MatA and 505 MatB become the bank-private and the bank-shared, respec-506 tively. In that case, their costs become Cost MatA = I ×J ×K r 507 and Cost MatB = I ×J ×K #ofbanks , resulting in the optimal tiling factor 508 of (p, q, r) = (1, 1, 32). 509 2) ALL-BANK EXECUTION TILE BY TILE 510 We found an optimal tiling factor of (p = 32, q = 1, r = 1) 511 to maximize the all-bank execution opportunity by reusing 512 a bank-private operand as much as possible, i.e., avoiding 513 frequent reloading of the bank-private operand. Therefore, 514 the optimal tiling factor prefers to perform (p×q)×(q×r) = 515 (p × r), i.e., (32 × 1) × (1 × 1) = (32 × 1).

516
Both tile sizes of MatA and Acc are 32 × 1, and that of 517 MatB is 1 × 1. However, since the DRAM access granularity 518 is 64B (i.e., 32 elements), we regard 32 elements as a tile for 519 MatB for one bank; thus, the optimal tiling factor becomes 520 (32, 32, 1). Also, we concurrently execute the 16 banks by 521 the broadcast in the j dimension; thus, the optimal tiling 522 factor finally becomes (32, 32, 16). Therefore, we store the 523 interleaved 32 (= q) columns of MatB and Acc across 16 524 (= r) banks and broadcast 32 (= p) elements of MatA to 525 all banks 32 (= q) times. We call this a register-sized window 526 in the rest of the paper. 527 Figure 8(a) illustrates the register-sized window matrix 528 multiplication at bank 0, i.e., (32 × 32) × (32 × 16), and 529 Figure 8(b) shows the timeline of DRAM commands execut-530 ing the multiplication. For the timeline, we assume that all 531 matrices are stored in the same row of a DRAM so that only 532 one activation is required per bank. Also, standard memory 533 requests perform all reads, writes, and broadcasts with the 534 PIM computation. Each bank i multiplies pairs of (a 0:31,0 , 535 b 0,i ), (a 0:31,1 , b 1,i ), · · · , (a 0:31,31 , b 31,i ) and accumulates the 536 multiplication results one-by-one to calculate c 0:31,i .

585
To guarantee the correctness of the PIM computation regard-586 ing Figure 8, the following conditions should be satisfied. 587 1) Each phase should start after the previous phase is finished, 588 and 2) at the computation phase, a PIM engine should cor-589 rectly determine which element of vecB is multiplied with 590 the in-flight broadcast data.

591
When each phase is not separated, e.g., the computation 592 phase starts before finishing the memory phase, some banks 593 start the PIM computation with its vecB filled with garbage 594 or empty, thus violating the correctness. We prevent such 595 situations by offloading the PIM requests using Direct Mem-596 ory Access (DMA). Since our PIM architecture conforms to 597 VOLUME 10, 2022 the standard DRAM interface, we offload the PIM requests 598 using a conventional DMA engine. Each DMA transaction 599 invokes each phase, and the DMA engine requests the next 600 transaction only after the previous transaction is finished. 601 Therefore, we ensure that each phase starts after the earlier 602 phase is completed. 603 However, the memory requests within a DMA transaction  Since we used the FPGA DDR4 for PIM, we configured 662 it uncacheable. The memory controller (MC) of the FPGA 663 was regarded to be equivalent to the host memory controller. 664 We verified all operations at the system level.

665
Since our PIM architecture complies with the standard 666 DRAM interface, we did not modify the Xilinx DDR4 mem-667 ory controller IP, and a conventional DMA invoked the PIM 668 requests to the FPGA-based PIM [4]. The number of banks 669 in a PIM device is 16, and each bank has 8-way MAC units. 670 As discussed in Section II, the data is fetched by 128-bit × 671 4 cycles. The proposed architecture can be considered as 672 one die of 3D-stacked memory. When the memory controller 673 captures a PIM request by scanning the requested address, the 674 PIM device module placed between the memory controller 675 and the DRAM emulates the DRAM access, broadcast, and 676 PIM operations while obeying the DDR4 timings mounted on 677 the FPGA.  Also, to estimate the performance of the ideal all-bank 689 PIM, we modified the Xilinx memory controller to simulate 690 the all-bank execution behavior (i.e., one command operates 691 all banks at once). We assumed that all banks operate by one 692 DRAM command; thus, we named ''ideal''. 693 We ran microbenchmarks of Level-2/3 BLAS, a multi-

742
Although we applied the ALU-width tiling, we underuti-743 lize the 8-way ALUs for I < 8, thus the number of memory 744 requests on our PIM for I < 8 is the same as I = 8. Therefore, 745 the memory requests increase at every I multiple of 8. When 746 I is a multiple of 32, the proposed PIM requires only 9.3% of 747 memory requests of PIM_PB due to maximizing the reuse 748 of the bank-private operands and sharing the bank-shared 749 operands by broadcast.

750
The number of RD A requests on our approach is the same 751 as RD B on PIM_AB when I is a multiple of 32 since both the 752 requests trigger the all-bank computations and their number 753 of computations are the same. However, as our decoupled 754 PIM uses the memory phase for the bank-private operands 755 (i.e., RD B) by per-bank, it needs 48% more total memory 756 requests than PIM_AB. 757 2) EXECUTION TIME AND SPEEDUP 758 Figure 12 illustrates the execution time in a log scale and 759 the speedup normalized to CPU_S running Level-2/3 BLAS 760 algorithms on each platform by varying I . The performance 761 was measured assuming that all the matrices are stored in 762 main memory, i.e., not yet brought into any cache at the start. 763 The execution time of CPU_S and CPU_P grew slowly 764 when I was small because of the data reuse in a cache, but 765 they became proportional to I as the data size increased. The 766 speedup of CPU_P using 16 logical cores increased slightly 767 from 4.5× at I = 1 to 5.8× at I = 32 and slightly degraded 768 as I increases due to cache misses. GPU spent over 90% 769 of the time for copying the input/output data to/from the 770 device; therefore, its execution time was longer than CPU_P 771 when I < 8. However, the execution time hardly increased 772 thanks to the massive parallelism supported by its numerous 773 stream multiprocessors, and its speedup continuously grew as 774 I increased, up to 65.9×.

775
Since it was observed in [4] that using DMA as 776 the PIM offloading engine and applying both DMA and 777 DRAM-friendly data layout for PIM operands improves 778 performance, we adopted the same approach for PIM_PB, 779 PIM_AB, and our decoupled PIM. Such an approach allowed 780 the performance of all PIM platforms to outperform CPU in 781 all cases despite the relatively high data reuse in batching. 782 PIM_PB and PIM_AB repeat the VM multiplication for I 783 times without exploiting the reuse opportunities, and their 784 number of memory requests determines the execution time, 785 as discussed in the previous section. They demonstrated the 786 highest speedup of 37.1× and 169.1× at I = 1 and an 787 almost constant speedup of 16× and 86× due to CPU_S's 788 cache effect at larger I 's. Although PIM_AB demonstrates 789 VOLUME 10, 2022

3) DRAM BEHAVIOR
819 Figure 13 illustrates the breakdown of the row buffer 820 hit/miss/conflict and DRAM commands of PIM_PB, 821 PIM_AB, and our decoupled PIM. We implemented the 822 performance counter inside the Xilinx memory controller for 823 profiling the DRAM behaviors.  We compared the PIM power consumption with the con-889 ventional DDR4 peak power consumed by back-to-back RDs; 890 5.95W [12]. PIM_PB obeys the standard DRAM constraints 891 as it operates at most one bank. Our PIM also adheres to the 892 standard memory power as it operates in a per-bank manner 893 in the memory phase, and also operates at most one bank of 894 memory array to broadcast the bank-shared operand to the 895 PIM engines of all banks in the computation phase. There-896 fore, PIM_PB and our PIM's worst-case peak powers remain 897 close to 5.95W considering 0.03W of the engine power, 898 i.e., 5.98W at the worst case. PIM_AB consumed a peak 899 power of 21.58W when performing RDs for all 16 banks, 900 far exceeding the conventional peak power. Only four banks 901 could perform the computations simultaneously within the 902 conventional peak power [6], [12]. In [5], the authors reduced 903 the power consumption by limiting the concurrently operat-904 ing banks to half and avoiding the data transfer to external 905 I/O at all-bank PIM mode. Their back-to-back RDs consumed 906 105.4% of the normal HBM2 power [54]. 907 Figure 14 illustrates the energy consumption normalized to 908 CPU_S in a log scale. The normalized energy consumption of 909 CPU_P and PIM_PB did not vary much by the I size. The 910 normalized energy consumption of GPU became less as I 911 increased but always worse than our decoupled PIM. At I = 912 128, the energy consumption of GPU was 94.6% and 33.0% 913 less than CPU_S and PIM_PB, respectively. The energy of 914 our PIM was less consumed than CPU, GPU, and PIM_PB 915 by 98.5%, 72.0%, and 78.4%, respectively. PIM_AB showed 916 the lowest energy consumption among all platforms due to 917 the fastest execution time, and our decoupled PIM consumed 918 only 7.4% higher than PIM_AB. The execution time and the speedup of the multi-batch 923 LSTM-based Seq2Seq model to process 1000 input data on 924 each platform are depicted in Figure 15. The larger batch size 925 implies a lower framework overhead for the python to C++ 926 interface and a higher opportunity for weight reuse since the 927 VOLUME 10, 2022       However, because of the highest speedup of PIM_AB, its 999 energy consumption was the lowest in all cases. At the batch 1000 size 128, the normalized energy consumption of GPU was 1001 28.3% less than CPU_S and 20.2% higher than PIM_PB, 1002 respectively. The normalized energy consumption of our PIM 1003 was 86.3%, 80.8%, and 77.0% less than CPU_S, GPU, and 1004 PIM_PB. Compared to PIM_AB, our PIM consumed only 1005 0.3% and 2.7% higher energy at the batch sizes of 64 and 1006 128, respectively.
1007 Figure 17(b) shows that the normalized energy consump-1008 tion of BERT on all the platforms became lower by the 1009 sequence length since their speedup increased, as shown 1010 in Figure 16. The normalized energy of GPU was sig-1011 nificantly higher than all platforms because of the exces-1012 sive initialization overhead. At the sequence length 32, the 1013 energy consumption of CPU_S (MLAS) was 30.3% less than 1014 CPU_S (MKL), and both CPU_P showed similar energy 1015 consumption, which was 43.8% and 48.6% lower than 1016 CPU_S (MKL), respectively. The normalized energy con-1017 sumption of PIM_PB were 50.6% less than CPU_S (MKL), 1018 and our PIM consumed 40.3%, 98.3%, and 33.5% less 1019 energy than CPU_P (MKL), GPU, and PIM_PB, respectively. 1020 PIM_AB consumed the lowest energy in all cases, and the 1021 energy consumption of our decoupled PIM was 3.1% higher 1022 than PIM_AB. 1024 We also analyzed the performance of our decoupled PIM 1025 using one more tile (32 × 1) for MatA to underutilize ALUs. 1026 Figure 18 compares the number of memory requests, execu-1027 tion time, speedup, and DRAM behavior of Level-3 BLAS, 1028 (I × 512) × (512 × 2048) on our decoupled PIM of two 1029 tile sizes (i.e., 32 × 1 and 8 × 4). At I = 32, both tile 1030 sizes fully utilized the ALUs and exploited the maximum 1031 opportunity for reusing the bank-private operands. Therefore, 1032 the performance of both tile sizes was the same.

1033
The (32 × 1) tile operates in the same way for all the cases 1034 I ≤ 32, i.e., underutilizes ALUs for I = 8 and I = 16. 1035 Therefore, the execution time for all the cases was the same. 1036 On the other hand, the (8 × 4) tile fully utilized the ALUs 1037 and reduced the number of RD A requests by 75% and 50%, respectively; thus, the (8 × 4) tile showed 18% and 13% speedup over the (32 × 1) tile execution.