BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM

As the demand for transformer applications increases rapidly, technologies to solve memory bottlenecks are attracting attention. One of them is an in-DRAM Processing-In-Memory (PIM) architecture to perform the computation inside DRAM. Major DRAM makers introduce the PIM samples, executing all banks’ computations simultaneously to maximize the internal DRAM bandwidth for achieving the highest performance. However, the realization as a commercial product is problematic since the all-bank execution does not concurrently perform non-PIM applications during the PIM execution with PIM memory, thus separating their memory space. This paper proposes a BL-PIM architecture to increase the burst length (BL) of memory requests inside a bank to maximize internal bandwidth and overlap the computation across banks, thus achieving all-bank performance. On the other hand, outside a bank, it seems not to increase the BL, thus allowing us to preserve the data consistency in memory hierarchy and execute non-PIM and PIM applications together with PIM memory. Also, the memory-intensive PIM computation using larger BL significantly reduces their outstanding memory requests, thus minimizing the performance interference with other applications. We carefully extend the DRAM timing diagram and develop the cooperation mechanism between a memory controller and a PIM device. We implemented the BL-PIM architecture on FPGA and compared the performance with real machines using four transformer models and eight compute and memory-bound SPEC benchmarks. We achieved the BL-PIM performance up to 28.9x and 12.0x faster than the CPU single-thread and multi-threaded execution in the transformer models. Also, when we increased the burst length by 16 times as the maximum, the BL-PIM was 1.2x faster than the ideal all-bank PIM execution. We also experimented with the multi-workload execution using the SPEC benchmarks, showing that our architecture can minimize performance interference. To our knowledge, the study of the PIM’s multi-workload execution is the first in public.


I. INTRODUCTION
Deep neural networks (DNNs) are machine learning (ML) algorithms widely used in applications in various fields; The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino . convolution neural networks (CNNs) [1] mainly perform spatial or temporal correlation in data, exploiting high data locality [2], [3], [4], and recurrent neural networks (RNNs) [5] primarily process sequential data [6], [7], [8], [9], exploiting low locality. Most transformer models, developed in RNN, repeatedly perform the vector-matrix (VM) multiplication with input vectors and pre-trained constant weight matrices, and the weight size of popular models such as T5-encoder [10] (520MB∼), RoBERTa [11] (470MB∼) generally exceeds the LLC size [12] (∼30MB). The characteristics result in their performance degradation due to the limited off-chip memory bandwidth and the high data transfer energy consumption [13], [14], [15], [16], [17]. In order to resolve the issue, high bandwidth memory (HBM) [18] and GDDR6 have been developed; however, the required data size and computation throughput of the recently proposed models rapidly increase, much higher than its memory size and throughput, thus still not enough to handle the size of large models. As a challenging approach, many in-DRAM processing-in-memory (PIM) architectures [19], [20], [21], [22], [23] have been actively proposed by designing a processing unit inside a DRAM device, performing the computation near memory banks, exploiting internal bandwidth and eliminating the data transfer to a memory controller.
The in-DRAM PIM architecture generally uses a memory request command to fetch an operand from a bank and perform the computation using it. Therefore, its performance is directly related, i.e., linearly proportional to the number of memory requests, and there are two ways to reduce them. One is to increase the memory request granularity; however, the CPU's cache block size is fixed, and there are few opportunities. The other is to adapt the all-bank execution to PIM, operating all banks with only one memory request, thus reducing the number of memory requests, maximizing the internal bandwidth, and achieving high performance [21], [22], [24]. However, the all-bank execution incurs unacceptable design issues to the market, i.e., blocking non-PIM applications during the PIM application execution and separating PIM from non-PIM memory area since all banks should be in the same DRAM state during the execution. Also, it would consume high power, generate high temperatures, and require a memory controller modification.
In this study, to remove the all-bank PIM's design issues while still providing its ideal performance, we propose a variable burst length (BL) for the per-bank execution of in-DRAM PIM, called the BL-PIM by 1) increasing the BL inside a bank to maximize internal bandwidth and overlap the computation across banks, thus achieving all-bank performance for the PIM execution and 2) using the original 64B BL outside a bank for executing non-PIM and PIM applications together and preserving the data consistency in memory hierarchy. Fig. 1 shows the execution of the BL-PIM architecture with four times larger BL (64Bx4). Each bank has a switch (pimS) on the local bus to provide data to the ALU and a switch (dataS) on the global bus to deliver data to the memory controller, already proposed in [19] and [20]. The followings are the significant contribution of our work.
First, our larger burst length allows the PIM execution to overlap across banks like all-bank PIMs, even with the per-bank execution, thus exploiting higher bank-level parallelisms in computation. When the PIM request is delivered to each bank, the pimS switch of the bank is turned on during tCCDxN (for example, T0 to T3 for bank 0, T1 to T4 for bank 1, T4 to T7 for bank 3, and T3 to T6 for bank 4), thus allowing the bank's overlapping execution (BLP at T1 to T6). The first BL is only delivered to the global bus by turning on the dataS switch (T0 for bank 0, T1 for bank 1, T3 for bank 3, and T4 for bank 2). The PIM execution does not need to use a global bus for more than 64-byte BL, but continuously transfers data through a local bus to maximize internal bandwidth. There is quite a lot of opportunity to increase BL fundamentally from the OS page layout. With 16 banks, minimally four cache blocks (64Bx4) are laid out physically contiguously on one page (4KB). Further, if we allocate four pages as much as the DRAM row size, all column data in one row can be fetched by a single memory request, thus maximizing the burst length.
Second, more importantly, unlike all-bank PIM executions, the BL-PIM can serve non-PIM requests during the PIM execution (T2 for bank 2) and provide attractive performance in the multi-workload environment. While serving the PIM memory requests through the local bus of each bank, we can use the idle global bus for the incoming non-PIM request. The all-bank PIM execution processes the PIM request in a lockstep manner [21], [22], [24]; our elegant solution can solve the disadvantage of blocking the non-PIM request. Furthermore, the large burst length can significantly reduce the number of memory requests from memory-intensive PIM applications, requiring only one instead of four for Bank 2's PIM execution during T4∼T7, thus making the global data bus idler at T5 to T7. Therefore, it reduces interference with the simultaneously executing non-PIM applications and maximizes their performance opportunity during the PIM execution.
We slightly modified the per-bank PIM architecture, Silent-PIM [19] for the BL-PIM. The main modification was to redefine the tCCD timing constraint in JEDEC for the PIM memory requests to be serviced simultaneously with the standard memory requests for non-PIM applications. For this purpose, we implemented the logic on the memory controller and the DRAM device to distinguish the PIM and non-PIM memory requests and support the various burst length sizes depending on hardware (DRAM row size) or software (OS page management) conditions. In more detail, for the memory controller, we added the PIM Interface Unit (PIU), the same as the PIM device, to recognize the BL-PIM memory requests and produce the only 1-bit BL-PIM valid signal and a shift register to control the tCCD timing constraint. For the PIM device, we extended a 3-bit counter to 7-bit to support the largest burst cycles, i.e., 128.
The remainder of this paper consists of the following: In Section II, we explain the background and introduce our baseline per-bank PIM for this work. In Section III, we discuss the PIM performance opportunity from the larger BL, and in Section IV, we describe the proposed BL-PIM architecture. In Section V, we evaluate our PIM design's performance. Section VI introduces the previous works of PIMs, and Section VII discusses the limitations of our work and suggests future work. Finally, Section VIII concludes the paper.

A. DRAM INTERNAL FOR RD COMMANDS
DRAM is a device operated by commands, described by timings in the JEDEC standard, and a memory controller must comply with the timings to issue the command to the DRAM [25]. For example, since the ACT command takes several ns to activate the DRAM row, we should have a timing margin called tRCD (Row address to Column address Delay) before issuing the RD command after ACT. Fig. 2(a) shows the data flow for the RD command in order; CSL (Column Select Line), IOSA (IO Sense Amplifier), and DQ (Data pin). The local I/O (LIO) line is electrically isolated from the global I/O (GIO) line in DRAM, as shown in Fig. 2(b). The isolation allows DRAM to internally pipeline more than one consecutive command without waiting for a long fetch time in the local bus, i.e., exploiting bank-level parallelism.
The amount of data transferred in DRAM, i.e., the DRAM access granularity, is the same as the cache-line size (64byte). Since the granularity of the RD/WR command is not the same as the DQ size of the DRAM interface, the data is continuously transferred over several cycles, defined as a burst length (BL). Each DRAM device has specific LIO and DQ sizes, which determine BL. For example, the 64-byte cache-line size LIO and the 8-byte DQ width (our design assumption in this paper) choose BL as 8, thus transferring the DQ data for 8 cycles.
The burst length is usually the same as tCCD (Column Command Delay), representing a cycle to occupy the DRAM interface continuously. The data line from IOSA to DQ is a global line shared by all banks and cannot be used by multiple banks simultaneously. Therefore, to prevent conflicts with the global bus, the memory controller schedules commands by applying the tCCD timing constraint when issuing RD commands from other banks. If the RD commands are issued continuously with preserving the tCCD timing constraint, the utilization of the global bus becomes 100%. For the DRAM device supporting bank groups like DDR4, the tCCD timing constraint can be divided into a tCCD_s (different bank group) and a tCCD_l (same bank group). Therefore, in this paper, we interpret the tCCD timing constraint as either the tCCD_s or the tCCD_l, depending on the bank activation.
B. SILENT-PIM: OUR BASELINE ARCHITECTURE Fig. 3 illustrates the Silent-PIM architecture [19], based on which we implemented our work. Silent-PIM includes the engine per bank for the bfloat16 8-way vector computation and the interface unit to identify the PIM memory requests. Silent-PIM satisfies the JEDEC memory standards while performing per-bank, thus, requiring no memory controller modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications' memory requests.
For the PIM computation, before starting the kernel, a programmer stores the operands' physical addresses of two source operands and one destination operand into control registers A, B, and C, respectively, and the opcode in control register D in the interface unit of Fig. 4(a). The PIM Request Identification Unit (RIU) distinguishes the PIM requests from the standard memory requests to generate the PIM valid signal to the PIM switch between the local bus of each bank and the PIM engine in Fig. 2(b). The RIU compares the stored PIM operand addresses with the incoming DRAM address bank/column/row, as shown in Fig. 4(b). A bank address enables the valid bank switch between the local and the global VOLUME 11, 2023  buses. By controlling the switches, Silent-PIM can supply the data to each bank and, at the same time, transfer them to the PIM engine for the PIM execution using the standard memory requests.

III. PIM PERFORMANCE OPPORTUNITY WITH LARGER BL A. tCCD TIMING CONSTRAINT
The memory controller schedules commands to maximize the utilization of the global bus while considering the tCCD timing constraint to prevent data conflict at the global bus. However, it wastes internal bandwidth due to the idle time of the local bus. Therefore, an excellent opportunity exists to utilize the internal bandwidth for the PIM execution fully. Since the PIM execution does not use the external bus but the internal bus to provide the data to the engine, we can increase the memory access granularity. However, it should be noted that the first 64-byte of larger BL than 64-byte should be delivered to a memory controller due to data consistency between memory and caches [26], so it is handled the same way as a standard memory request is processed. The increment of the memory access granularity, i.e., the increment of the burst length, can overlap the data fetching across banks through their local bus. For example, the 4x BL increment can overlap the fetching, i.e., the PIM execution up to 4 banks. As a result, by increasing BL, we can exploit the bank-level parallelism in computation with the per-bank execution.

B. OS PAGE LAYOUT
Since the DRAM access granularity is limited to the cache-line size (64-byte), the PIM execution targeting memory-intensive applications generates significant memory requests, degrading the PIM performance and interfering with non-PIM requests. For example, fetching one-page data (4,096-byte) as operands requires 4 memory requests for each bank (Silent-PIM has 16 banks), totaling 64. The weight matrix size in the transformer models is significantly large, for example, 120,320 pages in RoBerTa and 133,120 pages in T5-encoder [10], [11]. However, the OS page layout allows us to increase the DRAM access granularity, thus, reducing the memory requests.
As shown in Fig. 5, the page data is stored in 4 consecutive columns in each bank. For computing one-page data (4,096byte), it is possible to send only one memory request to each bank with 256-byte BL, i.e., by increasing the DRAM read granularity from 64 to 256-byte. The increment improves the PIM performance by 4x and reduces the inference with other applications. We may maximize the opportunity to increase the burst length by OS's allocating four contiguous physical pages for an operand to the same as the DRAM row size, i.e., 16x.

C. DRAM INTERNAL BANDWIDTH
The PIM execution does not transfer the data from a bank to a memory controller, not needing to use the global bus. Therefore, by simply turning off the bank switch between the local and the global buses and turning on the PIM switch between the local bus and the PIM engine, we can fully maximize the internal bandwidth and overlap the PIM execution across banks. As a result, we can realize the all-bank PIM performance with the per-bank scheduling. The switch logic is already available in commercial DRAMs.

IV. THE BL-PIM ARCHITECTURE
In this section, we describe the design of the proposed BL-PIM architecture: 1) developing the BL-PIM state diagram and its timings, 2) configuring the BL parameter in the PIM opcode, and 3) adding the BL components in the memory controller and the PIM device.

A. DEVELOPING STATE DIAGRAM AND TIMINGS
The BL-PIM architecture is studied based only on the read access granularity of the DRAM column command. Silent-PIM includes 64-byte ACC registers per bank to be stored, thus having no opportunity to increase BL for write. Fig. 6 shows the extended BL-PIM state diagram, and the N times longer burst length from the reading state incurs the same number of transitions to the same state, thus implying a total of tCCDxN transition time. Therefore, the memory controller and the BL-PIM device can communicate without adding a new state, i.e., their interaction if they know the BL size before the execution. Only the timing modification while preserving the standard state diagram allows us to schedule the PIM and non-PIM requests simultaneously. Increasing the memory controller's tCCD timing constraint to a multiple of 2 can be implemented as a simple shift operation. 2) The non-PIM request followed by the BL-PIM request at the same bank: We cannot service the non-PIM memory request until the PIM execution completes at the same bank due to resource conflict at the internal bus.
The timing constraints change depending on which bank or type of request follows after the BL-PIM RD command. The extended tCCD, i.e., tCCDxN , timing constraint applies to the following command only after the BL-PIM RD command instead of tCCD, and the standard timing constraint applies to the BL-PIM RD request after non-PIM requests.
For example, if the following BL-PIM RD or RD commands are in the same bank as tCCDxN , the command does not interfere with commands for other banks, i.e., no additional constraints for them. In the case of the following BL-PIM WR or WR commands, the timing constraint is defined as tCCDxN + (tRL + tBL + 2 -tWL) due to the I/O turnaround timing between RD and WR commands, and in the case of the following PRE command, timing constraint defined as tCCDxN + tRTP.

B. SPECIFYING THE BL SIZE IN THE PIM OPCODE
We extended the 3-bit BL field to specify the 64, 128, 256, 512, and 1024-byte BL size inside the PIM opcode, as shown in Fig. 8. The DDR4's row buffer size is 1KB [27]; therefore, we set the maximum size as 1024. There is no configuration overhead compared to Silent-PIM because it uses 3-bit, a reserved field, in the 64-bit configuration register already in the PIM interface unit. Suppose that the number of banks is 16 [28], one operand's size is 16KB, and its allocated frames are contiguous: we can fetch 1KB operand using only one BL-PIM RD request per bank. Therefore, we can significantly reduce the number of PIM memory requests and overlap the PIM execution across banks, thus achieving all-bank PIM performance.

C. COOPERATING BL-PIM DEVICE AND MEMORY CONTROLLER
Both the BL-PIM device and the memory controller should recognize the BL-PIM memory requests and their BL sizes to support the BL-PIM state diagram and timing constraint for the correct execution.
For the BL-PIM memory device, we store the BL opcode in the control register D like Silent-PIM and design the BL-PIM interface to perform the column decoding mechanism to execute column commands repeatedly depending on the BL value. All the other mechanisms, for example, detecting the PIM requests and controlling the PIM datapath, are the same as Silent-PIM. Modern DRAM chips fetch data as much as the row buffer size in advance and determine the target address through the column decoder [28]. The burst operation fetches the data during n cycles using the 3-bit column address counter to perform a burst operation for 8 cycles [29]. For the BL-PIM execution, the 3-bit counter is extended to 7bit, supporting 128 cycles to fetch all activated data in the row buffer, as shown in Fig. 9. The column decoding performs according to the stored BL value in the RIU for the PIM device, turns off the global bus switch to prevent DQ from receiving data with the increased BL, and sends the data to the PIM engine through the local bus.  We added the PIM Interface Unit (PIU) for the memory controller to recognize the BL-PIM memory requests and produce the only 1-bit BL-PIM valid signal for supporting the BL timing constraint, gray colored in Fig. 10. Silent-PIM does not require modifying the memory controller since it schedules the PIM requests in the same way as the non-PIM requests. The PIU in the memory controller is the same as the BL-PIM device except for the address comparison logic. The BL-PIM device's request identification unit (RIU) compares the stored addresses in the operand control registers with separately incoming addresses, such as row, bank, and column addresses. However, the RIU in the memory controller only compares them with one incoming address, as shown in the figure. The BL-PIM valid bit is a control signal applied to the memory scheduler for handling the BL timing constraint.

D. REPRESENTING THE BL IN DMA'S PIM OFFLOADING
We have used DMA [30] as an offloading engine for the PIM execution, specifying at least the following three fields: a source start address, a destination start address, and a transfer size. The DMA engine generates memory requests consecutively from the source and destination start addresses by the transfer size. The PIM performance depends on the number of memory requests, i.e., the number of DMA transactions and their data transfer size. Fig. 11 shows how to map BLx1, BLx4, and BLx16 memory requests to the DRAM row (1KB) in each bank and the OS page and represent the DMA transactions for the 4-page PIM offloading. For BLx1, we can set the DMA start addresses as a physical page number with the transaction size of 4K bytes (4 columns × 16 banks × 64 bytes). We should repeat the transaction four times since allocating contiguous four physical pages is not guaranteed. Each transaction generates four memory requests per bank and a total of 16 memory requests per bank. For BLx4, we can set the DMA start addresses as a physical page number with the transaction size of 1K bytes (16 banks × 64 bytes) since the BL covers col[1:0]. It also repeats four times, but each transaction generates only one memory request per bank and a total of 4 memory requests per bank.
A DRAM can store four pages in one row. Suppose we contiguously allocate four physical pages to fit into one DRAM row. In that case, we can set the DMA start addresses as a physical page number with LSBs = {00} with the transaction size of 1K bytes (16 banks × 64 bytes) since the BL represents col [3:0]. We use only one DMA transaction and generate only one memory request per bank. Therefore, we can minimize the number of memory requests and maximize the overlapping execution across banks as much as possible. However, we cannot represent BLx2 in the DMA transaction without increasing the 64-byte granularity to 128-byte.

V. PERFORMANCE ANALYSIS A. EXPERIMENTAL METHODOLOGY
We designed the BL-PIM architecture based on Silent-PIM, emulated on HTG-Z920 (Xilinx Virtex UltraScale VOLUME 11, 2023  board/XCVU190 including ARM cortex-A53), and measured the performance by running four transformer models (BERT, RoBERTa, T5-encoder, and GPT-2 with ONNX [31], [32]) and the SPEC2006 benchmark [33]. Also, we used MLAS [34] for the CPU execution. Tables 1 and 2 represent the detailed specification of the evaluation board and the parameters of the pre-trained transformer models used in the experiment, respectively.
We did not use an x86 platform due to the slow PCIe interface to the PIM-emulated FPGA board. Instead, we used the ARM SoC environment to allocate all the applications to the programmable logic (PL) DDR4 memory area and study the performance effect when servicing the non-PIM and BL-PIM memory requests simultaneously. To the best of our knowledge, the multi-workload execution is the first in public. Since we emulated the BL-PIM architecture in the PL area, the experimented transformer model's size must also fit the PL DDR4 memory size of the evaluation board, 4GB. Fig. 12 shows the ARM SoC environment, with gray-colored boxes to be modified or added. The SoC environment has two separate regions: ARM processors (PS) and programmable logic (PL) to model the BL-PIM architecture. We slightly modified the memory controller IP provided by Xilinx [35] and the Silent-PIM device for adding the BL-PIM interface unit and the BL-PIM configuration register. Also, we added scheduling for the tCCDxN requests inside the memory controller. Although our BL-PIM design was based on DDR4, we can surely apply our design concept to any commodity DRAM device.
We conducted three kinds of execution for the BL-PIM performance study: 1) matrix multiplication, 2) single transformer workload, and 3) multi-workload using PIM and SPEC benchmark non-PIM applications. As described in the introduction section, the transformer model has a characteristic that its performance is limited to off-chip memory bandwidth due to the high proportion of low-locality vectormatrix multiplication operations in execution. Therefore, it can be efficiently executed in the PIM architecture and is an application that can emphasize the effectiveness of BL-PIM in the single-workload experiment. The SPEC benchmark is an already widely used set of applications exploiting compute or memory-bound characteristics. Therefore, if it is used as a non-PIM application in a multi-workload experiment, the effect on the PIM request can be classified according to the application characteristics.
We compared the performance of the CPU multi-thread (CPU_M), BLx1/4/8/16-PIM, and two kinds of all-bank PIMs with respect to the CPU single-thread (CPU_S) execution. The all-bank execution is unsuitable for the DMA operation since it issues only one command to only bank 0 to enable all banks. Therefore, the PIM offloading overhead is too significant. To completely ignore the offloading overhead for the all-bank PIM execution, we changed the address mapping with row-bank-col (called ideal all-bank PIM). We did not change the mapping for all BL-PIM execution at all. We assumed that the OS allocated two and four contiguous physical pages for implementing BLx8 and BLx16.
The BL-PIM with large BL sizes consumes significantly less global bus and off-chip interface power by reducing the number of memory requests compared to a standard BL size of 64B while proving the all-bank performance as our research goal. It has been well known that the all-bank PIM suffers from a power consumption more than the standard budget, thus not being able to activate all banks but only half of them at one time [22], [36]. However, our BL-PIM simultaneously consumes more local bus power and computation engine power. They consume only 42.1mW and 1.6mW per bank [19], [37], [38], and a total of 362.2mW with 16 banks, which is less than the power of activating one bank, the amount of sufficient power margin [22], [36]. Thus, 81150 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the all-bank execution to activate 16 banks in our experiment does not satisfy the standard power budget, but our BL-PIM does. So, this paper did not compare the energy consumption with the all-bank execution.

B. MATRIX MULTIPLICATION
We provide a detailed performance analysis using the matrix multiplication of (p × 512) × (512 × 512) by increasing the p size before studying the application's performance since the multiplication occupies more than 90% of the total execution time in our transformer models of BERT, RoBERTa, T5encoder, and GPT-2, and Fig. 13 shows their execution time.
When the p size is 1, the BLx1 PIM execution time is 10.6x faster than the CPU single-thread. The longer the BL, the higher the BL-PIM performance; the BLx4, BLx8, and BLx16 PIM execution is 14.2x, 22.7x, and 31.4x faster than the CPU single-thread, respectively. Our BLx16 PIM execution is 2.0x faster than the all-bank PIM to incur significant offloading overhead. Also, our PIM achieves 91.0% of ideal all-bank PIM performance. It should be noted that we moved bank bits of the address mapping for the ideal all-bank execution. Our BL-PIM overlaps the PIM execution across banks without changing the address mapping.
As p increases, all the BL-PIM performance decreases due to repeating the vector-matrix computation by p times. Therefore, the execution time of the multiplication linearly increases with p, and the BL-PIM speedup decreases with p. However, our BLx16 PIM still achieves 26.8x, 10.1x, and 5.5x with p = 4, 16, 64 than the serial CPU execution, respectively. Also, the BLx16 PIM execution achieves 2.5x, 2.6x, and 2.7x faster than the all-bank execution and 92.1%, 91.8%, and 87.3% of the ideal all-bank execution.
The CPU performs the matrix multiplication using cache locality, and the CPU multi-thread execution's performance is saturating by about 3.8x.  . 14 shows the number of DRAM commands profiled by the hardware performance counter during the PIM execution, and it shows that the number of RD commands decreases by almost half with a doubling of the BL size. The PIM offloading information, i.e., the DMA descriptors, is stored in DRAM. Every PIM execution should read the descriptors before the execution, incurring the DRAM row misses. Therefore, the longer the contiguous physical pages, the larger the DMA PIM transactions, the less the DMA descriptor accesses, and the lower the DRAM row buffer misses. The BLx1 and BLx4 use only one page, thus incurring the same row misses, i.e., the same number of (ACT+PRE) commands.
The all-bank PIM's offloading overhead generates 1.3x more RD commands than BLx16. On the other hand, the ideal all-bank PIM has the same number of RD commands as BLx16 by taking advantage of the changed address mapping. Also, it incurs almost no ACT+PRE command because it operates only with one command. Although the number of RD commands is the same, the all-bank PIM is slightly faster than BLx16 due to its complete overlapping execution across banks and fewer DRAM commands due to the same bank states. Fig. 15 shows the execution time ratio by the number of banks overlapped in execution with varying BL sizes. We cannot overlap all banks' execution entirely for two reasons: 1) a memory controller issues a command bank-by-bank, thus enabling only one bank at the beginning and the end of the execution, like pipe-lining. 2) after issuing the RD command to the last bank, we can issue the command to the first bank. The largest BL, i.e., BLx16, can overlap only 85.9% of the total execution; however, the BLx16 PIM achieves comparable performance with the ideal all-bank PIM due to issuing the same number of RD commands. The larger DRAM row buffer can allow us to use larger BL, thus increasing the overlapping and improving the performance further. Fig. 16 compares the execution time of four models on different platforms. Since the matrix-matrix multiplication consumes 91.5%, 96.1%, 98.5%, and 96.2% of the total execution time in BERT, RoBERTa, T5-encoder, and GPT-2, respectively, their overall performance is very similar to the matrix-matrix multiplication. In all the models, the larger the BL size becomes, the faster the BL-PIM executes at the same sequence length.

C. SINGLE WORKLOAD: TRANSFORMER MODELS
In BERT, when the sequence length is 1, the BLx1 PIM execution time is 9.6x faster than the CPU single-thread and 3.6x faster than the CPU multi-threaded executions. As the burst length increases, the BLx4, BLx8, and BLx16 PIM execution is 10.9x, 17.1x, and 15.2x faster than the CPU single-thread and 4.1x, 4.9x, and 5.7x faster than the CPU multi-thread, respectively. The BLx16 execution is always faster than the all-bank PIMs: 1.3x faster than the all-bank PIM because of offloading overhead and 1.1x faster than the ideal all-bank PIM because of low bank-level parallelism in the non-PIM code sections due to changing the address mapping. As shown in Section V-B, as the sequence length increases, the BL-PIM performance decreases. The BLx1 achieves 5.7x, 2.3x, and 1.3x faster than the CPU single-thread and 1.7x, 0.7x, and 0.4x than the CPU multithreaded executions. However, our BLx16 PIM is always faster than all the other experiments' executions.
The RoBERTa, T5 encoder, and GPT-2 models have similar performance trends as the BERT model, as shown in Fig. 16(b), Fig. 16(c), and Fig. 16(d). The BLx16 execution is the fastest in all the cases.
In summary, the all-bank performance is unsuitable for DMA-based offloading, thus incurring the overhead to handle the DMA descriptors and requiring the address mapping change. The change would incur performance degradation in non-PIM code sections due to low bank-level parallelism exploitation. Our BL-PIM architecture fits the DMA offloading, i.e., the existing memory component in a system, and the BLx16 outperforms all the other executions by overlapping the execution across banks.

D. MULTI-WORKLOAD WITH NON-PIM APPLICATIONS
Through the single workload performance analysis, we confirmed that the larger BL execution could maximize the internal bus utilization, resulting in higher performance than the all-bank execution. The larger BL also can reduce the number of memory requests, minimizing the interference caused by PIM applications and providing performance opportunities to non-PIM applications. To verify the effect, we selected SPEC benchmark [33] as non-PIM applications for the multi-workload execution with the transformer models. We did not experiment with the benchmarks incurring out-of-memory in our experimental environment. We categorized the benchmarks into two groups: memory-bound applications (lbm, libquantum, mcf, and soplex) to incur large LLC misses and compute-bound applications (namd, perlbench, gobmk, and sjeng) [39]. Also, since the four transformer models showed a similar performance trend, we experimented with only the BERT model as a representative. Fig. 17 and Fig. 18 show the performance degradation of SPEC benchmarks and BERT with CPU_M and BLx1/4/8/16-PIM at the sequence length of 1 and 64 with respect to the single-workload execution. A larger sequence length requires more execution time, and higher execution time incurs more interference to others in a multi-workload environment, thus increasing performance degradation. Fig. 17 shows that in the memory-bound benchmarks, the larger BL-PIM performs better; for example, when running the BERT and the soplex benchmark together, the performance degradation of BERT and soplex with BLx1/4/8/16 gradually decreases by 18%/8%/6%/4% and 21%/13%/10%/9%, respectively, at the sequence length of 1. When the sequence length increases to 64, the performance degradation also gradually decreases by 35%/30%/28%/25% and 31%/24%/22%/15% in BERT and soplex, respectively, but higher than at the sequence length of 1. In the case of the compute-bound benchmarks incurring low LLC misses, as shown in Fig. 18, a change in the burst length does not cause a change in performance. For example, the perfor-81152 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. mance degradation of BERT with BLx1/4/8/16 and sjeng was almost constant, i.e., 6%/5%/5%/4% and 7%/6%/6%/5%, respectively, at the sequence length of 1. The larger sequence length, like the memory-bound benchmarks, the performance degradation of BERT with BLx1/4/8/16 and sjeng was 9%/8%/8%/8% and 14%/14%/13%/13%, respectively, two times higher than at the sequence length of 1.

VI. RELATED WORK
In recent PIM research, the in-DRAM PIM architecture is receiving significant attention in the industry. In particular, the primary memory makers, Samsung and SK hynix, are preparing commercial products of the architecture. Samsung proposed HBM-PIM [22] based on the commercial HBM2 DRAM die design, where a bank coupled with a PIM execution unit consists of SIMD FPU and general & scalar register files. SK hynix proposed a PIM architecture called AiM [21], which places a minimal compute of only MAC tree units and buffers near banks. Both are all-bank PIM to maximize the internal bandwidth in computation. However, the all-bank execution interferes with non-PIM applications, thus not supporting multi-workload execution. Our BL-PIM supports the JEDEC memory standards, thus executing with any PIM and non-PIM applications concurrently.
UPMEM [23] is a commercial product in the form of a standard DDR4 DRAM package & DIMM. It integrates several simple general-purpose processing cores inside a DRAM chip. The DIMM module consists of a chip for PIM memory and a chip for main memory. The PIM and non-PIM memory areas are explicitly divided, thus reducing memory size for PIM and non-PIM applications. The host CPU explicitly performs data transfer between the two memories. Our BL-PIM can use the whole memory space for all the applications.
As shown in Table 3, we divided existing PIM works into all-bank and per-bank methods and then organized each work for several metrics. All the commercial products, such  as HBM-PIM [22], AiM [21], and UPMEM [23], use allbank execution, fully utilizing the internal bandwidth and providing high performance. However, there is a critical limitation in use, i.e., they cannot service non-PIM requests during the PIM execution, thus, cannot support the multiworkload execution. On the other hand, per-bank execution, such as Silent-PIM [19] and Decoupled-PIM [20], can service non-PIM requests during the PIM execution but deliver lower performance than the all-bank execution and provide unfriendly multi-workload environment due to a large number of memory requests. Our BL-PIM has all the advantages of high performance from the all-bank execution and support of concurrent non-PIM application execution from the perbank execution, and it also provides a friendly workload environment that all the previous works cannot support.

VII. LIMITATION AND FUTURE WORK
The small BL size limits the performance due to the pre-amble and post-amble due to the per-bank scheduling. As shown in Fig. 15, the smaller the BL size, the smaller the number of overlapping banks, and the lower the bank-level parallelism in computation. However, this case would rarely occur due to the large size of recent DNN models [40], [41].
One of the great advantages of the per-bank PIM solution is to satisfy the JEDEC standard and not modify the memory controller. The proposed BL-PIM method needs the modification for adding PIU to recognize the PIM memory request, thus incompatible with current commercial memory controllers. Therefore, to overcome the disadvantage, we suggest future work to implement the BL-PIM by avoiding the controller modification and adapting the approximation computing while satisfying the accuracy in DNN models.

VIII. CONCLUSION
In this paper, we proposed the BL-PIM architecture to use a longer burst length, outperforming the ideal all-bank PIM and removing interference in multi-workload execution with non-PIM applications. We carefully observed the OS page layout and tCCD timing constraint and confirmed the opportunity for improving the PIM performance with the large burst length. For this purpose, we extended the state diagram and timing constraint and developed a method for cooperating with the BL-PIM device and the memory controller for the BL-PIM memory request.
We analyzed the performance from the real execution of the BL-PIM modeled FPGA platform with ARM SoC using the four kinds of transformer models by varying the sequence length. The BLx16-PIM achieved the best performance in all the experiments, i.e., consistently faster than a single CPU, multi-CPU, and all-bank PIM executions. For multiworkload, non-PIM applications were given the maximum opportunity to be serviced with BL-PIM, which reduces the number of memory requests.
To the best of our knowledge, our work is the first to show the multi-workload execution in PIM architecture research. Also, we believe that our architecture can provide fresh insights that PIM memory is the main memory rather than the accelerator.