Introduction
The memory wall problem remains one of the major bottlenecks to system-level performance improvements in modern computing systems [1]. The effect of the mismatch in the scaling of compute performance and main memory speed is further exacerbated for AI workloads that rely on intensive data movement [2]. In recent times, the architectural efforts toward overcoming the memory wall have included processing-in/near-memory (PIM), memory disaggregation, etc. [3], [4]. While memory disaggregation targets the memory capacity issue and applies primarily to data centers, PIM architectures, targeting the memory bandwidth issue, are being actively explored across edge- and high-performance computing. In addition, increasing the low-level cache capacity remains one of the primary methods for improving performance across different workloads and processors.
For instance, Fig. 1 shows the performance improvement due to the scaling of first-level cache capacity, based on architecture-level simulations.1 Fig. 1(a) shows the reduction in the latency due to increased L1 cache capacity in a low-power ARM core, while executing a memory-bound workload: LU decomposition. Similarly, Fig. 1(b) shows the latency with increasing buffer capacity in a systolic-array-based accelerator with
Workload performance gains with low-level cache capacity scaling using architecture-level simulations. (a) Latency and L1 miss rate in a commercial mobile core. (b) Latency and overall compute utilization in a systolic-array-based accelerator core.
Consequently, various novel integration schemes have been explored to improve the performance, power, area, cost, and thermal (PPACT) of edge computing systems. Specifically, vertical integration methods such as monolithic and sequential 3-D integrated circuits (3D-ICs) are being actively explored. From a technology perspective, novel through-silicon-via (TSV) and 3D-IC bonding approaches are being used for enabling high-capacity, fast, local memory. Similarly, partitioning schemes such as memory-on-logic (MoL), logic-on-memory (LoM), and logic-on-logic (LoL) are being actively explored for system-level design optimization in 3D-ICs. The partitioning schemes focus on the separation of the memory and logic components onto different stacks (for monolithic 3-D) or tiers (for sequential 3-D).
However, in most related works, the exploration with respect to partitioning does not extend to the memory macros. As a result, using homogeneous memory macros does not fully exploit the benefits of 3-D integration. To this end, we present a comparison of 3D-IC design methods, focusing primarily on the partitioning of low-level caches.
Our novel contributions include:
We propose a novel partitioning scheme of memory macros in array under CMOS (AuC) technology that enables heterogeneous 3-D integration. Specifically, it involves placing all the peripheral circuits in the logic tier alongside other standard cells while the other tier comprises only static RAM (SRAM) bit-cells. With the proposed schemes, we report up to 12% improved performance using homogeneous AuC technology macros compared with 2-D SRAM macros. Furthermore, using heterogeneous AuC macros, we report up to 50% lower leakage power.
We present a comparative study of different integration schemes for a commercial mobile core. Specifically, we use the proposed memory macro schemes for the L1 cache of the mobile core to compare the power, performance, and area metrics for 2-D and various 3-D integration schemes.
The proposed 3-D partitioning scheme allows SRAM bit-cells and periphery logic to be optimized separately in the back-end-of-line (BEOL). Furthermore, it also enables the SRAM periphery logic to be placed near system-level core logic in the same advanced node. The study shows this memory-level optimization results in up to 60% higher energy efficiency compared with 2-D and at least 14% higher operating frequency compared with standard 3-D MoL partitioning in the commercial mobile core.
Background and Related Works
A. Memory Macro Design
The low-level caches, tightly coupled with the core, comprise architectural elements designed to optimize the performance by reducing the miss rate. Depending on the capacity, and other architecture-level specifications such as data/instruction cache, cache-line size, and associativity, the memory component of the cache is physically realized with multiple memory macros. Technology-specific metrics are also considered during the design of the appropriate macros, usually implemented as single/multiple subarrays (SAs). The SRAM SA is designed with bit-cell array and their corresponding peripheries. The peripheries can be categorized as row peripheries (word-line drivers, row decoders) and column peripheries (sense amplifiers, write drivers, etc.). These SAs can be combined to form a macro of required memory capacity. The major factors contributing to the performance and power of the macro are word-line and bitline metal resistances and capacitances which keep deteriorating with scaling. SRAM bit-cell area is reaching its limits in advanced CMOS technology nodes due to restricted scaling of poly-pitch (PP) and metal pitch (MP) [5].
New device architectures such as forksheet [6], nanosheet [7], and CFET [8] have been able to improve the SRAM density at scaled technology nodes. However, cache performance deteriorates with scaling as metal resistance increases [9], [10]. In modern mobile and high-performance processors, larger caches (such as L2 and above) are built from smaller memory macros to reduce delays within each macro. In a 2-D layout, this leads to long wires for routing all the macros, which increases interconnect delays and worsens the memory wall problem. To tackle this issue, several 3-D integration technologies were proposed showing that 3-D integration enables overall wirelength reduction by shortening the connections between logic and memory [11], [12], [13], [14].
B. Emerging Integration
Various 3-D integration approaches have been recently proposed to address device and memory scaling challenges in modern electronics. 3D-IC technologies such as micro-bumping, hybrid bonding, and sequential 3-D have gained popularity in recent times. In micro-bumping 3D-ICs, two dies are vertically stacked using a dense array of micro-bumps in a face-to-face (F2F) configuration, ensuring high yield and reliability.
A 3-D die stacking technique using
Sequential integration is an emerging technology that integrates device layers sequentially in the vertical direction, using nano intertier via (ITV) for fine-grained integration. Vandooren et al. [18] demonstrated sequentially stacked FinFETs with high alignment accuracy showing a footprint reduction of up to 50%. AuC is a unique partitioning scheme proposed by Salahuddin et al. [19] where memory and logic are partitioned vertically to achieve system-level performance and cost benefits.
How different tiers are partitioned plays a major role in optimizing the overall PPACT aspects of the chip. Partitioning schemes such as MoL/LoM and LoL have shown significant potential for future high-density designs, facilitated by aggressive TSV and bump scaling [20]. Among these, MoL/LoM appears to offer the most feasible partitioning, considering existing designs, providing gains from 3-D integration without compromising the system’s thermal properties that much [21]. This is a critical consideration for emerging 3D-ICs, making MoL a widely accepted scheme across various designs. Therefore, it is essential to explore fine-grained, optimized partitioning strategies for splitting memory and logic elements in 3-D implementations.
C. Summary
Sequential 3-D technology appears to be the most promising candidate for logic and high-speed cache partitioning due to its low parasitic ITVs. This article evaluates the potential of memory on logic partitioning schemes (with hybrid bonding and sequential integration methods) through design co-optimization (DTCO)/system technology co-optimization (STCO). The partitioning scheme proposed in previous literature [19], where some of the periphery circuits are placed with SRAM bit-cell tier, restricts the decoupling of the array and periphery circuits efficiently. The proposed memory-element and logic partitioning in this article, which decouples the SRAM bit-cell arrays and all the peripheries, allows the optimization in BEOL of the SRAM array tier independently. The metal aspect ratio of word-lines and bitlines is increased by two times of that of 2-D to reduce the parasitics. This partitioning scheme also gives the freedom of heterogeneous integration, where the periphery circuits correspond to nanosheet technology (same as that of logic standard cells) while SRAM bit-cells correspond to FinFET technology.
Memory Macro Design
The partitioning of CMOS Logic and SRAM arrays in 3-D integration scheme is illustrated in Fig. 2, highlighting the placement of SRAM SA and control peripheries. The SRAM memory macros are placed in the bottom tier, while the periphery circuits sit in the top tier along with the other logic circuits of the core/processor as shown in Fig. 2. The schematic differentiation between the conventional 2-D and AuC SoC is described in Fig. 2(a). In case of 2-D IC, the logic core and SRAM SA along with its control peripheries (SA + P) are placed side by side on the same plane. Whereas in AuC, core logic and SRAM periphery (P) are on the top tier, while the SRAM memory SAs are in the bottom tier. Supervias are used to propagate signals between SRAM SA and their peripheries (shown by red via connections).
(a) Schematic illustration of the conventional 2-D IC and the novel AuC integration scheme. (b) SRAM macro SA arrangements in case of 2-D and AuC.
Fig. 2(b) shows the design considerations of 2-D and AuC SRAM macros. The 2-D SRAM macros are designed in two ways, i.e., macros consisting of one SA with row peripheries at the edge and macros with two SAs with shared row peripheries in the middle. The memory macro in AuC is arranged such that the two SA s are merged to a single SA in the bottom tier and the shared row periphery is pulled up to be placed in the top tier along with column peripheries and other core logic. The supervia connections are made between word-line drivers (top-tier) and word-lines (bottom-tier) and sense amplifiers (top-tier) to column select (bottom-tier) as shown in the figure.
Table 1 provides the information on different SA configurations used in 2-D 1S, 2-D 2S, and AuC macros. Iso-capacity memory macros are considered in all the scenarios for a fair comparison. One macro instance corresponds to two SA configurations (i.e., 1S and 2S) (e.g.,
This architecture presents a unique advantage: the independent optimization of SRAM (bottom tier) and core logic (top tier) transistors and BEOL processes. By decoupling the SRAM array in the bottom tier from that of the logic tier, we unlock the flexibility to optimize each component separately. Notably, the degradation of SRAM performance in scaled technology nodes, attributed to word-line and bitline resistance, necessitates innovative approaches. In this study, we leverage this flexibility to optimize the BEOL of the SRAM array, aiming to enhance its performance.
Physical Design
A. PDK, 2-D, AuC, and 3-D Integration
The physical implementation involves a 2-D process design kit (PDK) based on an A10 nanosheet technology. Furthermore, we consider an N3 FinFET device [22] for the bit-cells in the heterogeneous integration case of AuC. We use a five-track standard cell library characterized at 0.7 V and 25 °C. The front-side BEOL stack includes 13 routing metal layers (M0–M12). We generate timing and geometry views of the memories for 2-D and 3-D implementation using an in-house memory compiler, simulating the complete SRAM SA operation for different operating modes. The same memory compiler has been used to generate AuC memories but with consideration of proper resistance (R), capacitance (C), and physical shape as discussed in Section III. For the AuC block-level integration, we assume that the device growth will be monolithic for each plane separated by an insulator and thin silicon layer as shown in Fig. 3 with a pitch of
B. Physical Implementations and Architecture
We have used a low-power ARM core for the implementation and normalized the extracted data to prevent revealing proprietary information for the commercial processor. An almost equal split between memory blocks and logic modules in A10 technology makes this IP an ideal choice for such explorations. On top of that, it ensures the feasibility of AuC like memory optimization in current industrial systems even without any specific system-level modifications to extract gains.
Five different implementations have been selected for this exploration (a) 2-D baseline, (b) 3-D MoL, (c) AuC two-tier, (d) AuC three-tier, and (e) AuC heterogeneous. Two-dimensional and AuC implementations are done using traditional 2-D place and route (PnR) implementation flow from Cadence using Genus
In our study, we use buried power rail (BPR) and back-side power delivery network (BS-PDN) with three metal layers. For AuC, we implement a traditional 2-D power grid across logic plane and consider BS-PDN for bit-cell plane. Furthermore, for 3-D MoL stacking, we again implement a standard 2-D power grid across all the dies, excluding power exchange between the dies (3-D power structures are not included). The omission of IR-drop analysis is due to its scope falling beyond the focus of this article. However, it is important to note that the choice of BPR and BS-PDN significantly impacts the thermal properties of the stack, hence placing the logic die on top (near to heat-sink) is mostly preferable.
C. Partitioning and Floorplan Considerations
We investigated two partitioning scenarios for the AuC: two-tier and three-tier, as depicted in Fig. 4. In addition, we included a transparent version of the AuC in Fig. 4(a*), although it does not represent an actual implementation. This transparency aids readers in understanding the internal structure of the memory, including how periphery and bit-cells are partitioned. In the two-tier case, only block logic is allowed to reside on top of the bit-cells. In the three-tier scenario, both periphery and block logic can be placed either above or below the bit-cells. Specifically, some bit-cells are positioned in the top tier, while others are in the bottom tier. However, the memory peripheries of the bottom tier cannot overlap with the bottom tier bit-cells, and the same restriction applies to the top tier memory periphery, as illustrated in Fig. 5.
Partitioning scenarios (a) 2-D, (a*) transparent 2-D, (b) 3-D F2F-HB MoL, (c) AuC two-tier, and (d) AuC three-tier.
We have examined only one partitioning scheme, denoted as the MoL as our 3-D reference scenario. This partitioning choice aligns with related research methodologies in the field of 3-D stacked caches [17], [25] and allows a direct comparison with AuC. Our primary focus remains on the AuC, a fine-grained 3-D partitioning approach. All the implementations (2-D, 3-D, and AuC) are developed with approximately equal design utilization of 60%. To maintain consistency across design explorations, we deducted specific empty silicon area from both the 3-D and AuC two-tier floorplans [as shown in Fig. 5(b) and (c)]. The two-tier footprint of 3-D and AuC is comparable, representing approximately 50% of the 2-D footprint. Notably, the AuC three-tier configuration achieved an additional 35% footprint reduction compared with the two-tier setup due to more staking options. Unfortunately, further optimization in terms of area and utilization was constrained by the memory peripheral shape and the physical size of the ARM core in the considered technology node. The SRAM peripheral area, being only a small fraction (9%) of the total logic die, minimally impacts the overall utilization and congestion of the top die. The AuC integration method confines routing blockages to metal 1, allowing the top die to use all the lower metal layers from M0, thus reducing congestion and avoiding disruption to the top logic die routing.
Experiments and Results
A. Macro Design
Using circuit simulations, read access delay and leakage power of AuC macro design is evaluated and compared with the conventional 2-D SRAM macro counterpart. All the simulations for 2-D and homogeneous AuC SRAM macros are implemented in an advanced A10 nanosheet technology node. The heterogeneous AuC SRAM macros are implemented using a custom mix of FinFET (bit-cells) and nanosheet (peripheries) technology in A10 and N3 nodes, respectively. Circuit simulations for both 2-D and AuC SRAM macros are performed in Cadence Spectre circuit simulator [26] at a typical corner using a device compact model for nanosheet and FinFET [7], [27]. Two-dimensional macro configuration with two SAs is considered as the baseline 2-D in this study for a fair comparison with AuC macros. For SA performance analysis, we explore two distinct SRAM SA configurations
The read access delay comparison between 2-D macros and AuC for different SA configurations is shown in Fig. 6. Three-dimensional macro with two SA configuration performs faster when compared with 2-D macro with one SA by ~4% and ~20% in case of
Read delay comparison of 2-D and AuC technologies for different SA configurations.
Fig. 7 shows the comparison of leakage power between 2-D and AuC macros for different SA configurations. Heterogeneous AuC technology dissipates less leakage power when compared with 2-D baseline in both
Leakage power comparison of 2-D and AuC technologies for different SA configurations.
B. System-Level Integration
Multiple configurations of 3-D and AuC integration have been assessed with respect to 2-D single and double SA (Section III) SRAM memory instances in block level using an ARM core with several target frequencies. The implementations represent best in the class in each category which came after rigorous optimization of timing constrains and floorplan iterations. Block-level performance in terms of achieved frequency and power consumption is depicted in Fig. 8. A PnR run is considered valid if the worst negative slack (WNS) is negative and it is absolute value is less than 10% of the target period. Furthermore, the count of failing paths (FPs) should be lower than 1000. This design methodology ensures realistic area and power estimates. For our current work, the delay and power estimation have been performed for the standard cell library settings mentioned in Section IV-A. The power numbers reported are based on the activity annotation from the Dhrystone workload. On the X-axis, we use the notation I_C_F, where I indicates the integration option, C denotes the configuration used, and F represents the PnR target period expressed in nanoseconds. The Y-axis data are normalized with respect to the 2-D single SA, which serves as the baseline at 0% level. In addition, we present the single-core efficiency [Fig. 8(c)], which is a function of the power and delay product\begin{equation*} \text {CPU efficiency} = \frac {1}{\text {Total Power} \times \text {Eff. Period}}. \tag {1}\end{equation*}
Performance analysis (a) frequency, (b) power versus repeater area, and (c) CPU efficiency.
Our analysis shows that all the implementations using AuC memories can operate at a higher frequency than 2-D baseline and further exhibit a frequency gain of at least 14% compared with the 3-D implementation. This is because the faster memory access along with area and wirelength savings significantly reduce the critical path delay at the block level. The frequency gain is not that visible only in AuC heterogeneous case because of the much slower SRAM bit-cells in FinFET technology. But that is also a very minor (only 7%) penalty in delay, thanks to the area and resulting wirelength savings in AuC integration. This reduction in frequency can further be complimented by 40% power reduction and 10% less repeater area in Fig. 8(b). AuC heterogeneous can bring down the power consumption as low as −40%. This is possible because of the heterogeneous technology itself. Compared with advanced note technology, SRAM blocks with older node (FinFET) bit-cells consume significantly less energy, albeit at a slower speed. Moreover, the other AuC cases also highlight power optimization up to 10% more than 3-D for both memory and block-level compact optimization. If we look into the matrix of power and delay product, namely, CPU efficiency in Fig. 8(c), we find AuC two-tier and heterogeneous integration is highly efficient showing lower per unit power consumption for a given target frequency.
Further investigating wirelength statistics in Fig. 9, it is unsurprising that both 3-D MoL and AuC use approximately 18% less routing resources due to vertical integration capability. This reduction in the wirelength brings the core logic closer, contributing to delay reduction. Notably, in our benchmark design with only 500 K cells, this wirelength gain (~18%) will be even more significant for larger designs.
Wirelength statistics (a) total wirelength versus cell area, (b) metal distribution, and (c) memory to flop wirelength.
One interesting observation arises when examining the metal distribution and memory-flop wirelength, as depicted in Fig. 9. The use of AuC demonstrates greater efficiency in saving lower metal layers compared with 3-D MoL. This advantage stems from the reduced utilization of metal layers (up to M3) within the SRAM memory block itself (as discussed in Section III). Such an advantage is not feasible in 2-D or even 3-D designs that rely on traditional memory blocks. Notably, highly resistive lower metal layers are costlier to fabricate. However, AuC presents significant potential for savings in this regard. The abrupt increase in M5 [as indicated in Fig. 9(b)] cannot be avoided in 3-D and AuC designs due to the placement of block-level I/Os on this metal layer. Unlike 2-D designs, where blockages within the memory block restrict the use of M5, the PnR tool tends to favor M5 (highest Mx layer in the used technology) in 3-D and AuC layouts, especially when minimal blockages exist at this layer. Finally, another noteworthy observation suggests that AuC outperforms 3-D MoL in terms of metal savings [Fig. 9(c)]. This advantage results from optimizations in both block area and memory metal layers. As design spaces grow larger, this phenomenon is expected to be even more pronounced.
Conclusion
This study introduces a novel partitioning scheme for low-level SRAM caches using AuC technology, demonstrating enhancements in performance, overall CPU efficiency, and leakage power reduction. Preliminary investigations on a single-core mobile CPU highlight the significant impact of this next-generation advanced integration technique on small memory banks, with potential amplification at the system level. These findings suggest substantial gains in terms of power, performance, area, and cost benefits for larger SoCs with more integrated functionalities. This also underscores the promise of AuC technology for broader applications and emphasizes the importance of fine-grained functional partitioning in 3D-IC designs.