An Effective Block Pin Assignment Approach for Block-Level Monolithic 3-D ICs

In a 2-D design, the block pins are located at the periphery of a block optimally since blocks are placed side-by-side horizontally in a single placement layer. However, monolithic 3-D (M3D) integration relieves this boundary constraint by allowing vertical block communication between different tiers based on an nm-scale pitch of 3-D interconnection. In this article, we present a design methodology named pin-in-the-area that assigns block pins at any position inside the boundary of a block using commercial 2-D place-and-route (P&R) tools and enables an efﬁcient block implementation and integration for a block-level M3D integrated chips (ICs). Our pin-in-the-area starts from the netlist restructuring and connectivity-aware tier-by-tier chip planning, which deﬁnes blocks and decides their sizes and ( X , Y , Z ) locations for a two-tier M3D design. Next, we perform wirelength-driven 3-D placement to minimize 3-D half-perimeter wirelength (HPWL) and ﬁnd optimal pin locations inside the boundary of a block. Once block designs are done, we apply the unique macro handling scheme to the top-level timing closure. Based on a 28-nm two-tier M3D hierarchical design result, we show that our solution offers 13.6% and 24.7% energy-delay-product reduction compared to the M3D design with pins assigned at the block boundaries and its 2-D counterpart, respectively


I. INTRODUCTION
The functional density of integrated circuits has been increasing thanks to the device scaling as of today. However, moving toward the 3-nm technology era is not predicted to be the same as before because, on top of the geometric scaling, the new device architectures, such as gate-all-around fieldeffect transistor (FET) and complementary FET, increase the process complexity and require the reduction in new additional defect mechanisms. Therefore, the traditional geometric scaling continues in combination with various vertical stacking approaches at the integration level.
Stacking multiple dies in 3-D fashion has evolved in many different ways, including stacking of either packaged dies [package-on-package (PoP)] or bare dies [stacked-integrated-circuits (SiC)], to realize smaller 3-D interconnection pitch. In the packaging-level 3-D integration, 3-D interconnects have been made by ball-gridarray and wire-bonding technologies, which is 100-µm-scale pitch.
In the die-level 3-D integration, microbump array and through-silicon-via (TSV) technologies have been used for 3-D interconnects. While a 6-µm [1] pitch has been demonstrated in the advanced TSV technology, a 20-µm [2] pitch of microbump array has been the main bottleneck. However, the wafer-level 3-D integration opens up a new era of 3-D integration by enabling bumpless sub-µm 3-D interconnects [3].
The sequential face-to-back wafer-level 3-D integration, which is also known as M3D, is an emerging 3-D integration technology that enables the monolithic fabrication by stacking multiple active layers vertically [4]. In block-level M3D integrated chips (ICs), blocks are placed on different tiers and routed using M3D technology. Existing works on blocklevel M3D ICs [5], [6] have developed simulated annealingbased 3-D floorplanning engines and presented promising power, performance, and area (PPA) savings. However, these works have assumed that the block pins are assigned along the periphery of each block. As shown in Fig. 1, it is evident that the periphery of a block is not always an optimal location for the pin assignment when functionally partitioned blocks are vertically stacked. This is because interblock connections are unnecessarily lengthened to touch the boundary pins. Therefore, we develop an efficient block-level M3D IC design methodology named pin-in-thearea that tackles this type of pin assignment problem. The tight 3-D interconnection pitch achieved by M3D technology allows accommodating optimal pin locations inside the boundary of blocks.
In this article, we present an M3D hierarchical design approach named pin-in-the-area based on [7] giving freedom to the block pin locations assuming soft IP blocks. In the traditional 2-D hierarchical designs, block pins are assigned at the boundary of each block because connections among blocks are always made at their boundaries. However, the tight 3-D interconnection pitch in M3D technology allows direct vertical connections between the module blocks. By realizing block pin assignment at any location inside the boundary of each block, we show that the redundant routing detour for the interblock connections has been minimized and improve the top-level timing closure.

II. RELATED WORK
The pin assignment is one of the most important problems to reduce wirelength, meanwhile optimizing both critical delay and power consumption in M3D integration [5], [6]. Previously, the pin assignment in M3D has been optimized by simultaneous optimization of pin assignment, TSV placement [8], and genetic and simulated annealing algorithm to co-optimize area, wirelength, and temperature [9].
Zhong et al. [8] have proposed a heuristic algorithm based on Lagrangian relaxation to solve the pin assignment and TSV planning problem. By formulating the problem as a min-cost multicommodity flow model, they have improved the total wirelength of the target design by 38%. Hu et al. [9] proposed genetic and simulated annealing (GA-SA) algorithm to find the optimal 3-D IC floorplanning and assignments of on-chip and input/output (I/O) package pins considering the area, interconnection length, maximum temperature, and inductance of pins of the design.
However, these previous studies do not deviate from wire lengthening and wire congestion problems since the pins are placed at the boundary of blocks. In this work, pin-in-the-area design is the first work to improve the overall wirelength of block-level M3D ICs by providing more degrees of freedom for pin placements by placing block pins at any location inside the boundary. Fig. 2 shows the previous block-level design flow [5], which places the block pins at the boundaries of blocks. In this flow, the outlines of all blocks are given in the 3-D space. Then, we determine the locations of 3-D pins and block pins. In this step, we give the initial 3-D pin locations and perform block pin assignments. Next, the new 3-D pin locations are updated based on the determined block pin locations. As it is a chicken and an egg problem, the pin planning flow is iterated until 3-D pin locations become stable.

A. PREVIOUS BLOCK-LEVEL DESIGN FLOW
Even though the final 3-D pin locations are optimized, 3-D pins are placed outside the block boundaries. With these results, 3-D nets should have detour paths, which can be further optimized. Therefore, we present pin-in-the-area, which provides the optimal block pin locations inside the boundary of a block using commercial 2-D place-and-route (P&R) engines to build high-quality two-tier block-level M3D ICs.

B. OVERVIEW OF PROPOSED FLOW
First, pin-in-the-area flow begins with netlist restructuring to divide the overall design into optimal functional blocks for a two-tier M3D design considering the hierarchy. Next, we perform tier-by-tier chip planning, which follows where we decide the shape and the 3-D location of a block. Once the floorplanning is done, a wirelength-driven 3-D placement is performed to co-optimize the block-level placement and pin locations that are at any location inside the boundary of a block iteratively to improve the wirelength of interblock and intrablock nets.
We then proceed with timing budgeting where the wirelength saving turns into the additional block-level timing margin. This step allows P&R tools to remove or downsize the redundant buffers during block implementation, which leads to reduced total power consumption. After block implementation and top-level timing closure, we finally perform sign-off 3-D timing and power analysis using assembled tierby-tier layouts. The overview of pin-in-the-area flow is shown in Fig. 3.

C. BENCHMARK: RISC-V DUAL-CORE ROCKET PROCESSOR
RISC-V [10] is an open-source general-purpose instruction set architecture (ISA) based on reduced instruction set computing (RISC) principles. In this work, we implement a 28-nm dual-core rocket processor [11], [12], an open-source microarchitecture that executes scalar RISC-V ISA with a six-stage single-issue in-order pipeline. Rocket processor implements a memory management unit (MMU) that supports page-based virtual memory and is able to boot modern operating systems, including Linux.
Both caches are virtually indexed physically tagged with parallel translation lookaside buffer (TLB) look-ups. The data cache is nonblocking, allowing the core to exploit memorylevel parallelism. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. Rocket has an optional IEEE 754-2008-compliant FPU, which can execute single-and double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The fully pipelined double-precision FMA unit has a latency of three clock cycles.

D. NETLIST RESTRUCTURING
In this work, we use a commercial-grade 28-nm process design kit (PDK) and build a dual-core rocket processor whose block diagram is shown in Fig. 4. The original registertransfer level (RTL) netlists of rocket processors contain a multidepth block hierarchy. However, a block-level M3D IC makes 3-D connections only at the top level, while individual blocks remain as 2-D. Therefore, we transform the netlist into a two-level hierarchy, which includes top level and blocks.
We first flatten the netlist below the fourth hierarchy and ungroup tiny blocks whose standard cell area is under 10 µm 2 . We then create a module by merging blocks if the sum of the macro area from those modules is over 10 µm 2 . This netlist restructuring process is iterated until the second  hierarchy, resulting in an area-balanced module definition. The netlist restructuring has been done semimanually by considering the functionality of each module.
Note that the area threshold for ungrouping depends on the technology node and the design benchmark. According to the technology and benchmark design, the overall design area varies. Therefore, the area threshold should be set based on those variables. Table 1 shows the definitions of blocks as the result of the netlist restructuring process. Each core is divided into pipeline logic, floating-point unit, instruction/data (I/D)caches, and interface buffer blocks. Including bus units and memory controller units, totally 17 blocks have been defined in the end.

E. TIER-BY-TIER CHIP PLANNING
When blocks are defined by netlist restructuring, we synthesize the netlists and load them to a floorplanning engine. Tierby-tier chip planning includes the decision of tier location of blocks and the floorplan of separate tiers. This block-level tier partitioning stage is critical to the final design quality in which it decides the number of 3-D block interconnections. Given that we build a two-tier M3D IC, the form factor (footprint) of our M3D IC is as 50% small as its 2-D IC counterpart assuming no silicon area overhead. When we decide the size and (X , Y , Z ) location of each block, any two blocks with dense connections need to be stacked on each other as much as possible to maximize the benefits of direct vertical connections. In our M3D designs, we place I/D-cache memories on the top tier and the core blocks on the bottom tier to realize the memory-on-logic stacking principle, which targets efficient 3-D logic-memory interconnects. Next, we decide the tier locations of other logic blocks to balance the area skew on both tiers while optimizing the block interconnection.
After that, we insert an additional hierarchy level, which represents a tier between the top level and blocks in the synthesized netlist to organize it into the three-level hierarchy (top level, tiers, and blocks). Then, we generate two netlists at the second hierarchy for the top and bottom tiers. Each tier netlist is loaded on the floorplan engine again, and we perform tier-by-tier chip planning. It is obvious that this netlist modification defines a 3-D interblock net as a top-level net connected to the I/O port of each tier. Fig. 5 shows floorplans of 2-D and M3D RISC-V dual-core rocket processor design, respectively. Out of 6903 interblock nets, we achieve 3692 (53%) 3-D interblock nets as a result. Here, die-to-core and block-to-block spacings are 20 µm, and hard macro placement halo is 10 µm. The initial utilization of each block is assumed to be 65% ± 5% in both 2-D and M3D designs.

F. WIRELENGTH-DRIVEN 3-D PLACEMENT
The 3-D-aware placement solution for the individual blocks in the early design stage is necessary for extracting the best timing budget in M3D hierarchical designs. However, 2-D placement engines fail to produce the placement solution when block fences on separate tiers are flattened on the single placement layer. Therefore, we present an iterative tier-bytier placement approach to produce an optimal 3-D-aware placement solution. Fig. 6 shows our iterative wirelength-driven tier-by-tier global placement method to produce an optimal 3-D-aware placement solution. To make tier-by-tier global placement 3-D-aware, both tiers need to keep synchronizing the target location of 3-D ports inside the boundary of a tier dies. Therefore, the basic idea of our wirelength-driven 3-D placement solution is to iterate the exchange of 3-D port information and tier-by-tier global placement.
In the beginning, we first modify the tier netlist by eliminating all the top-level primary I/O ports and interblock connections (disconnected tier netlist) and run the global placement of each tier. In this way, the initial tier placement does not have any bias against an external block or I/O connections. After the global placement, we extract the output pin location of the driver cell of a 3-D port (defined at the original tier netlist) and decide the location of the 3-D port. Next, both tiers share the information of 3-D port locations. Then, we perform the tier-by-tier placement from scratch with the original tier netlist and extracted 3-D port locations. Those 3-D ports serve as anchors for cells connected to 3-D interblock nets and restrain the placement engine from overoptimizing 2-D interblock nets. Therefore, this anchoring process optimizes both 3-D/2-D interblock connections.
With the RISC-V dual-core rocket processor design, Fig. 7 shows that the total half perimeter wirelength (HPWL) for all interblock nets at the first iteration is reduced by 30.4% compared to the initial tier-by-tier placement solution. Moreover, it monotonically decreases as the iteration proceeds. On the other hand, we observe 5% HPWL overhead with small disturbance for intrablock nets. At the end of each iteration, we compare the total HPWL of nets with the threshold value to meet the exit condition. The threshold value is defined by where Min.HPWL is the minimum HPWL value during the whole iterations.
As the iteration proceeds, we update the 3-D port locations based on the latest tier-by-tier placement solution. When VOLUME 7, NO. 1, JUNE 2021 FIGURE 7. HPWL changes as the wirelength-driven 3-D placement proceeds. We observe 30.4% of HPWL reduction for interblock nets at the very first iteration, and it decreases monotonically as iteration proceeds. For intrablock nets, a small disturbance around 5% overhead is observed. the exit condition is met, we use the resulting tier-by-tier placement solution for 3-D timing budgeting. In the Rocket processor design, we meet the exit condition at the seventh iteration, as shown in Fig. 8, and find the optimal tier-bytier placement solution at the sixth iteration. Fig. 9 shows the movement of cells directly connected to the interblock nets. The standard cells in the bottom tier move toward the optimal 3-D connection points from iterations 0 to 6.

G. PIN ASSIGNMENT AND TIMING BUDGETING
Based on the optimal tier-by-tier placement solution and the result of wirelength-driven 3-D placement, we assign the block pins at the final location of the 3-D port inside the boundary of each block. For these pin assignments, a special 3-D P&R environment using full 3-D metal stack information is required. 3-D technology library exchange format (LEF) contains the full layer definition used for both top and bottom tiers. 3-D macro LEF defines the pin locations of standard cells based on their tier. The RC database for the 3-D metal stack (3-D TCH) is also needed for the timing budgeting later.  From these input files, we instantiate standard cells in a top-tier block with the top-tier macro LEF and cells in a bottom-tier block with the bottom-tier macro LEF. Since we know the tier location of cells, we update the macro of the cell in the original netlist based on its tier location. Finally, we load both tier-by-tier placement solutions on a single placement layer to accommodate all the synthesized cells of the design in the P&R tool. Although there are lots of overlaps between the top-tier and bottom-tier cells in this synthetic placement, the cell pin locations of them are still apart thanks to 3-D macro LEFs, as shown in Fig. 10. As there are unplaced top-level cells, additional top-level placement can proceed at this point to implement a full 3-D placement solution.
In our designs, we utilize six metal layers for the 2-D IC and for both top and bottom tiers in the M3D IC. We reserve In addition, we create an accessing via at a 3-D pin location inside the boundary of a blockage layer. This prevents the routing engine from utilizing metal layers reserved for blocks and accessing the blocks through the block regions unless the block pins are assigned as 3-D ports. Once detail routing is done honoring the routing blockage, remaining block pins (2-D pin) are assigned at the periphery of a block automatically based on the interblock routing information. Fig. 11 shows the pin assignment result. Finally, based on the extracted parasitics, the timing constraints for individual blocks and top-level design are generated, and we use them for the block-/top-level implementations.

H. TOP-LEVEL TIMING CLOSURE
For top-level timing closure, we apply the state-of-the-art gate-level M3D flow [13] to our block-level M3D IC environment. As shown in Fig. 12, we first treat the individual blocks as hard macros and flatten both tiers on the single placement layer. Then, we expand the flattened top-level floorplan up to the 2-D IC footprint and replace hard macros with placement blockages. When macros are fully overlapped, full placement blockages are created. Otherwise, partial placement blockages with a 50% allowance are created to reflect the empty spaces in the nonoverlapped region. Note that the placement blockage regions also should be expanded by the same expansion factor. Based on those blockage regions, 2-D P&R engines identify the legal buffer placement locations.
Next, we perform the conventional 2-D P&R steps with block timing models and scaled RC parasitics to close the top-level timing. Timing buffers are inserted at either white spaces or partial placement blockages during top-level timing closure. Note that parasitic scaling is required to reflect the 3-D wirelength savings in advance. Then, we linearly map the placement result onto the original M3D footprint to determine the final (X , Y ) location of top-level buffers. Assuming twotier M3D ICs without silicon area overhead, 1/(2) 1/2 = 0.707 is used for the parasitic scaling and placement contraction factor.
To determine Z location of top-level timing buffers, we use bin-based placement-driven Fiduccia-Mettheyses (FM)-mincut partitioning algorithm [14]. For the top-level 3-D connection, 3-D routing should proceed first, and MIV locations are decided based on the routing result of 3-D interblock nets. Then, our in-house tool generates the new RTL netlist for each tier that contains the MIVs as primary in/out ports. The RC model of MIVs is also generated as a standard parasitic exchange format (SPEF) file at this stage. Assembling block layouts at the top-level finalizes the fullchip block-level M3D IC implementation. Fig. 13 shows assembled design layouts of 28-nm RISC-V dual-core rocket processors for the 2-D IC, for the M3D IC with block pins at the boundary of blocks (M3D boundary), and for the M3D IC with block pins inside the boundaries of blocks (M3D area), respectively. The footprint of 2-D IC is 1.82 and 0.91 mm 2 (−50%) for that of M3D ICs; therefore, there is no silicon area overhead in two-tier M3D designs. Table 2 summarizes the wirelength of nets based on their connection types. As 50% of footprint reductions in both M3D designs lead the wirelength savings at the timing budgeting step, the total wirelength savings of M3D boundary and M3D area designs are 2.8% and 5.2%, respectively. It is worth noting that the wirelength saving on intrablock nets is the major source of total wirelength saving. Given that the wirelength of intrablock nets contributes to 75.3% of the total wirelength in the 2-D baseline, this implies that the benefit of additional timing margin offered by M3D integration affects the design quality more critically than reducing the wirelength of interblock nets directly. Table 2 also shows that the pin-in-thearea flow enables further optimized block implementation based on the better timing margin than that of a block with boundary pins.

B. CELL COUNT AND AREA SAVINGS
As the area of memory modules is kept the same as 0.129 mm 2 in all designs, we tabulate the number and area of standard cells for 2-D, M3D boundary, and M3D area VOLUME 7, NO. 1, JUNE 2021 designs, respectively, in Table 3. Top-level standard cells are top-level timing buffers, and block-level standard cells are the total sum of cells from individual blocks. Compared to 2-D, we observe that the M3D boundary and M3D area design reduces the total cell count by 4.2% and 9.6%, respectively. The 2-D baseline design contains 5.6% of the total standard cells as top-level timing buffers. While both M3D designs reduce the top-level buffers by around 50% due to the reduced top-level interblock interconnections, we observe that M3D area design further decreases the buffer count at an individual block implementation due to the increased timing margin at block boundaries. Area saving is a good metric that helps us to understand the combinational impact of cell count and drive strength reduction. This proves that buffers are reduced, but replaced by the larger buffer for the block-level cells, while both the number and drive strength of top-level cells are reduced.

C. ROUTING CONGESTION
Before block/top-level timing closure, we observe that M3D boundary and M3D area designs, indeed, achieve 10.4% and 12.1% wirelength savings compared to 2-D, respectively. Because individual blocks are not implemented yet, the impact of intrablock net wirelength can be excluded, and it allows us to capture the clear benefits of M3D integration to the interblock nets. However, all these wirelength savings become small after we actually implement blocks and perform top-level timing closure. To analyze the loss of wirelength savings, we check the routing congestions added by block and top implementation. Table 4 shows the change of wirelength per metal layer caused by the block implementation and the top-level timing closure. For 2-D IC, a huge wirelength increase is observed at the M3 layer. Since intrablock nets reserve up to the M4 layer inside each block, this increase is mainly originated from individual block optimization. However, in M3D  IC designs, the top-level timing buffer insertion is a major attribute causing routing congestion. This is because timing buffers that are placed on the top tier actually occupy the empty spaces between top tier blocks. As a result, they make the 3-D interblock nets longer to detour them, which results in significant wirelength increases from M5B to M3T layers. In the case of M3D boundary design, this congestion becomes worse at M4B and M5B layers since every 3-D interblock nets have to detour the top-level cells. On the other hand, the M3D area design shows better 3-D routing overhead by enabling direct pin access to the top tier blocks at the M1T layer.

D. POWER SAVING AT ISO-PERFORMANCE
As shown in Table 5, a major factor to decrease the total capacitance turns out to be the pin capacitance reduction by cell count and drive strength savings at the individual blocks. Although we observe the wire capacitance reduction, the top-level routing congestion is found a bottleneck to preserve the wirelength savings in M3D ICs. Table 6 shows the static power metrics of 2-D, M3D boundary, and M3D area designs at 538 MHz, which is the maximum frequency of 2-D design; 0.1 of switching activity is used for primary inputs and sequential elements, and 2.0 for clock ports. Note that cell count reduction leads to a great combinational power decrease. M3D boundary design shows 5.0% of the total power saving with 3.9% and 20.6% savings in terms of internal and leakage powers, respectively. In the case of M3D area design, 10.2% and 6.3% pin and wire capacitance reductions give us 8.3% of total power saving compared to  2-D counterpart based on the great switching and leakage power decreases.

E. ENERGY-DELAY-PRODUCT SAVING AT MAXIMUM PERFORMANCE
For the cross comparison of 2-D and M3D designs at their own maximum frequencies, we tabulate the energy-delayproduct metric in Table 7. We first observe that the M3D boundary and M3D area design achieves 8.7% and 19.0% faster clock frequency compared to 2-D. Total power and energy values are calculated at these maximum frequencies, and finally, we observe that the M3D area improves the energy-delay-product by 13.6% and 24.7% compared to 2-D and M3D boundary designs, respectively, and these improvements represent the benefit of pin-in-the-area flow.

V. CONCLUSION
In this article, we presented a physical design solution named pin-in-the-area for efficient block implementation and integration in M3D hierarchical designs. Our pin-in-the-area flow enables block pin assignment at any location inside the boundary of block regions for the direct vertical block communications in M3D ICs. We also proposed iterative wirelength-driven 3-D placement to co-optimize the blocklevel placement and pin locations. With 28-nm RISC-V dualcore rocket processor designs, we observed that the direct vertical block communication offers a larger timing margin for individual blocks and allows efficient block implementation resulting in a 9.6% total cell count and 10.2% pin capacitance reduction. Finally, we achieved 19.7% faster maximum clock frequency and 24.7% better energy-delay efficiency in block-level M3D design using pin-in-the-area flow than those of the conventional 2-D hierarchical design.