By Topic

• Abstract

SECTION I

## INTRODUCTION

CONVENTIONAL NAND flash memory employs a parallel multi-drop bus architecture to connect multiple devices to a controller. As bus speeds have increased with the emergence of high speed Open NAND Flash Interface (ONFI) [1] and toggle-mode [2] NAND flash devices, the limitations of the multi-drop bus become apparent. As speed increases the number of parallel loads that can be supported is reduced. On-die termination is used to mitigate the loading effects at the cost of increased power consumption. To support high capacity and high performance solid-state drives (SSDs) a large number of memory channels, each supporting only 4–8 NAND devices, will be required.

A ring topology was originally proposed for DRAM main memory in the IEEE Ramlink standard [3]. Higher speed operation is possible because each device drives only a single load, rather than multiple loads as in a parallel bus topology. A drawback of the ring topology in DRAM applications is that each stage in the ring adds additional latency which is critical for processor main memory performance.

HyperLink NAND or HLNAND [4] was introduced to overcome the performance and scalability issues of conventional NAND flash parallel bus architectures. Since internal NAND read and program operations take on the order of 100 $\mu{\rm s}$ and 1 ms respectively, the additional nano-second latency of the ring topology is negligible. HLNAND also introduced a fully packetized command and address format with DDR data transfers. Multiple chip enable signals (CE#) were eliminated by a device ID byte in the command packet. While each device in the HyperLink ring requires a separate 8-bit input and 8-bit output as opposed to a single 8-bit bidirectional input/output bus in the case of conventional NAND, the number of controller pins required per channel for high capacity configurations remains about the same due to the reduction in CE# pins.

An HLNAND Multi-Chip Package (MCP) employing a stack of 8 conventional commercially available parallel bus NAND devices and a custom bridge chip interfacing to the external ring was developed [5]. The bridge chip allows slower parallel-bus NAND devices to communicate with the 300 MB/s high speed ring, providing concurrent operations over 4 internal interfaces within a single package. This MCP approach isolates individual memory die from the ring so that they do not contribute to power dissipation. Although the data circulates around the ring through point-to-point connections the 150 MHz DDR clock is delivered in a parallel multi-drop fashion. Only 4–8 loads can be driven by a single clock and larger configurations would require multiple parallel clocks. Also, DDR operation beyond 300 MB/s becomes challenging when clock and data do not have identical signal paths and load. To achieve higher speed operation with a single clock we introduce in this paper a source synchronous clocking scheme for the HyperLink ring architecture. A new 90 nm bridge chip was developed and fabricated for this work.

SECTION II

## SOURCE SYNCHRONOUS ARCHITECTURE

Fig. 1 shows an HLNAND source synchronous ring topology configuration with 8 flash MCPs. Clock and data originate from the same source and terminate at the same destination with matched drivers and receivers so that their phase relationship is maintained over the full range of operating conditions and system configurations. Signals originate from the controller and are regenerated in each MCP as they circulate around the ring back to the controller. A single-ended 8 bit bus for control, address and data information D[7:0]—Q[7:0] is synchronized with a differential source synchronous clock CKI/CKI#—CKO/CKO#. The differential clock allows better control of clock duty cycle. An active CSI—CSO strobe signal indicates the presence of a command packet on the D[7:0]—Q[7:0] bus. An active DSI—DSO strobe signal indicates the presence of a data packet on the D[7:0]—Q[7:0] bus. Two low speed signals are provided from the controller to the MCPs in parallel. Chip enable CE# allows the ring to be powered down and clocks to be suspended for low power standby while reset R# initializes the devices on power up.

Fig. 1. HLNAND source synchronous ring topology.

A simplified schematic of the source synchronous bridge chip clocking and input/output circuitry is shown in Fig. 2. The actual circuit contains delay compensation in the PLL feedback path and dummy circuits for clock and data delay paths to provide matching essential for high speed operation. Input strobe signals DSI and CSI pass through to DSO and CSO through circuits similar to the D[7:0] to Q[7:0] path except the multiplexers to overwrite incoming signals with read data are not required.

Fig. 2. Source synchronous bridge chip clocking and I/O circuits.

Fig. 3 shows the inputs and outputs of a single MCP after power up. Edges of the DDR data inputs D[7:0] and outputs Q[7:0] are aligned with edges of the clock inputs CKI/CKI# and clock outputs CKO/CKO#. A differential clock input buffer provides the internal clock ckin with a delay tdi. Input data passes through a matched input buffer employing reference voltage vref to provide internal input data din with matched delay tdi. A PLL locked to 2x the input frequency is used to regenerate ckin for duty cycle correction and provide a 90${^{\circ}}$ delay to create a sampling clock ckint centered in the din data valid window. A PLL rather than a DLL is used because the PLL filters jitter outside the loop bandwidth. With a DLL the jitter would accumulate around the ring. Output data and clocks are generated from edges of ckint with a delay of tdo. The total delay from input to output varies with operating conditions but is matched for data and clocks at ${\rm tdi}+{\rm td}90+{\rm tdo}+{\rm t}180$.

Fig. 3. HLNAND MCP signals after power-up.

To save power the PLL can be shut down in alternate MCPs by employing edge aligned clock and data between odd and even numbered MCPs and center aligned clock and data between even and odd numbered MCPs. The odd numbered MCPs have their PLLs shut down by a command from the controller which also reconfigures the inputs to be sampled directly with the received center-aligned input clock. The even numbered MCPs receive a command to reconfigure their outputs to provide center-aligned clock and data by inserting an additional 90° delay on the outbound clock. The center-aligned clock on the output of the even devices compensates for the disabled PLL in the odd devices. The signals at odd and even MCPs in PLL power saving mode are shown in Fig. 4

Fig. 4. HLNAND MCP signals with alternating PLLs disabled for reduced power.

In the odd devices the input sampling clock ckint is identical to the received input clock ckin. Although the delay through even and odd devices is different the average delay per stage remains ${\rm tdi}+{\rm td}90+{\rm tdo}+{\rm td}180$. After power-up and synchronization the controller can measure the total delay around the ring by observing the delay of the command strobe CSI. If there is a single device or an odd number of devices in the ring the controller does not require a DLL or PLL to create a 90° phase shifted sampling clock, because the return clock is already centered in the data valid window.

SECTION III

## MULTI-CHIP PACKAGE

The internal architecture of the 32 GB MCP is shown in Fig. 5. The bridge chip implemented in a 90 nm CMOS logic process interfaces to eight 27 nm 32 Gb 133 Mb/s/pin toggle-mode MLC NAND flash devices. The bridge chip can accommodate standard asynchronous NAND and ONFI NAND in addition to toggle-mode by programming the appropriate bond option. Four separate internal channels allow independent bank data transfer operations to the bridge chip to better exploit the bandwidth capabilities of multiple flash devices. Each internal channel supports 2 NAND die within the 8 die stack, although a single die per internal channel with a 4 die stack is also an option.

Fig. 5. HLNAND MCP internal architecture.

The bridge chip includes page buffers for each channel to mirror the data in the local NAND page buffers. During an HLNAND page read operation a local page read command is issued to the selected NAND device followed by a burst data read command to load data into the bridge chip page buffer. The data is then available for a subsequent HLNAND burst data read command. Similarly, an HLNAND program command is translated to a local burst data load command to transfer page buffer contents to the selected NAND device followed by a local program command. In this way the operation of the bridge chip page buffer is transparent to the system and hidden within tR and tPROG intervals.

The external 800 Mb/s/pin HyperLink interface employs JEDEC standard un-terminated HSUL 1.2 v signaling with drivers calibrated within a range of 30–50 $\Omega$ with an external ZQ reference resistor similar to DDR3 DRAM. The MCP requires 3 different supply voltages, 3.3 v for the NAND core, 1.8 v for the toggle mode NAND interface, and 1.2 v for the HyperLink HSUL interface.

Fig. 6 shows a de-encapsulated 100-pin 14 mm ×18 mm BGA package with 8 stacked 32 Gb MLC NAND Flash devices and the 800 MB/s bridge chip while Fig. 7shows the bridge chip measuring 5.07 mm ×2.28 mm. Since the flash devices are relatively narrow the bridge chip is placed directly on the package substrate. With larger NAND die the bridge chip can be placed on top of the NAND stack. NAND devices have pads along a single short side of the die. The devices are staggered to allow bond wires to connect to 4 die on the left and 4 die on the right. The outer bond pads of the bridge chip connect to two internal NAND channels on each side through the package substrate while the inner bond wires connect to package balls for the external HyperLink interface.

Fig. 6. De-encapsulated 100-pin BGA package with 8 stacked 32 Gb MLC NAND flash devices and 800 MB/s bridge chip.
Fig. 7. 800 MB/s HLNAND bridge chip.
SECTION IV

## MEASURED RESULTS

A test board shown in Fig. 8 was developed to characterize the performance of a ring of 8 MCPs. A pseudo-random data source is connected through SMA connectors to fully exercise the channel and provide crosstalk. The 800 Mb/s data eye shown in Fig. 9 was measured at the output of the last device in the ring and shows good vertical opening and low timing jitter. A schmoo plot shown in Fig. 10 shows error free DDR834 operation with 417 MHz clock, 1.03 v supply voltage, and room temperature.

Fig. 8. 800 MB/s HLNAND ring test board.
Fig. 9. 800 Mb/s eye diagram from last device in the ring.
Fig. 10. Schmoo plot showing error free DDR834 operation at 417 MHz, 1.03 v, room temperature.
SECTION V

## CONCLUSION

A custom 90 nm HLNAND bridge chip with PLL based DDR timing generation and self-calibrated driver impedance was developed to support 800 MB/s data transactions. The bridge chip interfaces to 8 stacked conventional NAND die within a single MCP to deliver higher performance and scalability than un-buffered NAND devices could provide. Table I summarizes the key features of the 256 Gb NAND Flash MCP. HLNAND provides a higher bandwidth channel than standalone NAND devices due to the point-to-point unidirectional ring architecture. Lower power is achieved through the use of 1.2 v un-terminated I/O, single point loads, and a hierarchical architecture where the NAND devices are physically isolated from the main channel. Using the conventional NAND interface a maximum of 8 die can be connected to a single controller channel. The HLNAND MCP allows 64 or more die to be supported by a single channel to enable cost-effective multi-TB SSDs.

TABLE I 256 Gb NAND FLASH MCP KEY FEATURES

### ACKNOWLEDGMENT

The authors thank Silicon Creations for PLL design, TSMC for chip fabrication, Winpac for package substrate design and assembly, Fidus for board design and assembly, and DA-Integrated for testing

## Footnotes

P. Gillingham, D. Chinn, E. Choi, J.-K. Kim, D. Macdonald, H. Oh, and H.-B. Pyeon are with the Conversant Intellectual Property Management, Inc., Ottawa, ON K2K 2×1, Canada

R. Schuetz is with the Founder of a software startup focused on development of equity trading algorithms and financial data analysis

Corresponding author: P. Gillingham (gillingham@conversantip.com)

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Comment Policy