Skip to Main Content
Shared tightly coupled data memories are key architectural elements for building multi-core clusters in programmable accelerators and embedded systems, as they provide a convenient shared memory abstraction while avoiding cache coherence overheads. The performance of these memories largely depends on the architecture of the interconnect used between processing elements (PEs) and memory banks. The advent of three-dimensional (3D) technology has provided new opportunities to increase design modularity and reduce latency and manufacturing cost. In this study, the authors propose two 3D network architectures: C-logarithmic interconnect (LIN) and Distributed logarithmic interconnect (D-LIN) (designed in synthesisable RTL), which allow modular stacking of multiple L1 memory dies over a multi-core cluster with a limited number of PEs. The authors have used two through-silicon-via technologies: the state-of-the-art micro-bumps and the promising and dense Cu-Cu direct bonding. The overhead of electrostatic discharge protection circuits has been considered, as well. Architectural simulation results demonstrate that, in processor-to-L1-memory context, C-LIN and D-LIN perform significantly better than traditional network-on-chips and simple time-division multiplexing buses. Furthermore, post-layout results show that the proposed 3D architectures achieve comparable speed against their 2D counterparts, whereas enabling modularity: from 256 kB to 2 MB L1 memory configurations with a single mask set.