Abstract:
The 822mm2 Colossus Mk2x is a chip made by stacking and fusing two separately processed 12-inch wafers prior to probe-test, singulation and packaging. The two wafers are ...Show MoreMetadata
Abstract:
The 822mm2 Colossus Mk2x is a chip made by stacking and fusing two separately processed 12-inch wafers prior to probe-test, singulation and packaging. The two wafers are manufactured using TSMC's N7 CMOS technology and TSMC's 0.13\mu \mathrm{m} Deep Trench Capacitor (DTC) technology, respectively. Each N7 die has over 60 billion transistors (typically 1 to 3 fins each), 82% of which constitute >8Gb of SRAM. The circa-750\mu \mathrm{F} stacked capacitor die significantly reduces worst-case supply voltage undershoot and overshoot allowing headroom for the supply voltage to be elevated and a 40% increase in clock speed to achieve a peak performance of 700TFLOPS FP8 arithmetic (350TFLOPS FP16) that occurs when executing large, distributed matrix multiplications using all threads on all tile CPUs. For example, this scenario arises when executing the FFN (Feed Forward Network) in each transformer stage of an NLP model, such as BERT [4]. The chip is pictured in Fig. 29.4.1: the 1472 identical tile CPUs each with 624kB of SRAM are arranged in vertical columns adjacent to the exchange: a crossbar interconnect that provides contention-free communication between all of the tiles. Each CPU tile (Fig. 29.4.2) supports six round-robin scheduled hardware threads to minimize instruction latency and includes a fully pipelined FPU capable of accumulating 16 sets of either 4 FP16 products or 8 FP8 products per clock cycle.
Date of Conference: 19-23 February 2023
Date Added to IEEE Xplore: 23 March 2023
ISBN Information: