We will discuss the microarchitecture, design, and testing of the first 8 × 8-bit (by modulo 256) parallel carry-save RSFQ multiplier implemented using the ISTEC 10- kA/cm2 1.0-μm fabrication technology. Partial products are asynchronously generated and sent to the reduction stage at the internal “hardwired” rate of 80 GHz. The 8 × 8-bit RSFQ multiplier uses a two-level parallel carry-save reduction tree that significantly reduces the multiplier latency. The 80-GHz carry-save reduction is implemented with asynchronous data-driven wave-pipelined [4:2] compressors built with toggle flip-flop cells. The design has mostly regular layout with both local and global connections between modules. The multiplier core (without SFQ-to-DC and DC-to-SFQ converters) has 5948 Josephson junctions occupying the area of 3.5 mm2 . The multiplier is designed with the target operation frequency of 20 GHz and has the latency of 447 ps at the bias voltage of 2.5 mV. Despite some challenges due to fabrication process parameter variations and flux trapping, the multiplier chip was fabricated and successfully tested for the vast majority of test vectors by the Stony Brook designers with the assistance of colleagues from Yokohama National University in February 2012. While multiplier test operations were generated at low frequency, each of these operations was executed at the “hardwired” rate of 80 GHz. The fabricated chip operated with the measured DC bias margins of ±5%.