Turbo-codes have become an attractive forward error correction scheme as well for wired as for wireless communication systems. They achieve quasi-optimal coding gain, enabling reliable transmission on very noisy channels. Among the different existing turbo-coding schemes, Block Turbo-Codes (BTC) present excellent performances when their constitutive codes are decoded with the Fang-Buda Algorithm (FBA). Their bit error rate (BER) performance does not present any significant error floor down to 10-9. However, the complexity of the FBA is such that currently available block turbo-decoders, even if they achieve throughput up to 155 Mbps, can only handle very simple codes, which do not reveal totally the power of BTC. In order to enable high performing BTC without sacrificing throughput or energy, we have systematically analyzed and optimized the FBA algorithm, applying a Data Transfer and Storage Exploration methodology. We describe in this paper the algorithm transformation steps and detail the resulting architecture. The latter, when mapped in a typical 0.18 μm CMOS technology and clocked at 200 MHz enables high performing block turbo-decoding with throughput ranging from 30 Mbps to 120 Mbps. The memory power consumption, which is dominant for such a data-dominated application, has been estimated to be 480 mW while the memory area estimation led to 3.5 mm2 per FBA module in the BTC pipeline.