Skip to Main Content
This paper discusses the balance between loop-level parallelism and clock rate for enhancing the performance of DSP applications fully implemented on FPGAs. Loop-level parallelism reduces the total cycles of an application at the cost of increased routing complexity that often results in lower clock rates. We analyze loops that can be fully parallelized and show that it is possible to achieve better performance by controlling the number of parallel iterations of the loops than using fully parallel loops. We have implemented loop parallelism in our compilation framework and fine-tune them to enhance the performance of DSP applications that target Xilinx Virtex-II FPGA chip. Our experimental results show that it is possible to reach a performance equilibrium point where the total number of cycles and the overall clock frequency can be adjusted to maximize the overall performance of an application.