Skip to Main Content
This paper describes performance optimizations of a transfer controller for an FPGA-based blocked parallel matrix multiplication accelerator. One of the key challenges of the controller is the generation of a sequence of host memory addresses to transfer blocks of matrices between host and on-chip memories. These addresses are not contiguous, thereby introducing complexity for the controller design. This paper first outlines the intended system architecture for this controller. Next, detailed controller specifications are presented for generating host memory addresses. Various pipeline configurations that yield incremental performance improvements are then described. Finally, experimental results are presented, with the best configuration having an operating frequency exceeding 470 MHz on an Altera Stratix III chip. This level of performance is comparable to that of the pipelined floating-point arithmetic units in the complete system.