By Topic

Combining Data Reuse With Data-Level Parallelization for FPGA-Targeted Hardware Compilation: A Geometric Programming Framework

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Qiang Liu ; Dept. of Electr. & Electron. Eng., Imperial Coll. London, London ; George A. Constantinides ; Konstantinos Masselos ; Peter Y. K. Cheung

A nonlinear optimization framework is proposed in this paper to automate exploration of the design space consisting of data-reuse (buffering) decisions and loop-level parallelization, in the context of field-programmable-gate-array-targeted hardware compilation. Buffering frequently accessed data in on-chip memories can reduce off-chip memory accesses and open avenues for parallelization. However, the exploitation of both data reuse and parallelization is limited by the memory resources available on-chip. As a result, considering these two problems separately, e.g., first exploring data reuse and then exploring data-level parallelization, based on the data-reuse options determined in the first step, may not yield the performance-optimal designs for limited on-chip memory resources. We consider both problems at the same time, exposing the dependence between the two. We show that this combined problem can be formulated as a nonlinear program and further show that efficient solution techniques exist for this problem, based on recent advances in optimization of so-called geometric programming problems. The results from applying this framework to several real benchmarks implemented on a Xilinx device demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework. We have also implemented designs determined by a two-stage optimization method that first explores data reuse and then explores parallelization on the same platform, and by comparison, the performance-optimal designs proposed by our framework are faster than the designs determined by the two-stage method by up to 5.7 times.

Published in:

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  (Volume:28 ,  Issue: 3 )