By Topic

An operation placement and scheduling scheme for cache and communication localities in fine-grain parallel architectures

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Yen-Kuang Chen ; Dept. of Electr. Eng., Princeton Univ., NJ, USA ; Kung, S.Y.

With increasing on-chip hardware, concurrency is a way to bridge the gap between the computational power demanded by the applications and that afforded by the computer platforms. Although parallel systems are increasingly popular they remain very difficult to program. In fact, most compilers require the programmer to specify how to partition data or map program code to the system's processors. To ensure an effective program, cache locality is important because of the large speed gap between microprocessors and memory systems. It is also important to make use of local communication whenever possible, since it is cheaper faster and less power hungry than global communication. In order to exploit these locality properties, we present a systematic operation placement and scheduling scheme for fine-grain parallel architectures. The key advantages are twofolds: (1) This multiprojection method, which deals with multidimensional parallelism systematically, can alleviate the burden of the programmer in coding and data partitioning. (2) it addresses the memory/communication bandwidth bottleneck, and can lend to faster program execution. On a special design example of the motion estimation block-matching algorithm, which requires the most intensive computation and memory accesses in video coding, our method lends to a reduction of external memory accesses by two to three orders of magnitude

Published in:

Parallel Architectures, Algorithms, and Networks, 1997. (I-SPAN '97) Proceedings., Third International Symposium on

Date of Conference:

18-20 Dec 1997