Close category search window
 

High performance cache block replication using re-reference probability in CMPs

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Jinglei Wang ; Tsinghua Nat. Lab. for Inf. Sci. & Technol., Tsinghua Univ., Beijing, China ; Dongsheng Wang ; Haixia Wang ; Yibo Xue

In a Chip Multiprocessor(CMP) with shared caches, the last level cache (LLC) is distributed across all the cores. This increases the on-chip communication delay and thus influence the pr ocessor's performance. The LLC is also quite inefficient due to plenty of dead blocks. Replication can be provided in shared caches by replicating cache blocks evicted from cores to the local LLC slices to minimize access latency through utilizing the cache space of dead blocks which will not be referenced again before they are evicted. However, naively allowing all evicted blocks to be replicated have limited performance benefit as such replicating does not take into account reuse probability of replicated blocks. This paper proposes Adaptive Probability Replication (APR), a mechanism that counts each block's accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and insert them at appropriate position, according to their re-reference probability. We evaluate APR for a 16-core tiled CMP using splash-2 and parsec benchmarks. APR improves performance by 21% on average compared to conventional shared cache design, by 17% over Victim Replication (VR), by 10% over Adaptive Selective Replication (ASR), and by 15% over Reactive NUCA (R-NUCA). The additional hardware cost of APR is well under 1% of L2 cache slice.

Published in:
High Performance Computing (HiPC), 2011 18th International Conference on

Date of Conference: 18-21 Dec. 2011

Need Help?


IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2013 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.