By Topic

POPS: Coherence Protocol Optimization for Both Private and Shared Data

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Hossain, H. ; Univ. of Rochester, Rochester, NY, USA ; Dwarkadas, S. ; Huang, M.C.

As the number of cores in a chip multiprocessor (CMP) increases, the need for larger on-chip caches also increases in order to avoid creating a bottleneck at the off-chip interconnect. Utilization of these CMPs include combinations of multithreading and multiprogramming, showing a range of sharing behavior, from frequent inter-thread communication to no communication. The goal of the CMP cache design is to maximize capacity for a given size while providing as low a latency as possible for the entire range of sharing behavior. In a typical CMP design, the last level cache (LLC) is shared across the cores and incurs a latency of access that is a function of distance on the chip. Sharing helps avoid the need for replicas at the LLC and allows access to the entire on-chip cache space by any core. However, the cost is the increased latency of communication based on where data is mapped on the chip. In this paper, we propose a cache coherence design we call POPS that provides localized data and metadata access for both shared data (in multithreaded workloads) and private data (predominant in multiprogrammed workloads). POPS achieves its goal by (1) decoupling data and metadata, allowing both to be delegated to local LLC slices for private data and between sharers for shared data, (2) freeing delegated data storage in the LLC for larger effective capacity, and (3) changing the delegation and/or coherence protocol action based on the observed sharing pattern. Our analysis on an execution-driven full system simulator using multithreaded and multiprogrammed workloads shows that POPS performs 42% (28% without micro benchmarks) better for multithreaded workloads, 16% better for multiprogrammed workloads, and 8% better when one single-threaded application is the only running process, compared to the base non-uniform shared L2 protocol. POPS has the added benefits of reduced on-chip and off-chip traffic and reduced dynamic energy consumption.

Published in:

Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on

Date of Conference:

10-14 Oct. 2011