Abstract:
Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a unified memory architecture allows the explicit...Show MoreMetadata
Abstract:
Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a unified memory architecture allows the explicit data copying between CPU and integrated GPU (iGPU) to be eliminated with the benefit of significantly improving performance and energy savings. However, to enable such a “zero-copy” communication model, many devices either implement intricate cache coherence protocols or they may disable the last level caches. This often leads to strong performance degradation of cache-dependent applications, for which CPU-iGPU data transfer based on standard copy remains the best solution. This paper presents a framework based on a performance model, a set of micro-benchmarks, and a novel zero-copy communication pattern to accurately estimate the potential speedup a CPU-iGPU application may have by considering different communication models (i.e., standard copy, unified memory, or pinned “zerocopy”). It shows how the framework can be combined with standard profiler information to efficiently drive the application tuning for a given programmable embedded device.
Published in: 2021 58th ACM/IEEE Design Automation Conference (DAC)
Date of Conference: 05-09 December 2021
Date Added to IEEE Xplore: 08 November 2021
ISBN Information:
Print on Demand(PoD) ISSN: 0738-100X