Loading [MathJax]/extensions/MathZoom.js
Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads | IEEE Conference Publication | IEEE Xplore

Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads


Abstract:

Shared memory heterogeneous systems are now mainstream, with nearly every mobile phone and tablet containing integrated processing units. However, developing applications...Show More
Notes: Acknowledgement: "This research was supported in part by the DARPA SDH Program under agreement No. FA8650-18-2-7862 and the U.S. Government. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government."

Abstract:

Shared memory heterogeneous systems are now mainstream, with nearly every mobile phone and tablet containing integrated processing units. However, developing applications for such devices is difficult as workloads must be decomposed across different processing units, and the decomposition must be flexible to account for the growing diversity of devices, each with different relative processing unit throughput. Furthermore, many devices require distinct programming front ends, requiring significant effort to write cross-platform applications. In this work, we identify a pragmatic class of applications, which we call traverse-compute applications, that are ideal for shared memory heterogeneous systems. These applications have a flexible heterogeneous decomposition where CPUs excel at traversing a tree structure, while accelerators excel at node computations. Leveraging this insight, we present Redwood: a framework for writing heterogeneous traverse-compute workloads. Redwood provides a simple processing unit abstraction and a tree traversal library that enables heterogeneous optimizations. Using Redwood, we implement Grove, a benchmark suite containing nine pragmatic tree traversal applications, e.g., k-nearest neighbors. We instantiate Redwood for three different heterogeneous programming platforms: CUDA, SYCL, and HighLevel Synthesis; we use Grove to evaluate five shared memory heterogeneous systems. Our evaluation highlights the importance of flexible heterogeneous decomposition as the optimal parameters differ widely across platforms and applications. However, once optimally configured, heterogeneous implementations can provide up to 13.53× speedups (geomean of 3.01×) over homogeneous implementations, showcasing the potential of heterogeneous computing for these workloads.
Notes: Acknowledgement: "This research was supported in part by the DARPA SDH Program under agreement No. FA8650-18-2-7862 and the U.S. Government. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government."
Date of Conference: 23-25 April 2023
Date Added to IEEE Xplore: 23 June 2023
ISBN Information:
Conference Location: Raleigh, NC, USA

I. Introduction

As Moore’s Law and Dennard’s scaling come to an end, the demand for ever-increasing performance and energy efficiency has driven the development of Shared-Memory Heterogeneous Systems (SMHSs), particularly in mobile System-on-Chips (SoCs), e.g., an Apple A12 SoC has over 80% of the die area consisting of accelerators [45]. SMHSs incorporate diverse specialized processing units (PUs), including traditional CPUs and Programmable Accelerating PUs (PAPUs), such as integrated GPUs and embedded FPGAs, all interconnected through a shared-memory hierarchy on the same chip. In contrast to conventional accelerator-oriented heterogeneous systems (e.g., [23], [41]), SMHSs architecture enables efficient communication and data sharing between different PUs, compared to discrete heterogeneous systems where data is typically transferred via PCIe, as studied in [12], [19], [33].

Contact IEEE to Subscribe

References

References is not available for this document.