Journals & Magazines >IEEE Embedded Systems Letters >Volume: 17 Issue: 3

TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs

Download PDF
Download References
Request Permissions
Save to
Alerts

Embedded GPUs are often used in multi-tenant inference systems, handling numerous requests concurrently. These systems should meet strict Quality-of-Service requirements ...

Abstract:

This letter introduces a novel software technique to optimize thread allocation for merged and fused kernels in multitenant inference systems on embedded graphics process...Show More

Metadata

Abstract:

This letter introduces a novel software technique to optimize thread allocation for merged and fused kernels in multitenant inference systems on embedded graphics processing units (GPUs). Embedded systems equipped with GPUs face challenges in managing diverse deep learning workloads while adhering to quality-of-service (QoS) standards, primarily due to limited hardware resources and the varied nature of deep learning models. Prior work has relied on static thread allocation strategies, often leading to suboptimal hardware utilization. To address these challenges, we propose a new software technique called thread-level parallelism (TLP) Balancer. TLP Balancer automatically identifies the best-performing number of threads based on performance modeling. This approach significantly enhances hardware utilization and ensures QoS compliance, outperforming traditional fixed-thread allocation methods. Our evaluation shows that TLP Balancer improves throughput by 40% compared to the state-of-the-art automated kernel merge and fusion techniques.

Embedded GPUs are often used in multi-tenant inference systems, handling numerous requests concurrently. These systems should meet strict Quality-of-Service requirements ...

Published in: IEEE Embedded Systems Letters ( Volume: 17, Issue: 3, June 2025)

Page(s): 180 - 183

Date of Publication: 14 November 2024

ISSN Information:

DOI: 10.1109/LES.2024.3497587

Funding Agency:

No metrics found for this document.

Contents

I. Introduction

Recently, embedded systems equipped with graphics processing units (GPUs) run various deep learning workloads [1], [2]. These systems have become crucial in maintaining stringent quality-of-service (QoS) standards. However, the inherent limitation of hardware resources in embedded GPUs and the diverse nature of deep learning models pose significant challenges in workload management [3]. These challenges involve optimizing hardware utilization without compromising service latency and queuing delay.

Usage

Select a Year

View as

Total usage sinceNov 2024:44

Year Total:27

Data is updated monthly. Usage includes PDF downloads and HTML views.

Citations

Search for
Citations in
Google Scholar^®

References is not available for this document.

TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?

TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Keywords

Metrics

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?