TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs | IEEE Journals & Magazine | IEEE Xplore

TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs


Embedded GPUs are often used in multi-tenant inference systems, handling numerous requests concurrently. These systems should meet strict Quality-of-Service requirements ...

Abstract:

This letter introduces a novel software technique to optimize thread allocation for merged and fused kernels in multitenant inference systems on embedded graphics process...Show More

Abstract:

This letter introduces a novel software technique to optimize thread allocation for merged and fused kernels in multitenant inference systems on embedded graphics processing units (GPUs). Embedded systems equipped with GPUs face challenges in managing diverse deep learning workloads while adhering to quality-of-service (QoS) standards, primarily due to limited hardware resources and the varied nature of deep learning models. Prior work has relied on static thread allocation strategies, often leading to suboptimal hardware utilization. To address these challenges, we propose a new software technique called thread-level parallelism (TLP) Balancer. TLP Balancer automatically identifies the best-performing number of threads based on performance modeling. This approach significantly enhances hardware utilization and ensures QoS compliance, outperforming traditional fixed-thread allocation methods. Our evaluation shows that TLP Balancer improves throughput by 40% compared to the state-of-the-art automated kernel merge and fusion techniques.
Embedded GPUs are often used in multi-tenant inference systems, handling numerous requests concurrently. These systems should meet strict Quality-of-Service requirements ...
Published in: IEEE Embedded Systems Letters ( Volume: 17, Issue: 3, June 2025)
Page(s): 180 - 183
Date of Publication: 14 November 2024

ISSN Information:

Funding Agency:

No metrics found for this document.

I. Introduction

Recently, embedded systems equipped with graphics processing units (GPUs) run various deep learning workloads [1], [2]. These systems have become crucial in maintaining stringent quality-of-service (QoS) standards. However, the inherent limitation of hardware resources in embedded GPUs and the diverse nature of deep learning models pose significant challenges in workload management [3]. These challenges involve optimizing hardware utilization without compromising service latency and queuing delay.

Usage
Select a Year
2025

View as

Total usage sinceNov 2024:44
024681012JanFebMarAprMayJunJulAugSepOctNovDec1154700000000
Year Total:27
Data is updated monthly. Usage includes PDF downloads and HTML views.
Contact IEEE to Subscribe

References

References is not available for this document.