I. Introduction
Recently, embedded systems equipped with graphics processing units (GPUs) run various deep learning workloads [1], [2]. These systems have become crucial in maintaining stringent quality-of-service (QoS) standards. However, the inherent limitation of hardware resources in embedded GPUs and the diverse nature of deep learning models pose significant challenges in workload management [3]. These challenges involve optimizing hardware utilization without compromising service latency and queuing delay.