I. Introduction
Modern autonomous systems employ deep neural networks (DNN) for various tasks and all-in-one system-on-chips (SoC) to enable decision making on-the-go without human intervention. Typically, such autonomous systems are equipped with SoCs featuring graphical processing units (GPU) for the execution of DNNs. A common and critical autonomous task is object detection (OD) which identifies objects of interest in the environment, captured by the stream of images obtained by the camera. A common practice employed by system developers is to select and configure a single DNN, such as YoloV7 [1], and map it to the fastest processor in the SoC, which is typically a GPU. In this conventional setup, there is limited room for improving the latency and/or energy usage of the autonomous system, as the model and the target processing unit is fixed. In response, several studies [2], [3] propose offloading the computation to a remote server, while others [4]–[6] attempt to reduce the computational demand by modifying the underlying model or using a subset of the data stream. However, offloading is not a viable option due to the latency overhead associated with remote processing. On the other hand, modifying models or selectively skipping data often results in a significant compromise in accuracy. Instead, in this work, we explore optimizing the system performance by employing a context-aware multi-model execution and leveraging different type of accelerators available in SoCs.
Comparison of (a) single-model with multiple parameter sizes on the left against (b) multi-model object detection architectures on the right. The larger the value along each axis the better: a perfect model would be largest triangle across all axes.