A Performance Study Depending on Execution Times of Various Frameworks in Machine Learning Inference | IEEE Conference Publication | IEEE Xplore

A Performance Study Depending on Execution Times of Various Frameworks in Machine Learning Inference


Abstract:

This work is intended to compare the latency of various frameworks in machine learning inference through an average power calculation model. This model is created in term...Show More

Abstract:

This work is intended to compare the latency of various frameworks in machine learning inference through an average power calculation model. This model is created in terms of a 2-layer neural network with PyTorch, in Python. Then, it is converted to a traced Torch Script module and also to ONNX file format. Afterwards, the C++ front-end is used for the inference process. The traced model is run with Libtorch on CPU and GPU, the ONNX file is run with ONNX Runtime on both CPU and GPU and it is also run with TensorRT on GPU. The inference execution times for 100 trials are averaged for all cases and it is realized that TensorRT with ONNX file format significantly outperforms its counterparts as expected. Hence, this work highlights the performance of TensorRT in machine learning inference and sheds light into the future by proposing several extensions.
Date of Conference: 17-19 November 2021
Date Added to IEEE Xplore: 05 January 2022
ISBN Information:
Conference Location: Izmir, Turkey

I. Introduction

In machine learning (ML) inference, performance is usually measured by two metrics, which are latency and throughput. Latency is the amount of time spent in the inference process and throughput is the measure of the number of inferences that can be completed in unit time [1]. Hence, it is desirable to have low latency and high throughput. In this work, we focused on latency as the issue of speeding up ML inference has been in the forefront of novel research in the recent years. It is significant to explore especially when latency is critical for the system at hand. One such system is average instantaneous power calculation. We deployed a simple power calculation neural network model on various frameworks and computed the execution times of the inference processes. We considered only the inference part and excluded pre and post-processing of data from our calculations. We expected TensorRT to outperform other options due to the fact that it was developed for high-performance inference and is considered to be the leading framework [2]. Demonstrating its superiority was our main motivation. We believe our work contributes to the state-of-the-art literature as it is a quantitative study where various frameworks are compared in terms of their inference performance.

Contact IEEE to Subscribe

References

References is not available for this document.