2023 article

TensorRT Implementations of Model Quantization on Edge SoC

2023 IEEE 16TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP, MCSOC, pp. 486–493.

By: Y. Zhou, Z. Guo n, Z. Dong* & K. Yang

author keywords: deep neural networks; Network quantization; SoC; TensorRT; PyTorch; ONNX; edge device
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Source: Web Of Science
Added: April 22, 2024

Deep neural networks have shown remarkable capabilities in computer vision applications. However, their complex architectures can pose challenges for efficient real-time deployment on edge devices, as they require significant computational resources and energy costs. To overcome these challenges, TensorRT has been developed to optimize neural network models trained on major frameworks to speed up inference and minimize latency. It enables inference optimization using techniques such as model quantization which reduces computations by lowering the precision of the data type. The focus of our paper is to evaluate the effectiveness of TensorRT for model quantization. We conduct a comprehensive assessment of the accuracy, inference time, and throughput of TensorRT quantized models on an edge device. Our findings indicate that the quantization in TensorRT significantly enhances the efficiency of inference metrics while maintaining a high level of inference accuracy. Additionally, we explore various workflows for implementing quantization using TensorRT and discuss their advantages and disadvantages. Based on our analysis of these workflows, we provide recommendations for selecting an appropriate workflow for different application scenarios.