In any machine learning problem, the goal of our neural network is to do well on the new unseen data, training a deep learning model helps to achieve this goal. We have to focus on running an inference session and ensuring the model works perfectly when deployed in a specific environment.
You can train your model anywhere in your preferred framework but the environment where you need to run the inference session need not be favorable to that particular framework. For example, some time ago, applications preferred the Caffe model for deployment. Does this mean we have to use the Caffe framework for training as well? No, this is where the Open Neural Network Exchange (ONNX) model format comes into the picture.
With ONNX, one can switch between deep learning frameworks such as PyTorch and Caffe2.
Similarly, we may want to deploy our model on GCP, a mobile app, or somewhere else. During the inference session, we do not give much importance to the framework that we used to train our model. In fact, we don’t need to use any framework at all for running an inference session, all we need is the trained core neural network in whichever format that can produce good and fast results. And this is where inference engines like ONNX Runtime, TensorRT, etc. can help us.
Let’s dive deep into ONNX and ONNX Runtime before using them in our task. Our task is to train a neural network with PyTorch using the MNIST dataset to make predictions of handwritten digits in real-time and deploying the model on ONNX Runtime. I’m not going to discuss how to train your model for digit classification. We’ll focus more on ONNX and ONNX Runtime.
What is ONNX?
ONNX is an intermediate representation of our trained model that can be used for switching between frameworks and also for running inference sessions either on-prem or at the edge.
From the above image, it is clear that once you train your deep neural networks in whichever framework you prefer, ONNX takes that model, usually saved as .pth file, and converts it into a format that is suitable for deployment.
To understand how it is suitable, let’s start from the basics, neural network that we build in whichever frameworks are nothing but computations achieved through a dataflow graph, in other words, a computational graph. Some of these graphs are static and some are dynamic. These graphs are just an intermediate representation of the neural network that we build and they are translated to run on a specific device like GPUs, CPUs, etc.
Different frameworks have their intermediate representations of the graph. Also, the available frameworks are optimized for particular tasks in the ML pipeline, for example, although PyTorch is used for deploying your models nowadays, in the recent past, the majority of its usage came from research. Conversion of models between frameworks used to delay the deployment process. ONNX format provides a common intermediate representation with which one can choose their framework of choice and expedite the process of model deployment. Currently, ONNX is focused more on inferencing.
Did you know?
ONNX comes built-in with frameworks like CNTK and ML.NET. It is also a part of the PyTorch package. For other frameworks, manual installation is required.
Let’s see how we can convert our model to ONNX format. I have already trained a neural network for MNIST digit classification with PyTorch and have saved the .pth file.
Did you know?
A trained neural network will have a series of different operators applied to the input data.
Steps to export a model from PyTorch to ONNX:
- The first step is to ensure that we are running our model in evaluation mode. This is essential since operators like dropout or batch-norm behave a bit differently during inference when compared to training.
- The way we export a model is achieved by scripting or tracing. Let’s consider a simple convolutional neural network trained on the MNIST dataset. We load our model from the saved .pth file as usual.
- We then have to pass an input tensor whose values can be random as long as it is of the right type and size, to the network. The operations applied on the dummy input data will be traced by ONNX, with its in-built operators, a model file in onnx format will be generated.
- The created file is a binary protobuf file that contains both our neural network structure and also the trained parameters of our model.
Did you know?
ONNX built-in operators include typical operations associated with deep learning like convolution, linear operation, activation function, and more.
Once we have our .onnx file, we can run it in ONNX Runtime.
ONNX Runtime is an inference engine that takes in models in onnx format and optimizes them for deployment on any cloud platform, edge devices, and also IoT devices.
Here are the benefits of using ONNX Runtime:
- Improve inference performance for a wide variety of ML models
- Reduce the time and cost of training large models
- Train in Python but deploy into a C#/C++/Java app
- Run-on different hardware and operating systems
- Support models created in several different frameworks
Our task is to deploy ONNX Runtime on CPU by running the model in onnx format which we created from the previous step. Let’s see how to do that.
To run the model in ONNX Runtime:
- Create an inference session for the model with the chosen configuration parameters.
- Evaluate the model using the run API. The output of this call is a list containing the outputs of the model computed by the ONNX Runtime.
One way to verify the model’s output with ONNX Runtime is to compare it with the PyTorch model’s output for the same input. This is achieved by the last two lines in the above implementation.
Conclusion: The time taken by our model running on ONNX Runtime to give predictions is very much less than the time taken by our model when evaluating directly on CPU. ONNX Runtime for this task is 35% faster. This depends on CPU performance and can vary for different tasks. The below image shows the time taken by our model to evaluate a single image on CPU and ONNX Runtime.