Concurrent Processing¶
Achieving full utilization when inferring a deep learning model with accelerators is a complex task. Each inference operation inherently involves some waiting time. To understand this better, it is essential to provide a concise explanation of the process. The following steps must be executed to obtain inference results from the accelerator:
- Loading (and preprocessing) inputs - host (CPU)
- Feeding into the accelerator - host-accelerator IO (DMA)
- Running workloads - accelerator
- Getting outputs - accelerator-host IO (DMA)
The host CPU is actively involved in step 1, while the accelerator takes over in step 3. Additional DMA transfers may occur during step 3 if processing on the host is beneficial.
In a multiple inference scenario, where each inference is data-independent, RBLN SDK can optimize the utilization of the aforementioned steps by implementing the concept of concurrency. This involves processing multiple samples simultaneously, thereby filling idle time gaps with other inferences and ensuring maximal utilization of the device.
RBLN SDK offers a straightforward and user-friendly set of asynchronous APIs for achieving concurrency. This document provides an overview of our asynchronous APIs with two examples, TensorFlow and PyTorch, demonstrating how to effectively utilize them.
Prerequisite¶
Before getting started, please make sure you have installed the following pip packages in your system:
How to use¶
Once the DL model has been compiled, you will obtain an RBLNCompiledModel
object. To utilize an asynchronous runtime, create a runtime in async mode:
When a pre-compiled and saved model *.rbln
is available from local storage, it is possible to directly create an asynchronous runtime as below:
If your program utilizes native asyncio as the event loop, using AsyncRuntime.async_run
is an excellent choice for Pythonic programming, as it adheres to the PEP-492 async and await syntax:
If the caller program runs on a different type of event loop, such as PyQt or gevent, the invocations and joins need to be managed manually in accordance with the logic. To facilitate this, RBLN SDK also offers AsyncRuntime.run
, which is a straightforward asynchronous version of the run function.
Performance Optimization Considerations¶
When implementing concurrent processing, RBLN SDK provides flexibility in managing input preparation through the parallel
parameter in AsyncRuntime. While single-threaded input preparation (parallel=1
) is the default behavior, users can enable double buffering with two threads (parallel=2
) for potential performance improvements in specific scenarios.
Double buffering allows one thread to prepare the next input while the NPU processes the current one, which can be beneficial when:
- Input preprocessing is computationally intensive
- The model's inference time is comparable to input preparation time
- DMA transfer times are significant due to:
- Large input/output tensors that require substantial data transfer between host and NPU
- Models with multiple inputs/outputs requiring multiple DMA transfers
However, users should benchmark their specific use cases, as double buffering may not always provide performance benefits and could potentially impact stability with certain models.
Examples¶
TensorFlow - UNet¶
UNet is renowned for its ability to provide pixel-level predictions and is widely applied in tasks such as medical image biomarker detection, instance segmentation, depth estimation, and more. Unlike simple classification models, UNet
takes an input image and generates output prediction images, necessitating higher transaction bandwidth. This, in turn, increases the communication time between the host and the accelerator.
In this example, we will demonstrate a depth estimation task following the official Keras Tutorial. The model has been trained on the NYU Depth V2 dataset.
Below are step-by-step examples for efficiently inferring UNet
using the RBLN NPU. To begin, run the following bash script to obtain the dataset:
In this tutorial, we used 100 samples for the test:
Once the sample data is prepared, we can simply compile the model as below:
By creating the async runtime, we can run it concurrently as below:
Below is the complete code that includes all the steps above to execute UNet
using our asynchronous runtime APIs:
PyTorch - YOLOv8¶
YOLOv8 is a popular object detection model. In this example, we will use YOLOv8
with the COCO dataset. To access the dataset, please refer to the link.
Let's assume that the images and the annotation file for the 2017/validation set
are stored in the current directory:
Below is the complete code to execute YOLOv8
using our asynchronous runtime APIs: