Quick Start¶
The RBLN SDK enables developers to run deep learning models efficiently on the RBLN Neural Processing Unit (NPU). This guide walks you through the complete workflow—from setup to inference—using practical examples with PyTorch and HuggingFace models. Follow these steps to get started:
- Setup & Installation
- Construct or Import a Model
- Compile a Model
- Model Inference
- Using the Model Serving Framework
1. Setup & Installation¶
System Requirements¶
- Ubuntu 22.04 LTS (Debian bullseye) or higher
- Python (supports 3.9 - 3.12)
- A system equipped with an RBLN NPU
- RBLN Driver
This tutorial assumes that the above system requirements have been met. You can check the RBLN Driver installation and RBLN NPU presence using the rbln-stat CLI as follows:
Install the RBLN SDK¶
RBLN SDK is distributed as a .whl package. Please note that rebel-compiler and vllm-rbln require an RBLN Portal account.
2. Construct or Import a Model¶
Before compiling a model for the RBLN NPU, you need to construct or import it using a supported deep learning framework. The RBLN SDK supports models from frameworks like tensorflow, torch, transformers, and diffusers. This section provides examples for a PyTorch model (non-HuggingFace) (Option 1) and a HuggingFace Diffusers model (Option 2).
Option 1: Non-HuggingFace Model¶
Non-HuggingFace models, such as custom PyTorch or TensorFlow models, are compiled using the rebel-compiler.
Option 2: HuggingFace Model¶
HuggingFace models, built with libraries like transformers or diffusers, are compiled using optimum-rbln, an RBLN-optimized extension of HuggingFace APIs.
3. Compile a Model¶
To run a model on the RBLN NPU, it must first be compiled into a format optimized for the hardware. The compilation function you use depends on whether your model integrates with the HuggingFace ecosystem (e.g., transformers or diffusers libraries). Use rebel-compiler for non-HuggingFace models, such as custom PyTorch or TensorFlow models, and optimum-rbln for models leveraging HuggingFace APIs. Refer to Option 1 for non-HuggingFace models or Option 2 for HuggingFace-compatible models.
Option 1: Non-HuggingFace Model¶
Non-HuggingFace models, such as custom PyTorch or TensorFlow models, are compiled using the rebel-compiler API. This tool converts the model into an RBLN NPU-compatible format. For PyTorch models, like the SimpleConvBNRelu in the example below, use the rebel.compile_from_torch() function:
Option 2: HuggingFace Model¶
HuggingFace models are compiled with optimum-rbln. The example below compiles a Stable Diffusion XL model using RBLNStableDiffusionXLPipeline(), which adapts the diffusers StableDiffusionXLPipeline class for the RBLN NPU.
Set the export argument to True to enable compilation:
4. Model Inference¶
Run inference on the RBLN NPU using a compiled model. The process depends on whether the model was compiled with rebel-compiler (non-HuggingFace) or optimum-rbln (HuggingFace), as outlined below.
Option 1: Non-HuggingFace Model¶
Non-HuggingFace models, compiled with rebel-compiler, use the rebel.Runtime() API to load and execute the model on the RBLN NPU.
Option 2: HuggingFace Model¶
HuggingFace models, compiled with optimum-rbln, perform inference using the same optimum-rbln API. The example below runs inference on a Stable Diffusion XL model using RBLNStableDiffusionXLPipeline(), which adapts the diffusers StableDiffusionXLPipeline class for the RBLN NPU. Unlike compilation (see Option 2 of Section 3), set export=False and use the local path of the compiled model:
5. Using the Model Serving Framework¶
The RBLN SDK supports the following model serving frameworks to deploy compiled models in production environments, leveraging the RBLN NPU for efficient inference. Consult the linked documentation for detailed setup and configuration:
- NVIDIA Triton Inference Server – a multi-model, multi-framework inference engine (refer to NVIDIA Triton Inference Server Documentation)
- vllm-rbln – a high-performance inference engine for large language models (refer to vllm-rbln Documentation)
- TorchServe – a framework for building, shipping, and running production-ready PyTorch models (refer to TorchServe Documentation)