Quick Start¶
The RBLN SDK enables developers to run deep learning models efficiently on the RBLN Neural Processing Unit (NPU). This guide walks you through the complete workflow—from setup to inference—using practical examples with PyTorch and HuggingFace models. Follow these steps to get started:
- Setup & Installation
- Construct or Import a Model
- Compile a Model
- Model Inference
- Using the Model Serving Framework
1. Setup & Installation¶
System Requirements¶
- Ubuntu 22.04 LTS (Debian bullseye) or higher
- Python (supports 3.9 - 3.12)
- A system equipped with an RBLN NPU
- RBLN Driver
This tutorial assumes that the above system requirements have been met. You can check the RBLN Driver installation and RBLN NPU presence using the rbln-stat
CLI as follows:
Install the RBLN SDK¶
RBLN SDK is distributed as a .whl
package. Please note that rebel-compiler
and vllm-rbln
require an RBLN Portal account.
2. Construct or Import a Model¶
Before compiling a model for the RBLN NPU, you need to construct or import it using a supported deep learning framework. The RBLN SDK supports models from frameworks like tensorflow
, torch
, transformers
, and diffusers
. This section provides examples for a PyTorch model (non-HuggingFace) (Option 1) and a HuggingFace Diffusers model (Option 2).
Option 1: Non-HuggingFace Model¶
Non-HuggingFace models, such as custom PyTorch or TensorFlow models, are compiled using the rebel-compiler
.
Option 2: HuggingFace Model¶
HuggingFace models, built with libraries like transformers
or diffusers
, are compiled using optimum-rbln
, an RBLN-optimized extension of HuggingFace APIs.
3. Compile a Model¶
To run a model on the RBLN NPU, it must first be compiled into a format optimized for the hardware. The compilation function you use depends on whether your model integrates with the HuggingFace ecosystem (e.g., transformers
or diffusers
libraries). Use rebel-compiler
for non-HuggingFace models, such as custom PyTorch or TensorFlow models, and optimum-rbln
for models leveraging HuggingFace APIs. Refer to Option 1 for non-HuggingFace models or Option 2 for HuggingFace-compatible models.
Option 1: Non-HuggingFace Model¶
Non-HuggingFace models, such as custom PyTorch or TensorFlow models, are compiled using the rebel-compiler
API. This tool converts the model into an RBLN NPU-compatible format. For PyTorch models, like the SimpleConvBNRelu
in the example below, use the rebel.compile_from_torch()
function:
Option 2: HuggingFace Model¶
HuggingFace models are compiled with optimum-rbln
. The example below compiles a Stable Diffusion XL model using RBLNStableDiffusionXLPipeline()
, which adapts the diffusers
StableDiffusionXLPipeline
class for the RBLN NPU.
Set the export
argument to True
to enable compilation:
4. Model Inference¶
Run inference on the RBLN NPU using a compiled model. The process depends on whether the model was compiled with rebel-compiler
(non-HuggingFace) or optimum-rbln
(HuggingFace), as outlined below.
Option 1: Non-HuggingFace Model¶
Non-HuggingFace models, compiled with rebel-compiler
, use the rebel.Runtime()
API to load and execute the model on the RBLN NPU.
Option 2: HuggingFace Model¶
HuggingFace models, compiled with optimum-rbln
, perform inference using the same optimum-rbln
API. The example below runs inference on a Stable Diffusion XL model using RBLNStableDiffusionXLPipeline()
, which adapts the diffusers
StableDiffusionXLPipeline
class for the RBLN NPU. Unlike compilation (see Option 2 of Section 3), set export=False
and use the local path of the compiled model:
5. Using the Model Serving Framework¶
The RBLN SDK supports the following model serving frameworks to deploy compiled models in production environments, leveraging the RBLN NPU for efficient inference. Consult the linked documentation for detailed setup and configuration:
- NVIDIA Triton Inference Server – a multi-model, multi-framework inference engine (refer to NVIDIA Triton Inference Server Documentation)
- vllm-rbln – a high-performance inference engine for large language models (refer to vllm-rbln Documentation)
- TorchServe – a framework for building, shipping, and running production-ready PyTorch models (refer to TorchServe Documentation)