Model APIs Overview¶

The Optimum RBLN library provides APIs for using HuggingFace models on RBLN NPU. This overview focuses on the API design and usage patterns.

API Design¶

Naming Convention¶

Optimum RBLN follows a simple naming pattern:

Model Classes: Add "RBLN" prefix to original HuggingFace class names

# Original HuggingFace
from transformers import LlamaForCausalLM

# Optimum RBLN equivalent
from optimum.rbln import RBLNLlamaForCausalLM

Configuration Classes: Add "Config" suffix to the RBLN model class

# Configuration for RBLN model
from optimum.rbln import RBLNLlamaForCausalLMConfig

Static Graph Compilation¶

RBLN SDK works with static computational graphs, which means certain parameters must be specified at compile time. These include:

Batch size
Sequence length (for language models)
Tensor parallel size (for distributed execution)
Other model-specific parameters

The configuration classes (RBLNModelNameConfig) allow you to specify these static values.

Supported Methods¶

Optimum RBLN preserves the same API interface as the original HuggingFace models:

Language models support .generate() method
Vision and other models support both direct call syntax (model(inputs)) and .forward() method
Diffusers pipelines support the __call__() method (used with direct call syntax)

Usage Patterns¶

There are multiple ways to configure and use RBLN models:

Method 1: Using rbln_* parameters¶

from optimum.rbln import RBLNLlamaForCausalLM
from transformers import AutoTokenizer

# Load and compile the model with inline parameters
model = RBLNLlamaForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    export=True,  # Set to True for compilation
    rbln_batch_size=1,
    rbln_max_seq_len=512,  # Optional, inferred from model config if not specified
    rbln_tensor_parallel_size=4  # For multi-device execution
)

# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")

# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)

Method 2: Using RBLNConfig object (recommended)¶

from optimum.rbln import RBLNLlamaForCausalLM, RBLNLlamaForCausalLMConfig
from transformers import AutoTokenizer

# Define configuration through RBLNConfig object
config = RBLNLlamaForCausalLMConfig(
    batch_size=1,
    max_seq_len=4096,
    tensor_parallel_size=4
)

# Load and compile the model with config object
model = RBLNLlamaForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    export=True,
    rbln_config=config
)

# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")

# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)

Benefits¶

Drop-in Replacement: Replace original HuggingFace imports with RBLN equivalents
Same Familiar API: Use the same methods you're already familiar with
Fine-grained Control: Configure static parameters for maximum performance

For detailed model support information and hardware compatibility, please refer to the Optimum RBLN Overview.