Skip to content

Model APIs Overview

The Optimum RBLN library provides APIs for using HuggingFace models on RBLN NPU. This overview focuses on the API design and usage patterns.

API Design

Naming Convention

Optimum RBLN follows a simple naming pattern:

  • Model Classes: Add "RBLN" prefix to original HuggingFace class names

    1
    2
    3
    4
    5
    # Original HuggingFace
    from transformers import LlamaForCausalLM
    
    # Optimum RBLN equivalent
    from optimum.rbln import RBLNLlamaForCausalLM
    

  • Configuration Classes: Add "Config" suffix to the RBLN model class

    # Configuration for RBLN model
    from optimum.rbln import RBLNLlamaForCausalLMConfig
    

Static Graph Compilation

RBLN SDK works with static computational graphs, which means certain parameters must be specified at compile time. These include:

  • Batch size
  • Sequence length (for language models)
  • Tensor parallel size (for distributed execution)
  • Other model-specific parameters

The configuration classes (RBLNModelNameConfig) allow you to specify these static values.

Supported Methods

Optimum RBLN preserves the same API interface as the original HuggingFace models:

  • Language models support .generate() method
  • Vision and other models support both direct call syntax (model(inputs)) and .forward() method
  • Diffusers pipelines support the __call__() method (used with direct call syntax)

Usage Patterns

There are multiple ways to configure and use RBLN models:

Method 1: Using rbln_* parameters

from optimum.rbln import RBLNLlamaForCausalLM
from transformers import AutoTokenizer

# Load and compile the model with inline parameters
model = RBLNLlamaForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    export=True,  # Set to True for compilation
    rbln_batch_size=1,
    rbln_max_seq_len=512,  # Optional, inferred from model config if not specified
    rbln_tensor_parallel_size=4  # For multi-device execution
)

# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")

# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)
from optimum.rbln import RBLNLlamaForCausalLM, RBLNLlamaForCausalLMConfig
from transformers import AutoTokenizer

# Define configuration through RBLNConfig object
config = RBLNLlamaForCausalLMConfig(
    batch_size=1,
    max_seq_len=4096,
    tensor_parallel_size=4
)

# Load and compile the model with config object
model = RBLNLlamaForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    export=True,
    rbln_config=config
)

# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")

# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)

Benefits

  • Drop-in Replacement: Replace original HuggingFace imports with RBLN equivalents
  • Same Familiar API: Use the same methods you're already familiar with
  • Fine-grained Control: Configure static parameters for maximum performance

For detailed model support information and hardware compatibility, please refer to the Optimum RBLN Overview.