Model APIs Overview
The Optimum RBLN library provides APIs for using HuggingFace models on RBLN NPU. This overview focuses on the API design and usage patterns.
API Design
Naming Convention
Optimum RBLN follows a simple naming pattern:
-
Model Classes: Add "RBLN" prefix to original HuggingFace class names
| # Original HuggingFace
from transformers import LlamaForCausalLM
# Optimum RBLN equivalent
from optimum.rbln import RBLNLlamaForCausalLM
|
-
Configuration Classes: Add "Config" suffix to the RBLN model class
| # Configuration for RBLN model
from optimum.rbln import RBLNLlamaForCausalLMConfig
|
-
Auto Classes: Add "RBLN" prefix to original Auto Model/Pipeline names
-
AutoModel
| # Original HuggingFace AutoModel
from transformers import AutoModelForCausalLM
# Optimum RBLN equivalent
from optimum.rbln import RBLNAutoModelForCausalLM
|
-
AutoPipeline
| # Original Diffusers AutoPipeline
from diffusers import AutoPipelineForText2Image
# Optimum RBLN equivalent
from optimum.rbln import RBLNAutoPipelineForText2Image
|
Static Graph Compilation
RBLN SDK works with static computational graphs, which means certain parameters must be specified at compile time. These include:
- Batch size
- Sequence length (for language models)
- Tensor parallel size (for distributed execution)
- Other model-specific parameters
The configuration classes (RBLNModelNameConfig
) allow you to specify these static values.
Supported Methods
Optimum RBLN preserves the same API interface as the original HuggingFace models:
- Language models support
.generate()
method
- Vision and other models support both direct call syntax (
model(inputs)
) and .forward()
method
- Diffusers pipelines support the
__call__()
method (used with direct call syntax)
Usage Patterns
There are multiple ways to configure and use RBLN models:
Method 1: Using rbln_* parameters
| from optimum.rbln import RBLNLlamaForCausalLM
from transformers import AutoTokenizer
# Load and compile the model with inline parameters
model = RBLNLlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
export=True, # Set to True for compilation
rbln_batch_size=1,
rbln_max_seq_len=512, # Optional, inferred from model config if not specified
rbln_tensor_parallel_size=4 # For multi-device execution
)
# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)
|
Method 2: Using RBLNConfig object (recommended)
| from optimum.rbln import RBLNLlamaForCausalLM, RBLNLlamaForCausalLMConfig
from transformers import AutoTokenizer
# Define configuration through RBLNConfig object
config = RBLNLlamaForCausalLMConfig(
batch_size=1,
max_seq_len=4096,
tensor_parallel_size=4
)
# Load and compile the model with config object
model = RBLNLlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
export=True,
rbln_config=config
)
# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)
|
Method 3: Using Auto Classes
| from optimum.rbln import RBLNAutoModelForCausalLM
from transformers import AutoTokenizer
# Load and compile the model with inline parameters
model = RBLNAutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
export=True, # Set to True for compilation
rbln_batch_size=1,
rbln_max_seq_len=512, # Optional, inferred from model config if not specified
rbln_tensor_parallel_size=4 # For multi-device execution
)
# Load tokenizer and prepare inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=64)
|
Benefits
- Drop-in Replacement: Replace original HuggingFace imports with RBLN equivalents
- Same Familiar API: Use the same methods you're already familiar with
- Fine-grained Control: Configure static parameters for maximum performance
For detailed model support information and hardware compatibility, please refer to the Optimum RBLN Overview.