vLLM RBLN: A vLLM Plugin for RBLN NPU¶
vLLM RBLN(vllm-rbln) is a hardware plugin for the vLLM library that delivers high-performance large language model inference and serving on RBLN NPUs.
How to install¶
Before installing vLLM RBLN, ensure that you have installed the latest version of rebel-compiler.
You can install vLLM RBLN either installing directly from PyPI or building from source.
Install using PyPI¶
To install the latest release via pip:
Install from source codes¶
1. Clone the vllm and vllm-rbln repositories¶
Please note that the version number of vllm does not necessarily match the version number required by vllm-rbln.
2. Install vllm¶
Setting VLLM_TARGET_DEVICE=empty allows you to build vLLM without specifying a target device during installation.
3. Install vllm-rbln¶
For details on the latest version and changes, refer to the Release Notes.
Design Overview¶
The initial design of vLLM RBLN integrates with optimum-rbln. In this setup, the user compiles the model with optimum-rbln ahead of time and passes the resulting directory to vLLM via the model parameter. This remains the default, stable workflow, and all tutorials and examples are based on it.
From vLLM RBLN v0.10.4, vLLM RBLN can handle the optimum-rbln compile step automatically as a beta feature: pass engine parameters directly to LLM(), and the compilation runs at engine startup, eliminating the separate pre-compilation step.
Migration to Torch Compile-Based Integration¶
We are actively migrating toward a new architecture that leverages torch.compile() and natively integrates with vLLM's APIs and model zoo. While the optimum-rbln path provides RBLN-specific model classes (e.g., RBLNLlamaForCausalLM) for each supported architecture, this new design slots into PyTorch's standard torch.compile backend mechanism, fitting more naturally into the upstream vLLM/PyTorch ecosystem.
With torch.compile(), the first run is a cold start, during which the model is compiled. Once compiled, the result is cached—enabling subsequent runs to become warm starts, which are faster and benefit from the optimized compiled artifacts.
Tutorials & Features¶
To help users get started with vLLM RBLN, we have created multiple comprehensive tutorials demonstrating its capabilities and diverse deployment options:
- Model Tutorials: Step-by-step tutorials demonstrating how to run vLLM RBLN with representative models
- Feature Descriptions: Explanations of key vLLM RBLN capabilities, covering various execution modes and server functionalities
Supported Models¶
The following table presents the comprehensive lineup of models currently supported by vLLM RBLN.
Decoder-only Models¶
| Architecture | Example Model Code |
|---|---|
| RBLNLlamaForCausalLM | Llama-2/3 |
| RBLNGemmaForCausalLM | Gemma |
| RBLNGemma2ForCausalLM | Gemma2 |
| RBLNPhiForCausalLM | Phi-2 |
| RBLNOPTForCausalLM | OPT |
| RBLNGPT2LMHeadModel | GPT2 |
| RBLNMistralForCausalLM | Mistral |
| RBLNExaoneForCausalLM | EXAONE-3/3.5 |
| RBLNQwen2ForCausalLM | Qwen2/2.5 |
| RBLNQwen3ForCausalLM | Qwen3 |
| RBLNGptOssForConditionalGeneration | gpt-oss |
Encoder-Decoder Models¶
| Architecture | Example Model Code |
|---|---|
| RBLNWhisperForConditionalGeneration | Whisper |
Change
As of vLLM RBLN v0.10.1, V0 has been deprecated.
Consequently, Whisper is now the only supported encoder–decoder model, and support for all other encoder–decoder models has been removed.
For more information, see the vLLM V1 User Guide.
Multimodal Language Models¶
| Architecture | Example Model Code |
|---|---|
| RBLNLlavaNextForConditionalGeneration | LlaVa-Next |
| RBLNQwen2VLForConditionalGeneration | Qwen2-VL |
| RBLNQwen2_5_VLForConditionalGeneration | Qwen2.5-VL |
| RBLNQwen3VLForConditionalGeneration | Qwen3-VL |
| RBLNIdefics3ForConditionalGeneration | Idefics3 |
| RBLNGemma3ForConditionalGeneration | Gemma3 |
| RBLNLlavaForConditionalGeneration | Llava |
| RBLNBlip2ForConditionalGeneration | BLIP2 |
| RBLNPaliGemmaForConditionalGeneration | PaliGemma |
| RBLNPaliGemmaForConditionalGeneration | PaliGemma2 |
Pooling Models¶
| Architecture | Example Model Code |
|---|---|
| RBLNT5EncoderModel | T5Encoder-based |
| RBLNBertModel | BERT-based |
| RBLNRobertaModel | RoBERTa-based |
| RBLNXLMRobertaModel | XLM-RoBERTa-based |
| RBLNXLMRobertaForSequenceClassification | XLM-RoBERTa-based |
| RBLNRobertaForSequenceClassification | RoBERTa-based |
| RBLNQwen3ForCausalLM | Qwen3-based |
| RBLNQwen3Model | Qwen3-based |


