vllm-rbln: A vLLM Plugin for RBLN NPU¶
vllm-rbln
is a hardware plugin for the vLLM library that delivers high-performance large language model inference and serving on RBLN NPUs.
How to install¶
Before installing vllm-rbln, ensure that you have installed the latest versions of the required dependencies, including rebel-compiler
and optimum-rbln
.
You can install vllm-rbln
directly from PyPI or access the source code via the vllm-rbln GitHub repository, as it is open source.
To install the latest release via pip:
For details on the latest version and changes, refer to the Release Notes.
Note
As of version v0.8.1, vllm-rbln
has migrated to the new plugin system. From this version onward, installing vllm-rbln
will automatically pull in vllm
as a dependency.
For earlier versions (e.g., v0.8.0), the vllm-rbln
package does not depend on the vllm
package, and duplicate installations may lead to operational issues. If you installed vllm
after vllm-rbln
, please reinstall the vllm-rbln
to ensure proper functionality.
Tutorials¶
To help users get started with vllm-rbln
, we have created multiple comprehensive tutorials demonstrating its capabilities and diverse deployment options:
- How to use the
vllm-rbln
plugin with vLLM. - How to use the
vllm-rbln
plugin to build an OpenAI-compatible API server
Design Overview¶
The initial design of vllm-rbln
integrates with optimum-rbln
. In this setup, models are first compiled using optimum-rbln
, and the resulting compiled model directory is then used by vLLM via the model
parameter. This remains the default implementation for now, offering a stable and proven workflow while a new architecture is under active development.
All tutorials are currently based on this design.
Migration to Torch Compile-Based Integration¶
We are actively migrating toward a new architecture that leverages torch.compile()
and natively integrates with vLLM's APIs and model zoo. This new design eliminates the need for a separate compilation step, offering a more seamless and intuitive user experience through standard vLLM workflows.
With torch.compile()
, the first run is a cold start, during which the model is compiled. Once compiled, the result is cached—enabling subsequent runs to become warm starts, which are faster and benefit from the optimized compiled artifacts.
Supported Models¶
The following table presents the comprehensive lineup of models currently supported by vllm-rbln
.
Generative Models¶
Architecture | Example Model Code |
---|---|
RBLNLlamaForCausalLM | Llama-2/3 |
RBLNGemmaForCausalLM | Gemma |
RBLNPhiForCausalLM | Phi-2 |
RBLNOPTForCausalLM | OPT |
RBLNGPT2LMHeadModel | GPT2 |
RBLNMidmLMHeadModel | Mi:dm |
RBLNMistralForCausalLM | Mistral |
RBLNExaoneForCausalLM | EXAONE-3/3.5 |
RBLNQwen2ForCausalLM | Qwen2/2.5 |
RBLNBartForConditionalGeneration | BART |
RBLNT5ForConditionalGeneration | T5 |
RBLNBlip2ForConditionalGeneration | BLIP2 |
Multimodal Language Models¶
Architecture | Example Model Code |
---|---|
RBLNLlavaNextForConditionalGeneration | LlaVa-Next |
RBLNQwen2_5_VLForConditionalGeneration | Qwen2.5-VL |
RBLNIdefics3ForConditionalGeneration | Idefics3 |
RBLNGemma3ForConditionalGeneration | Gemma3 |
Pooling Models¶
Architecture | Example Model Code |
---|---|
RBLNT5EncoderModel | T5Encoder-based |
RBLNBertModel | BERT-based |
RBLNRobertaModel | RoBERTa-based |
RBLNXLMRobertaModel | XLM-RoBERTa-based |
RBLNXLMRobertaForSequenceClassification | XLM-RoBERTa-based |
RBLNRobertaForSequenceClassification | RoBERTa-based |