Skip to content

Configuration Guide

Overview

vLLM RBLN uses the same entry points as upstream vLLM — LLM(...), vllm serve, and EngineArgs. Most arguments work the same way, with a few RBLN-specific constraints.

Additionally, RBLN-specific concepts that have no upstream counterpart are exposed through two mechanisms:

  • additional_config parameter
  • VLLM_RBLN_* environment variables

Engine Arguments

Engine arguments are configured the same way as in upstream vLLM. The arguments below carry RBLN-specific constraints.

block_size (Required)

Together with max_model_len, selects the attention mode:

  • block_size == max_model_len: Naive Attention.
  • max_model_len is a multiple of block_size: Flash Attention.

Exception: for pooling and encoder-decoder models (e.g. Whisper), block_size is auto-derived to equal max_model_len and need not be set. Qwen3-backbone pooling models can override with an explicit smaller value for Flash Attention. See Attention Modes for details.

max_model_len

Default: Model's max sequence length.

Combined with block_size, selects the attention mode (see above).

max_num_seqs
Default: 1.
tensor_parallel_size (Fixed at 1)
Configure tensor parallelism on RBLN NPUs through the VLLM_RBLN_TP_SIZE environment variable. This engine argument accepts only 1, and any other value raises an error. vLLM-level tensor parallelism through tensor_parallel_size is planned for a future release, once RBLN-CCL integration lands.

A complete example:

1
2
3
4
5
6
7
8
9
from vllm import LLM

llm = LLM(
    model="<huggingface-model-id>",
    block_size=4096,
    max_model_len=32768,
    max_num_seqs=1,
    tensor_parallel_size=1,
)

Additional Configuration Keys

RBLN-specific keys passed via the additional_config parameter of LLM(...). Some keys live at the top level of additional_config, while others are nested under additional_config["rbln_config"]. See RblnModelConfig in optimum-rbln for the complete rbln_config schema.

prefix_block_size

Default: 128.

Set via additional_config["prefix_block_size"]. Controls the prefix-cache hit granularity when Automatic Prefix Caching is enabled. The value must be a multiple of prefill_chunk_size and a divisor of kvcache_block_size. See Automatic Prefix Caching for details.

device

Default: Mapped automatically to devices 0 through tensor_parallel_size - 1.

Set via additional_config["rbln_config"][...]["device"] under each model or submodule it configures, so a single compiled model can route different submodules to different device sets (see example below). The placement is decided at runtime, so the same compiled artifact can be redeployed to different hardware without recompiling.

decoder_batch_sizes

Default: A single decoder compiled at max_num_seqs.

Set via additional_config["rbln_config"]["decoder_batch_sizes"]. Compiles one decoder graph per batch size in the list, so the engine can serve a request by picking the smallest decoder that still fits the in-flight request count at runtime. All values must be ≤ max_num_seqs. The list is sorted in descending order, and max_num_seqs is added automatically if missing. See Inference with Dynamic Batch Sizes for details.

from vllm import LLM

llm = LLM(
    ...,
    additional_config={
        "prefix_block_size": 256,
        "rbln_config": {
            "visual": {
                "device": [0],
            },
            "language_model": {
                "device": [1, 2, 3, 4],
            },
        },
    },
)

Environment Variables

RBLN-specific environment variables prefixed with VLLM_RBLN_. Set them on the process that launches the engine. For upstream vLLM environment variables, see the vLLM Environment Variables Guide.

VLLM_RBLN_TP_SIZE

Default: 1.

Sets the tensor-parallel size for RSD (Rebellions Scalable Design), the compiler-side parallelism that reduces runtime overhead. See Model Parallelism for context.

VLLM_RBLN_SAMPLER

Default: True.

Runs sampling on the RBLN NPU via the RBLNSampler. Set to False to fall back to the upstream vLLM sampler, which runs on CPU.

VLLM_RBLN_ENABLE_WARM_UP

Default: True.

Runs a model warm-up pass at engine startup so the first request is not delayed by one-time initialization. Set to False to skip warm-up.