Configuration Guide¶

Overview¶

vLLM RBLN uses the same entry points as upstream vLLM — LLM(...), vllm serve, and EngineArgs. Most arguments work the same way, with a few RBLN-specific constraints.

Additionally, RBLN-specific concepts that have no upstream counterpart are exposed through two mechanisms:

additional_config parameter
VLLM_RBLN_* environment variables

Engine Arguments¶

Engine arguments are configured the same way as in upstream vLLM. The arguments below carry RBLN-specific constraints.

block_size (Required)

Together with max_model_len, selects the attention mode:

block_size == max_model_len: Naive Attention.
max_model_len is a multiple of block_size: Flash Attention.

Exception: for pooling and encoder-decoder models (e.g. Whisper), block_size is auto-derived to equal max_model_len and need not be set. Qwen3-backbone pooling models can override with an explicit smaller value for Flash Attention. See Attention Modes for details.

max_model_len

Default: Model's max sequence length.

Combined with block_size, selects the attention mode (see above).

max_num_seqs

Default: 1.

tensor_parallel_size (Fixed at 1)

This vLLM engine argument selects vLLM-level tensor parallelism, which splits the model across multiple worker processes that communicate over Rebellions Collective Communications Library (RCCL). On RBLN NPUs it accepts only 1, and any other value raises an error. Multi-process tensor parallelism is planned for a future release, once RCCL integration lands.

To run a model across multiple NPU devices within a single process, set the VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK environment variable instead. The RBLN compiler partitions the model across those devices and runs them as one process — a separate mechanism, Rebellions Scalable Design (RSD), from vLLM-level tensor_parallel_size.

A complete example:

from vllm import LLM

llm = LLM(
    model="<huggingface-model-id>",
    block_size=4096,
    max_model_len=32768,
    max_num_seqs=1,
    tensor_parallel_size=1,
)

Additional Configuration Keys¶

RBLN-specific keys passed via the additional_config parameter of LLM(...). Some keys live at the top level of additional_config, while others are nested under additional_config["rbln_config"]. See RblnModelConfig in optimum-rbln for the complete rbln_config schema.

prefix_block_size

Default: 128.

Set via additional_config["prefix_block_size"]. Controls the prefix-cache hit granularity when Automatic Prefix Caching is enabled. The value must be a multiple of prefill_chunk_size and a divisor of kvcache_block_size. See Automatic Prefix Caching for details.

device

Default: Mapped automatically to devices 0 through VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK - 1.

Set via additional_config["rbln_config"][...]["device"] under each model or submodule it configures, so a single compiled model can route different submodules to different device sets (see example below). The placement is decided at runtime, so the same compiled artifact can be redeployed to different hardware without recompiling.

decoder_batch_sizes

Default: A single decoder compiled at max_num_seqs.

Set via additional_config["rbln_config"]["decoder_batch_sizes"]. Compiles one decoder graph per batch size in the list, so the engine can serve a request by picking the smallest decoder that still fits the in-flight request count at runtime. All values must be ≤ max_num_seqs. The list is sorted in descending order, and max_num_seqs is added automatically if missing. See Inference with Dynamic Batch Sizes for details.

from vllm import LLM

llm = LLM(
    ...,
    additional_config={
        "prefix_block_size": 256,
        "rbln_config": {
            "visual": {
                "device": [0],
            },
            "language_model": {
                "device": [1, 2, 3, 4],
            },
        },
    },
)

Environment Variables¶

RBLN-specific environment variables prefixed with VLLM_RBLN_. Set them on the process that launches the engine. For upstream vLLM environment variables, see the vLLM Environment Variables Guide.

VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK

Default: 1.

Sets the number of RBLN NPU devices grouped into a single local rank, that is, one engine process. The RBLN compiler partitions the model across these devices and runs them as one process — the mechanism RBLN calls RSD, which reduces runtime overhead. This differs from vLLM-level tensor_parallel_size, which splits the model across separate processes over RCCL. See Model Parallelism for context.

Note

VLLM_RBLN_TP_SIZE is the former name of this variable and is deprecated. It is still honored as a fallback when VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK is unset, but will be removed in a future release.

VLLM_RBLN_SAMPLER

Default: True.

Runs sampling on the RBLN NPU via the RBLNSampler. Set to False to fall back to the upstream vLLM sampler, which runs on CPU.

VLLM_RBLN_ENABLE_WARM_UP

Default: True.

Runs a model warm-up pass at engine startup so the first request is not delayed by one-time initialization. Set to False to skip warm-up.