Configuration Guide¶
Overview¶
vLLM RBLN uses the same entry points as upstream vLLM — LLM(...), vllm serve, and EngineArgs. Most arguments work the same way, with a few RBLN-specific constraints.
Additionally, RBLN-specific concepts that have no upstream counterpart are exposed through two mechanisms:
additional_configparameterVLLM_RBLN_*environment variables
Engine Arguments¶
Engine arguments are configured the same way as in upstream vLLM. The arguments below carry RBLN-specific constraints.
block_size(Required)-
Together with
max_model_len, selects the attention mode:block_size == max_model_len: Naive Attention.max_model_lenis a multiple ofblock_size: Flash Attention.
Exception: for pooling and encoder-decoder models (e.g. Whisper),
block_sizeis auto-derived to equalmax_model_lenand need not be set. Qwen3-backbone pooling models can override with an explicit smaller value for Flash Attention. See Attention Modes for details. max_model_len-
Default: Model's max sequence length.
Combined with
block_size, selects the attention mode (see above). max_num_seqs- Default: 1.
tensor_parallel_size(Fixed at1)-
This vLLM engine argument selects vLLM-level tensor parallelism, which splits the model across multiple worker processes that communicate over Rebellions Collective Communications Library (RCCL). On RBLN NPUs it accepts only
1, and any other value raises an error. Multi-process tensor parallelism is planned for a future release, once RCCL integration lands.To run a model across multiple NPU devices within a single process, set the
VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANKenvironment variable instead. The RBLN compiler partitions the model across those devices and runs them as one process — a separate mechanism, Rebellions Scalable Design (RSD), from vLLM-leveltensor_parallel_size.
A complete example:
Additional Configuration Keys¶
RBLN-specific keys passed via the additional_config parameter of LLM(...). Some keys live at the top level of additional_config, while others are nested under additional_config["rbln_config"]. See RblnModelConfig in optimum-rbln for the complete rbln_config schema.
prefix_block_size-
Default: 128.
Set via
additional_config["prefix_block_size"]. Controls the prefix-cache hit granularity when Automatic Prefix Caching is enabled. The value must be a multiple ofprefill_chunk_sizeand a divisor ofkvcache_block_size. See Automatic Prefix Caching for details. device-
Default: Mapped automatically to devices
0throughVLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK - 1.Set via
additional_config["rbln_config"][...]["device"]under each model or submodule it configures, so a single compiled model can route different submodules to different device sets (see example below). The placement is decided at runtime, so the same compiled artifact can be redeployed to different hardware without recompiling. decoder_batch_sizes-
Default: A single decoder compiled at
max_num_seqs.Set via
additional_config["rbln_config"]["decoder_batch_sizes"]. Compiles one decoder graph per batch size in the list, so the engine can serve a request by picking the smallest decoder that still fits the in-flight request count at runtime. All values must be≤ max_num_seqs. The list is sorted in descending order, andmax_num_seqsis added automatically if missing. See Inference with Dynamic Batch Sizes for details.
Environment Variables¶
RBLN-specific environment variables prefixed with VLLM_RBLN_. Set them on the process that launches the engine. For upstream vLLM environment variables, see the vLLM Environment Variables Guide.
VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK-
Default:
1.Sets the number of RBLN NPU devices grouped into a single local rank, that is, one engine process. The RBLN compiler partitions the model across these devices and runs them as one process — the mechanism RBLN calls RSD, which reduces runtime overhead. This differs from vLLM-level
tensor_parallel_size, which splits the model across separate processes over RCCL. See Model Parallelism for context.Note
VLLM_RBLN_TP_SIZEis the former name of this variable and is deprecated. It is still honored as a fallback whenVLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANKis unset, but will be removed in a future release. VLLM_RBLN_SAMPLER-
Default:
True.Runs sampling on the RBLN NPU via the RBLNSampler. Set to
Falseto fall back to the upstream vLLM sampler, which runs on CPU. VLLM_RBLN_ENABLE_WARM_UP-
Default:
True.Runs a model warm-up pass at engine startup so the first request is not delayed by one-time initialization. Set to
Falseto skip warm-up.