Configuration Guide¶
Overview¶
vLLM RBLN uses the same entry points as upstream vLLM — LLM(...), vllm serve, and EngineArgs. Most arguments work the same way, with a few RBLN-specific constraints.
Additionally, RBLN-specific concepts that have no upstream counterpart are exposed through two mechanisms:
additional_configparameterVLLM_RBLN_*environment variables
Engine Arguments¶
Engine arguments are configured the same way as in upstream vLLM. The arguments below carry RBLN-specific constraints.
block_size(Required)-
Together with
max_model_len, selects the attention mode:block_size == max_model_len: Naive Attention.max_model_lenis a multiple ofblock_size: Flash Attention.
Exception: for pooling and encoder-decoder models (e.g. Whisper),
block_sizeis auto-derived to equalmax_model_lenand need not be set. Qwen3-backbone pooling models can override with an explicit smaller value for Flash Attention. See Attention Modes for details. max_model_len-
Default: Model's max sequence length.
Combined with
block_size, selects the attention mode (see above). max_num_seqs- Default: 1.
tensor_parallel_size(Fixed at1)- Configure tensor parallelism on RBLN NPUs through the
VLLM_RBLN_TP_SIZEenvironment variable. This engine argument accepts only1, and any other value raises an error. vLLM-level tensor parallelism throughtensor_parallel_sizeis planned for a future release, once RBLN-CCL integration lands.
A complete example:
Additional Configuration Keys¶
RBLN-specific keys passed via the additional_config parameter of LLM(...). Some keys live at the top level of additional_config, while others are nested under additional_config["rbln_config"]. See RblnModelConfig in optimum-rbln for the complete rbln_config schema.
prefix_block_size-
Default: 128.
Set via
additional_config["prefix_block_size"]. Controls the prefix-cache hit granularity when Automatic Prefix Caching is enabled. The value must be a multiple ofprefill_chunk_sizeand a divisor ofkvcache_block_size. See Automatic Prefix Caching for details. device-
Default: Mapped automatically to devices
0throughtensor_parallel_size - 1.Set via
additional_config["rbln_config"][...]["device"]under each model or submodule it configures, so a single compiled model can route different submodules to different device sets (see example below). The placement is decided at runtime, so the same compiled artifact can be redeployed to different hardware without recompiling. decoder_batch_sizes-
Default: A single decoder compiled at
max_num_seqs.Set via
additional_config["rbln_config"]["decoder_batch_sizes"]. Compiles one decoder graph per batch size in the list, so the engine can serve a request by picking the smallest decoder that still fits the in-flight request count at runtime. All values must be≤ max_num_seqs. The list is sorted in descending order, andmax_num_seqsis added automatically if missing. See Inference with Dynamic Batch Sizes for details.
Environment Variables¶
RBLN-specific environment variables prefixed with VLLM_RBLN_. Set them on the process that launches the engine. For upstream vLLM environment variables, see the vLLM Environment Variables Guide.
VLLM_RBLN_TP_SIZE-
Default:
1.Sets the tensor-parallel size for RSD (Rebellions Scalable Design), the compiler-side parallelism that reduces runtime overhead. See Model Parallelism for context.
VLLM_RBLN_SAMPLER-
Default:
True.Runs sampling on the RBLN NPU via the RBLNSampler. Set to
Falseto fall back to the upstream vLLM sampler, which runs on CPU. VLLM_RBLN_ENABLE_WARM_UP-
Default:
True.Runs a model warm-up pass at engine startup so the first request is not delayed by one-time initialization. Set to
Falseto skip warm-up.