Qwen3-VL-2B (VLM)¶
Overview¶
This guide is intended for users who are already familiar with the basics of optimum-rbln, and introduces two key techniques for using a Vision-Language Model (VLM) effectively on RBLN NPUs.
- Use bucketing to efficiently handle variable input resolutions and decoder batch sizes within a single compilation.
- Configure
rbln_configper submodule to control behavior at each level.
The running example is Qwen/Qwen3-VL-2B-Instruct. All default values used in the snippets below match the Model Zoo's compile.py and inference.py.
Note
Debugging error conditions (for example Failed to create RBLN runtime or No memory blocks are available) is out of scope for this guide. For memory- or runtime-creation-related errors, refer to Troubleshooting Multi-Module Models.
Model structure with submodules¶
A VLM is a model made up of several neural-network components such as a vision encoder and a language model. optimum-rbln places the principal component as the top-level model and declares the rest as submodules, compiling each into its own graph and serving each with its own runtime. With this structure, the desired settings can be specified independently for each submodule.
The structure of Qwen3-VL (Qwen3VLForConditionalGeneration) looks like this.
The top-level model is a causal language model (LM), and it contains a Vision Transformer that encodes images and video frames as a single submodule named visual. The roles of each component are as follows.
- Top-level LM - consumes the image embeddings produced by
visualand generates text tokens auto-regressively. visualsubmodule - a Vision Transformer, executed per image or per video frame.
This tree structure maps directly onto the key structure of the rbln_config passed at compile time: top-level keys apply to the LM, while keys nested under "visual" apply to the visual submodule.
Tip
Each field is defined in the RBLN config file of the corresponding model. For example, the fields available on Qwen3-VL's visual submodule are defined in RBLNQwen3VLVisionModelConfig.
Note
Which component is the top-level model and which are submodules varies across models. For example, in LLaVA and Gemma3, the top-level is a Conditional Generation wrapper and both vision_tower and language_model are submodules; in Idefics3, vision_model and text_model are declared as submodules together. To check the structure of a specific model, look at the submodules = [...] declaration in its RBLN config file (configuration_<model>.py).
Using a submodule's rbln_config¶
This section collects the most common patterns for configuring optimum-rbln's VLMs with rbln_config - starting with a simple device-distribution example, then bucketing along two axes, and finally guidance for other memory and workflow scenarios.
The rbln_config fields most commonly used with Qwen3-VL are summarized below.
| Field | Level | Purpose |
|---|---|---|
batch_size |
shared | Number of concurrent sequences for the whole model - cannot be set per submodule |
visual.max_seq_lens |
submodule (visual) |
Maximum number of patches per image or frame accepted by the ViT graph. Accepts either a single int or a List[int] (bucketing) |
visual.tensor_parallel_size |
submodule (visual) |
Tensor parallel size for visual |
visual.device |
submodule (visual) |
Device(s) on which the visual runtime is placed |
visual.create_runtimes |
submodule (visual) |
Whether to create the visual runtime at compile time |
tensor_parallel_size |
top (LM) | Tensor parallel size for the LM |
kvcache_partition_len |
top (LM) | Flash attention partition size - must divide max_seq_len evenly |
max_seq_len |
top (LM) | Maximum position embedding for the LM - must be a multiple of kvcache_partition_len |
decoder_batch_sizes |
top (LM) | List of decoder batch sizes for batch-size bucketing - every value must be ≤ batch_size |
device |
top (LM) | Devices on which the LM runtime is placed |
create_runtimes |
top (LM) | Whether to create the LM runtime at compile time |
Caution
Notes for keys that may appear at both levels:
tensor_parallel_sizecan be set per submodule, but it must always match the length of that submodule'sdevicelist.batch_sizeis a single value that applies to the whole model and cannot be set per submodule.visualand the LM process batches in different ways internally, butoptimum-rblnreconciles this difference.
Distributing devices between the submodule and the top-level model¶
The 2-level structure of rbln_config allows splitting devices between the submodule and the top-level model. For example, on a 16-device server you can use the device field in rbln_config to place visual on the first 8 devices and the LM on the last 8, so that no single device has to hold the memory of both submodules at once - avoiding running out of device memory even with large batches and long contexts.
The 8-device baseline (where both models share the same pool) and the diagnostic flow when device or memory pressure arises are covered in Other scenarios below, which links to the troubleshooting guide.
Bucketing the ViT input: visual.max_seq_lens¶
Qwen3-VL's ViT runs on one image (or video frame) at a time, and its graph shape is fixed at compile time. A ViT compiled with max_seq_lens=16384 always performs work proportional to 16,384 patches, regardless of whether the actual input contains 1,024 patches or 16,384 patches. In real serving workloads, a single request often contains images of different sizes or videos with different numbers of frames, so selecting a single max_seq_lens sized for the largest expected input means smaller inputs still consume that full capacity - wasting latency and memory.
optimum-rbln addresses this with a multi-size bucketing strategy. visual.max_seq_lens accepts not only a single int but also a List[int], and the behavior is as follows.
- When a list is provided,
optimum-rblncompiles a separate ViT graph for each length in a single compilation. - At inference time, the smallest bucket that can accommodate the actual patch count is selected automatically.
- If an input exceeds every bucket in the list, an error is raised. Choose the largest value so that it covers the upper bound of your workload.
The example below is a modified version of the Model Zoo compile.py, where the single value (max_seq_lens: 16384) is replaced with a list of three buckets compiled together.
At runtime, the compiled model behaves as follows.
| Input patch count | Selected bucket |
|---|---|
| 800 | 1024 |
| 2000 | 3136 |
| 5000 | 16384 |
| 20000 | error: exceeds all buckets |
Note
Each value in max_seq_lens is the upper bound on the number of patches the ViT processes for a single image (or video frame). With Qwen3-VL's patch_size=16 and spatial_merge_size=2, the patch count for an H × W image is:
For example, 1024×1024 yields 1,024 patches, 1792×1792 yields 3,136, and 4096×4096 yields 16,384. The bucket values in the example above (1024, 3136, 16384) cover up to those three resolutions, respectively.
Caution
The more buckets you add, the longer compilation takes and the more device memory the compiled model uses. Selecting too many can exhaust device memory, so 2–4 buckets chosen from the actual resolution distribution of your traffic is a reasonable default.
Tip
The general mechanics, trade-offs, and standalone examples of bucketing are covered in the Bucketing guide. ViT input-length bucketing applies only to models that explicitly support it; check the model's RBLN config file for a field named max_seq_lens (note the plural) declared as Union[int, List[int]] - rather than the singular max_seq_len.
Bucketing the decoder batch size: decoder_batch_sizes¶
Qwen3-VL's language-model decoder is also a graph whose shape is fixed at compile time. A decoder compiled with batch_size=8 always performs work proportional to 8 slots - whether the actual batch is 3 or 8. In real serving (continuous batching, in-flight batching), requests finish at different times and the active batch frequently shrinks below the compiled maximum, leaving compute on the unused slots wasted.
decoder_batch_sizes addresses this with bucketing at the decoder graph level. It is a top-level (LM-side) field defined on RBLNDecoderOnlyModelConfig and accepts a List[int]. The behavior is as follows.
- When a list is provided,
optimum-rblncompiles a separate decoder graph (decoder_batch_{N}) for each batch size. - Every value must be ≤
batch_size, and the maximum value should equalbatch_size. If it is smaller,optimum-rblnautomatically appendsbatch_sizeand emits a warning. - At inference time, the engine selects the most appropriate decoder graph for the active batch size during the decoding phase.
The configuration below compiles four decoder graphs (1, 2, 4, 8) alongside batch_size=8.
At runtime, the compiled model selects a decoder graph based on the active batch size.
| Active batch | Selected decoder graph |
|---|---|
| 1 | decoder_batch_1 |
| 2 | decoder_batch_2 |
| 3–4 | decoder_batch_4 |
| 5–8 | decoder_batch_8 |
Caution
More decoder graphs mean longer compilation time and higher device memory usage. A common pattern is 1, batch_size//2, batch_size - i.e., 2–4 buckets - adjusted to your traffic profile.
Tip
ViT input-length bucketing (visual.max_seq_lens) and decoder batch-size bucketing (decoder_batch_sizes) are independent and can be combined within a single compilation.
Other scenarios¶
Specific memory and device configurations, as well as compile/inference workflows, are covered step-by-step in Troubleshooting Multi-Module Models. Frequently encountered scenarios and their corresponding guides:
| Scenario | Guide |
|---|---|
Separate compilation from runtime creation (create_runtimes: False) |
Step 1 |
Split visual and the LM into disjoint device pools (16 ATOM™) |
Step 2 |
KV cache block pool exhaustion at large batches (tune kvcache_num_blocks) |
Step 3 |
| Cap the LM context length to the actual workload | Step 4 |
| Right-size the ViT input for the target resolution | Step 5 |