Skip to content

Qwen3-VL-2B (VLM)

Overview

This guide is intended for users who are already familiar with the basics of optimum-rbln, and introduces two key techniques for using a Vision-Language Model (VLM) effectively on RBLN NPUs.

  • Use bucketing to efficiently handle variable input resolutions and decoder batch sizes within a single compilation.
  • Configure rbln_config per submodule to control behavior at each level.

The running example is Qwen/Qwen3-VL-2B-Instruct. All default values used in the snippets below match the Model Zoo's compile.py and inference.py.

Note

Debugging error conditions (for example Failed to create RBLN runtime or No memory blocks are available) is out of scope for this guide. For memory- or runtime-creation-related errors, refer to Troubleshooting Multi-Module Models.


Model structure with submodules

A VLM is a model made up of several neural-network components such as a vision encoder and a language model. optimum-rbln places the principal component as the top-level model and declares the rest as submodules, compiling each into its own graph and serving each with its own runtime. With this structure, the desired settings can be specified independently for each submodule.

The structure of Qwen3-VL (Qwen3VLForConditionalGeneration) looks like this.

RBLNQwen3VLForConditionalGeneration   ← top-level (causal LM)
└── visual: RBLNQwen3VLVisionModel    ← submodule (Vision Transformer)

The top-level model is a causal language model (LM), and it contains a Vision Transformer that encodes images and video frames as a single submodule named visual. The roles of each component are as follows.

  • Top-level LM - consumes the image embeddings produced by visual and generates text tokens auto-regressively.
  • visual submodule - a Vision Transformer, executed per image or per video frame.

This tree structure maps directly onto the key structure of the rbln_config passed at compile time: top-level keys apply to the LM, while keys nested under "visual" apply to the visual submodule.

rbln_config={
    "visual": {                          # ← submodule level (Vision Transformer)
        "max_seq_lens": 16384,
        "tensor_parallel_size": 8,
    },
    "tensor_parallel_size": 8,           # ← top level (LM)
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "device": [0, 1, 2, 3, 4, 5, 6, 7],
}

Tip

Each field is defined in the RBLN config file of the corresponding model. For example, the fields available on Qwen3-VL's visual submodule are defined in RBLNQwen3VLVisionModelConfig.

Note

Which component is the top-level model and which are submodules varies across models. For example, in LLaVA and Gemma3, the top-level is a Conditional Generation wrapper and both vision_tower and language_model are submodules; in Idefics3, vision_model and text_model are declared as submodules together. To check the structure of a specific model, look at the submodules = [...] declaration in its RBLN config file (configuration_<model>.py).


Using a submodule's rbln_config

This section collects the most common patterns for configuring optimum-rbln's VLMs with rbln_config - starting with a simple device-distribution example, then bucketing along two axes, and finally guidance for other memory and workflow scenarios.

The rbln_config fields most commonly used with Qwen3-VL are summarized below.

Field Level Purpose
batch_size shared Number of concurrent sequences for the whole model - cannot be set per submodule
visual.max_seq_lens submodule (visual) Maximum number of patches per image or frame accepted by the ViT graph. Accepts either a single int or a List[int] (bucketing)
visual.tensor_parallel_size submodule (visual) Tensor parallel size for visual
visual.device submodule (visual) Device(s) on which the visual runtime is placed
visual.create_runtimes submodule (visual) Whether to create the visual runtime at compile time
tensor_parallel_size top (LM) Tensor parallel size for the LM
kvcache_partition_len top (LM) Flash attention partition size - must divide max_seq_len evenly
max_seq_len top (LM) Maximum position embedding for the LM - must be a multiple of kvcache_partition_len
decoder_batch_sizes top (LM) List of decoder batch sizes for batch-size bucketing - every value must be ≤ batch_size
device top (LM) Devices on which the LM runtime is placed
create_runtimes top (LM) Whether to create the LM runtime at compile time

Caution

Notes for keys that may appear at both levels:

  • tensor_parallel_size can be set per submodule, but it must always match the length of that submodule's device list.
  • batch_size is a single value that applies to the whole model and cannot be set per submodule. visual and the LM process batches in different ways internally, but optimum-rbln reconciles this difference.

Distributing devices between the submodule and the top-level model

The 2-level structure of rbln_config allows splitting devices between the submodule and the top-level model. For example, on a 16-device server you can use the device field in rbln_config to place visual on the first 8 devices and the LM on the last 8, so that no single device has to hold the memory of both submodules at once - avoiding running out of device memory even with large batches and long contexts.

rbln_config={
    "visual": {
        "max_seq_lens": 16384,
        "tensor_parallel_size": 8,
        "device": [0, 1, 2, 3, 4, 5, 6, 7],     # visual on devices 0–7
    },
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "batch_size": 8,
    "device": [8, 9, 10, 11, 12, 13, 14, 15],   # LM on devices 8–15
}

The 8-device baseline (where both models share the same pool) and the diagnostic flow when device or memory pressure arises are covered in Other scenarios below, which links to the troubleshooting guide.

Bucketing the ViT input: visual.max_seq_lens

Qwen3-VL's ViT runs on one image (or video frame) at a time, and its graph shape is fixed at compile time. A ViT compiled with max_seq_lens=16384 always performs work proportional to 16,384 patches, regardless of whether the actual input contains 1,024 patches or 16,384 patches. In real serving workloads, a single request often contains images of different sizes or videos with different numbers of frames, so selecting a single max_seq_lens sized for the largest expected input means smaller inputs still consume that full capacity - wasting latency and memory.

optimum-rbln addresses this with a multi-size bucketing strategy. visual.max_seq_lens accepts not only a single int but also a List[int], and the behavior is as follows.

  • When a list is provided, optimum-rbln compiles a separate ViT graph for each length in a single compilation.
  • At inference time, the smallest bucket that can accommodate the actual patch count is selected automatically.
  • If an input exceeds every bucket in the list, an error is raised. Choose the largest value so that it covers the upper bound of your workload.

The example below is a modified version of the Model Zoo compile.py, where the single value (max_seq_lens: 16384) is replaced with a list of three buckets compiled together.

import os

from optimum.rbln import RBLNQwen3VLForConditionalGeneration

model_id = "Qwen/Qwen3-VL-2B-Instruct"
model = RBLNQwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": [1024, 3136, 16384],   # ← compile three buckets together
            "tensor_parallel_size": 8,
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 262_144,
    },
)
model.save_pretrained(os.path.basename(model_id))

At runtime, the compiled model behaves as follows.

Input patch count Selected bucket
800 1024
2000 3136
5000 16384
20000 error: exceeds all buckets

Note

Each value in max_seq_lens is the upper bound on the number of patches the ViT processes for a single image (or video frame). With Qwen3-VL's patch_size=16 and spatial_merge_size=2, the patch count for an H × W image is:

merged patches = (H / 16 / 2) × (W / 16 / 2)

For example, 1024×1024 yields 1,024 patches, 1792×1792 yields 3,136, and 4096×4096 yields 16,384. The bucket values in the example above (1024, 3136, 16384) cover up to those three resolutions, respectively.

Caution

The more buckets you add, the longer compilation takes and the more device memory the compiled model uses. Selecting too many can exhaust device memory, so 2–4 buckets chosen from the actual resolution distribution of your traffic is a reasonable default.

Tip

The general mechanics, trade-offs, and standalone examples of bucketing are covered in the Bucketing guide. ViT input-length bucketing applies only to models that explicitly support it; check the model's RBLN config file for a field named max_seq_lens (note the plural) declared as Union[int, List[int]] - rather than the singular max_seq_len.

Bucketing the decoder batch size: decoder_batch_sizes

Qwen3-VL's language-model decoder is also a graph whose shape is fixed at compile time. A decoder compiled with batch_size=8 always performs work proportional to 8 slots - whether the actual batch is 3 or 8. In real serving (continuous batching, in-flight batching), requests finish at different times and the active batch frequently shrinks below the compiled maximum, leaving compute on the unused slots wasted.

decoder_batch_sizes addresses this with bucketing at the decoder graph level. It is a top-level (LM-side) field defined on RBLNDecoderOnlyModelConfig and accepts a List[int]. The behavior is as follows.

  • When a list is provided, optimum-rbln compiles a separate decoder graph (decoder_batch_{N}) for each batch size.
  • Every value must be ≤ batch_size, and the maximum value should equal batch_size. If it is smaller, optimum-rbln automatically appends batch_size and emits a warning.
  • At inference time, the engine selects the most appropriate decoder graph for the active batch size during the decoding phase.

The configuration below compiles four decoder graphs (1, 2, 4, 8) alongside batch_size=8.

rbln_config={
    "visual": {
        "max_seq_lens": 16384,
        "tensor_parallel_size": 8,
    },
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "batch_size": 8,
    "decoder_batch_sizes": [1, 2, 4, 8],   # ← compile four decoder graphs
    "device": [0, 1, 2, 3, 4, 5, 6, 7],
}

At runtime, the compiled model selects a decoder graph based on the active batch size.

Active batch Selected decoder graph
1 decoder_batch_1
2 decoder_batch_2
3–4 decoder_batch_4
5–8 decoder_batch_8

Caution

More decoder graphs mean longer compilation time and higher device memory usage. A common pattern is 1, batch_size//2, batch_size - i.e., 2–4 buckets - adjusted to your traffic profile.

Tip

ViT input-length bucketing (visual.max_seq_lens) and decoder batch-size bucketing (decoder_batch_sizes) are independent and can be combined within a single compilation.

Other scenarios

Specific memory and device configurations, as well as compile/inference workflows, are covered step-by-step in Troubleshooting Multi-Module Models. Frequently encountered scenarios and their corresponding guides:

Scenario Guide
Separate compilation from runtime creation (create_runtimes: False) Step 1
Split visual and the LM into disjoint device pools (16 ATOM™) Step 2
KV cache block pool exhaustion at large batches (tune kvcache_num_blocks) Step 3
Cap the LM context length to the actual workload Step 4
Right-size the ViT input for the target resolution Step 5

See also