Troubleshooting Multi-Module Models¶
This guide covers common issues when compiling and deploying a multi-module vision-language model on RBLN NPUs, using the Qwen3-VL-2B Model Zoo example as the reference. The same patterns apply to any Optimum RBLN model composed of independent submodules.
Prerequisites
Examples below use tensor_parallel_size=8, matching the Qwen3-VL-2B Model Zoo compile config. This guide uses two device layouts:
- 8 ATOM™ (devices 0–7) — the Model Zoo baseline. Both submodules share the same 8-device pool. Works with the default compile config and small batch sizes.
- 16 ATOM™ (devices 0–15) —
visualand LM on disjoint pools (8–15 and 0–7). Required for heavier configs such as thebatch_size=8scenario used throughout this guide.
Run rbln-smi to list available devices, and adjust tensor_parallel_size and device assignments to match your hardware.
Quick reference¶
| Symptom | Root cause | Go to |
|---|---|---|
Failed to create RBLN runtime right after compilation |
Default create_runtimes=True runs runtime creation during compile, which fails |
Step 1 |
Failed to create RBLN runtime when loading a saved model |
Every submodule defaults to the same [0..TP-1] device pool; combined per-device footprint exceeds memory |
Step 2 |
No memory blocks are available for allocation |
Paged attention block pool exhausted at higher batch sizes | Step 3 |
Memory is not enough for full sequence length |
KV cache sized for the architectural max, not the actual workload | Step 4 |
| ViT latency higher than expected for small images | max_seq_lens over-provisioned for actual resolution |
Step 5 |
Step 1: Compile without creating runtimes¶
Symptom¶
Compilation with batch_size=8 crashes immediately after it completes:
Root cause¶
from_pretrained(..., export=True) performs two operations in sequence:
- Compile the model to RBLN IR — succeeds.
- Create runtimes by loading the compiled modules onto device memory — fails.
Qwen3-VL consists of two independent submodules:
visual— a Vision Transformer that encodes images and video frames.model— a causal language model that generates text.
During runtime creation, each submodule's device defaults to list(range(tensor_parallel_size)) — [0, 1, …, 7] for TP=8. Both visual and the LM therefore shard across the same 8-device pool, and every device in that pool carries a ViT slice, an LM tensor-parallel shard, and a KV cache slice. Runtime creation fails when the combined per-device footprint exceeds device memory.
Resolution¶
create_runtimes defaults to True, so from_pretrained(..., export=True) proceeds from compilation directly into runtime creation — the step that fails. Setting create_runtimes: False at both the submodule and top levels stops from_pretrained after compilation, bypassing the failing runtime-creation step. This is the canonical Model Zoo compile.py pattern:
Compilation completes and the artifacts are written to disk. The underlying memory pressure remains; it resurfaces at load time, which Step 2 handles with explicit device placement.
Each field, scoped to its level:
| Field | Level | Purpose |
|---|---|---|
visual.max_seq_lens |
submodule (ViT) | Max merged patches per image/frame the ViT graph accepts |
visual.tensor_parallel_size |
submodule (ViT) | TP degree for the ViT |
visual.create_runtimes |
submodule (ViT) | Skip ViT runtime creation at compile time |
tensor_parallel_size |
top (LM) | TP degree for the language model |
kvcache_partition_len |
top (LM) | Flash attention partition size — must divide max_seq_len |
max_seq_len |
top (LM) | Maximum LM position embedding; must be a multiple of kvcache_partition_len |
create_runtimes |
top (LM) | Skip LM runtime creation at compile time |
Tip
Always set create_runtimes: False when compiling multi-module models.
Step 2: Distribute submodules across devices at load time¶
Symptom¶
The model compiled in Step 1 with batch_size=8 is loaded without any device hints:
Failed to create RBLN runtime appears again — this time during load.
Root cause¶
Same underlying pressure as Step 1 — without an explicit device setting, visual and the LM share the default [0..TP-1] pool, and each device's combined ViT + LM + KV cache footprint exceeds device memory. Step 1 sidestepped the failure by skipping runtime creation; loading the saved artifacts surfaces it again.
Resolution¶
Assign every submodule to a device list explicitly. For the batch_size=8 compile from Step 1, place visual and the LM on disjoint pools across 16 ATOM™ devices, so no device holds both ViT and LM memory:
When only 8 ATOM™ devices are available, use the Model Zoo inference.py shared-pool layout instead (visual and LM both on devices 0–7). The combined footprint must fit each device, so the batch_size=8 compile from Step 1 will not load on this layout; first reduce the workload via Step 3 or Step 4.
Step 3: Serve through vllm-rbln for higher batch sizes¶
Symptom¶
At higher batch sizes, the paged attention block pool is exhausted at runtime:
The error itself directs users to vllm-rbln; the rest of this step walks through the migration path.
Root cause¶
The KV cache uses paged attention. With optimum-rbln alone, the block pool is pre-allocated at compile time with a fixed kvcache_num_blocks; once the active batch needs more blocks than were reserved, the pool exhausts mid-generation and the error above is raised.
Resolution¶
Serve the model through vllm-rbln instead of calling the compiled artifacts directly. vllm-rbln manages the paged-attention block pool dynamically. The engine handles block allocation, eviction, and admission control, so workloads that overrun the pool are queued or rejected safely rather than crashing mid-generation.
Standalone optimum-rbln — Tune kvcache_num_blocks¶
When you must run on optimum-rbln directly, set kvcache_num_blocks so the pre-allocated pool covers the batch:
Under flash attention, kvcache_block_size equals kvcache_partition_len, so the two are interchangeable in the formula.
Valid range:
The minimum valid kvcache_num_blocks for this configuration is (max_seq_len / kvcache_block_size) + 1 = 17. Lower values within the valid range let inference start normally, but the pool exhausts before max_seq_len is reached and raises OOM.
Step 4: Cap the LM context length to the actual workload¶
Symptom¶
Compilation at higher batch sizes is rejected with:
Root cause¶
In flash attention mode (activated by kvcache_partition_len), the KV cache is pre-allocated at compile time. Its memory footprint scales as:
The Model Zoo compile.py sets max_seq_len to 262,144 — Qwen3-VL's architectural maximum. With batch_size > 1, the engine reserves the full 256K-token slot for every sequence, quickly exhausting device memory even when actual requests are far shorter.
Resolution¶
Reduce max_seq_len to a realistic ceiling for the target workload. If requests stay under 32K tokens:
| Parameter | Constraint | Effect |
|---|---|---|
max_seq_len |
Must be a multiple of kvcache_partition_len |
Sets the per-sequence KV memory ceiling |
kvcache_partition_len |
Flash attention partition size | Smaller = more flexibility; larger = less overhead |
batch_size |
Concurrent sequences | Multiplies KV memory linearly |
Halving max_seq_len roughly halves KV cache memory. When the workload needs both long contexts and larger batches, combine this with Step 3: reduce max_seq_len first, then tune kvcache_num_blocks within the remaining memory budget.
Step 5: Right-size ViT input for the target resolution¶
Symptom¶
Batched inference runs, but ViT latency is higher than expected — even for small images.
Root cause¶
visual.max_seq_lens sets the maximum number of merged patches the ViT graph accepts per image or video frame. The Model Zoo compile.py defaults to 16,384 — a conservative upper bound that covers any input up to roughly 4096×4096. If the deployment or serving scenario has a known maximum resolution below that, max_seq_lens can be shrunk to that scenario's upper bound to cut ViT compute and memory that would otherwise be wasted.
Merged patch count depends on the model's patch_size and spatial_merge_size (both defined in config.json):
For Qwen3-VL (patch_size=16, spatial_merge_size=2):
| Image resolution | Merged patches |
|---|---|
| 1792 × 1792 | 3,136 |
| 4096 × 4096 | 16,384 |
Compiling with max_seq_lens: 16384 for a 1792×1792 deployment reserves about 5× the patches the workload actually uses (16,384 vs. 3,136). The ViT runs at the compiled capacity regardless of the actual input, so the excess translates directly into wasted latency and memory.
Resolution¶
Two parameters form a pipeline and must stay in sync:
processor.max_pixels— caps image size before patch extraction.rbln_config["visual"]["max_seq_lens"]— caps the patch count accepted by the compiled ViT graph.
| Relationship | Outcome |
|---|---|
Patches from max_pixels fit within max_seq_lens |
Inference runs |
Patches from max_pixels closely match max_seq_lens |
Compute and memory stay efficient |
For a deployment capped at 1792×1792:
Compile
Load
Note
max_seq_lens freezes the compiled graph shape. Size it to the realistic maximum for the deployment, not the architectural maximum the model supports. Recompile if the deployment resolution grows.
Bucketing — Variable image sizes¶
When the deployment serves a range of image sizes rather than a single fixed resolution, compile the ViT with multiple max_seq_lens buckets. The runtime picks the smallest fitting graph per request, so smaller images avoid the wasted compute described above without sacrificing the larger inputs that need the full graph.
See the VLM tutorial for the bucketing setup applied to image-size variation.