Troubleshooting Multi-Module Models¶

This guide covers common issues when compiling and deploying a multi-module vision-language model on RBLN NPUs, using the Qwen3-VL-2B Model Zoo example as the reference. The same patterns apply to any Optimum RBLN model composed of independent submodules.

Prerequisites

Examples below use tensor_parallel_size=8, matching the Qwen3-VL-2B Model Zoo compile config. This guide uses two device layouts:

8 ATOM™ (devices 0–7) — the Model Zoo baseline. Both submodules share the same 8-device pool. Works with the default compile config and small batch sizes.
16 ATOM™ (devices 0–15) — visual and LM on disjoint pools (8–15 and 0–7). Required for heavier configs such as the batch_size=8 scenario used throughout this guide.

Run rbln-smi to list available devices, and adjust tensor_parallel_size and device assignments to match your hardware.

Quick reference¶

Symptom	Root cause	Go to
`Failed to create RBLN runtime` right after compilation	Default `create_runtimes=True` runs runtime creation during compile, which fails	Step 1
`Failed to create RBLN runtime` when loading a saved model	Every submodule defaults to the same `[0..TP-1]` device pool; combined per-device footprint exceeds memory	Step 2
`No memory blocks are available for allocation`	Paged attention block pool exhausted at higher batch sizes	Step 3
`Memory is not enough for full sequence length`	KV cache sized for the architectural max, not the actual workload	Step 4
ViT latency higher than expected for small images	`max_seq_lens` over-provisioned for actual resolution	Step 5

Step 1: Compile without creating runtimes¶

Symptom¶

Compilation with batch_size=8 crashes immediately after it completes:

from optimum.rbln import RBLNAutoModelForVision2Seq

model = RBLNAutoModelForVision2Seq.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": 16384,
            "tensor_parallel_size": 8,
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 262_144,
        "batch_size": 8,
    },
)

Failed to create RBLN runtime: ...

If you only need to compile the model without loading it to NPU, you can use:
  from_pretrained(..., rbln_create_runtimes=False) or
  from_pretrained(..., rbln_config={..., 'create_runtimes': False})

Root cause¶

from_pretrained(..., export=True) performs two operations in sequence:

Compile the model to RBLN IR — succeeds.
Create runtimes by loading the compiled modules onto device memory — fails.

Qwen3-VL consists of two independent submodules:

visual — a Vision Transformer that encodes images and video frames.
model — a causal language model that generates text.

During runtime creation, each submodule's device defaults to list(range(tensor_parallel_size)) — [0, 1, …, 7] for TP=8. Both visual and the LM therefore shard across the same 8-device pool, and every device in that pool carries a ViT slice, an LM tensor-parallel shard, and a KV cache slice. Runtime creation fails when the combined per-device footprint exceeds device memory.

Resolution¶

create_runtimes defaults to True, so from_pretrained(..., export=True) proceeds from compilation directly into runtime creation — the step that fails. Setting create_runtimes: False at both the submodule and top levels stops from_pretrained after compilation, bypassing the failing runtime-creation step. This is the canonical Model Zoo compile.py pattern:

import os

from optimum.rbln import RBLNAutoModelForVision2Seq

model_id = "Qwen/Qwen3-VL-2B-Instruct"
model = RBLNAutoModelForVision2Seq.from_pretrained(
    model_id,
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": 16384,
            "tensor_parallel_size": 8,
            "create_runtimes": False,                   # ← added
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 262_144,
        "batch_size": 8,
        "create_runtimes": False,                       # ← added
    },
)
model.save_pretrained(os.path.basename(model_id))

Compilation completes and the artifacts are written to disk. The underlying memory pressure remains; it resurfaces at load time, which Step 2 handles with explicit device placement.

Each field, scoped to its level:

Field	Level	Purpose
`visual.max_seq_lens`	submodule (ViT)	Max merged patches per image/frame the ViT graph accepts
`visual.tensor_parallel_size`	submodule (ViT)	TP degree for the ViT
`visual.create_runtimes`	submodule (ViT)	Skip ViT runtime creation at compile time
`tensor_parallel_size`	top (LM)	TP degree for the language model
`kvcache_partition_len`	top (LM)	Flash attention partition size — must divide `max_seq_len`
`max_seq_len`	top (LM)	Maximum LM position embedding; must be a multiple of `kvcache_partition_len`
`create_runtimes`	top (LM)	Skip LM runtime creation at compile time

Tip

Always set create_runtimes: False when compiling multi-module models.

Step 2: Distribute submodules across devices at load time¶

Symptom¶

The model compiled in Step 1 with batch_size=8 is loaded without any device hints:

model = RBLNAutoModelForVision2Seq.from_pretrained(
    "Qwen3-VL-2B-Instruct",
    export=False,
)

Failed to create RBLN runtime appears again — this time during load.

Root cause¶

Same underlying pressure as Step 1 — without an explicit device setting, visual and the LM share the default [0..TP-1] pool, and each device's combined ViT + LM + KV cache footprint exceeds device memory. Step 1 sidestepped the failure by skipping runtime creation; loading the saved artifacts surfaces it again.

Resolution¶

Assign every submodule to a device list explicitly. For the batch_size=8 compile from Step 1, place visual and the LM on disjoint pools across 16 ATOM™ devices, so no device holds both ViT and LM memory:

from optimum.rbln import RBLNAutoModelForVision2Seq

model = RBLNAutoModelForVision2Seq.from_pretrained(
    "Qwen3-VL-2B-Instruct",
    export=False,
    rbln_config={
        "visual": {
            "device": [8, 9, 10, 11, 12, 13, 14, 15],   # ViT on devices 8–15
        },
        "device": [0, 1, 2, 3, 4, 5, 6, 7],             # LM on devices 0–7
    },
)

When only 8 ATOM™ devices are available, use the Model Zoo inference.py shared-pool layout instead (visual and LM both on devices 0–7). The combined footprint must fit each device, so the batch_size=8 compile from Step 1 will not load on this layout; first reduce the workload via Step 3 or Step 4.

Step 3: Serve through vllm-rbln for higher batch sizes¶

Symptom¶

At higher batch sizes, the paged attention block pool is exhausted at runtime:

RuntimeError: No memory blocks are available for allocation.
The generate() API cannot complete this inference task because Paged Attention is not fully supported by optimum-rbln.
This is supported by vllm-rbln (see: https://docs.rbln.ai/software/model_serving/vllm_support/vllm-rbln.html).
Using vllm-rbln should fix this issue and enhance inference performance.

The error itself directs users to vllm-rbln; the rest of this step walks through the migration path.

Root cause¶

The KV cache uses paged attention. With optimum-rbln alone, the block pool is pre-allocated at compile time with a fixed kvcache_num_blocks; once the active batch needs more blocks than were reserved, the pool exhausts mid-generation and the error above is raised.

Resolution¶

Serve the model through vllm-rbln instead of calling the compiled artifacts directly. vllm-rbln manages the paged-attention block pool dynamically. The engine handles block allocation, eviction, and admission control, so workloads that overrun the pool are queued or rejected safely rather than crashing mid-generation.

Standalone optimum-rbln — Tune `kvcache_num_blocks`¶

When you must run on optimum-rbln directly, set kvcache_num_blocks so the pre-allocated pool covers the batch:

1	`num_full_blocks = batch_size × (max_seq_len / kvcache_block_size)`

Under flash attention, kvcache_block_size equals kvcache_partition_len, so the two are interchangeable in the formula.

rbln_config={
    "visual": {...},                 # see Step 1 for the full visual block
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "kvcache_num_blocks": 128,       # = num_full_blocks
    "batch_size": 8,
    "create_runtimes": False,
}

Valid range:

1	`(max_seq_len / kvcache_block_size) + 1 ≤ kvcache_num_blocks ≤ num_full_blocks`

The minimum valid kvcache_num_blocks for this configuration is (max_seq_len / kvcache_block_size) + 1 = 17. Lower values within the valid range let inference start normally, but the pool exhausts before max_seq_len is reached and raises OOM.

Step 4: Cap the LM context length to the actual workload¶

Symptom¶

Compilation at higher batch sizes is rejected with:

1 2	ValueError: Memory is not enough for full sequence length. Please consider decreasing `max_seq_len` to reduce the number of blocks.

Root cause¶

In flash attention mode (activated by kvcache_partition_len), the KV cache is pre-allocated at compile time. Its memory footprint scales as:

1	`KV memory ∝ batch_size × (max_seq_len / kvcache_partition_len)`

The Model Zoo compile.py sets max_seq_len to 262,144 — Qwen3-VL's architectural maximum. With batch_size > 1, the engine reserves the full 256K-token slot for every sequence, quickly exhausting device memory even when actual requests are far shorter.

Resolution¶

Reduce max_seq_len to a realistic ceiling for the target workload. If requests stay under 32K tokens:

rbln_config={
    "visual": {
        "max_seq_lens": 16384,
        "tensor_parallel_size": 8,
        "create_runtimes": False,
    },
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 32_768,           # 262_144 → 32_768
    "batch_size": 8,
    "create_runtimes": False,
}

Parameter	Constraint	Effect
`max_seq_len`	Must be a multiple of `kvcache_partition_len`	Sets the per-sequence KV memory ceiling
`kvcache_partition_len`	Flash attention partition size	Smaller = more flexibility; larger = less overhead
`batch_size`	Concurrent sequences	Multiplies KV memory linearly

Halving max_seq_len roughly halves KV cache memory. When the workload needs both long contexts and larger batches, combine this with Step 3: reduce max_seq_len first, then tune kvcache_num_blocks within the remaining memory budget.

Step 5: Right-size ViT input for the target resolution¶

Symptom¶

Batched inference runs, but ViT latency is higher than expected — even for small images.

Root cause¶

visual.max_seq_lens sets the maximum number of merged patches the ViT graph accepts per image or video frame. The Model Zoo compile.py defaults to 16,384 — a conservative upper bound that covers any input up to roughly 4096×4096. If the deployment or serving scenario has a known maximum resolution below that, max_seq_lens can be shrunk to that scenario's upper bound to cut ViT compute and memory that would otherwise be wasted.

Merged patch count depends on the model's patch_size and spatial_merge_size (both defined in config.json):

1	`patches = (H / patch_size / spatial_merge_size) × (W / patch_size / spatial_merge_size)`

For Qwen3-VL (patch_size=16, spatial_merge_size=2):

Image resolution	Merged patches
1792 × 1792	3,136
4096 × 4096	16,384

Compiling with max_seq_lens: 16384 for a 1792×1792 deployment reserves about 5× the patches the workload actually uses (16,384 vs. 3,136). The ViT runs at the compiled capacity regardless of the actual input, so the excess translates directly into wasted latency and memory.

Resolution¶

Two parameters form a pipeline and must stay in sync:

processor.max_pixels — caps image size before patch extraction.
rbln_config["visual"]["max_seq_lens"] — caps the patch count accepted by the compiled ViT graph.

Relationship	Outcome
Patches from `max_pixels` fit within `max_seq_lens`	Inference runs
Patches from `max_pixels` closely match `max_seq_lens`	Compute and memory stay efficient

For a deployment capped at 1792×1792:

Compile

rbln_config={
    "visual": {
        "max_seq_lens": 3136,
        "tensor_parallel_size": 8,
        "create_runtimes": False,
    },
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "batch_size": 8,
    "create_runtimes": False,
}

Load

processor = AutoProcessor.from_pretrained(
    model_id,
    min_pixels=256 * 16 * 16,
    max_pixels=1792 * 1792,
)

Note

max_seq_lens freezes the compiled graph shape. Size it to the realistic maximum for the deployment, not the architectural maximum the model supports. Recompile if the deployment resolution grows.

Bucketing — Variable image sizes¶

When the deployment serves a range of image sizes rather than a single fixed resolution, compile the ViT with multiple max_seq_lens buckets. The runtime picks the smallest fitting graph per request, so smaller images avoid the wasted compute described above without sacrificing the larger inputs that need the full graph.

rbln_config={
    "visual": {
        "max_seq_lens": [3136, 8192, 16384],   # tiers chosen for the deployment's resolution mix
        "tensor_parallel_size": 8,
        "create_runtimes": False,
    },
    ...
}

See the VLM tutorial for the bucketing setup applied to image-size variation.

Troubleshooting Multi-Module Models¶

Quick reference¶

Step 1: Compile without creating runtimes¶

Symptom¶

Root cause¶

Resolution¶

Step 2: Distribute submodules across devices at load time¶

Symptom¶

Root cause¶

Resolution¶

Step 3: Serve through vllm-rbln for higher batch sizes¶

Symptom¶

Root cause¶

Resolution¶

Standalone optimum-rbln — Tune kvcache_num_blocks¶

Step 4: Cap the LM context length to the actual workload¶

Symptom¶

Root cause¶

Resolution¶

Step 5: Right-size ViT input for the target resolution¶

Symptom¶

Root cause¶

Resolution¶

Bucketing — Variable image sizes¶

Related resources¶

Standalone optimum-rbln — Tune `kvcache_num_blocks`¶