Skip to content

Troubleshooting Multi-Module Models

This guide covers common issues when compiling and deploying a multi-module vision-language model on RBLN NPUs, using the Qwen3-VL-2B Model Zoo example as the reference. The same patterns apply to any Optimum RBLN model composed of independent submodules.

Prerequisites

Examples below use tensor_parallel_size=8, matching the Qwen3-VL-2B Model Zoo compile config. This guide uses two device layouts:

  • 8 ATOM™ (devices 0–7) — the Model Zoo baseline. Both submodules share the same 8-device pool. Works with the default compile config and small batch sizes.
  • 16 ATOM™ (devices 0–15) — visual and LM on disjoint pools (8–15 and 0–7). Required for heavier configs such as the batch_size=8 scenario used throughout this guide.

Run rbln-smi to list available devices, and adjust tensor_parallel_size and device assignments to match your hardware.

Quick reference

Symptom Root cause Go to
Failed to create RBLN runtime right after compilation Default create_runtimes=True runs runtime creation during compile, which fails Step 1
Failed to create RBLN runtime when loading a saved model Every submodule defaults to the same [0..TP-1] device pool; combined per-device footprint exceeds memory Step 2
No memory blocks are available for allocation Paged attention block pool exhausted at higher batch sizes Step 3
Memory is not enough for full sequence length KV cache sized for the architectural max, not the actual workload Step 4
ViT latency higher than expected for small images max_seq_lens over-provisioned for actual resolution Step 5

Step 1: Compile without creating runtimes

Symptom

Compilation with batch_size=8 crashes immediately after it completes:

from optimum.rbln import RBLNAutoModelForVision2Seq

model = RBLNAutoModelForVision2Seq.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": 16384,
            "tensor_parallel_size": 8,
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 262_144,
        "batch_size": 8,
    },
)
1
2
3
4
5
Failed to create RBLN runtime: ...

If you only need to compile the model without loading it to NPU, you can use:
  from_pretrained(..., rbln_create_runtimes=False) or
  from_pretrained(..., rbln_config={..., 'create_runtimes': False})

Root cause

from_pretrained(..., export=True) performs two operations in sequence:

  1. Compile the model to RBLN IR — succeeds.
  2. Create runtimes by loading the compiled modules onto device memory — fails.

Qwen3-VL consists of two independent submodules:

  • visual — a Vision Transformer that encodes images and video frames.
  • model — a causal language model that generates text.

During runtime creation, each submodule's device defaults to list(range(tensor_parallel_size))[0, 1, …, 7] for TP=8. Both visual and the LM therefore shard across the same 8-device pool, and every device in that pool carries a ViT slice, an LM tensor-parallel shard, and a KV cache slice. Runtime creation fails when the combined per-device footprint exceeds device memory.

Resolution

create_runtimes defaults to True, so from_pretrained(..., export=True) proceeds from compilation directly into runtime creation — the step that fails. Setting create_runtimes: False at both the submodule and top levels stops from_pretrained after compilation, bypassing the failing runtime-creation step. This is the canonical Model Zoo compile.py pattern:

import os

from optimum.rbln import RBLNAutoModelForVision2Seq

model_id = "Qwen/Qwen3-VL-2B-Instruct"
model = RBLNAutoModelForVision2Seq.from_pretrained(
    model_id,
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": 16384,
            "tensor_parallel_size": 8,
            "create_runtimes": False,                   # ← added
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 262_144,
        "batch_size": 8,
        "create_runtimes": False,                       # ← added
    },
)
model.save_pretrained(os.path.basename(model_id))

Compilation completes and the artifacts are written to disk. The underlying memory pressure remains; it resurfaces at load time, which Step 2 handles with explicit device placement.

Each field, scoped to its level:

Field Level Purpose
visual.max_seq_lens submodule (ViT) Max merged patches per image/frame the ViT graph accepts
visual.tensor_parallel_size submodule (ViT) TP degree for the ViT
visual.create_runtimes submodule (ViT) Skip ViT runtime creation at compile time
tensor_parallel_size top (LM) TP degree for the language model
kvcache_partition_len top (LM) Flash attention partition size — must divide max_seq_len
max_seq_len top (LM) Maximum LM position embedding; must be a multiple of kvcache_partition_len
create_runtimes top (LM) Skip LM runtime creation at compile time

Tip

Always set create_runtimes: False when compiling multi-module models.


Step 2: Distribute submodules across devices at load time

Symptom

The model compiled in Step 1 with batch_size=8 is loaded without any device hints:

1
2
3
4
model = RBLNAutoModelForVision2Seq.from_pretrained(
    "Qwen3-VL-2B-Instruct",
    export=False,
)

Failed to create RBLN runtime appears again — this time during load.

Root cause

Same underlying pressure as Step 1 — without an explicit device setting, visual and the LM share the default [0..TP-1] pool, and each device's combined ViT + LM + KV cache footprint exceeds device memory. Step 1 sidestepped the failure by skipping runtime creation; loading the saved artifacts surfaces it again.

Resolution

Assign every submodule to a device list explicitly. For the batch_size=8 compile from Step 1, place visual and the LM on disjoint pools across 16 ATOM™ devices, so no device holds both ViT and LM memory:

from optimum.rbln import RBLNAutoModelForVision2Seq

model = RBLNAutoModelForVision2Seq.from_pretrained(
    "Qwen3-VL-2B-Instruct",
    export=False,
    rbln_config={
        "visual": {
            "device": [8, 9, 10, 11, 12, 13, 14, 15],   # ViT on devices 8–15
        },
        "device": [0, 1, 2, 3, 4, 5, 6, 7],             # LM on devices 0–7
    },
)

When only 8 ATOM™ devices are available, use the Model Zoo inference.py shared-pool layout instead (visual and LM both on devices 0–7). The combined footprint must fit each device, so the batch_size=8 compile from Step 1 will not load on this layout; first reduce the workload via Step 3 or Step 4.


Step 3: Serve through vllm-rbln for higher batch sizes

Symptom

At higher batch sizes, the paged attention block pool is exhausted at runtime:

1
2
3
4
RuntimeError: No memory blocks are available for allocation.
The generate() API cannot complete this inference task because Paged Attention is not fully supported by optimum-rbln.
This is supported by vllm-rbln (see: https://docs.rbln.ai/software/model_serving/vllm_support/vllm-rbln.html).
Using vllm-rbln should fix this issue and enhance inference performance.

The error itself directs users to vllm-rbln; the rest of this step walks through the migration path.

Root cause

The KV cache uses paged attention. With optimum-rbln alone, the block pool is pre-allocated at compile time with a fixed kvcache_num_blocks; once the active batch needs more blocks than were reserved, the pool exhausts mid-generation and the error above is raised.

Resolution

Serve the model through vllm-rbln instead of calling the compiled artifacts directly. vllm-rbln manages the paged-attention block pool dynamically. The engine handles block allocation, eviction, and admission control, so workloads that overrun the pool are queued or rejected safely rather than crashing mid-generation.

Standalone optimum-rbln — Tune kvcache_num_blocks

When you must run on optimum-rbln directly, set kvcache_num_blocks so the pre-allocated pool covers the batch:

num_full_blocks = batch_size × (max_seq_len / kvcache_block_size)

Under flash attention, kvcache_block_size equals kvcache_partition_len, so the two are interchangeable in the formula.

1
2
3
4
5
6
7
8
9
rbln_config={
    "visual": {...},                 # see Step 1 for the full visual block
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "kvcache_num_blocks": 128,       # = num_full_blocks
    "batch_size": 8,
    "create_runtimes": False,
}

Valid range:

(max_seq_len / kvcache_block_size) + 1   ≤   kvcache_num_blocks   ≤   num_full_blocks

The minimum valid kvcache_num_blocks for this configuration is (max_seq_len / kvcache_block_size) + 1 = 17. Lower values within the valid range let inference start normally, but the pool exhausts before max_seq_len is reached and raises OOM.


Step 4: Cap the LM context length to the actual workload

Symptom

Compilation at higher batch sizes is rejected with:

ValueError: Memory is not enough for full sequence length.
Please consider decreasing `max_seq_len` to reduce the number of blocks.

Root cause

In flash attention mode (activated by kvcache_partition_len), the KV cache is pre-allocated at compile time. Its memory footprint scales as:

KV memory  ∝  batch_size  ×  (max_seq_len / kvcache_partition_len)

The Model Zoo compile.py sets max_seq_len to 262,144 — Qwen3-VL's architectural maximum. With batch_size > 1, the engine reserves the full 256K-token slot for every sequence, quickly exhausting device memory even when actual requests are far shorter.

Resolution

Reduce max_seq_len to a realistic ceiling for the target workload. If requests stay under 32K tokens:

rbln_config={
    "visual": {
        "max_seq_lens": 16384,
        "tensor_parallel_size": 8,
        "create_runtimes": False,
    },
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 32_768,           # 262_144 → 32_768
    "batch_size": 8,
    "create_runtimes": False,
}
Parameter Constraint Effect
max_seq_len Must be a multiple of kvcache_partition_len Sets the per-sequence KV memory ceiling
kvcache_partition_len Flash attention partition size Smaller = more flexibility; larger = less overhead
batch_size Concurrent sequences Multiplies KV memory linearly

Halving max_seq_len roughly halves KV cache memory. When the workload needs both long contexts and larger batches, combine this with Step 3: reduce max_seq_len first, then tune kvcache_num_blocks within the remaining memory budget.


Step 5: Right-size ViT input for the target resolution

Symptom

Batched inference runs, but ViT latency is higher than expected — even for small images.

Root cause

visual.max_seq_lens sets the maximum number of merged patches the ViT graph accepts per image or video frame. The Model Zoo compile.py defaults to 16,384 — a conservative upper bound that covers any input up to roughly 4096×4096. If the deployment or serving scenario has a known maximum resolution below that, max_seq_lens can be shrunk to that scenario's upper bound to cut ViT compute and memory that would otherwise be wasted.

Merged patch count depends on the model's patch_size and spatial_merge_size (both defined in config.json):

patches = (H / patch_size / spatial_merge_size) × (W / patch_size / spatial_merge_size)

For Qwen3-VL (patch_size=16, spatial_merge_size=2):

Image resolution Merged patches
1792 × 1792 3,136
4096 × 4096 16,384

Compiling with max_seq_lens: 16384 for a 1792×1792 deployment reserves about 5× the patches the workload actually uses (16,384 vs. 3,136). The ViT runs at the compiled capacity regardless of the actual input, so the excess translates directly into wasted latency and memory.

Resolution

Two parameters form a pipeline and must stay in sync:

  1. processor.max_pixels — caps image size before patch extraction.
  2. rbln_config["visual"]["max_seq_lens"] — caps the patch count accepted by the compiled ViT graph.
Relationship Outcome
Patches from max_pixels fit within max_seq_lens Inference runs
Patches from max_pixels closely match max_seq_lens Compute and memory stay efficient

For a deployment capped at 1792×1792:

Compile

rbln_config={
    "visual": {
        "max_seq_lens": 3136,
        "tensor_parallel_size": 8,
        "create_runtimes": False,
    },
    "tensor_parallel_size": 8,
    "kvcache_partition_len": 16_384,
    "max_seq_len": 262_144,
    "batch_size": 8,
    "create_runtimes": False,
}

Load

1
2
3
4
5
processor = AutoProcessor.from_pretrained(
    model_id,
    min_pixels=256 * 16 * 16,
    max_pixels=1792 * 1792,
)

Note

max_seq_lens freezes the compiled graph shape. Size it to the realistic maximum for the deployment, not the architectural maximum the model supports. Recompile if the deployment resolution grows.

Bucketing — Variable image sizes

When the deployment serves a range of image sizes rather than a single fixed resolution, compile the ViT with multiple max_seq_lens buckets. The runtime picks the smallest fitting graph per request, so smaller images avoid the wasted compute described above without sacrificing the larger inputs that need the full graph.

1
2
3
4
5
6
7
8
rbln_config={
    "visual": {
        "max_seq_lens": [3136, 8192, 16384],   # tiers chosen for the deployment's resolution mix
        "tensor_parallel_size": 8,
        "create_runtimes": False,
    },
    ...
}

See the VLM tutorial for the bucketing setup applied to image-size variation.