Qwen3-VL 2B¶

Overview¶

This tutorial explains how to run the multi-modal model on vLLM using multiple RBLN NPUs. For this guide, we will use the Qwen/Qwen3-VL-2B-Instruct model that allows images and videos as inputs.

Setup & Installation¶

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

System Requirements:
- Python: 3.10–3.13
- RBLN Driver
Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs vLLM

Installation Command:

pip install \
  --extra-index-url https://pypi.rbln.ai/simple \
  rebel-compiler==0.10.4.post1
pip install \
  --extra-index-url https://wheels.vllm.ai/0.18.0/cpu \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  vllm-rbln==0.10.4

Note

Please note that rebel-compiler requires an RBLN Portal account.
The commands above are intended for a default pip install on Debian-based Linux such as Ubuntu. For all other configurations, refer to the Installation Guide for the supported install matrix and the applicable commands.

1. Execution: With Pre-Compilation¶

This step demonstrates how to use vLLM to serve the Qwen/Qwen3-VL-2B-Instruct model on multiple RBLN NPUs.

Warning

Sharing the default device pool across submodules can exhaust device memory as batch_size grows. The example below assigns visual and the LM to disjoint device pools to avoid this. See Multi-Module Models for details.

1.1. Model Compilation¶

You can modify the parameters of the main module as well as the submodules through rbln_config. For the original source code, refer to the RBLN Model Zoo. If you need the API reference, see RblnModelConfig.

visual submodule:
- max_seq_lens: Defines the max sequence length for Vision Transformer (ViT), representing the number of patches in an image.
- device: Defines the device allocation for each submodule during runtime.
- tensor_parallel_size: Defines the number of NPUs to be used for ViT inference.
  - As Qwen3-VL consists of multiple submodules, loading them all onto a single device may exceed its memory capacity, especially as the batch size increases. By distributing submodules across devices, memory usage can be optimized for efficient runtime performance.
main module:
- tensor_parallel_size: Defines the number of NPUs to be used for inference.
- kvcache_partition_len: Defines the length of KV cache partitions for flash attention.
- max_seq_len: Defines max position embedding for the language model, must be a multiple of kvcache_partition_len.
- device: Defines the device allocation for other modules except specifically device-allocated submodules.
- batch_size: Defines the batch size for compilation.
- decoder_batch_sizes: Defines the batch sizes for dynamic batching.

Note

Select batch size based on model size and NPU specs. Moreover, vllm-rbln supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.

from optimum.rbln import RBLNQwen3VLForConditionalGeneration

model_id = "Qwen/Qwen3-VL-2B-Instruct"
model = RBLNQwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    export=True,
    rbln_config={
        "visual": {
            "device": [8, 9, 10, 11, 12, 13, 14, 15],
            "max_seq_lens": 16_384,
            "tensor_parallel_size": 8,
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 262_144,
        "device": [0, 1, 2, 3, 4, 5, 6, 7],
        "batch_size": 8,
    },
)
model.save_pretrained("rbln-Qwen3-VL-2B-Instruct")

1.2. Inference using vLLM¶

You can use the compiled model with vLLM. The example below shows how to set up the vLLM engine using a compiled model and run inference.

Note

Parameters passed to from_pretrained typically require the rbln prefix (e.g., rbln_batch_size, rbln_max_seq_len).

In contrast, parameters within rbln_config should not include the prefix. Avoid using the rbln prefix when specifying the same parameters in rbln_config.

from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
import os

# If the video is too long
# set `VLLM_ENGINE_ITERATION_TIMEOUT_S` to a higher timeout value.
VIDEO_URLS = [
    "https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
]

def generate_prompts_video(model_id: str):
    processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
    video_nums = len(VIDEO_URLS)
    messages = [[
        {
            "role":
            "user",
            "content": [
                {
                    "type": "video",
                    "video": VIDEO_URLS[i],
                },
                {
                    "type": "text",
                    "text": "Describe this video."
                },
            ],
        },
    ] for i in range(video_nums)]

    texts = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    arr_video_inputs = []
    for i in range(video_nums):
        _, video_inputs = process_vision_info(
            messages[i], return_video_metadata=True)
        arr_video_inputs.append(video_inputs)

    # With return_video_metadata=True, each video_input is already a
    # (video_tensor, metadata_dict) tuple, satisfying Qwen3-VL's
    # video_needs_metadata=True requirement.
    return [{
        "prompt": text,
        "multi_modal_data": {
            "video": video_inputs,
        },
        "mm_processor_kwargs": {
            "min_pixels": 1024 * 14 * 14,
            "max_pixels": 5120 * 14 * 14,
        },
    } for text, video_inputs in zip(texts, arr_video_inputs)]

def main():
    model_id = "rbln-Qwen3-VL-2B-Instruct"
    llm = LLM(model=model_id)

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    sampling_params = SamplingParams(
        temperature=0,
        ignore_eos=False,
        skip_special_tokens=True,
        stop_token_ids=[tokenizer.eos_token_id],
        max_tokens=200
    )

    inputs = generate_prompts_video(model_id)
    outputs = llm.generate(inputs, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(generated_text)

if __name__ == "__main__":
    main()

Example Output:

The video begins with a close-up shot of a clear plastic container filled with various fruits, including apples, pears, and grapes. The container is placed on a wooden table, and the background features a blue wall with a floral arrangement. The camera then zooms in on the container, highlighting its transparent design and the vibrant colors of the fruits inside. The video emphasizes the container's ability to keep the fruits fresh and visible.

Next, the video transitions to a demonstration of the container's durability. A hand is seen placing a heavy brick on top of the container, which remains intact and undamaged. This scene is repeated several times, each time with a different angle or lighting, to emphasize the container's strength and resistance to pressure. The text on the screen highlights the container's ability to withstand heavy loads, making it suitable for transporting and storing fruits.

The video then shifts to a close-up of the container being rotated on a white surface, showcasing its sleek design and the clarity of the plastic

2. Execution: Without Pre-Compilation (Beta)¶

Info

This is a beta feature. It compiles and serves Qwen/Qwen3-VL-2B-Instruct with the given vLLM parameters, placing the encoder on a separate set of NPUs.

2.1. Compilation and Inference using vLLM¶

By providing vLLM parameters, vLLM RBLN compiles the model with optimum-rbln at engine startup and serves it for inference. The example below shows how to set up the vLLM engine.

Note

VLLM_RBLN_TP_SIZE sets the tensor-parallel size for the main module. To assign a tensor_parallel_size to a submodule, set it directly under that submodule in rbln_config, as in the example below.

Note

Compiled artifacts are saved under $VLLM_CACHE_ROOT/compiled_models. VLLM_CACHE_ROOT defaults to ~/.cache/vllm; set the environment variable to override the location.

from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
import os

# VIDEO_URLS and generate_prompts_video() unchanged.

def main():
    model_id = "Qwen/Qwen3-VL-2B-Instruct"
    os.environ["VLLM_RBLN_TP_SIZE"] = "8"
    llm = LLM(
        model=model_id,
        block_size=16_384,
        max_model_len=262_144,
        max_num_seqs=8,
        additional_config={
            "rbln_config": {
                # LM on devices 0-7
                "device": [0, 1, 2, 3, 4, 5, 6, 7],
                # ViT on devices 8-15
                "visual": {
                    "device": [8, 9, 10, 11, 12, 13, 14, 15],
                    "tensor_parallel_size": 8,
                    "max_seq_lens": 16_384,
                }
            }
        }
    )

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # ... (rest of main() unchanged)

if __name__ == "__main__":
    main()

Example Output:

The video begins with a close-up shot of a clear plastic container filled with various fruits, including apples, pears, and grapes. The container is placed on a wooden table, and the background features a blue wall with a floral arrangement. The camera then zooms in on the container, highlighting its transparent design and the vibrant colors of the fruits inside. The video emphasizes the container's ability to keep the fruits fresh and visible.

Next, the video transitions to a demonstration of the container's durability. A hand is seen placing a heavy brick on top of the container, which remains intact and undamaged. This scene is repeated several times, each time with a different angle or lighting, to emphasize the container's strength and resistance to pressure. The text on the screen highlights the container's ability to withstand heavy loads, making it suitable for transporting and storing fruits.

The video then shifts to a close-up of the container being rotated on a white surface, showcasing its sleek design and the clarity of the plastic

Qwen3-VL 2B¶

Overview¶

Setup & Installation¶

1. Execution: With Pre-Compilation¶

1.1. Model Compilation¶

1.2. Inference using vLLM¶

2. Execution: Without Pre-Compilation (Beta)¶

2.1. Compilation and Inference using vLLM¶

References¶