Qwen3-VL 2B
Overview
This tutorial explains how to run the multi-modal model on vLLM using multiple RBLN NPUs. For this guide, we will use the Qwen/Qwen3-VL-2B-Instruct model that allows images and videos as inputs.
Setup & Installation
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Packages Requirements:
- Installation Command:
| pip install \
--extra-index-url https://pypi.rbln.ai/simple \
rebel-compiler==0.10.4
pip install \
--extra-index-url https://wheels.vllm.ai/0.18.0/cpu \
--extra-index-url https://download.pytorch.org/whl/cpu \
vllm-rbln==0.10.4
|
Note
- Please note that
rebel-compiler requires an RBLN Portal account.
- The commands above are intended for a default pip install on Debian-based Linux such as Ubuntu. For all other configurations, refer to the Installation Guide for the supported install matrix and the applicable commands.
1. Execution: With Pre-Compilation
This step demonstrates how to use vLLM to serve the Qwen/Qwen3-VL-2B-Instruct model on multiple RBLN NPUs.
Warning
Sharing the default device pool across submodules can exhaust device memory as batch_size grows. The example below assigns visual and the LM to disjoint device pools to avoid this. See Multi-Module Models for details.
1.1. Model Compilation
You can modify the parameters of the main module as well as the submodules through rbln_config. For the original source code, refer to the RBLN Model Zoo.
If you need the API reference, see RblnModelConfig.
-
visual submodule:
-
max_seq_lens: Defines the max sequence length for Vision Transformer (ViT), representing the number of patches in an image.
-
device: Defines the device allocation for each submodule during runtime.
-
tensor_parallel_size: Defines the number of NPUs to be used for ViT inference.
- As Qwen3-VL consists of multiple submodules, loading them all onto a single device may exceed its memory capacity, especially as the batch size increases. By distributing submodules across devices, memory usage can be optimized for efficient runtime performance.
-
main module:
-
tensor_parallel_size: Defines the number of NPUs to be used for inference.
-
kvcache_partition_len: Defines the length of KV cache partitions for flash attention.
-
max_seq_len: Defines max position embedding for the language model, must be a multiple of kvcache_partition_len.
-
device: Defines the device allocation for other modules except specifically device-allocated submodules.
-
batch_size: Defines the batch size for compilation.
-
decoder_batch_sizes: Defines the batch sizes for dynamic batching.
Note
Select batch size based on model size and NPU specs. Moreover, vllm-rbln supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.
| from optimum.rbln import RBLNQwen3VLForConditionalGeneration
model_id = "Qwen/Qwen3-VL-2B-Instruct"
model = RBLNQwen3VLForConditionalGeneration.from_pretrained(
model_id,
export=True,
rbln_config={
"visual": {
"device": [8, 9, 10, 11, 12, 13, 14, 15],
"max_seq_lens": 16_384,
"tensor_parallel_size": 8,
},
"tensor_parallel_size": 8,
"kvcache_partition_len": 16_384,
"max_seq_len": 262_144,
"device": [0, 1, 2, 3, 4, 5, 6, 7],
"batch_size": 8,
},
)
model.save_pretrained("rbln-Qwen3-VL-2B-Instruct")
|
1.2. Inference using vLLM
You can use the compiled model with vLLM. The example below shows how to set up the vLLM engine using a compiled model and run inference.
Note
Parameters passed to from_pretrained typically require the rbln prefix (e.g., rbln_batch_size, rbln_max_seq_len).
In contrast, parameters within rbln_config should not include the prefix. Avoid using the rbln prefix when specifying the same parameters in rbln_config.
| from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
import os
# If the video is too long
# set `VLLM_ENGINE_ITERATION_TIMEOUT_S` to a higher timeout value.
VIDEO_URLS = [
"https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
]
def generate_prompts_video(model_id: str):
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
video_nums = len(VIDEO_URLS)
messages = [[
{
"role":
"user",
"content": [
{
"type": "video",
"video": VIDEO_URLS[i],
},
{
"type": "text",
"text": "Describe this video."
},
],
},
] for i in range(video_nums)]
texts = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False,
)
arr_video_inputs = []
for i in range(video_nums):
_, video_inputs = process_vision_info(
messages[i], return_video_metadata=True)
arr_video_inputs.append(video_inputs)
# With return_video_metadata=True, each video_input is already a
# (video_tensor, metadata_dict) tuple, satisfying Qwen3-VL's
# video_needs_metadata=True requirement.
return [{
"prompt": text,
"multi_modal_data": {
"video": video_inputs,
},
"mm_processor_kwargs": {
"min_pixels": 1024 * 14 * 14,
"max_pixels": 5120 * 14 * 14,
},
} for text, video_inputs in zip(texts, arr_video_inputs)]
def main():
model_id = "rbln-Qwen3-VL-2B-Instruct"
llm = LLM(model=model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
temperature=0,
ignore_eos=False,
skip_special_tokens=True,
stop_token_ids=[tokenizer.eos_token_id],
max_tokens=200
)
inputs = generate_prompts_video(model_id)
outputs = llm.generate(inputs, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(generated_text)
if __name__ == "__main__":
main()
|
Example Output:
| The video begins with a close-up shot of a clear plastic container filled with various fruits, including apples, pears, and grapes. The container is placed on a wooden table, and the background features a blue wall with a floral arrangement. The camera then zooms in on the container, highlighting its transparent design and the vibrant colors of the fruits inside. The video emphasizes the container's ability to keep the fruits fresh and visible.
Next, the video transitions to a demonstration of the container's durability. A hand is seen placing a heavy brick on top of the container, which remains intact and undamaged. This scene is repeated several times, each time with a different angle or lighting, to emphasize the container's strength and resistance to pressure. The text on the screen highlights the container's ability to withstand heavy loads, making it suitable for transporting and storing fruits.
The video then shifts to a close-up of the container being rotated on a white surface, showcasing its sleek design and the clarity of the plastic
|
2. Execution: Without Pre-Compilation (Beta)
Info
This is a beta feature. It compiles and serves Qwen/Qwen3-VL-2B-Instruct with the given vLLM parameters, placing the encoder on a separate set of NPUs.
2.1. Compilation and Inference using vLLM
By providing vLLM parameters, vLLM RBLN compiles the model with optimum-rbln at engine startup and serves it for inference. The example below shows how to set up the vLLM engine.
Note
VLLM_RBLN_TP_SIZE sets the tensor-parallel size for the main module. To assign a tensor_parallel_size to a submodule, set it directly under that submodule in rbln_config, as in the example below.
Note
Compiled artifacts are saved under $VLLM_CACHE_ROOT/compiled_models. VLLM_CACHE_ROOT defaults to ~/.cache/vllm; set the environment variable to override the location.
| from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
import os
# VIDEO_URLS and generate_prompts_video() unchanged.
def main():
model_id = "Qwen/Qwen3-VL-2B-Instruct"
os.environ["VLLM_RBLN_TP_SIZE"] = "8"
llm = LLM(
model=model_id,
block_size=16_384,
max_model_len=262_144,
max_num_seqs=8,
additional_config={
"rbln_config": {
# LM on devices 0-7
"device": [0, 1, 2, 3, 4, 5, 6, 7],
# ViT on devices 8-15
"visual": {
"device": [8, 9, 10, 11, 12, 13, 14, 15],
"tensor_parallel_size": 8,
"max_seq_lens": 16_384,
}
}
}
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# ... (rest of main() unchanged)
if __name__ == "__main__":
main()
|
Example Output:
| The video begins with a close-up shot of a clear plastic container filled with various fruits, including apples, pears, and grapes. The container is placed on a wooden table, and the background features a blue wall with a floral arrangement. The camera then zooms in on the container, highlighting its transparent design and the vibrant colors of the fruits inside. The video emphasizes the container's ability to keep the fruits fresh and visible.
Next, the video transitions to a demonstration of the container's durability. A hand is seen placing a heavy brick on top of the container, which remains intact and undamaged. This scene is repeated several times, each time with a different angle or lighting, to emphasize the container's strength and resistance to pressure. The text on the screen highlights the container's ability to withstand heavy loads, making it suitable for transporting and storing fruits.
The video then shifts to a close-up of the container being rotated on a white surface, showcasing its sleek design and the clarity of the plastic
|
References