Skip to content

Qwen2.5-VL-7B (Multi-modal)

Overview

This tutorial explains how to run the multi-modal model on vLLM using multiple RBLN NPUs. For this guide, we will use the Qwen/Qwen2.5-VL-7B-Instruct model that allows images and videos as inputs.

Note

Rebellions Scalable Design (RSD) is available on ATOM™+ (RBLN-CA12 and RBLN-CA22) and ATOM™-Max (RBLN-CA25). You can check your RBLN NPU type using the rbln-stat command.

Note

Qwen2.5-VL is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

Setup & Installation

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

Note

Please note that rebel-compiler requires an RBLN Portal account.

Note

Please note that the Qwen/Qwen2.5-VL-7B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli command as shown below:

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) .
Token: *****

Compile Qwen2.5-VL-7B

You can modify the parameters of the main module as well as the submodules through rbln_config. For the original source code, refer to the RBLN Model Zoo. If you need the API reference, see RblnModelConfig.

  • visual submodule:
    • max_seq_lens: Defines the max sequence length for Vision Transformer (ViT), representing the number of patches in an image.
    • device: Defines the device allocation for each submodule during runtime.
      • As Qwen2.5-VL consists of multiple submodules, loading them all onto a single device may exceed its memory capacity, especially as the batch size increases. By distributing submodules across devices, memory usage can be optimized for efficient runtime performance.
  • main module:
    • export: Must be True to compile the model.
    • tensor_parallel_size: Defines the number of NPUs to be used for inference.
    • kvcache_partition_len: Defines the length of KV cache partitions for flash attention.
    • max_seq_len: Defines max position embedding for the language model, must be a multiple of kvcache_partition_len.
    • device: Defines the device allocation for other modules except specifically device-allocated submodules.
    • batch_size: Defines the batch size for compilation.
    • decoder_batch_sizes: Defines the dynamic batching. See Inference with Dynamic Batch Sizes for more details.
from optimum.rbln import RBLNQwen2_5_VLForConditionalGeneration

model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
model = RBLNQwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": 6400,
            "device": 0,
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 114_688,
        "device": [0, 1, 2, 3, 4, 5, 6, 7],
        "batch_size": 2,
        "decoder_batch_sizes": [2, 1],
    },
)
model.save_pretrained("rbln-Qwen2-5-7B-Instruct")

Note

Parameters passed to from_pretrained typically require the rbln prefix (e.g., rbln_batch_size, rbln_max_seq_len). In contrast, parameters within rbln_config should not include the prefix. Avoid using the rbln prefix when specifying the same parameters in rbln_config.

Use vLLM API for Inference

from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams

# If the video is too long
# set `VLLM_ENGINE_ITERATION_TIMEOUT_S` to a higher timeout value.
VIDEO_URLS = [
    "https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
]

model_id = "rbln-Qwen2-5-7B-Instruct"
batch_size = 2
max_seq_len = 114688
kvcache_partition_len = 16384

def generate_prompts_video(model_id):
    processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
    video_nums = len(VIDEO_URLS)
    messages = [[
        {
            "role":
            "user",
            "content": [
                {
                    "type": "video",
                    "video": VIDEO_URLS[i],
                },
                {
                    "type": "text",
                    "text": "Describe this video."
                },
            ],
        },
    ] for i in range(video_nums)]

    texts = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    arr_video_inputs = []
    arr_video_kwargs = []
    for i in range(video_nums):
        image_inputs, video_inputs, video_kwargs = process_vision_info(
            messages[i], return_video_kwargs=True)
        arr_video_inputs.append(video_inputs)
        arr_video_kwargs.append(video_kwargs)

    return [{
        "prompt": text,
        "multi_modal_data": {
            "video": video_inputs,
        },
        "mm_processor_kwargs": {
            "min_pixels": 1024 * 14 * 14,
            "max_pixels": 5120 * 14 * 14,
            **video_kwargs,
        },
    } for text, video_inputs, video_kwargs in zip(
        texts, arr_video_inputs, arr_video_kwargs)]

llm = LLM(
    model=model_id,
    device="rbln",
    max_num_seqs=batch_size,
    max_num_batched_tokens=max_seq_len,
    max_model_len=max_seq_len,
    block_size=kvcache_partition_len
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
    temperature=0,
    ignore_eos=False,
    skip_special_tokens=True,
    stop_token_ids=[tokenizer.eos_token_id],
    max_tokens=200
)

inputs = generate_prompts_video(model_id)
outputs = llm.generate(inputs, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text!r}")

You can find more vLLM API usage examples for encoder-decoder models and multi-modal models in RBLN Model Zoo.

Please refer to the vLLM Docs for more information on the vLLM API.