Qwen2.5-VL-7B (Multi-modal)¶
Overview¶
This tutorial explains how to run the multi-modal model on vLLM using multiple RBLN NPUs. For this guide, we will use the Qwen/Qwen2.5-VL-7B-Instruct model that allows images and videos as inputs.
Note
Rebellions Scalable Design (RSD) is available on ATOM™+ (RBLN-CA12 and RBLN-CA22) and ATOM™-Max (RBLN-CA25). You can check your RBLN NPU type using the rbln-stat command.
Note
Qwen2.5-VL is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln
- vllm - It is automatically installed when you install
vllm-rbln.
- vllm - It is automatically installed when you install
- Installation Command:
Note
Please note that rebel-compiler requires an RBLN Portal account.
Note
Please note that the Qwen/Qwen2.5-VL-7B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli command as shown below:
Compile Qwen2.5-VL-7B¶
You can modify the parameters of the main module as well as the submodules through rbln_config. For the original source code, refer to the RBLN Model Zoo. If you need the API reference, see RblnModelConfig.
- visual submodule:
max_seq_lens: Defines the max sequence length for Vision Transformer (ViT), representing the number of patches in an image.device: Defines the device allocation for each submodule during runtime.- As Qwen2.5-VL consists of multiple submodules, loading them all onto a single device may exceed its memory capacity, especially as the batch size increases. By distributing submodules across devices, memory usage can be optimized for efficient runtime performance.
- main module:
export: Must beTrueto compile the model.tensor_parallel_size: Defines the number of NPUs to be used for inference.kvcache_partition_len: Defines the length of KV cache partitions for flash attention.max_seq_len: Defines max position embedding for the language model, must be a multiple ofkvcache_partition_len.device: Defines the device allocation for other modules except specifically device-allocated submodules.batch_size: Defines the batch size for compilation.decoder_batch_sizes: Defines the dynamic batching. See Inference with Dynamic Batch Sizes for more details.
Note
Parameters passed to from_pretrained typically require the rbln prefix (e.g., rbln_batch_size, rbln_max_seq_len). In contrast, parameters within rbln_config should not include the prefix. Avoid using the rbln prefix when specifying the same parameters in rbln_config.
Use vLLM API for Inference¶
You can find more vLLM API usage examples for encoder-decoder models and multi-modal models in RBLN Model Zoo.
Please refer to the vLLM Docs for more information on the vLLM API.