Qwen2.5-VL-7B (Multi-modal)¶
Overview¶
This tutorial explains how to run the multi-modal model on vLLM using multiple RBLN NPUs. For this guide, we will use the Qwen/Qwen2.5-VL-7B-Instruct
model that allows images and videos as inputs.
Note
Rebellions Scalable Design (RSD) is available on ATOM™+ (RBLN-CA12
and RBLN-CA22
) and ATOM™-Max (RBLN-CA25
). You can check your RBLN NPU type using the rbln-stat
command.
Note
Qwen2.5-VL is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln
- vllm - It is automatically installed when you install
vllm-rbln
.
- vllm - It is automatically installed when you install
- Installation Command:
Note
Please note that rebel-compiler
requires an RBLN Portal account.
Note
Please note that the Qwen/Qwen2.5-VL-7B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli
command as shown below:
Compile Qwen2.5-VL-7B¶
You can modify the parameters of the main module as well as the submodules through rbln_config
. For the original source code, refer to the RBLN Model Zoo. If you need the API reference, see RblnModelConfig.
- visual submodule:
max_seq_lens
: Defines the max sequence length for Vision Transformer (ViT), representing the number of patches in an image.device
: Defines the device allocation for each submodule during runtime.- As Qwen2.5-VL consists of multiple submodules, loading them all onto a single device may exceed its memory capacity, especially as the batch size increases. By distributing submodules across devices, memory usage can be optimized for efficient runtime performance.
- main module:
export
: Must beTrue
to compile the model.tensor_parallel_size
: Defines the number of NPUs to be used for inference.kvcache_partition_len
: Defines the length of KV cache partitions for flash attention.max_seq_len
: Defines max position embedding for the language model, must be a multiple ofkvcache_partition_len
.device
: Defines the device allocation for other modules except specifically device-allocated submodules.batch_size
: Defines the batch size for compilation.decoder_batch_sizes
: Defines the dynamic batching. See Inference with Dynamic Batch Sizes for more details.
Note
Parameters passed to from_pretrained
typically require the rbln
prefix (e.g., rbln_batch_size
, rbln_max_seq_len
). In contrast, parameters within rbln_config
should not include the prefix. Avoid using the rbln
prefix when specifying the same parameters in rbln_config
.
Use vLLM API for Inference¶
You can find more vLLM API usage examples for encoder-decoder models and multi-modal models in RBLN Model Zoo.
Please refer to the vLLM Docs for more information on the vLLM API.