Qwen2.5-VL 7B¶

개요¶

이 튜토리얼은 여러 개의 RBLN NPU를 사용하여 vLLM에서 멀티 모달 모델을 실행하는 방법을 설명합니다. 이 가이드에서는 이미지와 비디오를 입력받을 수 있는 Qwen/Qwen2.5-VL-7B-Instruct 모델을 사용합니다.

환경 설정 및 설치 확인¶

시작하기 전에 시스템 환경이 올바르게 구성되어 있으며, 필요한 모든 필수 패키지가 설치되어 있는지 확인하십시오. 다음 항목이 포함됩니다:

시스템 요구 사항:
- Python: 3.9–3.12
- RBLN Driver
필수 패키지:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – vllm은 함께 자동으로 설치됨.

설치 명령어:

pip install optimum-rbln>=0.9.3.post1 vllm-rbln>=0.9.3.post2
pip install --extra-index-url https://pypi.rbln.ai/simple/ rebel-compiler>=0.9.3.post1

Note

rebel-compiler를 사용하려면 RBLN 포털 계정이 필요하니 참고하십시오.

Execution¶

모델 컴파일¶

rbln_config를 통해 메인 모듈을 비롯한 서브 모듈의 파라미터를 수정할 수 있습니다. 원본 소스 코드는 RBLN Model Zoo를 참고하세요. API 레퍼런스가 필요하다면 RblnModelConfig를 확인하세요.

visual submodule:
- max_seq_lens: DVision Transformer (ViT)에서의 최대 시퀀스 길이를 정의하며, 이는 이미지 내 패치(patch)의 개수를 나타냅니다.
- device: 실행 중 각 서브모듈에 할당할 디바이스를 정의합니다.
  - Qwen2.5-VL은 여러 서브모듈로 구성되어 있어, 모든 모듈을 하나의 디바이스에 로드하면 특히 배치 사이즈가 클 경우 메모리 용량을 초과할 수 있습니다. 서브모듈을 여러 디바이스에 분산시킴으로써 메모리 사용을 최적화하고 효율적인 실행 성능을 확보할 수 있습니다.
main module:
- export: 모델을 컴파일하려면 True로 설정해야 합니다.
- tensor_parallel_size: 추론에 사용할 NPU의 수를 정의합니다.
- kvcache_partition_len: Flash Attention을 위한 KV 캐시 파티션의 길이를 정의합니다.
- max_seq_len: 언어 모델의 최대 위치 임베딩을 정의하며, kvcache_partition_len의 배수여야 합니다.
- device:특정 디바이스에 할당된 서브모듈을 제외한 메인 모듈의 디바이스 할당을 정의합니다.
- batch_size: 컴파일 시의 배치 사이즈를 정의합니다.
- decoder_batch_sizes: 다이나믹 배칭을 위한 배치 크기들을 정의합니다.

Note

모델 크기와 NPU 사양에 따라 적절한 배치 사이즈를 선택하세요. 또한, vllm-rbln은 최적의 처리량과 자원 활용을 보장하기 위해 동적 배치(Dynamic Batching)을 지원합니다. 자세한 내용은 Dynamic Batching를 참고하세요.

from optimum.rbln import RBLNQwen2_5_VLForConditionalGeneration

model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
model = RBLNQwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    export=True,
    rbln_config={
        "visual": {
            "max_seq_lens": 6400,
            "device": 0,
        },
        "tensor_parallel_size": 8,
        "kvcache_partition_len": 16_384,
        "max_seq_len": 114_688,
        "device": [0, 1, 2, 3, 4, 5, 6, 7],
        "batch_size": 2,
        "decoder_batch_sizes": [2, 1],
    },
)
model.save_pretrained("rbln-Qwen2-5-7B-Instruct")

vLLM을 활용한 추론¶

컴파일된 모델은 vLLM과 함께 사용할 수 있습니다. 아래 예시는 컴파일된 모델을 사용하여 vLLM 엔진을 설정하고 추론을 수행하는 방법을 보여줍니다.

Note

from_pretrained의 매개변수는 일반적으로 rbln_batch_size, rbln_max_seq_len과 같이 rbln 접두사가 필요합니다.

하지만 rbln_config 내부의 매개변수는 이러한 접두사가 필요하지 않습니다. 동일한 매개변수를 rbln_config에서 설정할 때는 절대로 rbln 접두사를 붙이지 마세요.

from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams

# If the video is too long
# set `VLLM_ENGINE_ITERATION_TIMEOUT_S` to a higher timeout value.
VIDEO_URLS = [
    "https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
]

def generate_prompts_video(model_id: str):
    processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
    video_nums = len(VIDEO_URLS)
    messages = [[
        {
            "role":
            "user",
            "content": [
                {
                    "type": "video",
                    "video": VIDEO_URLS[i],
                },
                {
                    "type": "text",
                    "text": "Describe this video."
                },
            ],
        },
    ] for i in range(video_nums)]

    texts = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    arr_video_inputs = []
    arr_video_kwargs = []
    for i in range(video_nums):
        image_inputs, video_inputs, video_kwargs = process_vision_info(
            messages[i], return_video_kwargs=True)
        arr_video_inputs.append(video_inputs)
        arr_video_kwargs.append(video_kwargs)

    return [{
        "prompt": text,
        "multi_modal_data": {
            "video": video_inputs,
        },
        "mm_processor_kwargs": {
            "min_pixels": 1024 * 14 * 14,
            "max_pixels": 5120 * 14 * 14,
            **video_kwargs,
        },
    } for text, video_inputs, video_kwargs in zip(
        texts, arr_video_inputs, arr_video_kwargs)]

def main():
    model_id = "rbln-Qwen2-5-7B-Instruct"
    llm = LLM(model=model_id)

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    sampling_params = SamplingParams(
        temperature=0,
        ignore_eos=False,
        skip_special_tokens=True,
        stop_token_ids=[tokenizer.eos_token_id],
        max_tokens=200
    )

    inputs = generate_prompts_video(model_id)
    outputs = llm.generate(inputs, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(generated_text)

if __name__ == "__main__":
    main()

예시 출력:

The video showcases a clear plastic food container being used to hold various fruits, demonstrating its features and benefits.The container is placed on a wooden table with a decorative background that includes a bouquet of artificial flowers and a plate with a mango and some berries.

The video begins with a close-up of the container filled with peaches, accompanied by text highlighting its versatility for different types of fruits such as longan, sliced watermelon, strawberries, and cherries. The container is then shown being opened and closed, emphasizing its easy-to-use design and secure locking mechanism.

Next, the container is filled with cherries, and the text explains that it is made of PET material, ensuring durability and quality. The container is then subjected to a durability test by placing two bricks on top of it, demonstrating its strength and ability to withstand pressure.

The video concludes with a final shot of the container filled with peaches, showcasing its transparency and the clear view of the contents inside. The text reiterates the container

Qwen2.5-VL 7B¶

개요¶

환경 설정 및 설치 확인¶

Execution¶

모델 컴파일¶

vLLM을 활용한 추론¶

참고¶