동적 배치 크기를 이용한 추론¶

실제 서빙 환경에서는 다양한 수의 요청을 효율적으로 처리해야 하는 경우가 많습니다. 예를 들어, 때로는 1개의 요청, 때로는 3개, 5개, 또는 7개의 요청이 동시에 들어올 수 있습니다. 이때 항상 최대 배치 크기(예: 8)를 사용하면 그보다 작은 요청에 대해서는 연산 자원이 낭비됩니다. 이를 해결하기 위해 서로 다른 배치 크기를 가진 여러 디코더를 컴파일하고, 시스템이 실제 요청 수에 가장 가까운 배치 크기의 디코더를 자동으로 선택하도록 할 수 있습니다.

rbln_decoder_batch_sizes 파라미터를 사용하면 컴파일 시 여러 배치 크기를 지정할 수 있습니다. 이를 통해 실제 요청 수에 따라 가장 적절한 디코더를 자동으로 선택하여 처리량과 자원 활용도를 개선할 수 있습니다. 예를 들어, 3개의 요청이 들어오면 배치 크기 4의 디코더가, 7개의 요청이 들어오면 배치 크기 8의 디코더가 선택됩니다.

유사한 최적화 기법

이 방식은 vLLM의 다른 최적화 기법과 유사합니다:

CUDA Graph: cudagraph_capture_sizes - 다양한 배치 크기로 CUDA 그래프를 사전 캡처
Inductor 컴파일: compile_sizes - 특정 입력 크기로 커널을 사전 컴파일

모두 예상되는 입력 크기들을 미리 최적화하여 동적 서빙 성능을 향상시키는 공통 원리를 사용합니다.

여러 디코더 배치 크기로 모델 컴파일¶

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# 여러 디코더 배치 크기로 컴파일
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                        # True인 경우 모델 컴파일 진행
    rbln_batch_size=8,                  # 최대 배치 크기
    rbln_max_seq_len=8192,              # 최대 시퀀스 길이
    rbln_tensor_parallel_size=4,        # 텐서 병렬화
    rbln_decoder_batch_sizes=[8, 4, 1], # 배치 크기 8, 4, 1에 대한 디코더 컴파일
)

# 컴파일 결과를 저장하기
model.save_pretrained("rbln-dynamic-Llama-3-1-8B-Instruct")

Note

rbln_decoder_batch_sizes 리스트는 자동으로 내림차순으로 정렬됩니다. 모든 값은 rbln_batch_size보다 작거나 같아야 합니다. 최대 배치 크기가 리스트에 포함되지 않은 경우 자동으로 추가됩니다.

효율적인 동적 배치 추론을 위한 vLLM API 사용¶

이 예제에는 세 가지 테스트 케이스가 있습니다. small_batch_conversations와 medium_batch_conversations를 처리할 때는 낮은 지연 시간(Latency)을 보장하기 위해 batch_size가 4로 설정됩니다. 그리고 large_batch_conversations를 처리할 때는 높은 처리량(Throughput)과 더 나은 자원 활용(Resource Utilization)을 위해 batch_size가 8로 증가합니다.

small_batch_conversations = [
    [{"role": "user", "content": "What is the first letter of English alphabets?"}]
]

medium_batch_conversations = [
    [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    [{"role": "user", "content": "What are the benefits of renewable energy?"}],
    [{"role": "user", "content": "Describe the process of photosynthesis."}],
    [{"role": "user", "content": "How does machine learning work?"}],
]

large_batch_conversations = [
    [{"role": "user", "content": "What is the theory of relativity?"}],
    [{"role": "user", "content": "Explain blockchain technology."}],
    [{"role": "user", "content": "Describe climate change effects."}],
    [{"role": "user", "content": "How do neural networks learn?"}],
    [{"role": "user", "content": "What is genetic engineering?"}],
    [{"role": "user", "content": "Explain the water cycle."}],
    [{"role": "user", "content": "How does the internet work?"}],
    [{"role": "user", "content": "What is sustainable development?"}],
]

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Please make sure the engine configurations match the parameters used when compiling.
# This example assumes the model was compiled with rbln_decoder_batch_sizes=[8, 4, 1]
model_id = "rbln-dynamic-Llama-3-1-8B-Instruct"
max_seq_len = 8192
batch_size = 8  # Maximum batch size

llm = LLM(
    model=model_id,
    device="rbln",
    max_num_seqs=batch_size,
    max_num_batched_tokens=max_seq_len,
    max_model_len=max_seq_len,
    block_size=max_seq_len,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=[tokenizer.eos_token_id],
)

conversations = [
    small_batch_conversations,
    medium_batch_conversations,
    large_batch_conversations,
]

for conversation in conversations:
    chats = [
        tokenizer.apply_chat_template(
            conv,
            add_generation_prompt=True,
            tokenize=False,
        ) for conv in conversation
    ]

    outputs = llm.generate(chats, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

동적 배치 컴파일의 장점¶

향상된 처리량: 시스템이 각 요청 배치 크기에 대해 최적의 디코더를 자동으로 선택하여 전체 처리량을 개선합니다.
유연한 서빙: 단일 배치 크기에 제약받지 않고 다양한 워크로드를 효율적으로 처리할 수 있습니다.