vLLM 네이티브 API¶

vllm-rbln를 이용하여, vLLM API 를 대형 언어 모델(LLMs)에 손쉽게 활용 할 수 있습니다. 이 튜토리얼에서는 vLLM API를 사용하여 Llama3-8B와 Llama3.1-8B 모델을 각각 Eager Attention과 Flash Attention으로 추론을 수행하는 방법을 배웁니다.

사전 준비¶

rebel-compiler, optimum-rbln, vllm-rbln 패키지의 최신 버전이 설치되어 있어야 합니다. 각 패키지를 설치하기 위해 리벨리온 사설 PyPI 서버의 접근 권한이 필요합니다. 관련 내용은 설치 가이드를 참고하시기 바랍니다. 각 패키지의 최신 버전은 릴리즈 노트에서 확인 할 수 있습니다.

$ pip3 install --extra-index https://pypi.rbln.ai/simple/ "rebel-compiler>=0.8.0" "optimum-rbln>=0.8.0.post2" "vllm-rbln>=0.8.0"

기본 모델 예제: Llama3-8B¶

1 단계. Llama3-8B 컴파일¶

튜토리얼 예시에서는 Llama3-8B 모델을 사용하여 진행합니다. 먼저, optimum-rbln을 사용하여 Llama3-8B 모델을 컴파일합니다.

from optimum.rbln import RBLNLlamaForCausalLM
import os

# HuggingFace PyTorch Llama3 모델을 RBLN 컴파일된 모델로 내보내기
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                  # True인 경우 모델 컴파일 진행
    rbln_max_seq_len=8192,        # Maximum sequence length
    rbln_tensor_parallel_size=4,  # Rebellions Scalable Design (RSD)를 위한 ATOM™+ 개수
    rbln_batch_size=4,            # Continuous batching을 위함, batch_size > 1 권장
)

# 컴파일 결과를 저장하기
compiled_model.save_pretrained(os.path.basename(model_id))

Note

서빙에 사용할 적절한 배치 크기를 선택해야 합니다. 여기에서는 4로 설정합니다.

2 단계. 추론을 위한 vLLM API 사용¶

vLLM의 API를 사용해 컴파일된 모델을 실행할 수 있습니다. 다음은 앞서 컴파일한 모델을 vLLM 엔진을 통해 초기화를 진행한 후 추론을 수행하는 코드입니다.

vllm_api_example_llama3_8B.py
import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams

# Please make sure the engine configurations match the parameters used when compiling.
model_id = "Meta-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 4

engine_args = AsyncEngineArgs(
  model=model_id,
  device="rbln",
  max_num_seqs=batch_size,
  max_num_batched_tokens=max_seq_len,
  max_model_len=max_seq_len,
  block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)

def stop_tokens():
  eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
  if eot_id is not None:
    return [tokenizer.eos_token_id, eot_id]
  else:
    return [tokenizer.eos_token_id]

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=stop_tokens(),
)


# Runs a single inference for an example
async def run_single(chat, request_id):
  results_generator = engine.generate(chat, sampling_params, request_id=request_id)
  final_result = None
  async for result in results_generator:
    # You can use the intermediate `result` here, if needed.
    final_result = result
  return final_result


conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
print(result)


async def run_multi(chats):
  tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
  return [await task for task in tasks]

# Runs multiple inferences in parallel
conversations = [
  [{"role": "user", "content": "What is the first letter of English alphabets?"}],
  [{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
  tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
  for conversation in conversations
]
results = asyncio.run(run_multi(chats))
for result in results:
   assert len(result.outputs) > 0, "Invalid output."
   print(result.outputs[0].text)

vLLM API를 이용해 여러 인코더-디코더 모델이나 멀티모달 모델을 실행할 수 있습니다. 모델 주에서 가능한 모델을 확인하실 수 있습니다.

vLLM API에 대한 더 자세한 내용은 vLLM 문서를 참고하시기 바랍니다.

응용 예제: Flash Attention 을 이용한 Llama3.1-8B¶

Flash Attention은 메모리 사용량을 줄이고 처리량을 향상시켜 Llama3.1-8B등의 모델에서 긴 컨텍스트를 효율적으로 처리할 수 있습니다. optimum-rbln으로 컴파일할 때 rbln_kvcache_partition_len 매개변수를 추가하면 Flash Attention을 활성화할 수 있습니다.

1 단계. Llama3.1-8B 컴파일¶

from optimum.rbln import RBLNLlamaForCausalLM
import os

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# HuggingFace PyTorch Llama3.1 모델을 RBLN 컴파일된 모델로 내보내기
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                        # True인 경우 모델 컴파일 진행
    rbln_batch_size=1,                  # Batch size
    rbln_max_seq_len=131_072,           # Maximum sequence length
    rbln_tensor_parallel_size=8,        # Tensor parallelism
    rbln_kvcache_partition_len=16_384,  # Flash Attention 을 사용하기 위한 KV cache 파티션 크기
)

# 컴파일 결과를 저장하기
model.save_pretrained(os.path.basename(model_id))

Note

배치 크기는 요구사항에 적합하게 선택하면 됩니다. 여기서는 1로 설정합니다.

2 단계. 추론을 위한 vLLM API 사용¶

컴파일 후에는 vLLM API로 모델을 사용할 수 있습니다:

Note

Flash Attention을 사용하기 위해서는, block_size가 rbln_kvcache_partition_len과 일치해야 합니다.

vllm_api_example_llama3_1_8B.py
import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams


# Please make sure the engine configurations match the parameters used when compiling.
model_id = "Llama-3.1-8B-Instruct"
max_seq_len = 131_072
batch_size = 1
block_size = 16_384  # Should match to `rbln_kvcache_partition_len` for flash attention.


engine_args = AsyncEngineArgs(
  model=model_id,
  device="rbln",
  max_num_seqs=batch_size,
  max_num_batched_tokens=max_seq_len,
  max_model_len=max_seq_len,
  block_size=block_size,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)

def stop_tokens():
  eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
  if eot_id is not None:
    return [tokenizer.eos_token_id, eot_id]
  else:
    return [tokenizer.eos_token_id]

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=stop_tokens(),
)


# Runs a single inference for an example
async def run_single(chat, request_id):
  results_generator = engine.generate(chat, sampling_params, request_id=request_id)
  final_result = None
  async for result in results_generator:
    # You can use the intermediate `result` here, if needed.
    final_result = result
  return final_result


conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
assert len(result.outputs) > 0, "Invalid output."
print(result.outputs[0].text)


async def run_multi(chats):
  tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
  return [await task for task in tasks]

conversations = [
  [{"role": "user", "content": "What is the first letter of English alphabets?"}],
  [{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
  tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
  for conversation in conversations
]
results = asyncio.run(run_multi(chats))
for result in results:
  assert len(result.outputs) > 0, "Invalid output."
  print(result.outputs[0].text)

vLLM API에 대한 더 자세한 내용은 vLLM 문서를 참고하시기 바랍니다.

응용 예제: 동적 배치 크기를 이용한 멀티 배치 추론¶

실제 서빙 환경에서는 다양한 수의 요청을 효율적으로 처리해야 하는 경우가 많습니다. 예를 들어, 때로는 1개의 요청, 때로는 3개, 5개, 또는 7개의 요청이 동시에 들어올 수 있습니다. 이때 항상 최대 배치 크기(예: 8)를 사용하면 그보다 작은 요청에 대해서는 연산 자원이 낭비됩니다. 이를 해결하기 위해 서로 다른 배치 크기를 가진 여러 디코더를 컴파일하고, 시스템이 실제 요청 수에 가장 가까운 배치 크기의 디코더를 자동으로 선택하도록 할 수 있습니다.

rbln_decoder_batch_sizes 파라미터를 사용하면 컴파일 시 여러 배치 크기를 지정할 수 있습니다. 이를 통해 실제 요청 수에 따라 가장 적절한 디코더를 자동으로 선택하여 처리량과 자원 활용도를 개선할 수 있습니다. 예를 들어, 3개의 요청이 들어오면 배치 크기 4의 디코더가, 7개의 요청이 들어오면 배치 크기 8의 디코더가 선택됩니다.

유사한 최적화 기법

이 방식은 vLLM의 다른 최적화 기법과 유사합니다:

CUDA Graph: cudagraph_capture_sizes - 다양한 배치 크기로 CUDA 그래프를 사전 캡처
Inductor 컴파일: compile_sizes - 특정 입력 크기로 커널을 사전 컴파일

모두 예상되는 입력 크기들을 미리 최적화하여 동적 서빙 성능을 향상시키는 공통 원리를 사용합니다.

1 단계. 여러 디코더 배치 크기로 모델 컴파일¶

from optimum.rbln import RBLNLlamaForCausalLM
import os

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# 여러 디코더 배치 크기로 컴파일
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                        # True인 경우 모델 컴파일 진행
    rbln_batch_size=8,                  # 최대 배치 크기
    rbln_max_seq_len=8192,              # 최대 시퀀스 길이
    rbln_tensor_parallel_size=4,        # 텐서 병렬화
    rbln_decoder_batch_sizes=[8, 4, 1], # 배치 크기 8, 4, 1에 대한 디코더 컴파일
)

# 컴파일 결과를 저장하기
model.save_pretrained(os.path.basename(model_id))

Note

rbln_decoder_batch_sizes 리스트는 자동으로 내림차순으로 정렬됩니다. 모든 값은 rbln_batch_size보다 작거나 같아야 합니다. 최대 배치 크기가 리스트에 포함되지 않은 경우 자동으로 추가됩니다.

2 단계. 효율적인 멀티 배치 추론을 위한 vLLM API 사용¶

vllm_api_example_multi_batch.py
import asyncio
import time

from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams

# Please make sure the engine configurations match the parameters used when compiling.
# This example assumes the model was compiled with rbln_decoder_batch_sizes=[8, 4, 1]
model_id = "Meta-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 8  # Maximum batch size

engine_args = AsyncEngineArgs(
    model=model_id,
    device="rbln",
    max_num_seqs=batch_size,
    max_num_batched_tokens=max_seq_len,
    max_model_len=max_seq_len,
    block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)


def stop_tokens():
    eot_id = next(
        (
            k
            for k, t in tokenizer.added_tokens_decoder.items()
            if t.content == "<|eot_id|>"
        ),
        None,
    )
    if eot_id is not None:
        return [tokenizer.eos_token_id, eot_id]
    else:
        return [tokenizer.eos_token_id]


sampling_params = SamplingParams(
    temperature=0.0,
    skip_special_tokens=True,
    stop_token_ids=stop_tokens(),
)


async def collect_async_result(
    engine: AsyncLLMEngine, chat, sampling_params: SamplingParams, request_id: str
):
    final_result = None
    async for result in engine.generate(chat, sampling_params, request_id=request_id):
        final_result = result
    return final_result


async def run_batch(chats, batch_name):
    print(f"=== {batch_name} ===")

    tasks = [
        asyncio.create_task(
            collect_async_result(engine, chat, sampling_params, f"{batch_name}_{i}")
        )
        for i, chat in enumerate(chats)
    ]

    results = await asyncio.gather(*tasks)

    for i, result in enumerate(results):
        output = result.outputs[0].text
        print(f"===================== Output {i} ==============================")
        print(output)
        print("===============================================================\n")

    return results


async def main():
    # Scenario 1: Single request (uses batch_size=1 decoder)
    single_conversation = [
        {
            "role": "user",
            "content": "Tell me a short story about artificial intelligence.",
        }
    ]
    single_chat = [
        tokenizer.apply_chat_template(
            single_conversation, add_generation_prompt=True, tokenize=False
        )
    ]

    await run_batch(single_chat, "Single Request")

    # Scenario 2: Medium batch (uses batch_size=4 decoder)
    medium_conversations = [
        [{"role": "user", "content": "Explain quantum computing in simple terms."}],
        [{"role": "user", "content": "What are the benefits of renewable energy?"}],
        [{"role": "user", "content": "Describe the process of photosynthesis."}],
        [{"role": "user", "content": "How does machine learning work?"}],
    ]
    medium_chats = [
        tokenizer.apply_chat_template(conv, add_generation_prompt=True, tokenize=False)
        for conv in medium_conversations
    ]

    await run_batch(medium_chats, "Medium Batch (4 requests)")

    # Scenario 3: Large batch (uses batch_size=8 decoder)
    large_conversations = [
        [{"role": "user", "content": "What is the theory of relativity?"}],
        [{"role": "user", "content": "Explain blockchain technology."}],
        [{"role": "user", "content": "Describe climate change effects."}],
        [{"role": "user", "content": "How do neural networks learn?"}],
        [{"role": "user", "content": "What is genetic engineering?"}],
        [{"role": "user", "content": "Explain the water cycle."}],
        [{"role": "user", "content": "How does the internet work?"}],
        [{"role": "user", "content": "What is sustainable development?"}],
    ]
    large_chats = [
        tokenizer.apply_chat_template(conv, add_generation_prompt=True, tokenize=False)
        for conv in large_conversations
    ]

    await run_batch(large_chats, "Large Batch (8 requests)")


if __name__ == "__main__":
    asyncio.run(main())

멀티 배치 컴파일의 장점¶

향상된 처리량: 시스템이 각 요청 배치 크기에 대해 최적의 디코더를 자동으로 선택하여 전체 처리량을 개선합니다.
유연한 서빙: 단일 배치 크기에 제약받지 않고 다양한 워크로드를 효율적으로 처리할 수 있습니다.