vLLM Native API
Using vllm-rbln
, you can easily utilize the vLLM API for large language models (LLMs). In this tutorial, you will learn how to perform inference with the Llama3-8B and Llama3.1-8B models using vLLM API, utilizing Eager Attention and Flash Attention, respectively.
How to install
First, make sure you have the latest versions of the required packages including rebel-compiler
, optimum-rbln
, and vllm-rbln
. You need access rights to Rebellions' private PyPI server. Please refer to the Installation Guide for more information. You can find the latest version of the packages in the Release Note.
| $ pip3 install --extra-index https://pypi.rbln.ai/simple/ "rebel-compiler>=0.8.0" "optimum-rbln>=0.8.0.post2" "vllm-rbln>=0.8.0"
|
Standard Model Example: Llama3-8B
Step1: Compile Llama3-8B
You need to compile the Llama3-8B model using optimum-rbln.
| from optimum.rbln import RBLNLlamaForCausalLM
import os
# Export HuggingFace pytorch llama3 model to RBLN compiled model
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True, # To compile the model, this argument must be True
rbln_max_seq_len=8192, # Maximum sequence length
rbln_tensor_parallel_size=4, # number of ATOM™+ for Rebellions Scalable Design (RSD)
rbln_batch_size=4, # batch_size > 1 is recommended for continuous batching
)
compiled_model.save_pretrained(os.path.basename(model_id))
|
Note
You can choose an appropriate batch size for your serving needs. Here, it is set to 4.
Step2: Use vLLM API for Inference
You can use the compiled model with vLLM APIs. The following code shows how to initialize the vLLM engine with the compiled model and run the inference with the engine.
vllm_api_example_llama3_8B.py |
---|
| import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams
# Please make sure the engine configurations match the parameters used when compiling.
model_id = "Meta-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 4
engine_args = AsyncEngineArgs(
model=model_id,
device="rbln",
max_num_seqs=batch_size,
max_num_batched_tokens=max_seq_len,
max_model_len=max_seq_len,
block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def stop_tokens():
eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
if eot_id is not None:
return [tokenizer.eos_token_id, eot_id]
else:
return [tokenizer.eos_token_id]
sampling_params = SamplingParams(
temperature=0.0,
skip_special_tokens=True,
stop_token_ids=stop_tokens(),
)
# Runs a single inference for an example
async def run_single(chat, request_id):
results_generator = engine.generate(chat, sampling_params, request_id=request_id)
final_result = None
async for result in results_generator:
# You can use the intermediate `result` here, if needed.
final_result = result
return final_result
conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
print(result)
async def run_multi(chats):
tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
return [await task for task in tasks]
# Runs multiple inferences in parallel
conversations = [
[{"role": "user", "content": "What is the first letter of English alphabets?"}],
[{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
for conversation in conversations
]
results = asyncio.run(run_multi(chats))
for result in results:
assert len(result.outputs) > 0, "Invalid output."
print(result.outputs[0].text)
|
You can find more vLLM API usage examples for encoder-decoder models and multi-modal models in RBLN Model Zoo.
Please refer to the vLLM Docs for more information on the vLLM API.
Advanced Example: Llama3.1-8B with Flash Attention
Flash Attention enables efficient handling of long contexts in models like Llama3.1-8B
by reducing memory usage and improving throughput. When working with optimum-rbln
, Flash Attention can be enabled by adding rbln_kvcache_partition_len
parameter when compiling.
Step1: Compile Llama3.1-8B
| from optimum.rbln import RBLNLlamaForCausalLM
import os
model_id = "meta-llama/Llama-3.1-8B-Instruct"
# Compile and export
model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True, # To compile the model, this argument must be True
rbln_batch_size=1, # Batch size
rbln_max_seq_len=131_072, # Maximum sequence length
rbln_tensor_parallel_size=8, # Tensor parallelism
rbln_kvcache_partition_len=16_384, # Length of KV cache partitions for Flash Attention
)
# Save compiled results to disk
model.save_pretrained(os.path.basename(model_id))
|
Note
You can choose an appropriate batch size for your serving needs. Here, it is set to 1.
Step2: Use vLLM API for Inference
After compiling, you can use the model with vLLM APIs:
Note
Note that for Flash Attention, block_size
should match with rbln_kvcache_partition_len
.
vllm_api_example_llama3_1_8B.py |
---|
| import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams
# Please make sure the engine configurations match the parameters used when compiling.
model_id = "Llama-3.1-8B-Instruct"
max_seq_len = 131_072
batch_size = 1
block_size = 16_384 # Should match to `rbln_kvcache_partition_len` for flash attention.
engine_args = AsyncEngineArgs(
model=model_id,
device="rbln",
max_num_seqs=batch_size,
max_num_batched_tokens=max_seq_len,
max_model_len=max_seq_len,
block_size=block_size,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def stop_tokens():
eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
if eot_id is not None:
return [tokenizer.eos_token_id, eot_id]
else:
return [tokenizer.eos_token_id]
sampling_params = SamplingParams(
temperature=0.0,
skip_special_tokens=True,
stop_token_ids=stop_tokens(),
)
# Runs a single inference for an example
async def run_single(chat, request_id):
results_generator = engine.generate(chat, sampling_params, request_id=request_id)
final_result = None
async for result in results_generator:
# You can use the intermediate `result` here, if needed.
final_result = result
return final_result
conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
assert len(result.outputs) > 0, "Invalid output."
print(result.outputs[0].text)
async def run_multi(chats):
tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
return [await task for task in tasks]
conversations = [
[{"role": "user", "content": "What is the first letter of English alphabets?"}],
[{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
for conversation in conversations
]
results = asyncio.run(run_multi(chats))
for result in results:
assert len(result.outputs) > 0, "Invalid output."
print(result.outputs[0].text)
|
Please refer to the vLLM Docs for more information on the vLLM API.
Advanced Example: Multi-Batch Inference with Dynamic Batch Sizes
In real-world serving scenarios, you often need to handle varying numbers of requests efficiently. For example, sometimes you might have 1 request, sometimes 3, 5, or 7 requests arriving simultaneously. Instead of always using the maximum batch size (e.g., 8) which wastes computation for smaller request counts, you can compile multiple decoders with different batch sizes and let the system automatically choose the decoder with the batch size closest to the actual number of requests.
The rbln_decoder_batch_sizes
parameter allows you to specify multiple batch sizes during compilation. This enables the model to automatically select the most appropriate decoder based on the actual number of requests, improving both throughput and resource utilization. For example, when 3 requests arrive, the decoder with batch size 4 would be selected, and when 7 requests arrive, the decoder with batch size 8 would be used.
Similar Optimization Techniques
This approach is similar to other vLLM optimization techniques:
- CUDA Graph:
cudagraph_capture_sizes
- pre-captures CUDA graphs for different batch sizes
- Inductor Compilation:
compile_sizes
- pre-compiles kernels for specific input sizes
All techniques share the principle of pre-optimizing for expected input sizes to improve dynamic serving performance.
Step1: Compile Model with Multiple Decoder Batch Sizes
| from optimum.rbln import RBLNLlamaForCausalLM
import os
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# Compile with multiple decoder batch sizes
model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True, # To compile the model, this argument must be True
rbln_batch_size=8, # Maximum batch size for prefill
rbln_max_seq_len=8192, # Maximum sequence length
rbln_tensor_parallel_size=4, # Tensor parallelism
rbln_decoder_batch_sizes=[8, 4, 1], # Compile decoders for batch sizes 8, 4, and 1
)
# Save compiled results to disk
model.save_pretrained(os.path.basename(model_id))
|
Note
The rbln_decoder_batch_sizes
list will be automatically sorted in descending order. All values must be less than or equal to rbln_batch_size
. If the maximum batch size is not included in the list, it will be automatically added.
Step2: Use vLLM API for Efficient Multi-Batch Inference
vllm_api_example_multi_batch.py |
---|
| import asyncio
import time
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams
# Please make sure the engine configurations match the parameters used when compiling.
# This example assumes the model was compiled with rbln_decoder_batch_sizes=[8, 4, 1]
model_id = "Meta-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 8 # Maximum batch size
engine_args = AsyncEngineArgs(
model=model_id,
device="rbln",
max_num_seqs=batch_size,
max_num_batched_tokens=max_seq_len,
max_model_len=max_seq_len,
block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def stop_tokens():
eot_id = next(
(
k
for k, t in tokenizer.added_tokens_decoder.items()
if t.content == "<|eot_id|>"
),
None,
)
if eot_id is not None:
return [tokenizer.eos_token_id, eot_id]
else:
return [tokenizer.eos_token_id]
sampling_params = SamplingParams(
temperature=0.0,
skip_special_tokens=True,
stop_token_ids=stop_tokens(),
)
async def collect_async_result(
engine: AsyncLLMEngine, chat, sampling_params: SamplingParams, request_id: str
):
final_result = None
async for result in engine.generate(chat, sampling_params, request_id=request_id):
final_result = result
return final_result
async def run_batch(chats, batch_name):
print(f"=== {batch_name} ===")
tasks = [
asyncio.create_task(
collect_async_result(engine, chat, sampling_params, f"{batch_name}_{i}")
)
for i, chat in enumerate(chats)
]
results = await asyncio.gather(*tasks)
for i, result in enumerate(results):
output = result.outputs[0].text
print(f"===================== Output {i} ==============================")
print(output)
print("===============================================================\n")
return results
async def main():
# Scenario 1: Single request (uses batch_size=1 decoder)
single_conversation = [
{
"role": "user",
"content": "Tell me a short story about artificial intelligence.",
}
]
single_chat = [
tokenizer.apply_chat_template(
single_conversation, add_generation_prompt=True, tokenize=False
)
]
await run_batch(single_chat, "Single Request")
# Scenario 2: Medium batch (uses batch_size=4 decoder)
medium_conversations = [
[{"role": "user", "content": "Explain quantum computing in simple terms."}],
[{"role": "user", "content": "What are the benefits of renewable energy?"}],
[{"role": "user", "content": "Describe the process of photosynthesis."}],
[{"role": "user", "content": "How does machine learning work?"}],
]
medium_chats = [
tokenizer.apply_chat_template(conv, add_generation_prompt=True, tokenize=False)
for conv in medium_conversations
]
await run_batch(medium_chats, "Medium Batch (4 requests)")
# Scenario 3: Large batch (uses batch_size=8 decoder)
large_conversations = [
[{"role": "user", "content": "What is the theory of relativity?"}],
[{"role": "user", "content": "Explain blockchain technology."}],
[{"role": "user", "content": "Describe climate change effects."}],
[{"role": "user", "content": "How do neural networks learn?"}],
[{"role": "user", "content": "What is genetic engineering?"}],
[{"role": "user", "content": "Explain the water cycle."}],
[{"role": "user", "content": "How does the internet work?"}],
[{"role": "user", "content": "What is sustainable development?"}],
]
large_chats = [
tokenizer.apply_chat_template(conv, add_generation_prompt=True, tokenize=False)
for conv in large_conversations
]
await run_batch(large_chats, "Large Batch (8 requests)")
if __name__ == "__main__":
asyncio.run(main())
|
Benefits of Multi-Batch Compilation
-
Better Throughput: The system automatically selects the optimal decoder for each request batch size, improving overall throughput.
-
Flexible Serving: Handle varying workloads efficiently without being constrained by a single batch size.