vLLM Native API¶

Using vllm-rbln, you can easily utilize the vLLM API for large language models (LLMs). In this tutorial, you will learn how to perform inference with the Llama3-8B and Llama3.1-8B models using vLLM API, utilizing Eager Attention and Flash Attention, respectively.

How to install¶

First, make sure you have the latest versions of the required packages including rebel-compiler, optimum-rbln, and vllm-rbln. You need access rights to Rebellions' private PyPI server. Please refer to the Installation Guide for more information. You can find the latest version of the packages in the Release Note.

$ pip3 install --extra-index https://pypi.rbln.ai/simple/ "rebel-compiler>=0.8.0" "optimum-rbln>=0.8.0.post2" "vllm-rbln>=0.8.0"

Standard Model Example: Llama3-8B¶

Step1: Compile Llama3-8B¶

You need to compile the Llama3-8B model using optimum-rbln.

from optimum.rbln import RBLNLlamaForCausalLM
import os

# Export HuggingFace pytorch llama3 model to RBLN compiled model
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                  # To compile the model, this argument must be True
    rbln_max_seq_len=8192,        # Maximum sequence length
    rbln_tensor_parallel_size=4,  # number of ATOM™+ for Rebellions Scalable Design (RSD)
    rbln_batch_size=4,            # batch_size > 1 is recommended for continuous batching
)

compiled_model.save_pretrained(os.path.basename(model_id))

Note

You can choose an appropriate batch size for your serving needs. Here, it is set to 4.

Step2: Use vLLM API for Inference¶

You can use the compiled model with vLLM APIs. The following code shows how to initialize the vLLM engine with the compiled model and run the inference with the engine.

vllm_api_example_llama3_8B.py
import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams

# Please make sure the engine configurations match the parameters used when compiling.
model_id = "Meta-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 4

engine_args = AsyncEngineArgs(
  model=model_id,
  device="rbln",
  max_num_seqs=batch_size,
  max_num_batched_tokens=max_seq_len,
  max_model_len=max_seq_len,
  block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)

def stop_tokens():
  eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
  if eot_id is not None:
    return [tokenizer.eos_token_id, eot_id]
  else:
    return [tokenizer.eos_token_id]

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=stop_tokens(),
)


# Runs a single inference for an example
async def run_single(chat, request_id):
  results_generator = engine.generate(chat, sampling_params, request_id=request_id)
  final_result = None
  async for result in results_generator:
    # You can use the intermediate `result` here, if needed.
    final_result = result
  return final_result


conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
print(result)


async def run_multi(chats):
  tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
  return [await task for task in tasks]

# Runs multiple inferences in parallel
conversations = [
  [{"role": "user", "content": "What is the first letter of English alphabets?"}],
  [{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
  tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
  for conversation in conversations
]
results = asyncio.run(run_multi(chats))
for result in results:
   assert len(result.outputs) > 0, "Invalid output."
   print(result.outputs[0].text)

You can find more vLLM API usage examples for encoder-decoder models and multi-modal models in RBLN Model Zoo.

Please refer to the vLLM Docs for more information on the vLLM API.

Advanced Example: Llama3.1-8B with Flash Attention¶

Flash Attention enables efficient handling of long contexts in models like Llama3.1-8B by reducing memory usage and improving throughput. When working with optimum-rbln, Flash Attention can be enabled by adding rbln_kvcache_partition_len parameter when compiling.

Step1: Compile Llama3.1-8B¶

from optimum.rbln import RBLNLlamaForCausalLM
import os

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Compile and export
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                        # To compile the model, this argument must be True
    rbln_batch_size=1,                  # Batch size
    rbln_max_seq_len=131_072,           # Maximum sequence length
    rbln_tensor_parallel_size=8,        # Tensor parallelism
    rbln_kvcache_partition_len=16_384,  # Length of KV cache partitions for Flash Attention
)

# Save compiled results to disk
model.save_pretrained(os.path.basename(model_id))

Note

You can choose an appropriate batch size for your serving needs. Here, it is set to 1.

Step2: Use vLLM API for Inference¶

After compiling, you can use the model with vLLM APIs:

Note

Note that for Flash Attention, block_size should match with rbln_kvcache_partition_len.

vllm_api_example_llama3_1_8B.py
import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams


# Please make sure the engine configurations match the parameters used when compiling.
model_id = "Llama-3.1-8B-Instruct"
max_seq_len = 131_072
batch_size = 1
block_size = 16_384  # Should match to `rbln_kvcache_partition_len` for flash attention.


engine_args = AsyncEngineArgs(
  model=model_id,
  device="rbln",
  max_num_seqs=batch_size,
  max_num_batched_tokens=max_seq_len,
  max_model_len=max_seq_len,
  block_size=block_size,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)

def stop_tokens():
  eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
  if eot_id is not None:
    return [tokenizer.eos_token_id, eot_id]
  else:
    return [tokenizer.eos_token_id]

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=stop_tokens(),
)


# Runs a single inference for an example
async def run_single(chat, request_id):
  results_generator = engine.generate(chat, sampling_params, request_id=request_id)
  final_result = None
  async for result in results_generator:
    # You can use the intermediate `result` here, if needed.
    final_result = result
  return final_result


conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
assert len(result.outputs) > 0, "Invalid output."
print(result.outputs[0].text)


async def run_multi(chats):
  tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
  return [await task for task in tasks]

conversations = [
  [{"role": "user", "content": "What is the first letter of English alphabets?"}],
  [{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
  tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
  for conversation in conversations
]
results = asyncio.run(run_multi(chats))
for result in results:
  assert len(result.outputs) > 0, "Invalid output."
  print(result.outputs[0].text)

Please refer to the vLLM Docs for more information on the vLLM API.

Advanced Example: Multi-Batch Inference with Dynamic Batch Sizes¶

In real-world serving scenarios, you often need to handle varying numbers of requests efficiently. For example, sometimes you might have 1 request, sometimes 3, 5, or 7 requests arriving simultaneously. Instead of always using the maximum batch size (e.g., 8) which wastes computation for smaller request counts, you can compile multiple decoders with different batch sizes and let the system automatically choose the decoder with the batch size closest to the actual number of requests.

The rbln_decoder_batch_sizes parameter allows you to specify multiple batch sizes during compilation. This enables the model to automatically select the most appropriate decoder based on the actual number of requests, improving both throughput and resource utilization. For example, when 3 requests arrive, the decoder with batch size 4 would be selected, and when 7 requests arrive, the decoder with batch size 8 would be used.

Similar Optimization Techniques

This approach is similar to other vLLM optimization techniques:

CUDA Graph: cudagraph_capture_sizes - pre-captures CUDA graphs for different batch sizes
Inductor Compilation: compile_sizes - pre-compiles kernels for specific input sizes

All techniques share the principle of pre-optimizing for expected input sizes to improve dynamic serving performance.

Step1: Compile Model with Multiple Decoder Batch Sizes¶

from optimum.rbln import RBLNLlamaForCausalLM
import os

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile with multiple decoder batch sizes
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                        # To compile the model, this argument must be True
    rbln_batch_size=8,                  # Maximum batch size for prefill
    rbln_max_seq_len=8192,              # Maximum sequence length
    rbln_tensor_parallel_size=4,        # Tensor parallelism
    rbln_decoder_batch_sizes=[8, 4, 1], # Compile decoders for batch sizes 8, 4, and 1
)

# Save compiled results to disk
model.save_pretrained(os.path.basename(model_id))

Note

The rbln_decoder_batch_sizes list will be automatically sorted in descending order. All values must be less than or equal to rbln_batch_size. If the maximum batch size is not included in the list, it will be automatically added.

Step2: Use vLLM API for Efficient Multi-Batch Inference¶

vllm_api_example_multi_batch.py
import asyncio
import time

from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams

# Please make sure the engine configurations match the parameters used when compiling.
# This example assumes the model was compiled with rbln_decoder_batch_sizes=[8, 4, 1]
model_id = "Meta-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 8  # Maximum batch size

engine_args = AsyncEngineArgs(
    model=model_id,
    device="rbln",
    max_num_seqs=batch_size,
    max_num_batched_tokens=max_seq_len,
    max_model_len=max_seq_len,
    block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)


def stop_tokens():
    eot_id = next(
        (
            k
            for k, t in tokenizer.added_tokens_decoder.items()
            if t.content == "<|eot_id|>"
        ),
        None,
    )
    if eot_id is not None:
        return [tokenizer.eos_token_id, eot_id]
    else:
        return [tokenizer.eos_token_id]


sampling_params = SamplingParams(
    temperature=0.0,
    skip_special_tokens=True,
    stop_token_ids=stop_tokens(),
)


async def collect_async_result(
    engine: AsyncLLMEngine, chat, sampling_params: SamplingParams, request_id: str
):
    final_result = None
    async for result in engine.generate(chat, sampling_params, request_id=request_id):
        final_result = result
    return final_result


async def run_batch(chats, batch_name):
    print(f"=== {batch_name} ===")

    tasks = [
        asyncio.create_task(
            collect_async_result(engine, chat, sampling_params, f"{batch_name}_{i}")
        )
        for i, chat in enumerate(chats)
    ]

    results = await asyncio.gather(*tasks)

    for i, result in enumerate(results):
        output = result.outputs[0].text
        print(f"===================== Output {i} ==============================")
        print(output)
        print("===============================================================\n")

    return results


async def main():
    # Scenario 1: Single request (uses batch_size=1 decoder)
    single_conversation = [
        {
            "role": "user",
            "content": "Tell me a short story about artificial intelligence.",
        }
    ]
    single_chat = [
        tokenizer.apply_chat_template(
            single_conversation, add_generation_prompt=True, tokenize=False
        )
    ]

    await run_batch(single_chat, "Single Request")

    # Scenario 2: Medium batch (uses batch_size=4 decoder)
    medium_conversations = [
        [{"role": "user", "content": "Explain quantum computing in simple terms."}],
        [{"role": "user", "content": "What are the benefits of renewable energy?"}],
        [{"role": "user", "content": "Describe the process of photosynthesis."}],
        [{"role": "user", "content": "How does machine learning work?"}],
    ]
    medium_chats = [
        tokenizer.apply_chat_template(conv, add_generation_prompt=True, tokenize=False)
        for conv in medium_conversations
    ]

    await run_batch(medium_chats, "Medium Batch (4 requests)")

    # Scenario 3: Large batch (uses batch_size=8 decoder)
    large_conversations = [
        [{"role": "user", "content": "What is the theory of relativity?"}],
        [{"role": "user", "content": "Explain blockchain technology."}],
        [{"role": "user", "content": "Describe climate change effects."}],
        [{"role": "user", "content": "How do neural networks learn?"}],
        [{"role": "user", "content": "What is genetic engineering?"}],
        [{"role": "user", "content": "Explain the water cycle."}],
        [{"role": "user", "content": "How does the internet work?"}],
        [{"role": "user", "content": "What is sustainable development?"}],
    ]
    large_chats = [
        tokenizer.apply_chat_template(conv, add_generation_prompt=True, tokenize=False)
        for conv in large_conversations
    ]

    await run_batch(large_chats, "Large Batch (8 requests)")


if __name__ == "__main__":
    asyncio.run(main())

Benefits of Multi-Batch Compilation¶

Better Throughput: The system automatically selects the optimal decoder for each request batch size, improving overall throughput.
Flexible Serving: Handle varying workloads efficiently without being constrained by a single batch size.