Inference with Dynamic Batch Sizes¶

Overview¶

In real-world serving scenarios, you often need to handle varying numbers of requests efficiently. For example, sometimes you might have 1 request, sometimes 3, 5, or 7 requests arriving simultaneously. Instead of always using the maximum batch size (e.g., 8) which wastes computation for smaller request counts, you can compile multiple decoders with different batch sizes and let the system automatically choose the decoder with the batch size closest to the actual number of requests.

The rbln_decoder_batch_sizes parameter allows you to specify multiple batch sizes during compilation. This enables the model to automatically select the most appropriate decoder based on the actual number of requests, improving both throughput and resource utilization. For example, when 3 requests arrive, the decoder with batch size 4 would be selected, and when 7 requests arrive, the decoder with batch size 8 would be used.

Similar Optimization Techniques

This approach is similar to other vLLM optimization techniques:

CUDA Graph: cudagraph_capture_sizes - pre-captures CUDA graphs for different batch sizes
Inductor Compilation: compile_sizes - pre-compiles kernels for specific input sizes

All techniques share the principle of pre-optimizing for expected input sizes to improve dynamic serving performance.

Setup & Installation¶

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

System Requirements:
- Python: 3.9–3.12
- RBLN Driver
Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs vLLM

Installation Command:

pip install optimum-rbln>=0.9.4 vllm-rbln>=0.9.4
pip install --extra-index-url https://pypi.rbln.ai/simple/ rebel-compiler>=0.9.4

Note

Please note that rebel-compiler requires an RBLN Portal account.

Execution¶

Model Compilation with Multiple Decoder Batch Sizes¶

You can compile the model with multiple decoder batch sizes using rbln_decoder_batch_sizes.

Note

The rbln_decoder_batch_sizes list will be automatically sorted in descending order. All values must be less than or equal to rbln_batch_size. If the maximum batch size is not included in the list, it will be automatically added.

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile with multiple decoder batch sizes
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,                        # To compile the model, this argument must be True
    rbln_batch_size=8,                  # Maximum batch size for prefill
    rbln_max_seq_len=8192,              # Maximum sequence length
    rbln_tensor_parallel_size=4,        # Tensor parallelism
    rbln_decoder_batch_sizes=[8, 4, 1], # Compile decoders for batch sizes 8, 4, and 1
)

# Save compiled results to disk
model.save_pretrained("rbln-dynamic-Llama-3-8B-Instruct")

Inference using vLLM¶

You can use the compiled model with vLLM. The example below shows how to set up the vLLM engine using a compiled model and run inference.

Note

Parameters passed to from_pretrained typically require the rbln prefix (e.g., rbln_batch_size, rbln_max_seq_len).

In contrast, parameters within rbln_config should not include the prefix. Avoid using the rbln prefix when specifying the same parameters in rbln_config.

small_batch_conversations = [
    [{"role": "user", "content": "What is the first letter of English alphabets?"}]
]

medium_batch_conversations = [
    [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    [{"role": "user", "content": "What are the benefits of renewable energy?"}],
    [{"role": "user", "content": "Describe the process of photosynthesis."}],
    [{"role": "user", "content": "How does machine learning work?"}],
]

large_batch_conversations = [
    [{"role": "user", "content": "What is the theory of relativity?"}],
    [{"role": "user", "content": "Explain blockchain technology."}],
    [{"role": "user", "content": "Describe climate change effects."}],
    [{"role": "user", "content": "How do neural networks learn?"}],
    [{"role": "user", "content": "What is genetic engineering?"}],
    [{"role": "user", "content": "Explain the water cycle."}],
    [{"role": "user", "content": "How does the internet work?"}],
    [{"role": "user", "content": "What is sustainable development?"}],
]

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

def main():
    model_id = "rbln-dynamic-Llama-3-8B-Instruct"
    llm = LLM(model=model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    sampling_params = SamplingParams(
    temperature=0.0,
    skip_special_tokens=True,
    stop_token_ids=[tokenizer.eos_token_id],
    )

    conversations = [
        small_batch_conversations,
        medium_batch_conversations,
        large_batch_conversations,
    ]

    for conversation in conversations:
        chats = [
            tokenizer.apply_chat_template(
                conv,
                add_generation_prompt=True,
                tokenize=False,
            ) for conv in conversation
        ]

        outputs = llm.generate(chats, sampling_params)
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(generated_text)
        print("=====================================")

if __name__ == "__main__":
    main()

Example Output:

The first letter of the English alphabet is "A".'
=====================================
Quantum computing! It's a fascinating topic that can be a bit tricky to
Renewable energy has numerous benefits, including:

1. **Sustainability**:
Photosynthesis is the process by which plants, algae, and some bacteria convert light
Machine learning is a subfield of artificial intelligence that involves training algorithms to learn from
=====================================
The theory of relativity, developed by Albert Einstein, is a fundamental concept in
Blockchain technology is a decentralized, distributed ledger system that enables secure, transparent, and
Climate change is having a profound impact on our planet, and its effects are widespread
Neural networks learn through a process called supervised learning, unsupervised learning,
Genetic engineering, also known as genetic modification (GM), is the process of
The water cycle, also known as the hydrologic cycle, is the continuous process
What a great question! The internet is a complex system, but I'll try
Sustainable development is a concept that was first introduced in the 1987 report

Inference with Dynamic Batch Sizes¶

Overview¶

Setup & Installation¶

Execution¶

Model Compilation with Multiple Decoder Batch Sizes¶

Inference using vLLM¶

References¶