Inference with Dynamic Batch Sizes
Overview
In real-world serving scenarios, you often need to handle varying numbers of requests efficiently.
For example, sometimes you might have 1 request, sometimes 3, 5, or 7 requests arriving simultaneously. Instead of always using the maximum batch size (e.g., 8) which wastes computation for smaller request counts, you can compile multiple decoders with different batch sizes and let the system automatically choose the decoder with the batch size closest to the actual number of requests.
The rbln_decoder_batch_sizes
parameter allows you to specify multiple batch sizes during compilation.
This enables the model to automatically select the most appropriate decoder based on the actual number of requests, improving both throughput and resource utilization.
For example, when 3 requests arrive, the decoder with batch size 4 would be selected, and when 7 requests arrive, the decoder with batch size 8 would be used.
Similar Optimization Techniques
This approach is similar to other vLLM optimization techniques:
- CUDA Graph:
cudagraph_capture_sizes
- pre-captures CUDA graphs for different batch sizes
- Inductor Compilation:
compile_sizes
- pre-compiles kernels for specific input sizes
All techniques share the principle of pre-optimizing for expected input sizes to improve dynamic serving performance.
Setup & Installation
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Packages Requirements:
- Installation Command:
| pip install optimum-rbln>=0.8.2 vllm-rbln>=0.8.2
pip install --extra-index-url https://pypi.rbln.ai/simple/ rebel-compiler>=0.8.2
|
Execution
Model Compilation with Multiple Decoder Batch Sizes
You can compile the model with multiple decoder batch sizes using rbln_decoder_batch_sizes
.
Note
The rbln_decoder_batch_sizes
list will be automatically sorted in descending order.
All values must be less than or equal to rbln_batch_size
.
If the maximum batch size is not included in the list, it will be automatically added.
| from optimum.rbln import RBLNLlamaForCausalLM
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# Compile with multiple decoder batch sizes
model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True, # To compile the model, this argument must be True
rbln_batch_size=8, # Maximum batch size for prefill
rbln_max_seq_len=8192, # Maximum sequence length
rbln_tensor_parallel_size=4, # Tensor parallelism
rbln_decoder_batch_sizes=[8, 4, 1], # Compile decoders for batch sizes 8, 4, and 1
)
# Save compiled results to disk
model.save_pretrained("rbln-dynamic-Llama-3-8B-Instruct")
|
Inference using vLLM
You can use the compiled model with vLLM
. The example below shows how to set up the vLLM engine
using a compiled model and run inference.
Note
Parameters passed to from_pretrained
typically require the rbln
prefix (e.g., rbln_batch_size
, rbln_max_seq_len
).
In contrast, parameters within rbln_config
should not include the prefix. Avoid using the rbln
prefix when specifying the same parameters in rbln_config
.
| small_batch_conversations = [
[{"role": "user", "content": "What is the first letter of English alphabets?"}]
]
medium_batch_conversations = [
[{"role": "user", "content": "Explain quantum computing in simple terms."}],
[{"role": "user", "content": "What are the benefits of renewable energy?"}],
[{"role": "user", "content": "Describe the process of photosynthesis."}],
[{"role": "user", "content": "How does machine learning work?"}],
]
large_batch_conversations = [
[{"role": "user", "content": "What is the theory of relativity?"}],
[{"role": "user", "content": "Explain blockchain technology."}],
[{"role": "user", "content": "Describe climate change effects."}],
[{"role": "user", "content": "How do neural networks learn?"}],
[{"role": "user", "content": "What is genetic engineering?"}],
[{"role": "user", "content": "Explain the water cycle."}],
[{"role": "user", "content": "How does the internet work?"}],
[{"role": "user", "content": "What is sustainable development?"}],
]
|
| from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Please make sure the engine configurations match the parameters used when compiling.
# This example assumes the model was compiled with rbln_decoder_batch_sizes=[8, 4, 1]
model_id = "rbln-dynamic-Llama-3-8B-Instruct"
max_seq_len = 8192
batch_size = 8 # Maximum batch size
llm = LLM(
model=model_id,
device="rbln",
max_num_seqs=batch_size,
max_num_batched_tokens=max_seq_len,
max_model_len=max_seq_len,
block_size=max_seq_len,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
temperature=0.0,
skip_special_tokens=True,
stop_token_ids=[tokenizer.eos_token_id],
)
conversations = [
small_batch_conversations,
medium_batch_conversations,
large_batch_conversations,
]
for conversation in conversations:
chats = [
tokenizer.apply_chat_template(
conv,
add_generation_prompt=True,
tokenize=False,
) for conv in conversation
]
outputs = llm.generate(chats, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(generated_text)
print("=====================================")
|
Example Output:
| The first letter of the English alphabet is "A".'
=====================================
Quantum computing! It's a fascinating topic that can be a bit tricky to
Renewable energy has numerous benefits, including:
1. **Sustainability**:
Photosynthesis is the process by which plants, algae, and some bacteria convert light
Machine learning is a subfield of artificial intelligence that involves training algorithms to learn from
=====================================
The theory of relativity, developed by Albert Einstein, is a fundamental concept in
Blockchain technology is a decentralized, distributed ledger system that enables secure, transparent, and
Climate change is having a profound impact on our planet, and its effects are widespread
Neural networks learn through a process called supervised learning, unsupervised learning,
Genetic engineering, also known as genetic modification (GM), is the process of
The water cycle, also known as the hydrologic cycle, is the continuous process
What a great question! The internet is a complex system, but I'll try
Sustainable development is a concept that was first introduced in the 1987 report
|
References