Inference with Dynamic Batch Sizes
In real-world serving scenarios, you often need to handle varying numbers of requests efficiently. For example, sometimes you might have 1 request, sometimes 3, 5, or 7 requests arriving simultaneously. Instead of always using the maximum batch size (e.g., 8) which wastes computation for smaller request counts, you can compile multiple decoders with different batch sizes and let the system automatically choose the decoder with the batch size closest to the actual number of requests.
The rbln_decoder_batch_sizes
parameter allows you to specify multiple batch sizes during compilation. This enables the model to automatically select the most appropriate decoder based on the actual number of requests, improving both throughput and resource utilization. For example, when 3 requests arrive, the decoder with batch size 4 would be selected, and when 7 requests arrive, the decoder with batch size 8 would be used.
Similar Optimization Techniques
This approach is similar to other vLLM optimization techniques:
- CUDA Graph:
cudagraph_capture_sizes
- pre-captures CUDA graphs for different batch sizes
- Inductor Compilation:
compile_sizes
- pre-compiles kernels for specific input sizes
All techniques share the principle of pre-optimizing for expected input sizes to improve dynamic serving performance.
Compile Model with Multiple Decoder Batch Sizes
| from optimum.rbln import RBLNLlamaForCausalLM
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# Compile with multiple decoder batch sizes
model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True, # To compile the model, this argument must be True
rbln_batch_size=8, # Maximum batch size for prefill
rbln_max_seq_len=8192, # Maximum sequence length
rbln_tensor_parallel_size=4, # Tensor parallelism
rbln_decoder_batch_sizes=[8, 4, 1], # Compile decoders for batch sizes 8, 4, and 1
)
# Save compiled results to disk
model.save_pretrained("rbln-dynamic-Llama-3-1-8B-Instruct")
|
Note
The rbln_decoder_batch_sizes
list will be automatically sorted in descending order. All values must be less than or equal to rbln_batch_size
. If the maximum batch size is not included in the list, it will be automatically added.
Use vLLM API for Efficient Dynamic Batch Inference
There are three test cases in this example. When processing small_batch_conversations
or medium_batch_conversations
, the batch_size is expected to be 4 to ensure low latency. When processing large_batch_conversations
, the batch_size increases to 8 to achieve higher throughput and better resource utilization.
| small_batch_conversations = [
[{"role": "user", "content": "What is the first letter of English alphabets?"}]
]
medium_batch_conversations = [
[{"role": "user", "content": "Explain quantum computing in simple terms."}],
[{"role": "user", "content": "What are the benefits of renewable energy?"}],
[{"role": "user", "content": "Describe the process of photosynthesis."}],
[{"role": "user", "content": "How does machine learning work?"}],
]
large_batch_conversations = [
[{"role": "user", "content": "What is the theory of relativity?"}],
[{"role": "user", "content": "Explain blockchain technology."}],
[{"role": "user", "content": "Describe climate change effects."}],
[{"role": "user", "content": "How do neural networks learn?"}],
[{"role": "user", "content": "What is genetic engineering?"}],
[{"role": "user", "content": "Explain the water cycle."}],
[{"role": "user", "content": "How does the internet work?"}],
[{"role": "user", "content": "What is sustainable development?"}],
]
|
| from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Please make sure the engine configurations match the parameters used when compiling.
# This example assumes the model was compiled with rbln_decoder_batch_sizes=[8, 4, 1]
model_id = "rbln-dynamic-Llama-3-1-8B-Instruct"
max_seq_len = 8192
batch_size = 8 # Maximum batch size
llm = LLM(
model=model_id,
device="rbln",
max_num_seqs=batch_size,
max_num_batched_tokens=max_seq_len,
max_model_len=max_seq_len,
block_size=max_seq_len,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
temperature=0.0,
skip_special_tokens=True,
stop_token_ids=[tokenizer.eos_token_id],
)
conversations = [
small_batch_conversations,
medium_batch_conversations,
large_batch_conversations,
]
for conversation in conversations:
chats = [
tokenizer.apply_chat_template(
conv,
add_generation_prompt=True,
tokenize=False,
) for conv in conversation
]
outputs = llm.generate(chats, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
Benefits of Dynamic Batch Compilation
-
Better Throughput: The system automatically selects the optimal decoder for each request batch size, improving overall throughput.
-
Flexible Serving: Handle varying workloads efficiently without being constrained by a single batch size.