Automatic Prefix Caching¶
Overview¶
vLLM RBLN supports Automatic Prefix Caching (APC), which utilizes the KV cache from previous requests when a new request shares the same prefix. This allows the model to skip computation for the overlapping prefix segment, improving efficiency and throughput.
Enabling APC in vLLM RBLN¶
APC is configured the same way as in vLLM. It is enabled by default. To disable it, set enable_prefix_caching=False.
Advanced Configuration: Prefix Cache Hit Granularity¶
By default, the prefix cache hit granularity is determined by the prefill_chunk_size specified at model compilation time. You can modify this granularity by setting prefix_block_size in additional_config when initializing the LLM Engine, provided that prefix_block_size is a multiple of prefill_chunk_size.