Skip to content

Automatic Prefix Caching

Overview

vLLM RBLN supports Automatic Prefix Caching (APC), which utilizes the KV cache from previous requests when a new request shares the same prefix. This allows the model to skip computation for the overlapping prefix segment, improving efficiency and throughput.

Enabling APC in vLLM RBLN

APC is configured the same way as in vLLM. It is enabled by default. To disable it, set enable_prefix_caching=False.

Limits

Currently, APC applies prefix caching in units of 128 tokens. Support for more flexible caching granularities will be added in future releases.