Skip to content

Automatic Prefix Caching

OVERVIEW
OVERVIEW
GETTING STARTED
GETTING STARTED
SOFTWARE
SOFTWARE
- RBLN Compiler
  RBLN Compiler
  - Overview
  - Installation
  - Tutorials
    Tutorials
    
    Basic
    Basic
    
    PyTorch (Vision)
    
    PyTorch (NLP)
    
    TensorFlow (Vision)
    
    TensorFlow (NLP)
    
    Advanced
    Advanced
    
    Concurrent Processing
    
    Bucketing
  - Troubleshoot
  - Python API
  - Language Binding
    Language Binding
    
    C/C++
    C/C++
    
    Installation
    
    Tutorials
    Tutorials
    
    Image Classification
    
    Object Detection
    
    Text Generation
    
    API
  - Deep Dive
- RBLN Optimum
  RBLN Optimum
  - Overview
  - Installation
  - Tutorials
    Tutorials
    
    SDXL-turbo
    
    Llama3-8B
  - Model APIs
    Model APIs
    
    Overview
    
    Main Class
    Main Class
    
    Auto Classes
    
    AutoPipeline
    
    Transformers
    Transformers
    
    Common
    
    Text Models
    Text Models
    
    Common
    
    Bart
    
    Bert
    
    DistilBert
    
    EXAONE
    
    Gemma
    
    Gemma2
    
    GPT-2
    
    GPT-OSS
    
    Llama
    
    Mi:dm
    
    Mistral
    
    OPT
    
    Pegasus
    
    Phi
    
    Qwen2
    
    Qwen3
    
    RoBERTa
    
    T5
    
    XLM-RoBERTa
    
    Vision Models
    Vision Models
    
    Common
    
    DPT
    
    ResNet
    
    ViT
    
    Depth Anything V2
    
    Audio Models
    Audio Models
    
    AST
    
    Wav2Vec
    
    Whisper
    
    Multimodal Models
    Multimodal Models
    
    Clip
    
    BLIP2
    
    ColPali
    
    ColQwen2
    
    Idefics3
    
    LLaVa
    
    Llava Next
    
    Qwen2 VL
    
    Qwen2.5 VL
    
    Gemma3
    
    Grounding DINO
    
    PaliGemma
    
    Time Series Models
    Time Series Models
    
    Time Series Transformer
    
    Diffusers
    Diffusers
    
    Common
    
    Models
    Models
    
    UNets
    
    UNets (SVD)
    
    VAE
    
    VQModel
    
    Prior Transformer
    
    Transformer (SD3)
    
    Transformer (Cosmos)
    
    ControlNet
    
    Pipelines
    Pipelines
    
    Cosmos
    
    StableDiffusion
    
    StableDiffusion XL
    
    StableDiffusion 3
    
    StableVideoDiffusion
    
    Kandinsky V2.2
    
    ControlNet (SD)
    
    ControlNet (SDXL)
- RBLN Profiler
  RBLN Profiler
  - Overview
  - RBLN NPU Architecture
  - Profiling
  - Perfetto
    Perfetto
    
    Introduction
    
    How to Analyze
    
    Large Model Visualization
  - Examples
    Examples
    
    YOLOv8
    
    Stable Diffusion 3
    
    Llama3-8B
- RBLN PyTorch
  RBLN PyTorch
  - Overview
  - Tutorials
    Tutorials
    
    Basic
    
    Advanced
  - Supported Ops
  - TroubleShoot
  - APIs
MODEL SERVING
MODEL SERVING
- vLLM
  vLLM
  - vLLM RBLN
  - Tutorials
    Tutorials
    
    Llama3 8B
    
    Llama3.1 8B
    
    Qwen2.5-VL 7B
    
    Qwen3 0.6B
  - Features
    Features
    
    Attention Modes
    
    Sampler Modes
    
    Automatic Prefix Caching Automatic Prefix Caching
    Table of contents
    
    Overview
    
    Enabling APC in vLLM RBLN
    
    Advanced Configuration: Prefix Cache Hit Granularity
    
    Inference with Dynamic Batch Sizes
    
    OpenAI Compatible Server
    
    Custom Kernel
    
    Profiling Guide
  - Troubleshooting
- Nvidia Triton Inference Server
  Nvidia Triton Inference Server
  - Nvidia Triton Inference Server
  - Tutorials
    Tutorials
    
    ResNet50
    
    YOLOv8
    
    Llama3-8B
    
    Llama3.1-8B with Flash Attention
- TorchServe
  TorchServe
  - TorchServe
  - Tutorials
    Tutorials
    
    ResNet50
    
    YOLOv8
    
    Llama3-8B
    
    Llama3.1-8B with Flash Attention
- Ray Serve
  Ray Serve
  - Ray Serve
  - Tutorials
    Tutorials
    
    Resnet50
    
    YOLOv8
    
    Llama3-8B
    
    Llama3.1-8B with Flash Attention
CLOUD-NATIVE SUPPORT
CLOUD-NATIVE SUPPORT
- Container Support
  Container Support
  - Container Toolkit
  - NPU Allocation
- Kubernetes Support
  Kubernetes Support
  - NPU Driver Upgrade Workflow
  - RBLN NPU Operator
    RBLN NPU Operator
    
    About the RBLN NPU Operator
    
    Sandboxed Workloads
  - Device Plugin
  - NPU Feature Discovery
  - Metrics Exporter
SYSTEM MANAGEMENT
SYSTEM MANAGEMENT
- Overview
- RBLN System Management Damon
- Management Tools
  Management Tools
MISCELLANEOUS
MISCELLANEOUS
- Model Zoo
  Model Zoo
  - PyTorch
  - TensorFlow
- Supported Ops
  Supported Ops
  - PyTorch
  - TensorFlow
- Error Codes
SUPPORTS
SUPPORTS
EXTERNAL LINKS
EXTERNAL LINKS

Automatic Prefix Caching¶

Overview¶

vLLM RBLN supports Automatic Prefix Caching (APC), which utilizes the KV cache from previous requests when a new request shares the same prefix. This allows the model to skip computation for the overlapping prefix segment, improving efficiency and throughput.

Enabling APC in vLLM RBLN¶

APC is configured the same way as in vLLM. It is enabled by default. To disable it, set enable_prefix_caching=False.

Advanced Configuration: Prefix Cache Hit Granularity¶

By default, the prefix cache hit granularity is determined by the prefill_chunk_size specified at model compilation time. You can modify this granularity by setting prefix_block_size in additional_config when initializing the LLM Engine, provided that prefix_block_size is a multiple of prefill_chunk_size.

from vllm import LLM
llm = LLM(
    model=MODEL,
    additional_config={
        "prefix_block_size": 64,
    },
)