Skip to content

Automatic Prefix Caching (APC)

OVERVIEW
OVERVIEW
GETTING STARTED
GETTING STARTED
SOFTWARE
SOFTWARE
- RBLN Compiler
  RBLN Compiler
  - Overview
  - Installation
  - Tutorials
    Tutorials
    
    Basic
    Basic
    
    PyTorch (Vision)
    
    PyTorch (NLP)
    
    TensorFlow (Vision)
    
    TensorFlow (NLP)
    
    Advanced
    Advanced
    
    Concurrent Processing
    
    Bucketing
  - Troubleshoot
  - Python API
  - Language Binding
    Language Binding
    
    C/C++
    C/C++
    
    Installation
    
    Tutorials
    Tutorials
    
    Image Classification
    
    Object Detection
    
    Text Generation
    
    API
  - Deep Dive
- RBLN Optimum (HuggingFace Models)
  RBLN Optimum (HuggingFace Models)
  - Overview
  - Installation
  - Tutorials
    Tutorials
    
    SDXL-turbo (Image Generation)
    
    Llama3-8B (Chatbot)
  - Model APIs
    Model APIs
    
    Overview
    
    Main Class
    Main Class
    
    Auto Classes
    
    AutoPipeline
    
    Transformers
    Transformers
    
    Common
    
    Text Models
    Text Models
    
    Common
    
    Bart
    
    Bert
    
    DistilBert
    
    EXAONE
    
    Gemma
    
    GPT-2
    
    Llama
    
    Mi:dm
    
    Mistral
    
    OPT
    
    Pegasus
    
    Phi
    
    Qwen2
    
    Qwen3
    
    RoBERTa
    
    T5
    
    XLM-RoBERTa
    
    Vision Models
    Vision Models
    
    Common
    
    DPT
    
    ResNet
    
    ViT
    
    Depth Anything V2
    
    Audio Models
    Audio Models
    
    AST
    
    Wav2Vec
    
    Whisper
    
    Multimodal Models
    Multimodal Models
    
    Clip
    
    BLIP2
    
    ColPali
    
    ColQwen2
    
    Idefics3
    
    LLaVa
    
    Llava Next
    
    Qwen2 VL
    
    Qwen2.5 VL
    
    Gemma3
    
    Grounding DINO
    
    Time Series Models
    Time Series Models
    
    Time Series Transformer
    
    Diffusers
    Diffusers
    
    Common
    
    Models
    Models
    
    UNets
    
    UNets (SVD)
    
    VAE
    
    VQModel
    
    Prior Transformer
    
    Transformer (SD3)
    
    Transformer (Cosmos)
    
    ControlNet
    
    Pipelines
    Pipelines
    
    Cosmos
    
    StableDiffusion
    
    StableDiffusion XL
    
    StableDiffusion 3
    
    StableVideoDiffusion
    
    Kandinsky V2.2
    
    ControlNet (SD)
    
    ControlNet (SDXL)
- RBLN Profiler
  RBLN Profiler
  - Overview
  - RBLN NPU Architecture
  - Profiling
  - Perfetto
    Perfetto
    
    Introduction
    
    How to Analyze
    
    Large Model Visualization
  - Examples
    Examples
    
    YOLOv8 (Object Detection)
    
    Stable Diffusion 3 (Image Generation)
    
    Llama3-8B (Text Generation)
- RBLN PyTorch
  RBLN PyTorch
  - Overview
  - Tutorials
    Tutorials
    
    Basic
    
    Advanced
  - Supported Ops
  - TroubleShoot
  - APIs
MODEL SERVING
MODEL SERVING
- vLLM
  vLLM
  - vLLM RBLN
  - Tutorials
    Tutorials
    
    Llama3 8B
    
    Llama3.1 8B
    
    Qwen2.5-VL 7B
  - Features
    Features
    
    Attention Modes
    
    Sampler Modes
    
    Automatic Prefix Caching (APC) Automatic Prefix Caching (APC)
    Table of contents
    
    Overview
    
    Enabling APC in vLLM RBLN
    
    Limits
    
    Inference with Dynamic Batch Sizes
    
    OpenAI Compatible Server
  - Troubleshooting
- Nvidia Triton Inference Server
  Nvidia Triton Inference Server
  - Nvidia Triton Inference Server
  - Tutorials
    Tutorials
    
    ResNet50
    
    YOLOv8
    
    Llama3-8B
    
    Llama3.1-8B with Flash Attention
- TorchServe
  TorchServe
  - TorchServe
  - Tutorials
    Tutorials
    
    ResNet50
    
    YOLOv8
    
    Llama3-8B
    
    Llama3.1-8B with Flash Attention
- Ray Serve
  Ray Serve
  - Ray Serve
  - Tutorials
    Tutorials
    
    Resnet50
    
    YOLOv8
    
    Llama3-8B
    
    Llama3.1-8B with Flash Attention
CLOUD-NATIVE SUPPORT
CLOUD-NATIVE SUPPORT
- Kubernetes Support
  Kubernetes Support
  - RBLN NPU Operator
    RBLN NPU Operator
    
    About the RBLN NPU Operator
    
    Sandboxed Workloads
  - Device Plugin
  - NPU Feature Discovery
  - Metrics Exporter
- System Management
  System Management
  - Docker Support
  - Device Management
MISCELLANEOUS
MISCELLANEOUS
- Model Zoo
  Model Zoo
  - PyTorch
  - TensorFlow
- Supported Ops
  Supported Ops
  - PyTorch
  - TensorFlow
- Error Codes
SUPPORTS
SUPPORTS
EXTERNAL LINKS
EXTERNAL LINKS

Automatic Prefix Caching¶

Overview¶

vLLM RBLN supports Automatic Prefix Caching (APC), which utilizes the KV cache from previous requests when a new request shares the same prefix. This allows the model to skip computation for the overlapping prefix segment, improving efficiency and throughput.

Enabling APC in vLLM RBLN¶

APC is configured the same way as in vLLM. It is enabled by default. To disable it, set enable_prefix_caching=False.

Limits¶

Currently, APC applies prefix caching in units of 128 tokens. Support for more flexible caching granularities will be added in future releases.