Skip to content

Profiling Guide

To maximize the performance of vLLM RBLN, it is essential to perform low-level profiling that allows for an in-depth analysis of RBLN NPU behavior, in addition to monitoring standard metrics such as TTFT (Time To First Token) and TPOT (Time Per Output Token).

This document provides profiling guides for two types: PyTorch-Level Profiler and RBLN Profiler.

1. PyTorch-Level Profiler

This method uses the PyTorch Profiler to measure performance at the operation level. It allows for the analysis of kernel execution times and computational bottlenecks. vLLM RBLN supports the same interface as the official vLLM Profiling guide.

1.1 Environment Setup

You must set the VLLM_TORCH_PROFILER_DIR environment variable to specify the directory where the profiling results (traces) will be saved.

export VLLM_TORCH_PROFILER_DIR=/path/to/save/trace

1.2 Offline Inference Profiling

For offline inference using Python scripts, use the start_profile() and stop_profile() methods to wrap the section you wish to measure.

from vllm import LLM, SamplingParams

# Initialize LLM
llm = LLM(
    model="/path/to/model",
    # ... other configurations
)

# Start Profiling
llm.start_profile()

# Run Inference
outputs = llm.generate(prompts, sampling_params)

# Stop Profiling
llm.stop_profile()

1.3 Online Inference Profiling

For online inference via the API server, specific endpoints are called to control profiling. 1. Call the /start_profile endpoint to begin profiling. 2. Send the actual inference request (e.g., /v1/chat/completions). 3. Call the /stop_profile endpoint to stop profiling and save the results.

# 1. Start Profiling
curl -X POST http://localhost:8000/start_profile

# 2. Call Model Inference (Generate)
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/path/to/model",
        "messages": [
            {
                "role": "user",
                "content": "Seoul is a"
            }
        ]
    }'

# 3. Stop Profiling
curl -X POST http://localhost:8000/stop_profile

1.4 Additional Options

For more detailed analysis, you can configure the following additional environment variables:

  • VLLM_TORCH_PROFILER_RECORD_SHAPES: Records tensor shape information.
  • VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY: Profiles memory usage as well.
  • VLLM_TORCH_PROFILER_WITH_STACK: Records source code stack traces.
  • VLLM_TORCH_PROFILER_WITH_FLOPS: Estimates and records FLOPS (floating-point operations per second).

1.5 Visualizing Profiling Results

The generated profiling trace files can be visualized using the Perfetto UI.

2. RBLN Profiler

The RBLN Profiler is a software tool designed for in-depth performance analysis of workloads running on RBLN NPUs. For more details about RBLN Profiler, please refer to RBLN Profiler

2.1 Constraints and Configuration

In the vLLM RBLN environment, profiling with the RBLN Profiler is supported only when the multiprocessing configuration is disabled. Support for profiling with multiprocessing enabled will be included in an upcoming release.

Set the environment variables as shown below and run vLLM to generate profiling results.

  • RBLN_PROFILER=1: Enables the RBLN Query Profiler.
  • VLLM_ENABLE_V1_MULTIPROCESSING=0: Disables multiprocessing.
export RBLN_PROFILER=1
export VLLM_ENABLE_V1_MULTIPROCESSING=0

2.2 Visualizing Profiling Results

The resulting profiling trace files can also be visualized using the Perfetto UI. For more details on trace analysis, please refer to the Perfetto Analysis Guide.