Skip to content

Custom Kernel

Overview

vLLM RBLN supports custom kernels for performance optimization. Currently, RBLN NPU supports writing custom kernels with Triton.

Installation

No separate installation is required to use Triton. The ability to build custom kernels using Triton is built into the RBLN compiler (rebel-compiler). Installing the toolchain is sufficient.

For installation instructions, see the Installation guide.

Compilation Flow

When you write and run a custom kernel, Triton compiles a custom kernel into RBLN IR artifacts. The RBLN compiler then compiles those artifacts into a target binary and runs it on the device.

The compilation pipeline is split into:

  • A frontend that performs target-independent analysis and optimization.
  • An RBLN backend that lowers Triton IR to RBLN-specific forms and emits artifacts consumable by rebel-compiler.

image

The compilation flow is:

  • Frontend (target-independent)

    • Parse the custom kernel and validate its semantics (shapes, pointer usage, and constraints required by the compiler).
    • Lower the custom kernel into Triton Tensor IR (TTIR), a target-independent intermediate representation used for further compilation.
    • Apply target-independent optimizations (e.g., canonicalization and simplification) before target-specific lowering.
  • RBLN backend (target-specific)

    • Run RBLN-specific compiler passes to make the program executable on RBLN NPU (target-specific lowering).
    • Convert TTIR into RBLN IR, the backend IR consumed by the RBLN Compiler.
    • Apply hardware-specific optimizations (e.g., layout and scheduling decisions) to match RBLN constraints.
    • Emit IR artifacts (files/objects) that the toolchain can compile into a target binary and execute.

Basic Custom Kernel Writing Guide

This section demonstrates the actual usage using flash_attention_naive_prefill as an example, which is used in vLLM RBLN. The related code can be found in flash attention.

How to use a custom kernel

1. Defining a Custom Kernel with Triton

Kernels are defined with @triton.jit. Kernel behavior should be implemented inside the function.

1
2
3
4
5
@triton.jit
def flash_attention_naive_prefill(
    query,
    key,
    ...

2. Custom Kernel Compilation (warmup)

Kernels written in Triton are converted to RBLN IR, which is recognized by rebel-compiler, by the Triton compiler. To support this, run warmup provided by Triton.

def warmup(func, *args):
    ...

3. PyTorch Operator Registration (triton_op, register_fake)

To make the written custom kernel recognizable in PyTorch, it must be registered with PyTorch.

1
2
3
4
5
6
7
@triton_op("rbln_triton_ops::flash_attention_naive_prefill", mutates_args=())
def _(
    ...

@register_fake("rbln_triton_ops::flash_attention_naive_prefill")
def _(
    ...

4. Operator Invocation (torch.ops.rbln_triton_ops)

When using in actual models, use torch.ops.rbln_triton_ops.<user_kernel_name>. In this example, <user_kernel_name> is flash_attention_naive_prefill.

1
2
3
4
5
    flash_attention_naive_prefill = torch.ops.rbln_triton_ops.flash_attention_naive_prefill

    attn_output = flash_attention_naive_prefill(
        .....
    )

How to write custom kernels

Notes when writing custom kernels

  • Use tl.static_range for loops

    • Loops operate with a fixed number of iterations. Use tl.static_range(...) and tl.constexpr parameters.
    • Currently, tl.range is not supported.
  • Set keep_dims=True for reductions (tl.max, tl.sum)

    • Many attention kernels broadcast the reduced value back to the original rank.
    • Use keep_dims=True to maintain dimensions and make broadcasting explicit.
  • In this example, the warmup grid is compile-time only

    • In this example, warmup is used for compilation and artifact generation, and runtime execution is handled by the RBLN toolchain.
    • Use a simple placeholder such as grid=(1,).

Supported Ops

For information about currently supported operations, see Supported Ops. The supported surface will expand over time.

In addition to standard triton operations, it also supports additional operations specialized for RBLN NPU hardware. For more information, please refer to Specialized Operations.

vLLM RBLN kernel examples

Currently, vLLM RBLN supports the following custom kernels.

References

Triton official documentation