Custom Kernel¶
Overview¶
vLLM RBLN supports custom kernels for performance optimization. Currently, RBLN NPU supports writing custom kernels with Triton.
Installation¶
No separate installation is required to use Triton. The ability to build custom kernels using Triton is built into the RBLN compiler (rebel-compiler). Installing the toolchain is sufficient.
For installation instructions, see the Installation guide.
Compilation Flow¶
When you write and run a custom kernel, Triton compiles a custom kernel into RBLN IR artifacts. The RBLN compiler then compiles those artifacts into a target binary and runs it on the device.
The compilation pipeline is split into:
- A frontend that performs target-independent analysis and optimization.
- An RBLN backend that lowers Triton IR to RBLN-specific forms and emits artifacts consumable by
rebel-compiler.
The compilation flow is:
-
Frontend (target-independent)
- Parse the custom kernel and validate its semantics (shapes, pointer usage, and constraints required by the compiler).
- Lower the custom kernel into Triton Tensor IR (TTIR), a target-independent intermediate representation used for further compilation.
- Apply target-independent optimizations (e.g., canonicalization and simplification) before target-specific lowering.
-
RBLN backend (target-specific)
- Run RBLN-specific compiler passes to make the program executable on RBLN NPU (target-specific lowering).
- Convert TTIR into RBLN IR, the backend IR consumed by the RBLN Compiler.
- Apply hardware-specific optimizations (e.g., layout and scheduling decisions) to match RBLN constraints.
- Emit IR artifacts (files/objects) that the toolchain can compile into a target binary and execute.
Basic Custom Kernel Writing Guide¶
This section demonstrates the actual usage using flash_attention_naive_prefill as an example, which is used in vLLM RBLN. The related code can be found in flash attention.
How to use a custom kernel¶
1. Defining a Custom Kernel with Triton¶
Kernels are defined with @triton.jit. Kernel behavior should be implemented inside the function.
2. Custom Kernel Compilation (warmup)¶
Kernels written in Triton are converted to RBLN IR, which is recognized by rebel-compiler, by the Triton compiler. To support this, run warmup provided by Triton.
3. PyTorch Operator Registration (triton_op, register_fake)¶
To make the written custom kernel recognizable in PyTorch, it must be registered with PyTorch.
4. Operator Invocation (torch.ops.rbln_triton_ops)¶
When using in actual models, use torch.ops.rbln_triton_ops.<user_kernel_name>. In this example, <user_kernel_name> is flash_attention_naive_prefill.
How to write custom kernels¶
Notes when writing custom kernels¶
-
Use
tl.static_rangefor loops- Loops operate with a fixed number of iterations. Use
tl.static_range(...)andtl.constexprparameters. - Currently,
tl.rangeis not supported.
- Loops operate with a fixed number of iterations. Use
-
Set
keep_dims=Truefor reductions (tl.max,tl.sum)- Many attention kernels broadcast the reduced value back to the original rank.
- Use
keep_dims=Trueto maintain dimensions and make broadcasting explicit.
-
In this example, the warmup
gridis compile-time only- In this example, warmup is used for compilation and artifact generation, and runtime execution is handled by the RBLN toolchain.
- Use a simple placeholder such as
grid=(1,).
Supported Ops¶
For information about currently supported operations, see Supported Ops. The supported surface will expand over time.
In addition to standard triton operations, it also supports additional operations specialized for RBLN NPU hardware. For more information, please refer to Specialized Operations.
vLLM RBLN kernel examples¶
Currently, vLLM RBLN supports the following custom kernels.
