Qwen3 (Custom Kernel)¶
Overview¶
This tutorial shows how to enable Triton custom kernels in vLLM RBLN and run a Qwen3-0.6B example.
All paths are written relative to the vllm-rbln repository root.
You will wire a model call site to use torch.ops.rbln_triton_ops.* and validate the execution path end-to-end.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.10–3.13
- RBLN Driver
- Packages Requirements:
- RBLN Compiler
- vllm-rbln – install in Step 1
- Installation Command:
Note
rebel-compiler requires an RBLN Portal account.
Execution¶
Step labels
- Action: required steps to run the example.
- Reference: background and integration notes; follow as needed.
1) Action: start from the repository root¶
Clone vllm-rbln and move into the repository root.
All steps below assume you are working from the repository root: vllm-rbln/.
2) Reference: identify the integration touch points¶
Use this map to find the exact files you will edit for registration, invocation, and execution.
| Path | Why you open it | What you do |
|---|---|---|
vllm_rbln/__init__.py |
Initialization import hook (register_ops()). |
Operators are registered because register_ops() imports vllm_rbln.triton_kernels.* during initialization. If you add a new operator module, ensure it is imported there. |
vllm_rbln/triton_kernels/ |
Operator definitions, schemas, and registration patterns. | Reference: use this to confirm operator interfaces (Step 3). You can proceed without editing these files for this guide. |
vllm_rbln/v1/attention/backends/flash_attention.py |
Model call site (operator invocation wiring). | Reference: confirm it selects torch.ops.rbln_triton_ops.* when RBLN_USE_CUSTOM_KERNEL=1 (Step 4). |
examples/experimental/offline_inference_basic.py |
Execution entrypoint. | Run compilation and inference with env vars enabled (Step 5). |
3) Reference: operator interface and registration¶
Use this step as a reference to understand the operator contracts used by vLLM RBLN.
Terminology (and how it maps to code)
| Term | Meaning in this guide | Where it shows up |
|---|---|---|
| Custom kernel | A Triton kernel implementation that is exposed to model code via a PyTorch operator under torch.ops.rbln_triton_ops.*. |
@triton.jit kernel functions under vllm_rbln/triton_kernels/ |
| Registration | Making the PyTorch operator available under torch.ops.rbln_triton_ops.<op_name>. |
@triton_op("rbln_triton_ops::<op_name>", ...) and @register_fake("rbln_triton_ops::<op_name>") |
| Invocation | Calling the operator from model code at runtime. | torch.ops.rbln_triton_ops.<op_name>(...) call sites |
| Interface item | Why it matters |
|---|---|
Operator name (<op_name>) |
Model code resolves the operator by name. Renaming breaks lookup. |
| Operator schema (signature) | The model passes tensors in a fixed argument order. Mismatches fail at runtime/compile time. |
| Tensor dtypes / shapes / layouts | Kernels often assume specific dtypes/layouts. Mismatches can fail compilation or produce incorrect results. |
Notes
- Schema compatibility: some operators include placeholder tensor arguments (e.g.,
dummy0) to keep the schema stable.
Example: @triton.jit kernel implementation
# Excerpt: `vllm_rbln/triton_kernels/attention.py`
import torch
from rebel import triton
from rebel.triton import language as tl
@triton.jit
def flash_attention_naive_prefill(
query,
key,
value,
kv_cache,
mask,
output,
qk_scale,
seq_idx,
block_table,
block_size,
H: tl.constexpr,
G: tl.constexpr,
D: tl.constexpr,
L: tl.constexpr,
NB: tl.constexpr,
P: tl.constexpr,
C: tl.constexpr,
B: tl.constexpr,
DIM_BLOCK_TABLE: tl.constexpr,
):
NP: tl.constexpr = C // P
for batch_id in tl.static_range(0, NB, 1):
Q_block_ptr = tl.make_block_ptr(
base=query,
shape=(NB, H, G, L, D),
strides=(H * G * L * D, G * L * D, L * D, D, 1),
offsets=(batch_id, 0, 0, 0, 0),
block_shape=(1, H, G, L, D),
order=(4, 3, 2, 1, 0),
)
# ... more block pointers + compute ...
Example: @triton_op operator registration
# Excerpt: `vllm_rbln/triton_kernels/attention.py`
@triton_op("rbln_triton_ops::flash_attention_naive_prefill", mutates_args=())
def _(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: torch.Tensor,
mask: torch.Tensor,
qk_scale: torch.Tensor,
seq_idx: torch.Tensor,
block_table: torch.Tensor,
dummy0: torch.Tensor,
) -> torch.Tensor:
...
@triton_op("rbln_triton_ops::flash_attention_naive_decode", mutates_args=())
def _(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: torch.Tensor,
mask: torch.Tensor,
qk_scale: torch.Tensor,
seq_idx: torch.Tensor,
block_table: torch.Tensor,
dummy0: torch.Tensor,
) -> torch.Tensor:
...
Example: @register_fake operator stub
# Excerpt: `vllm_rbln/triton_kernels/attention.py`
@register_fake("rbln_triton_ops::flash_attention_naive_prefill")
def _(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: torch.Tensor,
mask: torch.Tensor,
qk_scale: torch.Tensor,
seq_idx: torch.Tensor,
block_table: torch.Tensor,
dummy0: torch.Tensor,
) -> torch.Tensor:
return torch.empty_like(query)
@register_fake("rbln_triton_ops::flash_attention_naive_decode")
def _(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: torch.Tensor,
mask: torch.Tensor,
qk_scale: torch.Tensor,
seq_idx: torch.Tensor,
block_table: torch.Tensor,
dummy0: torch.Tensor,
) -> torch.Tensor:
return torch.empty_like(query)
4) Reference: confirm the model call site selects torch.ops.rbln_triton_ops¶
Confirm the model call site selects torch.ops.rbln_triton_ops.* so the custom kernel path is exercised at runtime.
Target: vllm_rbln/v1/attention/backends/flash_attention.py
RBLN_USE_CUSTOM_KERNEL=1 enables custom kernels used for attention:
Excerpt: kernel-mode selection for flash_attention_naive_*
5) Action: run inference with custom kernels enabled¶
Location: vllm-rbln/
Run inference:
Environment variables used by the command:
| Environment variable | Description |
|---|---|
RBLN_USE_CUSTOM_KERNEL=1 |
Enable the custom-kernel execution path (torch.ops.rbln_triton_ops.*). |
VLLM_RBLN_COMPILE_MODEL=1 |
Compile the model for device execution (required on first run). |
VLLM_RBLN_COMPILE_STRICT_MODE=1 |
Surface unsupported ops/kernels early (recommended for development). |
VLLM_RBLN_USE_VLLM_MODEL=1 |
Use the vLLM model integration path expected by vLLM RBLN. |
VLLM_DISABLE_COMPILE_CACHE=1 |
Bypass the compile cache to force a clean rebuild. |
VLLM_USE_V1=1 |
Use the vLLM V1 execution path. |
Tip
If you modified kernel code and want to ensure it is rebuilt, keep VLLM_DISABLE_COMPILE_CACHE=1 for that run.
Troubleshooting¶
Troubleshooting checklist¶
If you hit issues during install/compile/run, start here:
- Operator lookup: confirm the initialization import runs and the operator name matches the registration.
- Rebuild after changes: keep
VLLM_DISABLE_COMPILE_CACHE=1to trigger a full rebuild. - Strict-mode signals: leave
VLLM_RBLN_COMPILE_STRICT_MODE=1on during development to surface issues early.
For more common issues and fixes, see Troubleshooting.