Skip to content

Qwen3 (Custom Kernel)

Overview

This tutorial shows how to enable Triton custom kernels in vLLM RBLN and run a Qwen3-0.6B example. All paths are written relative to the vllm-rbln repository root. You will wire a model call site to use torch.ops.rbln_triton_ops.* and validate the execution path end-to-end.

Setup & Installation

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

  • System Requirements:
  • Packages Requirements:
  • Installation Command:
    pip install --extra-index-url https://pypi.rbln.ai/simple/ "rebel-compiler==0.10.0"
    

Note

rebel-compiler requires an RBLN Portal account.

Execution

Step labels

  • Action: required steps to run the example.
  • Reference: background and integration notes; follow as needed.

1) Action: start from the repository root

Clone vllm-rbln and move into the repository root.

All steps below assume you are working from the repository root: vllm-rbln/.

1
2
3
git clone https://github.com/RBLN-SW/vllm-rbln.git
cd vllm-rbln
pip install -e .

2) Reference: identify the integration touch points

Use this map to find the exact files you will edit for registration, invocation, and execution.

Path Why you open it What you do
vllm_rbln/__init__.py Initialization import hook (register_ops()). Operators are registered because register_ops() imports vllm_rbln.triton_kernels.* during initialization. If you add a new operator module, ensure it is imported there.
vllm_rbln/triton_kernels/ Operator definitions, schemas, and registration patterns. Reference: use this to confirm operator interfaces (Step 3). You can proceed without editing these files for this guide.
vllm_rbln/v1/attention/backends/flash_attention.py Model call site (operator invocation wiring). Reference: confirm it selects torch.ops.rbln_triton_ops.* when RBLN_USE_CUSTOM_KERNEL=1 (Step 4).
examples/experimental/offline_inference_basic.py Execution entrypoint. Run compilation and inference with env vars enabled (Step 5).

3) Reference: operator interface and registration

Use this step as a reference to understand the operator contracts used by vLLM RBLN.

Terminology (and how it maps to code)

Term Meaning in this guide Where it shows up
Custom kernel A Triton kernel implementation that is exposed to model code via a PyTorch operator under torch.ops.rbln_triton_ops.*. @triton.jit kernel functions under vllm_rbln/triton_kernels/
Registration Making the PyTorch operator available under torch.ops.rbln_triton_ops.<op_name>. @triton_op("rbln_triton_ops::<op_name>", ...) and @register_fake("rbln_triton_ops::<op_name>")
Invocation Calling the operator from model code at runtime. torch.ops.rbln_triton_ops.<op_name>(...) call sites
Interface item Why it matters
Operator name (<op_name>) Model code resolves the operator by name. Renaming breaks lookup.
Operator schema (signature) The model passes tensors in a fixed argument order. Mismatches fail at runtime/compile time.
Tensor dtypes / shapes / layouts Kernels often assume specific dtypes/layouts. Mismatches can fail compilation or produce incorrect results.

Notes

  • Schema compatibility: some operators include placeholder tensor arguments (e.g., dummy0) to keep the schema stable.
Example: @triton.jit kernel implementation
    # Excerpt: `vllm_rbln/triton_kernels/attention.py`
    import torch
    from rebel import triton
    from rebel.triton import language as tl


    @triton.jit
    def flash_attention_naive_prefill(
        query,
        key,
        value,
        kv_cache,
        mask,
        output,
        qk_scale,
        seq_idx,
        block_table,
        block_size,
        H: tl.constexpr,
        G: tl.constexpr,
        D: tl.constexpr,
        L: tl.constexpr,
        NB: tl.constexpr,
        P: tl.constexpr,
        C: tl.constexpr,
        B: tl.constexpr,
        DIM_BLOCK_TABLE: tl.constexpr,
    ):
        NP: tl.constexpr = C // P
        for batch_id in tl.static_range(0, NB, 1):
            Q_block_ptr = tl.make_block_ptr(
                base=query,
                shape=(NB, H, G, L, D),
                strides=(H * G * L * D, G * L * D, L * D, D, 1),
                offsets=(batch_id, 0, 0, 0, 0),
                block_shape=(1, H, G, L, D),
                order=(4, 3, 2, 1, 0),
            )
            # ... more block pointers + compute ...
Example: @triton_op operator registration
    # Excerpt: `vllm_rbln/triton_kernels/attention.py`
    @triton_op("rbln_triton_ops::flash_attention_naive_prefill", mutates_args=())
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        ...

    @triton_op("rbln_triton_ops::flash_attention_naive_decode", mutates_args=())
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        ...
Example: @register_fake operator stub
    # Excerpt: `vllm_rbln/triton_kernels/attention.py`
    @register_fake("rbln_triton_ops::flash_attention_naive_prefill")
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        return torch.empty_like(query)

    @register_fake("rbln_triton_ops::flash_attention_naive_decode")
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        return torch.empty_like(query)

4) Reference: confirm the model call site selects torch.ops.rbln_triton_ops

Confirm the model call site selects torch.ops.rbln_triton_ops.* so the custom kernel path is exercised at runtime.

Target: vllm_rbln/v1/attention/backends/flash_attention.py

RBLN_USE_CUSTOM_KERNEL=1 enables custom kernels used for attention:

Excerpt: kernel-mode selection for flash_attention_naive_*
    if envs.VLLM_RBLN_KERNEL_MODE == "torch_triton":
        flash_attention_naive_prefill = (
            torch.ops.rbln_triton_ops.flash_attention_naive_prefill
        )
        flash_attention_naive_decode = (
            torch.ops.rbln_triton_ops.flash_attention_naive_decode
        )
    # ...

5) Action: run inference with custom kernels enabled

Location: vllm-rbln/

Run inference:

env \
  RBLN_USE_CUSTOM_KERNEL=1 \
  VLLM_RBLN_COMPILE_MODEL=1 \
  VLLM_RBLN_COMPILE_STRICT_MODE=1 \
  VLLM_RBLN_USE_VLLM_MODEL=1 \
  VLLM_DISABLE_COMPILE_CACHE=1 \
  VLLM_USE_V1=1 \
  python examples/experimental/offline_inference_basic.py \
  --model Qwen/Qwen3-0.6B \
  --max-num-seqs 1 \
  --max-model-len 4096

Environment variables used by the command:

Environment variable Description
RBLN_USE_CUSTOM_KERNEL=1 Enable the custom-kernel execution path (torch.ops.rbln_triton_ops.*).
VLLM_RBLN_COMPILE_MODEL=1 Compile the model for device execution (required on first run).
VLLM_RBLN_COMPILE_STRICT_MODE=1 Surface unsupported ops/kernels early (recommended for development).
VLLM_RBLN_USE_VLLM_MODEL=1 Use the vLLM model integration path expected by vLLM RBLN.
VLLM_DISABLE_COMPILE_CACHE=1 Bypass the compile cache to force a clean rebuild.
VLLM_USE_V1=1 Use the vLLM V1 execution path.

Tip

If you modified kernel code and want to ensure it is rebuilt, keep VLLM_DISABLE_COMPILE_CACHE=1 for that run.

Troubleshooting

Troubleshooting checklist

If you hit issues during install/compile/run, start here:

  • Operator lookup: confirm the initialization import runs and the operator name matches the registration.
  • Rebuild after changes: keep VLLM_DISABLE_COMPILE_CACHE=1 to trigger a full rebuild.
  • Strict-mode signals: leave VLLM_RBLN_COMPILE_STRICT_MODE=1 on during development to surface issues early.

For more common issues and fixes, see Troubleshooting.

References