Qwen3 (Custom Kernel)¶

Overview¶

This tutorial shows how to enable Triton custom kernels in vLLM RBLN and run inference with Qwen3-0.6B. All paths are relative to the vllm-rbln repository root. You will update the model execution code to call torch.ops.rbln_triton_ops.* and validate the execution path end-to-end.

Setup & Installation¶

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

System Requirements:
- Python: 3.10–3.13
- RBLN Driver
Packages Requirements:
- RBLN Compiler
- vllm-rbln – install in Step 1

Installation Command:

pip install --extra-index-url https://pypi.rbln.ai/simple/ "rebel-compiler==0.10.1"

Note

rebel-compiler requires an RBLN Portal account.

Execution¶

Step labels

Action: required steps to run the example.
Reference: background and integration notes; follow as needed.

1) Action: start from the repository root¶

Clone vllm-rbln and move into the repository root.

All steps below assume you are working from the repository root: vllm-rbln/.

git clone https://github.com/RBLN-SW/vllm-rbln.git
cd vllm-rbln
pip install -e .

2) Reference: identify the integration touch points¶

Use this map to locate the files involved in operator registration, runtime invocation, and execution.

Path	Why you open it	What you do
`vllm_rbln/__init__.py`	Initialization import hook (`register_ops()`).	Operators are registered because `register_ops()` imports `vllm_rbln.triton_kernels.*` during initialization. If you add a new operator module, ensure it is imported there.
`vllm_rbln/triton_kernels/`	Operator definitions, schemas, and registration patterns.	Reference: use this to confirm operator interfaces (Step 3). You can proceed without editing these files for this guide.
`vllm_rbln/v1/attention/backends/flash_attention.py`	Model execution code (operator invocation).	Reference: confirm it selects `torch.ops.rbln_triton_ops.*` when `RBLN_USE_CUSTOM_KERNEL=1` (Step 4).
`examples/experimental/offline_inference_basic.py`	Execution entrypoint.	Run compilation and inference with env vars enabled (Step 5).

3) Reference: operator interface and registration¶

Use this step as a reference to understand the operator contracts used by vLLM RBLN.

Terminology (and how it maps to code)

Term	Meaning in this guide	Where it shows up
Custom kernel	A Triton kernel implementation that is exposed to model code via a PyTorch operator under `torch.ops.rbln_triton_ops.*`.	`@triton.jit` kernel functions under `vllm_rbln/triton_kernels/`
Registration	Making the PyTorch operator available under `torch.ops.rbln_triton_ops.<op_name>`.	`@triton_op("rbln_triton_ops::<op_name>", ...)` and `@register_fake("rbln_triton_ops::<op_name>")`
Invocation	Calling the operator from model code at runtime.	`torch.ops.rbln_triton_ops.<op_name>(...)` call sites

Interface item	Why it matters
Operator name (`<op_name>`)	Model code resolves the operator by name. Renaming breaks lookup.
Operator schema (signature)	The model passes tensors in a fixed argument order. Mismatches fail at runtime/compile time.
Tensor dtypes / shapes / layouts	Kernels often assume specific dtypes/layouts. Mismatches can fail compilation or produce incorrect results.

Notes

Schema compatibility: some operators include placeholder tensor arguments (e.g., dummy0) to keep the schema stable.

Example: @triton.jit kernel implementation

    # Excerpt: `vllm_rbln/triton_kernels/attention.py`
    import torch
    from rebel import triton
    from rebel.triton import language as tl


    @triton.jit
    def flash_attention_naive_prefill(
        query,
        key,
        value,
        kv_cache,
        mask,
        output,
        qk_scale,
        seq_idx,
        block_table,
        block_size,
        H: tl.constexpr,
        G: tl.constexpr,
        D: tl.constexpr,
        L: tl.constexpr,
        NB: tl.constexpr,
        P: tl.constexpr,
        C: tl.constexpr,
        B: tl.constexpr,
        DIM_BLOCK_TABLE: tl.constexpr,
    ):
        NP: tl.constexpr = C // P
        for batch_id in tl.static_range(0, NB, 1):
            Q_block_ptr = tl.make_block_ptr(
                base=query,
                shape=(NB, H, G, L, D),
                strides=(H * G * L * D, G * L * D, L * D, D, 1),
                offsets=(batch_id, 0, 0, 0, 0),
                block_shape=(1, H, G, L, D),
                order=(4, 3, 2, 1, 0),
            )
            # ... more block pointers + compute ...

Example: @triton_op operator registration

    # Excerpt: `vllm_rbln/triton_kernels/attention.py`
    @triton_op("rbln_triton_ops::flash_attention_naive_prefill", mutates_args=())
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        ...

    @triton_op("rbln_triton_ops::flash_attention_naive_decode", mutates_args=())
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        ...

Example: @register_fake operator stub

    # Excerpt: `vllm_rbln/triton_kernels/attention.py`
    @register_fake("rbln_triton_ops::flash_attention_naive_prefill")
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        return torch.empty_like(query)

    @register_fake("rbln_triton_ops::flash_attention_naive_decode")
    def _(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: torch.Tensor,
        mask: torch.Tensor,
        qk_scale: torch.Tensor,
        seq_idx: torch.Tensor,
        block_table: torch.Tensor,
        dummy0: torch.Tensor,
    ) -> torch.Tensor:
        return torch.empty_like(query)

4) Reference: confirm the model execution path selects `torch.ops.rbln_triton_ops`¶

Confirm the model execution path selects torch.ops.rbln_triton_ops.* so the custom kernel path is exercised at runtime.

Target: vllm_rbln/v1/attention/backends/flash_attention.py

RBLN_USE_CUSTOM_KERNEL=1 enables custom kernels used for attention:

Excerpt: kernel-mode selection for flash_attention_naive_*

    if envs.VLLM_RBLN_KERNEL_MODE == "torch_triton":
        flash_attention_naive_prefill = (
            torch.ops.rbln_triton_ops.flash_attention_naive_prefill
        )
        flash_attention_naive_decode = (
            torch.ops.rbln_triton_ops.flash_attention_naive_decode
        )
    # ...

5) Action: run inference with the custom-kernels enabled¶

Location: vllm-rbln/

Run inference:

env \
  RBLN_USE_CUSTOM_KERNEL=1 \
  VLLM_RBLN_COMPILE_MODEL=1 \
  VLLM_RBLN_COMPILE_STRICT_MODE=1 \
  VLLM_RBLN_USE_VLLM_MODEL=1 \
  VLLM_DISABLE_COMPILE_CACHE=1 \
  VLLM_USE_V1=1 \
  python examples/experimental/offline_inference_basic.py \
  --model Qwen/Qwen3-0.6B \
  --max-num-seqs 1 \
  --max-model-len 4096

Environment variables used by the command:

Environment variable	Description
`RBLN_USE_CUSTOM_KERNEL=1`	Enable the custom-kernel execution path (`torch.ops.rbln_triton_ops.*`).
`VLLM_RBLN_COMPILE_MODEL=1`	Compile the model for device execution (required on first run).
`VLLM_RBLN_COMPILE_STRICT_MODE=1`	Surface unsupported ops/kernels early (recommended for development).
`VLLM_RBLN_USE_VLLM_MODEL=1`	Use the `vLLM` model integration path expected by `vLLM RBLN`.
`VLLM_DISABLE_COMPILE_CACHE=1`	Bypass the compile cache to force a clean rebuild.
`VLLM_USE_V1=1`	Use the `vLLM` V1 execution path.

Tip

If you modified kernel code and want to ensure it is rebuilt, keep VLLM_DISABLE_COMPILE_CACHE=1 for that run.

Troubleshooting¶

Troubleshooting checklist¶

If you hit issues during install/compile/run, start here:

Operator lookup: confirm the initialization import runs and the operator name matches the registration.
Rebuild after changes: keep VLLM_DISABLE_COMPILE_CACHE=1 to trigger a full rebuild.
Strict-mode signals: leave VLLM_RBLN_COMPILE_STRICT_MODE=1 on during development to surface issues early.

For more common issues and fixes, see Troubleshooting.

References¶

Triton Custom Kernel