Skip to content

PyTorch RBLN: A PyTorch extension for RBLN NPUs

PyTorch RBLN beta status

PyTorch RBLN is currently in beta and is actively developed. APIs may change or be removed between releases, and backward compatibility is not guaranteed. Supported operations are currently limited. We do not recommend using this integration in production workloads yet. We encourage you to try it and share feedback to help us stabilize it for general availability.

Overview

PyTorch RBLN (torch-rbln) is a PyTorch extension that enables RBLN NPUs to be used directly from PyTorch. By supporting eager (define-by-run) execution, it preserves a familiar workflow across model development, deployment, and serving in the PyTorch ecosystem, and it is also useful for debugging.

With the same device-oriented programming model used for CPUs and GPUs, PyTorch RBLN is designed to provide a seamless way to use RBLN NPUs from standard PyTorch code. This page explains how execution reaches the device, what software layers make that possible, and where to go next for installation, supported operators, and troubleshooting.

Execution flow

When a PyTorch operation runs, the dispatcher selects the backend implementation based on tensor device metadata. For tensors on the rbln device, the dispatcher routes execution through PrivateUse1 and then proceeds through torch.compile to the RBLN Compiler.

RBLN runtime path in PyTorch: the main flow goes from ATen op to the Dispatcher, which routes RBLN tensors through PrivateUse1, then to torch.compile, the RBLN Compiler, and the RBLN Driver; an inset shows the dispatcher matrix and highlights the PrivateUse1 column.

  1. User-facing operations such as torch.add resolve to ATen operators such as aten::add.
  2. The PyTorch dispatcher selects the implementation for the target backend and routes rbln tensors through PrivateUse1.
  3. PrivateUse1 is the dispatch key reserved for Out-of-Tree (OOT) extensions and used by third-party backends such as RBLN.
  4. torch.compile serves as the entry point to the RBLN path.
  5. The RBLN Compiler handles compilation and execution for the RBLN path.
  6. The RBLN Driver provides the low-level device interface.

Software stack

The runtime flow above shows execution order. The stack below shows the same system as software layers: PyTorch, torch-rbln, the RBLN Compiler, and the RBLN Driver. The compiler layer includes runtime components for compilation and execution, while the driver provides the low-level hardware interface to RBLN NPUs.

PyTorch RBLN software stack: four numbered layers (1–4), from PyTorch to the RBLN Driver. The compiler layer includes runtime components for compilation and execution, and the driver is the low-level interface to RBLN NPUs.

The diagram uses short labels only. Each box corresponds to one row in the table below, in the same top-to-bottom order within each layer. Use the Library / file column for concrete artifacts and the Role column for fuller descriptions.

Layer Library / file Role
PyTorch (torch) libtorch_python.so Interface between Python and the PyTorch C++ backend
libtorch.so Main PyTorch library; provides PyTorch C++ APIs
libtorch_cpu.so LibTorch CPU backend
libc10.so Low-level PyTorch utilities and core structures (tensor management, device abstraction, memory allocation, etc.)
PyTorch RBLN (torch-rbln) libtorch_rbln.so ATen operator C++ library for RBLN runtime calls such as copy and resize
register_ops.py Python operator implementations for RBLN NPUs (via torch.compile)
libc10_rbln.so RBLN C10 extension: extends libc10 for RBLN device types and related hooks
RBLN Compiler (rebel-compiler) librbln.so Compiler library with runtime components for RBLN NPUs
RBLN Driver librbln-thunk.so Low-level device / hardware interface for RBLN NPUs

Next steps

  • Installation — Install from pre-built wheels or build from source.
  • Tutorials — Learn the basic workflow and try example models.
  • Supported ops — Check the current operator coverage.
  • Troubleshoot — Resolve common setup and runtime issues.