PyTorch RBLN: A PyTorch extension for RBLN NPUs¶

PyTorch RBLN beta status

PyTorch RBLN is currently in beta and is actively developed. APIs may change or be removed between releases, and backward compatibility is not guaranteed. Supported operations are currently limited. We do not recommend using this integration in production workloads yet. We encourage you to try it and share feedback to help us stabilize it for general availability.

Overview¶

PyTorch RBLN (torch-rbln) is a PyTorch extension that enables RBLN NPUs to be used directly from PyTorch. By supporting eager (define-by-run) execution, it preserves a familiar workflow across model development, deployment, and serving in the PyTorch ecosystem, and it is also useful for debugging.

With the same device-oriented programming model used for CPUs and GPUs, PyTorch RBLN is designed to provide a seamless way to use RBLN NPUs from standard PyTorch code. This page explains how execution reaches the device, what software layers make that possible, and where to go next for installation, supported operators, and troubleshooting.

Execution flow¶

When a PyTorch operation runs, the dispatcher selects the backend implementation based on tensor device metadata. For tensors on the rbln device, the dispatcher routes execution through PrivateUse1 and then proceeds through torch.compile to the RBLN Compiler.

User-facing operations such as torch.add resolve to ATen operators such as aten::add.
The PyTorch dispatcher selects the implementation for the target backend and routes rbln tensors through PrivateUse1.
PrivateUse1 is the dispatch key reserved for Out-of-Tree (OOT) extensions and used by third-party backends such as RBLN.
torch.compile serves as the entry point to the RBLN path.
The RBLN Compiler handles compilation and execution for the RBLN path.
The RBLN Driver provides the low-level device interface.

Software stack¶

The runtime flow above shows execution order. The stack below shows the same system as software layers: PyTorch, torch-rbln, the RBLN Compiler, and the RBLN Driver. The compiler layer includes runtime components for compilation and execution, while the driver provides the low-level hardware interface to RBLN NPUs.

The diagram uses short labels only. Each box corresponds to one row in the table below, in the same top-to-bottom order within each layer. Use the Library / file column for concrete artifacts and the Role column for fuller descriptions.

Layer	Library / file	Role
PyTorch (`torch`)	`libtorch_python.so`	Interface between Python and the PyTorch C++ backend
	`libtorch.so`	Main PyTorch library; provides PyTorch C++ APIs
	`libtorch_cpu.so`	LibTorch CPU backend
	`libc10.so`	Low-level PyTorch utilities and core structures (tensor management, device abstraction, memory allocation, etc.)
PyTorch RBLN (`torch-rbln`)	`libtorch_rbln.so`	ATen operator C++ library for RBLN runtime calls such as copy and resize
	`register_ops.py`	Python operator implementations for RBLN NPUs (via `torch.compile`)
	`libc10_rbln.so`	RBLN C10 extension: extends `libc10` for RBLN device types and related hooks
RBLN Compiler (`rebel-compiler`)	`librbln.so`	Compiler library with runtime components for RBLN NPUs
RBLN Driver	`librbln-thunk.so`	Low-level device / hardware interface for RBLN NPUs

Next steps¶

Installation — Install from pre-built wheels or build from source.
Tutorials — Learn the basic workflow and try example models.
Supported ops — Check the current operator coverage.
Troubleshoot — Resolve common setup and runtime issues.