Skip to content

RBLN NPU Operator

The rbln-npu-operator deploys and manages the Rebellions software components needed to use the RBLN NPU family across Kubernetes and OpenShift clusters. A single RBLNClusterPolicy custom resource controls the workflow for the whole cluster. When you create a RBLNClusterPolicy custom resource, the operator automatically:

  • Detect hardware and labels nodes using Node Feature Discovery (NFD)
  • Expose devices for two workload modes: container and VM passthrough
  • Install and maintains the NPU driver via the Driver Manager (driven by the RBLNDriver custom resource)
  • Deploy and manages the components required to use NPUs in Kubernetes environments
  • Configure RBAC/SCC automatically so it works on both OpenShift and standard Kubernetes

For installation, upgrade, removal, and related operator guides, see:

Core Components

The table below summarizes the key components that the operator deploys and manages.

Component Description
Device Plugin Exposes NPU resources such as rebellions.ai/npu so Pods can request the device through resources.limits.
NPU DRA Driver Allocates NPUs through Kubernetes Dynamic Resource Allocation (DeviceClass / ResourceSlice / ResourceClaim). Cannot be enabled together with the Device Plugin.
Sandbox Device Plugin Exposes VFIO resources for each card, such as rebellions.ai/ATOM_PT, for VM passthrough; designed for KubeVirt environments.
VFIO Manager Binds and unbinds RBLN PCI functions to the host's vfio-pci driver for the Sandbox Device Plugin.
Driver Manager Reconciles the kernel driver, UMD libraries, and rbln-smi on each node according to the RBLNDriver custom resource.
RBLN Daemon Runs on each node as a host service used by NPU workloads (default hostPort 50051). Disabled by default; enable with rblnDaemon.enabled=true.
NPU Feature Discovery Detects RBLN PCI devices (vendor 1eff) and applies attributes such as rebellions.ai/npu.present, rebellions.ai/npu.count, and rebellions.ai/npu.product as node labels.
Container Toolkit Generates CDI specifications and configures the container runtime so the Device Plugin and workloads can use CDI-based device injection.
Metrics Exporter Exports NPU telemetry (utilization, temperature, DRAM, power) in Prometheus format.
Operator Validator Verifies driver and Container Toolkit readiness on each node and writes a readiness state file via hostPath that other components depend on.
Operator Manager Reconciles the RBLNClusterPolicy and RBLNDriver CRDs and orchestrates the components above.