RBLN NPU Operator¶
The rbln-npu-operator automates the deployment and management of all Rebellions software components required for provisioning the RBLN NPU family across Kubernetes and OpenShift clusters. A singleton RBLNClusterPolicy custom resource drives the entire workflow. Once that policy is created, the operator automatically performs:
- Hardware detection and node labeling via Node Feature Discovery (NFD)
- Device exposure for two workload types: container passthrough and VM passthrough
- Driver installation and lifecycle management via the Driver Manager (driven by the
RBLNDrivercustom resource) - Deployment and lifecycle management of adjacent components, including VFIO binding, device plugins, and metrics exporters
- Automatic RBAC/SCC configuration so it works on both OpenShift and vanilla Kubernetes
Core Components¶
The table below summarizes the key components that the operator deploys and manages.
| Component | Description |
|---|---|
| Device Plugin | Publishes resources such as rebellions.ai/ATOM to Pods using the rebellions/k8s-device-plugin image. Maps per-card resources in resourceList. |
| Sandbox Device Plugin | Exposes resources like rebellions.ai/ATOM_PT for VM passthrough. Works with the VFIO checker container and targets KubeVirt environments. |
| VFIO Manager | Delivers the rebellions/rbln-vfio-manager image and vfio-manage.sh script through a ConfigMap to handle VFIO bind/unbind. |
| Driver Manager | Installs and maintains NPU drivers based on the RBLNDriver custom resource. |
| NPU Feature Discovery | Uses rebellions/rbln-npu-feature-discovery to detect PCI-1eff vendor devices and apply labels such as rebellions.ai/npu.present. |
| Container Toolkit | Generates CDI specs and configures/restarts the container runtime so workloads can use CDI-based device injection. |
| Metrics Exporter | Exposes NPU metrics in Prometheus format through the rebellions/rbln-metrics-exporter image. |
| Operator Validator | Validates driver and container toolkit readiness and writes a hostPath ready file that other components can depend on. |
| Operator Manager | The controller loaded from cmd/main.go enforces the singleton RBLNClusterPolicy and watches for changes to Kubernetes Nodes and component DaemonSets. |
Installing the NPU Operator¶
The RBLN NPU Operator Helm chart automatically discovers RBLN NPUs in your cluster, deploys the required device plugins, and monitors the health of the related components. Follow these steps to get up and running quickly.
-
Prerequisites
- Kubernetes 1.19+ cluster with access to
kubectlandhelm - A dedicated namespace such as
rbln-systemis recommended for the operator - Worker nodes equipped with NPUs and Node Feature Discovery installed (set
nfd.enabled=truein the chart to install it together)
- Kubernetes 1.19+ cluster with access to
-
Install Helm (if needed)
-
Add the Rebellions Helm repository
-
Install the NPU Operator
Verify Installation¶
After Helm reports a successful install, confirm that the RBLNClusterPolicy custom resource is present and reconciled:
If driver management is enabled, confirm that the RBLNDriver custom resource is created:
Next, inspect the operator namespace to verify the health of the controller and operand pods:
- Controller component:
controller-manager-*reconciles theRBLNClusterPolicy. - Driver management:
RBLNDrivercustom resources declare the NPU driver version, and the Driver Manager installs it across the cluster. - Validator:
rbln-operator-validator-*verifies driver and container toolkit readiness and records the result as a hostPath ready file. - Operands:
rbln-device-plugin,rbln-npu-feature-discovery,rbln-container-toolkit, andrbln-metrics-exporterhandle device exposure, labeling, CDI enablement, and telemetry.
If any pod is stuck or CrashLooping, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.
Driver Management¶
The operator defines the RBLNDriver CRD to manage NPU driver installation. Create an RBLNDriver custom resource with the desired driver version, and the Driver Manager installs and maintains that version across the cluster.
RBLNDriver Sample¶
What the Driver Manager Installs¶
When an RBLNDriver resource is applied, the Driver Manager installs:
- Kernel driver
- UMD libraries
- Tools such as
rbln-smi
Driver Image Selection¶
The Driver Manager selects the driver container image by combining Node Feature Discovery labels on each node:
feature.node.kubernetes.io/system-os_release.IDfeature.node.kubernetes.io/system-os_release.VERSION_IDfeature.node.kubernetes.io/kernel-version.full
For example, a node running Ubuntu 22.04 with kernel 6.8.0-90-generic can resolve to:
docker.io/rebellions/rbln-driver:3.0.0-rc3-6.8.0-90-generic-ubuntu22.04.
Operator Validator¶
rbln-validator verifies the readiness of the RBLN NPU stack, including the driver and container toolkit. It runs inside the operator-managed Validator DaemonSet and records the result as a hostPath ready file so other components can depend on it.
Container Toolkit¶
The NPU Operator deploys the Container Toolkit DaemonSet based on the containerToolkit settings in the RBLNClusterPolicy. It generates CDI specs and configures/restarts the container runtime so the Device Plugin and workloads can use CDI-based device injection.
Checking NPU Status¶
Inspect the NPU capacity that Kubernetes reports for each node to confirm that the device plugin published the expected resources:
If you expose additional resource names (for example, rebellions.ai/ATOM_PT for sandbox workloads), substitute that resource in the command and confirm the corresponding column is populated.
Creating an NPU-enabled Pod¶
-
Create a manifest (for example,
npu-demo-pod.yaml) that requests fourrebellions.ai/ATOMdevices: -
Create the Pod.
-
Verify the Pod status and resource assignment.
Usekubectl describe pod npu-podto confirm the requested NPU resources are bound and the Pod is scheduled onto an NPU-capable node.
Chart Customization Options¶
The tables below group commonly tuned values.yaml entries and describe when you might adjust them.
Global & Dependencies¶
| Key | Default / Description | When to Customize |
|---|---|---|
name |
rbln – prefix for all child resources |
Avoid naming collisions when running multiple operator instances |
nfd.enabled |
false – whether the chart installs NFD |
Set to true if your cluster does not already run NFD |
npuFeatureDiscovery.enabled / image.* |
true, rebellions/rbln-npu-feature-discovery:latest |
Disable if you already manage labels manually |
Operator Deployment¶
| Key | Default / Description | When to Customize |
|---|---|---|
operator.image.* |
rebellions/rbln-npu-operator:latest |
Pin to a private registry or a specific version |
operator.replicas |
1 |
Increase to 2+ for high availability |
operator.resources |
CPU/memory requests and limits | Tune to fit cluster capacity |
operator.service.* |
ClusterIP, ports 8443/8443 |
Change when integrating with ingress/service mesh |
operator.securityContext.runAsNonRoot |
true |
Relax or tighten privileges per security policy |
operator.affinity / tolerations |
Empty by default | Force scheduling on control-plane or specific worker nodes |
Container Workload Stack¶
| Key | Default / Description | When to Customize |
|---|---|---|
devicePlugin.enabled |
true |
Disable if you only run VM workloads |
devicePlugin.image.* |
rebellions/k8s-device-plugin:latest |
Pin to a private registry or a specific version |
devicePlugin.resourceList[] |
rebellions.ai/ATOM |
Add custom resource names or prefixes |
metricsExporter.enabled / image.* |
true, rebellions/rbln-metrics-exporter:latest |
Disable when another telemetry pipeline exists |
Driver Configuration¶
| Key | Default / Description | When to Customize |
|---|---|---|
driver.enabled |
true |
Disable if drivers are preinstalled or managed outside the operator |
driver.image.* |
harbor.k8s.rebellions.in/playground/rbln-driver:3.0.0-rc3, IfNotPresent |
Use a private registry or pin a specific driver version |
driver.imagePullSecrets |
[] |
Required when pulling from a private registry |
driver.nodeSelector / nodeAffinity / tolerations |
Empty by default | Constrain driver pods to specific nodes or architectures |
driver.labels / annotations |
{} |
Add metadata for policy, audit, or monitoring integration |
driver.priorityClassName |
"" |
Increase scheduling priority (for example, system-node-critical) |
driver.resources |
{} |
Set CPU/memory requests and limits |
driver.args / driver.env |
[] |
Pass custom arguments or environment variables (for example, log levels) |
driver.manager.* |
rebellions/rbln-k8s-driver-manager:latest, IfNotPresent |
Override the Driver Manager image or version |
Validator Configuration¶
| Key | Default / Description | When to Customize |
|---|---|---|
validator.registry / image / tag |
harbor.k8s.rebellions.in, rebellions/rbln-npu-operator-validator, driver-test |
Override the validator image location or version |
validator.pullPolicy |
IfNotPresent |
Reuse cached images when available |
validator.imagePullSecrets |
[] |
Required when pulling from a private registry |
validator.resources |
{} |
Set CPU/memory requests and limits |
validator.args / validator.env |
[] |
Pass custom arguments or environment variables |
Container Toolkit Configuration¶
| Key | Default / Description | When to Customize |
|---|---|---|
containerToolkit.enabled |
true |
Disable if CDI specs and runtime config are managed externally |
containerToolkit.image.* |
harbor.k8s.rebellions.in/rebellions/rbln-container-toolkit:latest, Always |
Override the toolkit image location or version |
containerToolkit.imagePullSecrets |
[] |
Required when pulling from a private registry |
containerToolkit.resources |
{} |
Set CPU/memory requests and limits |
containerToolkit.args / containerToolkit.env |
[] |
Pass custom arguments or environment variables |
VM Passthrough Stack¶
| Key | Default / Description | When to Customize |
|---|---|---|
sandboxDevicePlugin.enabled |
false (disabled by default) |
Enable for KubeVirt or other VM environments |
sandboxDevicePlugin.resourceList[] |
rebellions.ai/ATOM_PT |
Match VFIO resource names per card model |
sandboxDevicePlugin.vfioChecker.image.* |
rebellions/rbln-vfio-manager:latest |
Swap in a different validation image |
vfioManager.enabled |
false |
Enable when configuring VM passthrough |
vfioManager.image.* |
rebellions/rbln-vfio-manager:latest |
Pin to a private registry or a specific version |