RBLN NPU Operator¶
The rbln-npu-operator automates the deployment and management of all Rebellions software components required for provisioning the RBLN NPU family across Kubernetes and OpenShift clusters. A singleton RBLNClusterPolicy custom resource drives the entire workflow. Once that policy is created, the operator automatically performs:
- Hardware detection and node labeling via Node Feature Discovery (NFD)
- Device exposure for two workload types: container passthrough and VM passthrough
- Deployment and lifecycle management of adjacent components, including VFIO binding, device plugins, and metrics exporters
- Automatic RBAC/SCC configuration so it works on both OpenShift and vanilla Kubernetes
Core Components¶
The table below summarizes the key components that the operator deploys and manages.
| Component | Description |
|---|---|
| Device Plugin | Publishes resources such as rebellions.ai/ATOM to Pods using the rebellions/k8s-device-plugin image. Maps per-card resources in resourceList. |
| Sandbox Device Plugin | Exposes resources like rebellions.ai/ATOM_PT for VM passthrough. Works with the VFIO checker container and targets KubeVirt environments. |
| VFIO Manager | Delivers the rebellions/rbln-vfio-manager image and vfio-manage.sh script through a ConfigMap to handle VFIO bind/unbind. |
| NPU Feature Discovery | Uses rebellions/rbln-npu-feature-discovery to detect PCI-1eff vendor devices and apply labels such as rebellions.ai/npu.present. |
| Metrics Exporter | Exposes NPU metrics in Prometheus format through the rebellions/rbln-metrics-exporter image. |
| Operator Manager | The controller loaded from cmd/main.go enforces the singleton RBLNClusterPolicy and watches for changes to Kubernetes Nodes and component DaemonSets. |
Installing the NPU Operator¶
The RBLN NPU Operator Helm chart automatically discovers RBLN NPUs in your cluster, deploys the required device plugins, and monitors the health of the related components. Follow these steps to get up and running quickly.
-
Prerequisites
- Kubernetes 1.19+ cluster with access to
kubectlandhelm - A dedicated namespace such as
rbln-systemis recommended for the operator - Worker nodes equipped with NPUs and Node Feature Discovery installed (set
nfd.enabled=truein the chart to install it together)
- Kubernetes 1.19+ cluster with access to
-
Install Helm (if needed)
-
Add the Rebellions Helm repository
-
Install the NPU Operator
Verify Installation¶
After Helm reports a successful install, confirm that the RBLNClusterPolicy custom resource is present and reconciled:
Next, inspect the operator namespace to verify the health of the controller and operand pods:
- Controller component:
controller-manager-*reconciles theRBLNClusterPolicy. - Operands:
rbln-device-plugin,rbln-npu-feature-discovery, andrbln-metrics-exporterhandle device exposure, labeling, and telemetry.
If any pod is stuck or CrashLooping, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.
Checking NPU Status¶
Inspect the NPU capacity that Kubernetes reports for each node to confirm that the device plugin published the expected resources:
If you expose additional resource names (for example, rebellions.ai/ATOM_PT for sandbox workloads), substitute that resource in the command and confirm the corresponding column is populated.
Creating an NPU-enabled Pod¶
-
Create a manifest (for example,
npu-demo-pod.yaml) that requests fourrebellions.ai/ATOMdevices: -
Create the Pod.
-
Verify the Pod status and resource assignment.
Usekubectl describe pod npu-podto confirm the requested NPU resources are bound and the Pod is scheduled onto an NPU-capable node.
Chart Customization Options¶
The tables below group commonly tuned values.yaml entries and describe when you might adjust them.
Global & Dependencies¶
| Key | Default / Description | When to Customize |
|---|---|---|
name |
rbln – prefix for all child resources |
Avoid naming collisions when running multiple operator instances |
nfd.enabled |
false – whether the chart installs NFD |
Set to true if your cluster does not already run NFD |
npuFeatureDiscovery.enabled / image.* |
true, rebellions/rbln-npu-feature-discovery:latest |
Disable if you already manage labels manually |
Operator Deployment¶
| Key | Default / Description | When to Customize |
|---|---|---|
operator.image.* |
rebellions/rbln-npu-operator:latest |
Pin to a private registry or a specific version |
operator.replicas |
1 |
Increase to 2+ for high availability |
operator.resources |
CPU/memory requests and limits | Tune to fit cluster capacity |
operator.service.* |
ClusterIP, ports 8443/8443 |
Change when integrating with ingress/service mesh |
operator.securityContext.runAsNonRoot |
true |
Relax or tighten privileges per security policy |
operator.affinity / tolerations |
Empty by default | Force scheduling on control-plane or specific worker nodes |
Container Workload Stack¶
| Key | Default / Description | When to Customize |
|---|---|---|
devicePlugin.enabled |
true |
Disable if you only run VM workloads |
devicePlugin.image.* |
rebellions/k8s-device-plugin:latest |
Pin to a private registry or a specific version |
devicePlugin.resourceList[] |
rebellions.ai/ATOM |
Add custom resource names or prefixes |
metricsExporter.enabled / image.* |
true, rebellions/rbln-metrics-exporter:latest |
Disable when another telemetry pipeline exists |
VM Passthrough Stack¶
| Key | Default / Description | When to Customize |
|---|---|---|
sandboxDevicePlugin.enabled |
false (disabled by default) |
Enable for KubeVirt or other VM environments |
sandboxDevicePlugin.resourceList[] |
rebellions.ai/ATOM_PT |
Match VFIO resource names per card model |
sandboxDevicePlugin.vfioChecker.image.* |
rebellions/rbln-vfio-manager:latest |
Swap in a different validation image |
vfioManager.enabled |
false |
Enable when configuring VM passthrough |
vfioManager.image.* |
rebellions/rbln-vfio-manager:latest |
Pin to a private registry or a specific version |