RBLN NPU Operator¶

The rbln-npu-operator automates the deployment and management of all Rebellions software components required for provisioning the RBLN NPU family across Kubernetes and OpenShift clusters. A singleton RBLNClusterPolicy custom resource drives the entire workflow. Once that policy is created, the operator automatically performs:

Hardware detection and node labeling via Node Feature Discovery (NFD)
Device exposure for two workload types: container passthrough and VM passthrough
Driver installation and lifecycle management via the Driver Manager (driven by the RBLNDriver custom resource)
Deployment and lifecycle management of adjacent components, including VFIO binding, device plugins, and metrics exporters
Automatic RBAC/SCC configuration so it works on both OpenShift and vanilla Kubernetes

Core Components¶

The table below summarizes the key components that the operator deploys and manages.

Component	Description
Device Plugin	Publishes resources such as `rebellions.ai/ATOM` to Pods using the `rebellions/k8s-device-plugin` image. Maps per-card resources in `resourceList`.
Sandbox Device Plugin	Exposes resources like `rebellions.ai/ATOM_PT` for VM passthrough. Works with the VFIO checker container and targets KubeVirt environments.
VFIO Manager	Delivers the `rebellions/rbln-vfio-manager` image and `vfio-manage.sh` script through a ConfigMap to handle VFIO bind/unbind.
Driver Manager	Installs and maintains NPU drivers based on the `RBLNDriver` custom resource.
NPU Feature Discovery	Uses `rebellions/rbln-npu-feature-discovery` to detect `PCI-1eff` vendor devices and apply labels such as `rebellions.ai/npu.present`.
Container Toolkit	Generates CDI specs and configures/restarts the container runtime so workloads can use CDI-based device injection.
Metrics Exporter	Exposes NPU metrics in Prometheus format through the `rebellions/rbln-metrics-exporter` image.
Operator Validator	Validates driver and container toolkit readiness and writes a hostPath ready file that other components can depend on.
Operator Manager	The controller loaded from `cmd/main.go` enforces the singleton `RBLNClusterPolicy` and watches for changes to Kubernetes Nodes and component DaemonSets.

Installing the NPU Operator¶

The RBLN NPU Operator Helm chart automatically discovers RBLN NPUs in your cluster, deploys the required device plugins, and monitors the health of the related components. Follow these steps to get up and running quickly.

Prerequisites
- Kubernetes 1.19+ cluster with access to kubectl and helm
- Node Feature Discovery installed (set nfd.enabled=true in the chart to install it together)
- A dedicated namespace such as rbln-system is recommended for the operator
- Worker nodes equipped with NPUs

Install Helm (if needed)

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
  && chmod 700 get_helm.sh \
  && ./get_helm.sh

Add the Rebellions Helm repository

helm repo add rebellions https://rbln-sw.github.io/rbln-npu-operator
helm repo update

Install the NPU Operator

helm install --wait --generate-name \
     -n rbln-system --create-namespace \
     rebellions/rbln-npu-operator

Verify Installation¶

After Helm reports a successful install, confirm that the RBLNClusterPolicy custom resource is present and reconciled:

kubectl get rblnclusterpolicies.rebellions.ai -n rbln-system
NAME                  AGE
rbln-cluster-policy   8m

If driver management is enabled, confirm that the RBLNDriver custom resource is created:

kubectl get rblndrivers.rebellions.ai -n rbln-system
NAME          AGE
rbln-driver   99m

Next, inspect the operator namespace to verify the health of the controller and operand pods:

kubectl get pods -n rbln-system
NAME                                             READY   STATUS    AGE
controller-manager-797798d7b8-rjzht              1/1     Running   8m
rbln-device-plugin-4qgxc                         1/1     Running   8m
rbln-metrics-exporter-jghbg                      1/1     Running   8m
rbln-npu-feature-discovery-zg47r                 1/1     Running   8m
rbln-container-toolkit-ttz2c                     1/1     Running   8m
rbln-driver-ubuntu22.04-6.8.0-90-generic-6gtrc   1/1     Running   8m
rbln-operator-validator-qhf4t                    1/1     Running   8m

Controller component: controller-manager-* reconciles the RBLNClusterPolicy.
Driver management: RBLNDriver custom resources declare the NPU driver version, and the Driver Manager installs it across the cluster.
Validator: rbln-operator-validator-* verifies driver and container toolkit readiness and records the result as a hostPath ready file.
Operands: rbln-device-plugin, rbln-npu-feature-discovery, rbln-container-toolkit, and rbln-metrics-exporter handle device exposure, labeling, CDI enablement, and telemetry.

If any pod is stuck or CrashLooping, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.

Driver Management¶

The operator defines the RBLNDriver CRD to manage NPU driver installation. Create an RBLNDriver custom resource with the desired driver version, and the Driver Manager installs and maintains that version across the cluster.

RBLNDriver Sample¶

apiVersion: rebellions.ai/v1alpha1
kind: RBLNDriver
metadata:
  labels:
    app.kubernetes.io/name: rbln-driver
  name: rblndriver-sample
spec:
  registry: docker.io
  image: rebellions/rbln-driver
  version: "3.0.0"
  imagePullPolicy: IfNotPresent
  labels:
    app.kubernetes.io/component: driver
  resources:
    requests:
      cpu: 250m
      memory: 64Mi
    limits:
      cpu: 500m
      memory: 128Mi
  manager:
    registry: docker.io
    image: rebellions/rbln-k8s-driver-manager
    version: "latest"
    imagePullPolicy: IfNotPresent

What the Driver Manager Installs¶

When an RBLNDriver resource is applied, the Driver Manager installs:

Kernel driver
UMD libraries
Tools such as rbln-smi

Driver Image Selection¶

The Driver Manager selects the driver container image by combining Node Feature Discovery labels on each node:

feature.node.kubernetes.io/system-os_release.ID
feature.node.kubernetes.io/system-os_release.VERSION_ID
feature.node.kubernetes.io/kernel-version.full

For example, a node running Ubuntu 22.04 with kernel 6.8.0-90-generic can resolve to: docker.io/rebellions/rbln-driver:3.0.0-rc3-6.8.0-90-generic-ubuntu22.04.

Operator Validator¶

rbln-validator verifies the readiness of the RBLN NPU stack, including the driver and container toolkit. It runs inside the operator-managed Validator DaemonSet and records the result as a hostPath ready file so other components can depend on it.

Container Toolkit¶

The NPU Operator deploys the Container Toolkit DaemonSet based on the containerToolkit settings in the RBLNClusterPolicy. It generates CDI specs and configures/restarts the container runtime so the Device Plugin and workloads can use CDI-based device injection.

Checking NPU Status¶

Inspect the NPU capacity that Kubernetes reports for each node to confirm that the device plugin published the expected resources:

kubectl get nodes -o custom-columns=NAME:.metadata.name,NPUs:.status.capacity.'rebellions\.ai/ATOM'
NAME                            NPUs
rbln-npu-worker-01              16

If you expose additional resource names (for example, rebellions.ai/ATOM_PT for sandbox workloads), substitute that resource in the command and confirm the corresponding column is populated.

Creating an NPU-enabled Pod¶

Create a manifest (for example, npu-demo-pod.yaml) that requests four rebellions.ai/ATOM devices:

apiVersion: v1
kind: Pod
metadata:
  name: npu-pod
spec:
  containers:
    - name: ubuntu
      image: ubuntu:latest
      imagePullPolicy: IfNotPresent
      command: ["/bin/bash", "-c", "--"]
      args: ["while true; do sleep 300000; done;"]
      resources:
        limits:
          rebellions.ai/ATOM: 4

Create the Pod.
1
kubectl apply -f npu-demo-pod.yaml
Verify the Pod status and resource assignment.
1
kubectl get pod npu-pod
Use kubectl describe pod npu-pod to confirm the requested NPU resources are bound and the Pod is scheduled onto an NPU-capable node.

Chart Customization Options¶

The tables below group commonly tuned values.yaml entries and describe when you might adjust them.

Global & Dependencies¶

Key	Default / Description	When to Customize
`name`	`rbln` – prefix for all child resources	Avoid naming collisions when running multiple operator instances
`nfd.enabled`	`false` – whether the chart installs NFD	Set to `true` if your cluster does not already run NFD
`npuFeatureDiscovery.enabled` / `image.*`	`true`, `rebellions/rbln-npu-feature-discovery:latest`	Disable if you already manage labels manually

Operator Deployment¶

Key	Default / Description	When to Customize
`operator.image.*`	`rebellions/rbln-npu-operator:latest`	Pin to a private registry or a specific version
`operator.replicas`	`1`	Increase to 2+ for high availability
`operator.resources`	CPU/memory requests and limits	Tune to fit cluster capacity
`operator.service.*`	`ClusterIP`, ports 8443/8443	Change when integrating with ingress/service mesh
`operator.securityContext.runAsNonRoot`	`true`	Relax or tighten privileges per security policy
`operator.affinity` / `tolerations`	Empty by default	Force scheduling on control-plane or specific worker nodes

Container Workload Stack¶

Key	Default / Description	When to Customize
`devicePlugin.enabled`	`true`	Disable if you only run VM workloads
`devicePlugin.image.*`	`rebellions/k8s-device-plugin:latest`	Pin to a private registry or a specific version
`devicePlugin.resourceList[]`	`rebellions.ai/ATOM`	Add custom resource names or prefixes
`metricsExporter.enabled` / `image.*`	`true`, `rebellions/rbln-metrics-exporter:latest`	Disable when another telemetry pipeline exists

Driver Configuration¶

Key	Default / Description	When to Customize
`driver.enabled`	`true`	Disable if drivers are preinstalled or managed outside the operator
`driver.image.*`	`harbor.k8s.rebellions.in/playground/rbln-driver:3.0.0-rc3`, `IfNotPresent`	Use a private registry or pin a specific driver version
`driver.imagePullSecrets`	`[]`	Required when pulling from a private registry
`driver.nodeSelector` / `nodeAffinity` / `tolerations`	Empty by default	Constrain driver pods to specific nodes or architectures
`driver.labels` / `annotations`	`{}`	Add metadata for policy, audit, or monitoring integration
`driver.priorityClassName`	`""`	Increase scheduling priority (for example, `system-node-critical`)
`driver.resources`	`{}`	Set CPU/memory requests and limits
`driver.args` / `driver.env`	`[]`	Pass custom arguments or environment variables (for example, log levels)
`driver.manager.*`	`rebellions/rbln-k8s-driver-manager:latest`, `IfNotPresent`	Override the Driver Manager image or version

Validator Configuration¶

Key	Default / Description	When to Customize
`validator.registry` / `image` / `tag`	`harbor.k8s.rebellions.in`, `rebellions/rbln-npu-operator-validator`, `driver-test`	Override the validator image location or version
`validator.pullPolicy`	`IfNotPresent`	Reuse cached images when available
`validator.imagePullSecrets`	`[]`	Required when pulling from a private registry
`validator.resources`	`{}`	Set CPU/memory requests and limits
`validator.args` / `validator.env`	`[]`	Pass custom arguments or environment variables

Container Toolkit Configuration¶

Key	Default / Description	When to Customize
`containerToolkit.enabled`	`true`	Disable if CDI specs and runtime config are managed externally
`containerToolkit.image.*`	`harbor.k8s.rebellions.in/rebellions/rbln-container-toolkit:latest`, `Always`	Override the toolkit image location or version
`containerToolkit.imagePullSecrets`	`[]`	Required when pulling from a private registry
`containerToolkit.resources`	`{}`	Set CPU/memory requests and limits
`containerToolkit.args` / `containerToolkit.env`	`[]`	Pass custom arguments or environment variables

VM Passthrough Stack¶

Key	Default / Description	When to Customize
`sandboxDevicePlugin.enabled`	`false` (disabled by default)	Enable for KubeVirt or other VM environments
`sandboxDevicePlugin.resourceList[]`	`rebellions.ai/ATOM_PT`	Match VFIO resource names per card model
`sandboxDevicePlugin.vfioChecker.image.*`	`rebellions/rbln-vfio-manager:latest`	Swap in a different validation image
`vfioManager.enabled`	`false`	Enable when configuring VM passthrough
`vfioManager.image.*`	`rebellions/rbln-vfio-manager:latest`	Pin to a private registry or a specific version