Skip to content

RBLN NPU Operator

The rbln-npu-operator automates the deployment and management of all Rebellions software components required for provisioning the RBLN NPU family across Kubernetes and OpenShift clusters. A singleton RBLNClusterPolicy custom resource drives the entire workflow. Once that policy is created, the operator automatically performs:

  • Hardware detection and node labeling via Node Feature Discovery (NFD)
  • Device exposure for two workload types: container passthrough and VM passthrough
  • Driver installation and lifecycle management via the Driver Manager (driven by the RBLNDriver custom resource)
  • Deployment and lifecycle management of adjacent components, including VFIO binding, device plugins, and metrics exporters
  • Automatic RBAC/SCC configuration so it works on both OpenShift and vanilla Kubernetes

Core Components

The table below summarizes the key components that the operator deploys and manages.

Component Description
Device Plugin Publishes resources such as rebellions.ai/ATOM to Pods using the rebellions/k8s-device-plugin image. Maps per-card resources in resourceList.
Sandbox Device Plugin Exposes resources like rebellions.ai/ATOM_PT for VM passthrough. Works with the VFIO checker container and targets KubeVirt environments.
VFIO Manager Delivers the rebellions/rbln-vfio-manager image and vfio-manage.sh script through a ConfigMap to handle VFIO bind/unbind.
Driver Manager Installs and maintains NPU drivers based on the RBLNDriver custom resource.
NPU Feature Discovery Uses rebellions/rbln-npu-feature-discovery to detect PCI-1eff vendor devices and apply labels such as rebellions.ai/npu.present.
Container Toolkit Generates CDI specs and configures/restarts the container runtime so workloads can use CDI-based device injection.
Metrics Exporter Exposes NPU metrics in Prometheus format through the rebellions/rbln-metrics-exporter image.
Operator Validator Validates driver and container toolkit readiness and writes a hostPath ready file that other components can depend on.
Operator Manager The controller loaded from cmd/main.go enforces the singleton RBLNClusterPolicy and watches for changes to Kubernetes Nodes and component DaemonSets.

Installing the NPU Operator

The RBLN NPU Operator Helm chart automatically discovers RBLN NPUs in your cluster, deploys the required device plugins, and monitors the health of the related components. Follow these steps to get up and running quickly.

  1. Prerequisites

    • Kubernetes 1.19+ cluster with access to kubectl and helm
    • A dedicated namespace such as rbln-system is recommended for the operator
    • Worker nodes equipped with NPUs and Node Feature Discovery installed (set nfd.enabled=true in the chart to install it together)
  2. Install Helm (if needed)

    1
    2
    3
    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
      && chmod 700 get_helm.sh \
      && ./get_helm.sh
    

  3. Add the Rebellions Helm repository

    helm repo add rebellions https://rebellions-sw.github.io/rbln-npu-operator
    helm repo update
    

  4. Install the NPU Operator

    1
    2
    3
    helm install --wait --generate-name \
         -n rbln-system --create-namespace \
         rebellions/rbln-npu-operator
    

Verify Installation

After Helm reports a successful install, confirm that the RBLNClusterPolicy custom resource is present and reconciled:

1
2
3
kubectl get rblnclusterpolicies.rebellions.ai -n rbln-system
NAME                  AGE
rbln-cluster-policy   8m

If driver management is enabled, confirm that the RBLNDriver custom resource is created:

1
2
3
kubectl get rblndrivers.rebellions.ai -n rbln-system
NAME          AGE
rbln-driver   99m

Next, inspect the operator namespace to verify the health of the controller and operand pods:

1
2
3
4
5
6
7
8
9
kubectl get pods -n rbln-system
NAME                                             READY   STATUS    AGE
controller-manager-797798d7b8-rjzht              1/1     Running   8m
rbln-device-plugin-4qgxc                         1/1     Running   8m
rbln-metrics-exporter-jghbg                      1/1     Running   8m
rbln-npu-feature-discovery-zg47r                 1/1     Running   8m
rbln-container-toolkit-ttz2c                     1/1     Running   8m
rbln-driver-ubuntu22.04-6.8.0-90-generic-6gtrc   1/1     Running   8m
rbln-operator-validator-qhf4t                    1/1     Running   8m
  • Controller component: controller-manager-* reconciles the RBLNClusterPolicy.
  • Driver management: RBLNDriver custom resources declare the NPU driver version, and the Driver Manager installs it across the cluster.
  • Validator: rbln-operator-validator-* verifies driver and container toolkit readiness and records the result as a hostPath ready file.
  • Operands: rbln-device-plugin, rbln-npu-feature-discovery, rbln-container-toolkit, and rbln-metrics-exporter handle device exposure, labeling, CDI enablement, and telemetry.

If any pod is stuck or CrashLooping, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.

Driver Management

The operator defines the RBLNDriver CRD to manage NPU driver installation. Create an RBLNDriver custom resource with the desired driver version, and the Driver Manager installs and maintains that version across the cluster.

RBLNDriver Sample

apiVersion: rebellions.ai/v1alpha1
kind: RBLNDriver
metadata:
  labels:
    app.kubernetes.io/name: rbln-driver
  name: rblndriver-sample
spec:
  registry: docker.io
  image: rebellions/rbln-driver
  version: "3.0.0"
  imagePullPolicy: IfNotPresent
  labels:
    app.kubernetes.io/component: driver
  resources:
    requests:
      cpu: 250m
      memory: 64Mi
    limits:
      cpu: 500m
      memory: 128Mi
  manager:
    registry: docker.io
    image: rebellions/rbln-k8s-driver-manager
    version: "latest"
    imagePullPolicy: IfNotPresent

What the Driver Manager Installs

When an RBLNDriver resource is applied, the Driver Manager installs:

  • Kernel driver
  • UMD libraries
  • Tools such as rbln-smi

Driver Image Selection

The Driver Manager selects the driver container image by combining Node Feature Discovery labels on each node:

  • feature.node.kubernetes.io/system-os_release.ID
  • feature.node.kubernetes.io/system-os_release.VERSION_ID
  • feature.node.kubernetes.io/kernel-version.full

For example, a node running Ubuntu 22.04 with kernel 6.8.0-90-generic can resolve to: docker.io/rebellions/rbln-driver:3.0.0-rc3-6.8.0-90-generic-ubuntu22.04.

Operator Validator

rbln-validator verifies the readiness of the RBLN NPU stack, including the driver and container toolkit. It runs inside the operator-managed Validator DaemonSet and records the result as a hostPath ready file so other components can depend on it.

Container Toolkit

The NPU Operator deploys the Container Toolkit DaemonSet based on the containerToolkit settings in the RBLNClusterPolicy. It generates CDI specs and configures/restarts the container runtime so the Device Plugin and workloads can use CDI-based device injection.

Checking NPU Status

Inspect the NPU capacity that Kubernetes reports for each node to confirm that the device plugin published the expected resources:

1
2
3
kubectl get nodes -o custom-columns=NAME:.metadata.name,NPUs:.status.capacity.'rebellions\.ai/ATOM'
NAME                            NPUs
rbln-npu-worker-01              16

If you expose additional resource names (for example, rebellions.ai/ATOM_PT for sandbox workloads), substitute that resource in the command and confirm the corresponding column is populated.

Creating an NPU-enabled Pod

  1. Create a manifest (for example, npu-demo-pod.yaml) that requests four rebellions.ai/ATOM devices:

    apiVersion: v1
    kind: Pod
    metadata:
      name: npu-pod
    spec:
      containers:
        - name: ubuntu
          image: ubuntu:latest
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "--"]
          args: ["while true; do sleep 300000; done;"]
          resources:
            limits:
              rebellions.ai/ATOM: 4
    

  2. Create the Pod.

    kubectl apply -f npu-demo-pod.yaml
    

  3. Verify the Pod status and resource assignment.

    kubectl get pod npu-pod
    
    Use kubectl describe pod npu-pod to confirm the requested NPU resources are bound and the Pod is scheduled onto an NPU-capable node.

Chart Customization Options

The tables below group commonly tuned values.yaml entries and describe when you might adjust them.

Global & Dependencies

Key Default / Description When to Customize
name rbln – prefix for all child resources Avoid naming collisions when running multiple operator instances
nfd.enabled false – whether the chart installs NFD Set to true if your cluster does not already run NFD
npuFeatureDiscovery.enabled / image.* true, rebellions/rbln-npu-feature-discovery:latest Disable if you already manage labels manually

Operator Deployment

Key Default / Description When to Customize
operator.image.* rebellions/rbln-npu-operator:latest Pin to a private registry or a specific version
operator.replicas 1 Increase to 2+ for high availability
operator.resources CPU/memory requests and limits Tune to fit cluster capacity
operator.service.* ClusterIP, ports 8443/8443 Change when integrating with ingress/service mesh
operator.securityContext.runAsNonRoot true Relax or tighten privileges per security policy
operator.affinity / tolerations Empty by default Force scheduling on control-plane or specific worker nodes

Container Workload Stack

Key Default / Description When to Customize
devicePlugin.enabled true Disable if you only run VM workloads
devicePlugin.image.* rebellions/k8s-device-plugin:latest Pin to a private registry or a specific version
devicePlugin.resourceList[] rebellions.ai/ATOM Add custom resource names or prefixes
metricsExporter.enabled / image.* true, rebellions/rbln-metrics-exporter:latest Disable when another telemetry pipeline exists

Driver Configuration

Key Default / Description When to Customize
driver.enabled true Disable if drivers are preinstalled or managed outside the operator
driver.image.* harbor.k8s.rebellions.in/playground/rbln-driver:3.0.0-rc3, IfNotPresent Use a private registry or pin a specific driver version
driver.imagePullSecrets [] Required when pulling from a private registry
driver.nodeSelector / nodeAffinity / tolerations Empty by default Constrain driver pods to specific nodes or architectures
driver.labels / annotations {} Add metadata for policy, audit, or monitoring integration
driver.priorityClassName "" Increase scheduling priority (for example, system-node-critical)
driver.resources {} Set CPU/memory requests and limits
driver.args / driver.env [] Pass custom arguments or environment variables (for example, log levels)
driver.manager.* rebellions/rbln-k8s-driver-manager:latest, IfNotPresent Override the Driver Manager image or version

Validator Configuration

Key Default / Description When to Customize
validator.registry / image / tag harbor.k8s.rebellions.in, rebellions/rbln-npu-operator-validator, driver-test Override the validator image location or version
validator.pullPolicy IfNotPresent Reuse cached images when available
validator.imagePullSecrets [] Required when pulling from a private registry
validator.resources {} Set CPU/memory requests and limits
validator.args / validator.env [] Pass custom arguments or environment variables

Container Toolkit Configuration

Key Default / Description When to Customize
containerToolkit.enabled true Disable if CDI specs and runtime config are managed externally
containerToolkit.image.* harbor.k8s.rebellions.in/rebellions/rbln-container-toolkit:latest, Always Override the toolkit image location or version
containerToolkit.imagePullSecrets [] Required when pulling from a private registry
containerToolkit.resources {} Set CPU/memory requests and limits
containerToolkit.args / containerToolkit.env [] Pass custom arguments or environment variables

VM Passthrough Stack

Key Default / Description When to Customize
sandboxDevicePlugin.enabled false (disabled by default) Enable for KubeVirt or other VM environments
sandboxDevicePlugin.resourceList[] rebellions.ai/ATOM_PT Match VFIO resource names per card model
sandboxDevicePlugin.vfioChecker.image.* rebellions/rbln-vfio-manager:latest Swap in a different validation image
vfioManager.enabled false Enable when configuring VM passthrough
vfioManager.image.* rebellions/rbln-vfio-manager:latest Pin to a private registry or a specific version