Skip to content

RBLN NPU Operator

The rbln-npu-operator automates the deployment and management of all Rebellions software components required for provisioning the RBLN NPU family across Kubernetes and OpenShift clusters. A singleton RBLNClusterPolicy custom resource drives the entire workflow. Once that policy is created, the operator automatically performs:

  • Hardware detection and node labeling via Node Feature Discovery (NFD)
  • Device exposure for two workload types: container passthrough and VM passthrough
  • Deployment and lifecycle management of adjacent components, including VFIO binding, device plugins, and metrics exporters
  • Automatic RBAC/SCC configuration so it works on both OpenShift and vanilla Kubernetes

Core Components

The table below summarizes the key components that the operator deploys and manages.

Component Description
Device Plugin Publishes resources such as rebellions.ai/ATOM to Pods using the rebellions/k8s-device-plugin image. Maps per-card resources in resourceList.
Sandbox Device Plugin Exposes resources like rebellions.ai/ATOM_PT for VM passthrough. Works with the VFIO checker container and targets KubeVirt environments.
VFIO Manager Delivers the rebellions/rbln-vfio-manager image and vfio-manage.sh script through a ConfigMap to handle VFIO bind/unbind.
NPU Feature Discovery Uses rebellions/rbln-npu-feature-discovery to detect PCI-1eff vendor devices and apply labels such as rebellions.ai/npu.present.
Metrics Exporter Exposes NPU metrics in Prometheus format through the rebellions/rbln-metrics-exporter image.
Operator Manager The controller loaded from cmd/main.go enforces the singleton RBLNClusterPolicy and watches for changes to Kubernetes Nodes and component DaemonSets.

Installing the NPU Operator

The RBLN NPU Operator Helm chart automatically discovers RBLN NPUs in your cluster, deploys the required device plugins, and monitors the health of the related components. Follow these steps to get up and running quickly.

  1. Prerequisites

    • Kubernetes 1.19+ cluster with access to kubectl and helm
    • A dedicated namespace such as rbln-system is recommended for the operator
    • Worker nodes equipped with NPUs and Node Feature Discovery installed (set nfd.enabled=true in the chart to install it together)
  2. Install Helm (if needed)

    1
    2
    3
    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
      && chmod 700 get_helm.sh \
      && ./get_helm.sh
    

  3. Add the Rebellions Helm repository

    helm repo add rebellions https://rebellions-sw.github.io/rbln-npu-operator
    helm repo update
    

  4. Install the NPU Operator

    1
    2
    3
    helm install --wait --generate-name \
         -n rbln-system --create-namespace \
         rebellions/rbln-npu-operator
    

Verify Installation

After Helm reports a successful install, confirm that the RBLNClusterPolicy custom resource is present and reconciled:

1
2
3
kubectl get rblnclusterpolicies.rebellions.ai -n rbln-system
NAME                  AGE
rbln-cluster-policy   8m

Next, inspect the operator namespace to verify the health of the controller and operand pods:

1
2
3
4
5
6
kubectl get pods -n rbln-system
NAME                                             READY   STATUS    AGE
controller-manager-797798d7b8-rjzht              1/1     Running   8m
rbln-device-plugin-4qgxc                         1/1     Running   8m
rbln-metrics-exporter-jghbg                      1/1     Running   8m
rbln-npu-feature-discovery-zg47r                 1/1     Running   8m
  • Controller component: controller-manager-* reconciles the RBLNClusterPolicy.
  • Operands: rbln-device-plugin, rbln-npu-feature-discovery, and rbln-metrics-exporter handle device exposure, labeling, and telemetry.

If any pod is stuck or CrashLooping, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.

Checking NPU Status

Inspect the NPU capacity that Kubernetes reports for each node to confirm that the device plugin published the expected resources:

1
2
3
kubectl get nodes -o custom-columns=NAME:.metadata.name,NPUs:.status.capacity.'rebellions\.ai/ATOM'
NAME                            NPUs
rbln-npu-worker-01              16

If you expose additional resource names (for example, rebellions.ai/ATOM_PT for sandbox workloads), substitute that resource in the command and confirm the corresponding column is populated.

Creating an NPU-enabled Pod

  1. Create a manifest (for example, npu-demo-pod.yaml) that requests four rebellions.ai/ATOM devices:

    apiVersion: v1
    kind: Pod
    metadata:
      name: npu-pod
    spec:
      containers:
        - name: ubuntu
          image: ubuntu:latest
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "--"]
          args: ["while true; do sleep 300000; done;"]
          resources:
            limits:
              rebellions.ai/ATOM: 4
    

  2. Create the Pod.

    kubectl apply -f npu-demo-pod.yaml
    

  3. Verify the Pod status and resource assignment.

    kubectl get pod npu-pod
    
    Use kubectl describe pod npu-pod to confirm the requested NPU resources are bound and the Pod is scheduled onto an NPU-capable node.

Chart Customization Options

The tables below group commonly tuned values.yaml entries and describe when you might adjust them.

Global & Dependencies

Key Default / Description When to Customize
name rbln – prefix for all child resources Avoid naming collisions when running multiple operator instances
nfd.enabled false – whether the chart installs NFD Set to true if your cluster does not already run NFD
npuFeatureDiscovery.enabled / image.* true, rebellions/rbln-npu-feature-discovery:latest Disable if you already manage labels manually

Operator Deployment

Key Default / Description When to Customize
operator.image.* rebellions/rbln-npu-operator:latest Pin to a private registry or a specific version
operator.replicas 1 Increase to 2+ for high availability
operator.resources CPU/memory requests and limits Tune to fit cluster capacity
operator.service.* ClusterIP, ports 8443/8443 Change when integrating with ingress/service mesh
operator.securityContext.runAsNonRoot true Relax or tighten privileges per security policy
operator.affinity / tolerations Empty by default Force scheduling on control-plane or specific worker nodes

Container Workload Stack

Key Default / Description When to Customize
devicePlugin.enabled true Disable if you only run VM workloads
devicePlugin.image.* rebellions/k8s-device-plugin:latest Pin to a private registry or a specific version
devicePlugin.resourceList[] rebellions.ai/ATOM Add custom resource names or prefixes
metricsExporter.enabled / image.* true, rebellions/rbln-metrics-exporter:latest Disable when another telemetry pipeline exists

VM Passthrough Stack

Key Default / Description When to Customize
sandboxDevicePlugin.enabled false (disabled by default) Enable for KubeVirt or other VM environments
sandboxDevicePlugin.resourceList[] rebellions.ai/ATOM_PT Match VFIO resource names per card model
sandboxDevicePlugin.vfioChecker.image.* rebellions/rbln-vfio-manager:latest Swap in a different validation image
vfioManager.enabled false Enable when configuring VM passthrough
vfioManager.image.* rebellions/rbln-vfio-manager:latest Pin to a private registry or a specific version