Skip to content

Installing the RBLN NPU Operator

This page covers how to perform a fresh installation of the RBLN NPU Operator using its Helm chart, verify the installation, run a quick start NPU workload, configure scheduling for specific models, and customize the chart.

For an architectural overview of the operator, see RBLN NPU Operator. For upgrade and removal, see Upgrading the NPU Operator and Uninstalling the NPU Operator.

Prerequisites

  • Kubernetes 1.19 or later cluster with access to kubectl and helm
  • Helm 3.8 or later (required for OCI registry support)
  • Node Feature Discovery installed in the cluster. To keep its lifecycle separate from the operator, we recommend deploying NFD separately using the upstream Helm chart into a dedicated node-feature-discovery namespace. The chart's bundled nfd.enabled=true option is convenient for quick trials, but it ties NFD's lifecycle to this Helm release.
  • A dedicated namespace for the operator, such as rbln-system
  • Worker nodes equipped with NPUs

If you do not have Helm installed (or your version is below 3.8), install it first:

1
2
3
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
  && chmod 700 get_helm.sh \
  && ./get_helm.sh

Creating an Image Pull Secret

The driver container image and rbln-daemon container image are hosted on repo.rebellions.ai, which requires RBLN Portal account authentication. Before installing, create a Docker registry secret in the operator namespace:

1
2
3
4
5
6
$ kubectl create secret docker-registry drivercred \
  --docker-server=repo.rebellions.ai \
  --docker-username=<your-username> \
  --docker-password=<your-password> \
  --docker-email=<your-email> \
  -n rbln-system

Using this exact secret name matches the chart's default driver.imagePullSecrets, so you do not need to override any Helm values.

Driver Installation

Before deploying the operator chart, decide whether the operator should install the kernel driver through a container or detect a driver installed directly on each host. See NPU Driver Installation for the two modes and how to configure them with the driver.enabled chart value.

Installing the NPU Operator from the OCI Registry

Starting with v0.3.4, the chart is published to Docker Hub as an OCI artifact at oci://docker.io/rebellions/rbln-npu-operator-chart. Pin a version for reproducible installation. Available versions are listed on the chart page on Docker Hub.

1
2
3
4
5
6
$ export CHART_VERSION=0.4.0

$ helm install rbln-npu-operator \
     oci://docker.io/rebellions/rbln-npu-operator-chart \
     --version ${CHART_VERSION} \
     --namespace rbln-system --create-namespace

Override individual values with --set:

1
2
3
4
5
6
7
$ export CHART_VERSION=0.4.0

$ helm install rbln-npu-operator \
     oci://docker.io/rebellions/rbln-npu-operator-chart \
     --version ${CHART_VERSION} \
     --namespace rbln-system --create-namespace \
     --set driver.image.tag="3.0.0"

Or supply a custom values file:

1
2
3
4
5
6
7
$ export CHART_VERSION=0.4.0

$ helm install rbln-npu-operator \
     oci://docker.io/rebellions/rbln-npu-operator-chart \
     --version ${CHART_VERSION} \
     --namespace rbln-system --create-namespace \
     -f my-values.yaml

Previous chart repository

Earlier versions (≤0.3.3) remain available from the GitHub Pages chart repository at https://rbln-sw.github.io/rbln-npu-operator, but new releases are published only to OCI.

Verifying the Installation

After Helm reports a successful installation, ensure that the RBLNClusterPolicy exists and has been reconciled. Both CRDs are scoped to the cluster, so the -n flag is not needed:

1
2
3
$ kubectl get rblnclusterpolicies.rebellions.ai
NAME                  STATUS   AGE
rbln-cluster-policy   ready    8m

If driver management is enabled, confirm that the RBLNDriver custom resource is created:

1
2
3
$ kubectl get rblndrivers.rebellions.ai
NAME          STATUS   AGE
rbln-driver   ready    99m

The STATUS column reflects .status.state (ready or notReady), and the operator reconciles this value.

Next, check that the controller and component pods are running in the operator namespace:

$ kubectl get pods -n rbln-system
NAME                                             READY   STATUS    AGE
controller-manager-797798d7b8-rjzht              1/1     Running   8m
rbln-daemon-45k2k                                1/1     Running   8m
rbln-device-plugin-4qgxc                         1/1     Running   8m
rbln-metrics-exporter-jghbg                      1/1     Running   8m
rbln-npu-feature-discovery-zg47r                 1/1     Running   8m
rbln-container-toolkit-ttz2c                     1/1     Running   8m
rbln-driver-ubuntu22.04-6.8.0-90-generic-6gtrc   1/1     Running   8m
rbln-operator-validator-qhf4t                    1/1     Running   8m

If all pods show Running, the operator is healthy. Pod names follow the rbln-<component>-* pattern. See Core Components for the role of each component. The driver pod's suffix (<os>-<kernel>) is composed from NFD labels. See Driver Image Selection for the rules.

If any pod is stuck or CrashLoopBackOff, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.

When the operator installs the driver through the driver container (driver.enabled=true in the chart), you can verify that the kernel module is loaded and matches the requested driver version. See Verifying the running driver.

Checking NPU Status

Check the NPU capacity that Kubernetes reports for each node. This confirms that the device plugin exposed the expected resources. With devicePlugin.useGenericResourceName: true (the chart default since device plugin v0.4.0), the operator exposes the generic resource name rebellions.ai/npu:

1
2
3
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,NPUs:.status.capacity.'rebellions\.ai/npu'
NAME                            NPUs
rbln-npu-worker-01              16

If the cluster contains multiple NPU models and you need to schedule workloads onto a specific one, combine the generic rebellions.ai/npu resource with the product label applied by NPU Feature Discovery. See Targeting a Specific NPU Product below.

Creating a Pod with an NPU

  1. Create a manifest (for example, npu-demo-pod.yaml). This example requests four rebellions.ai/npu devices:

    apiVersion: v1
    kind: Pod
    metadata:
      name: npu-pod
    spec:
      containers:
        - name: ubuntu
          image: ubuntu:latest
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "--"]
          args: ["while true; do sleep 300000; done;"]
          resources:
            limits:
              rebellions.ai/npu: 4
    

  2. Create the Pod.

    $ kubectl apply -f npu-demo-pod.yaml
    

  3. Verify the Pod status and resource assignment.

    $ kubectl get pod npu-pod
    
    Use kubectl describe pod npu-pod to confirm that the requested NPU resources are bound and that the Pod is scheduled onto a node with an NPU.

Targeting a Specific NPU Product

In a cluster with mixed NPU products (for example, both RBLN-CA25 and RBLN-CR03 cards), a Pod that requests the generic rebellions.ai/npu resource may be scheduled onto any node with an NPU. Use the product label applied by NPU Feature Discovery, rebellions.ai/npu.product, together with nodeSelector or nodeAffinity to pin the workload to a specific model.

nodeSelector: single product

apiVersion: v1
kind: Pod
metadata:
  name: npu-pod-ca25
spec:
  nodeSelector:
    rebellions.ai/npu.product: RBLN-CA25
  containers:
    - name: workload
      image: ubuntu:latest
      resources:
        limits:
          rebellions.ai/npu: 1

nodeAffinity: multiple products

apiVersion: v1
kind: Pod
metadata:
  name: npu-pod-atom-family
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: rebellions.ai/npu.family
                operator: In
                values: [ATOM]
  containers:
    - name: workload
      image: ubuntu:latest
      resources:
        limits:
          rebellions.ai/npu: 1

Previous resource names by card

The resource names by card (rebellions.ai/ATOM, rebellions.ai/REBEL) are no longer recommended for new deployments. Prefer the generic rebellions.ai/npu resource paired with NFD labels as shown above. See RBLN NPU Feature Discovery for the full label reference.

Configuration Reference

The tables below summarize the key configuration options in values.yaml. Each section maps to an entry in the Core Components overview.

Pass any of these values with --set <key>=<value> (or -f my-values.yaml) when running helm install or helm upgrade.

To see the chart's full values list and template comments, including keys not shown in the tables below, run:

1
2
3
$ export CHART_VERSION=0.4.0

$ helm show values oci://docker.io/rebellions/rbln-npu-operator-chart --version ${CHART_VERSION}

The image.* blocks follow the standard Helm chart convention with four sub-keys:

1
2
3
4
5
image:
  registry: <registry>
  repository: <repository>
  tag: <chart default>     # actual value fixed by the chart; see `helm show values` above
  pullPolicy: <pullPolicy>

In the Default column below, this is condensed as <registry>/<repository>:<tag> plus the pullPolicy: on a second line.

Chart-wide

Key Description Default
nameOverride Prefix applied to every child resource name. Override to avoid collisions when running multiple operator instances. rbln-npu-operator
workloadType Workload mode for the cluster. Set to vm-passthrough for KubeVirt deployments. In this mode, the CRD also validates that vfioManager.enabled and sandboxDevicePlugin.enabled are true. container
nfd.enabled Whether the chart deploys Node Feature Discovery as a subchart. Leave false and install NFD separately through its upstream Helm chart to keep its lifecycle separate from the operator. Set true only for quick trials. false
podDefaults.labels Labels applied to every DaemonSet pod managed by the operator (replaces the removed daemonsets block). {}
podDefaults.annotations Annotations applied to every DaemonSet pod managed by the operator. {}
podDefaults.tolerations Tolerations applied to every DaemonSet pod managed by the operator. []
podDefaults.priorityClassName Priority class applied to every operator-managed DaemonSet pod. ""

Operator

Key Description Default
operator.image.* Image of the controller-manager pod. Override tag to pin a specific operator version. docker.io/rebellions/rbln-npu-operator:<chart default>
pullPolicy: IfNotPresent
operator.replicas Number of operator pods. Increase to 2+ for high availability. 1
operator.resources.requests Minimum guaranteed resources for the operator pod. cpu: 50m, memory: 128Mi
operator.resources.limits Maximum resources for the operator pod. cpu: 500m, memory: 256Mi
operator.service.type Service type for the operator's webhook/metrics endpoint. Change when integrating with ingress or service mesh. ClusterIP
operator.service.port Service port. 8443
operator.service.targetPort Container port the service forwards to. 8443
operator.securityContext.runAsNonRoot Pod-level security context. Adjust based on your cluster's security policy. true
operator.affinity Affinity for the operator pod (e.g., to pin to control-plane nodes). {}
operator.tolerations Tolerations for the operator pod (for example, to allow scheduling on nodes with taints). []

Driver Manager

Key Description Default
driver.enabled Whether the operator installs and manages the NPU driver. Leave false if drivers are already installed on hosts. false
driver.image.* Driver container image. Override this value to pin a private mirror or a specific driver release. repo.rebellions.ai/rebellions/rbln-driver:<chart default>
pullPolicy: IfNotPresent
driver.imagePullSecrets Image pull secret for the driver image. Change this value to match your actual secret name if you do not use the drivercred name created earlier. [drivercred]
driver.nodeSelector Restrict driver pods to specific nodes. {}
driver.tolerations Tolerations for driver pods. []
driver.annotations Annotations for driver pods. {}
driver.priorityClassName Pod priority class. The CRD defaults to system-node-critical when this value is empty. ""
driver.resources CPU/memory requests and limits for the driver pod. The CRD field is required; the chart fills defaults when unset. {}
driver.env Environment variables passed to the driver container (e.g., log levels). []
driver.manager.image.* Image of the driver manager initContainer that performs reconciliation on the node. docker.io/rebellions/rbln-k8s-driver-manager:<chart default>
pullPolicy: IfNotPresent

driver.upgradePolicy.* keys (autoUpgrade, drain, reboot, etc.) are documented in NPU Driver Upgrade Workflow.

Device Plugin

Key Description Default
devicePlugin.enabled Whether the standard container Device Plugin is deployed. Disable when running only DRA (draKubeletPlugin) or only VM workloads. true
devicePlugin.image.* Device Plugin image. docker.io/rebellions/k8s-device-plugin:<chart default>
pullPolicy: IfNotPresent
devicePlugin.useGenericResourceName Whether to expose the generic rebellions.ai/npu resource. Keep true; setting false selects a naming mode by card that is not recommended for new deployments. true

DRA Driver

Key Description Default
draKubeletPlugin.enabled Enable on Kubernetes 1.34 or later to use Dynamic Resource Allocation. Mutually exclusive with devicePlugin.enabled. false
draKubeletPlugin.image.* DRA kubelet plugin image. docker.io/rebellions/k8s-dra-driver-npu:<chart default>
pullPolicy: IfNotPresent
draKubeletPlugin.driverName Driver name; must match the value referenced from DeviceClass.spec.config.driver. npu.rebellions.ai
draKubeletPlugin.kubeletRegistrarDirectoryPath Host path where the plugin registers with the kubelet. /var/lib/kubelet/plugins_registry
draKubeletPlugin.kubeletPluginsDirectoryPath Host path where the plugin sockets live. /var/lib/kubelet/plugins
draKubeletPlugin.healthcheckPort TCP port for the plugin's health-check endpoint. 51515

See NPU DRA Driver for full DRA usage.

Sandbox Device Plugin

Key Description Default
sandboxDevicePlugin.enabled Whether the VFIO Sandbox Device Plugin is deployed. Enable for KubeVirt or other VM environments. false
sandboxDevicePlugin.image.* Sandbox Device Plugin image. docker.io/rebellions/k8s-device-plugin:<chart default>
pullPolicy: IfNotPresent
sandboxDevicePlugin.resourceList[] VFIO resource map by card. Add entries as new SKUs ship. RBLN-CA12 → ATOM_PT
RBLN-CA25 → ATOM_MAX_PT
RBLN-CR03 → REBEL_PT

VFIO Manager

Key Description Default
vfioManager.enabled Whether the VFIO bind/unbind helper is deployed. Enable it together with the Sandbox Device Plugin for VM passthrough. false
vfioManager.image.* VFIO Manager image. docker.io/rebellions/rbln-vfio-manager:<chart default>
pullPolicy: IfNotPresent

Container Toolkit

Key Description Default
containerToolkit.enabled Whether the Container Toolkit DaemonSet is deployed. Disable if CDI specs and runtime config are managed externally. true
containerToolkit.image.* Container Toolkit image. docker.io/rebellions/rbln-container-toolkit:<chart default>
pullPolicy: IfNotPresent
containerToolkit.imagePullSecrets Image pull secret(s); required when pulling from a private registry. []
containerToolkit.resources CPU/memory requests and limits for the toolkit pod. {}
containerToolkit.env Environment variables passed to the toolkit, including RBLN_CTK_DAEMON_SOCKET and RBLN_CTK_DAEMON_CONFIG_PATH. []

NPU Feature Discovery

Key Description Default
npuFeatureDiscovery.enabled Whether the operator deploys NPU Feature Discovery. Disable if you manage NPU node labels through another mechanism. true
npuFeatureDiscovery.image.* NPU Feature Discovery image. docker.io/rebellions/rbln-npu-feature-discovery:<chart default>
pullPolicy: IfNotPresent

Metrics Exporter

Key Description Default
metricsExporter.enabled Whether the Prometheus metrics exporter is deployed. Disable when another telemetry pipeline already covers NPUs. true
metricsExporter.image.* Metrics exporter image. docker.io/rebellions/rbln-metrics-exporter:<chart default>
pullPolicy: IfNotPresent

Operator Validator

Key Description Default
validator.image.* Validator DaemonSet image. docker.io/rebellions/rbln-npu-operator-validator:<chart default>
pullPolicy: IfNotPresent
validator.imagePullSecrets Image pull secret(s); required when pulling from a private registry. []
validator.resources CPU/memory requests and limits for the Validator pod. {}
validator.env Environment variables for the top level Validator process. []
validator.toolkit.env Environment variables for the Container Toolkit readiness subcheck. []
validator.driver.env Environment variables for the driver readiness subcheck. []

RBLN Daemon

Key Description Default
rblnDaemon.enabled Whether the rbln-daemon host service is deployed on each node. Enable when workloads need the RBLN Daemon in the cluster. false
rblnDaemon.image.* RBLN Daemon image. repo.rebellions.ai/rebellions/rbln-daemon:<chart default>
pullPolicy: IfNotPresent
rblnDaemon.imagePullSecrets Image pull secret. Uses the same drivercred secret as the driver image. [drivercred]
rblnDaemon.hostPort TCP port the daemon listens on, exposed via hostPort on each node. 50051
rblnDaemon.resources CPU/memory requests and limits for the daemon pod. {}
rblnDaemon.env Environment variables passed to the daemon container. []