Installing the RBLN NPU Operator¶

This page covers how to perform a fresh installation of the RBLN NPU Operator using its Helm chart, verify the installation, run a quick start NPU workload, configure scheduling for specific models, and customize the chart.

For an architectural overview of the operator, see RBLN NPU Operator. For upgrade and removal, see Upgrading the NPU Operator and Uninstalling the NPU Operator.

Prerequisites¶

Kubernetes 1.19 or later cluster with access to kubectl and helm
Helm 3.8 or later (required for OCI registry support)
Node Feature Discovery installed in the cluster. To keep its lifecycle separate from the operator, we recommend deploying NFD separately using the upstream Helm chart into a dedicated node-feature-discovery namespace. The chart's bundled nfd.enabled=true option is convenient for quick trials, but it ties NFD's lifecycle to this Helm release.
A dedicated namespace for the operator, such as rbln-system
Worker nodes equipped with NPUs

If you do not have Helm installed (or your version is below 3.8), install it first:

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
  && chmod 700 get_helm.sh \
  && ./get_helm.sh

Creating an Image Pull Secret¶

The driver container image and rbln-daemon container image are hosted on repo.rebellions.ai, which requires RBLN Portal account authentication. Before installing, create a Docker registry secret in the operator namespace:

$ kubectl create secret docker-registry drivercred \
  --docker-server=repo.rebellions.ai \
  --docker-username=<your-username> \
  --docker-password=<your-password> \
  --docker-email=<your-email> \
  -n rbln-system

Using this exact secret name matches the chart's default driver.imagePullSecrets, so you do not need to override any Helm values.

Driver Installation¶

Before deploying the operator chart, decide whether the operator should install the kernel driver through a container or detect a driver installed directly on each host. See NPU Driver Installation for the two modes and how to configure them with the driver.enabled chart value.

Installing the NPU Operator from the OCI Registry¶

Starting with v0.3.4, the chart is published to Docker Hub as an OCI artifact at oci://docker.io/rebellions/rbln-npu-operator-chart. Pin a version for reproducible installation. Available versions are listed on the chart page on Docker Hub.

$ export CHART_VERSION=0.4.0

$ helm install rbln-npu-operator \
     oci://docker.io/rebellions/rbln-npu-operator-chart \
     --version ${CHART_VERSION} \
     --namespace rbln-system --create-namespace

Override individual values with --set:

$ export CHART_VERSION=0.4.0

$ helm install rbln-npu-operator \
     oci://docker.io/rebellions/rbln-npu-operator-chart \
     --version ${CHART_VERSION} \
     --namespace rbln-system --create-namespace \
     --set driver.image.tag="3.0.0"

Or supply a custom values file:

$ export CHART_VERSION=0.4.0

$ helm install rbln-npu-operator \
     oci://docker.io/rebellions/rbln-npu-operator-chart \
     --version ${CHART_VERSION} \
     --namespace rbln-system --create-namespace \
     -f my-values.yaml

Previous chart repository

Earlier versions (≤0.3.3) remain available from the GitHub Pages chart repository at https://rbln-sw.github.io/rbln-npu-operator, but new releases are published only to OCI.

Verifying the Installation¶

After Helm reports a successful installation, ensure that the RBLNClusterPolicy exists and has been reconciled. Both CRDs are scoped to the cluster, so the -n flag is not needed:

$ kubectl get rblnclusterpolicies.rebellions.ai
NAME                  STATUS   AGE
rbln-cluster-policy   ready    8m

If driver management is enabled, confirm that the RBLNDriver custom resource is created:

$ kubectl get rblndrivers.rebellions.ai
NAME          STATUS   AGE
rbln-driver   ready    99m

The STATUS column reflects .status.state (ready or notReady), and the operator reconciles this value.

Next, check that the controller and component pods are running in the operator namespace:

$ kubectl get pods -n rbln-system
NAME                                             READY   STATUS    AGE
controller-manager-797798d7b8-rjzht              1/1     Running   8m
rbln-daemon-45k2k                                1/1     Running   8m
rbln-device-plugin-4qgxc                         1/1     Running   8m
rbln-metrics-exporter-jghbg                      1/1     Running   8m
rbln-npu-feature-discovery-zg47r                 1/1     Running   8m
rbln-container-toolkit-ttz2c                     1/1     Running   8m
rbln-driver-ubuntu22.04-6.8.0-90-generic-6gtrc   1/1     Running   8m
rbln-operator-validator-qhf4t                    1/1     Running   8m

If all pods show Running, the operator is healthy. Pod names follow the rbln-<component>-* pattern. See Core Components for the role of each component. The driver pod's suffix (<os>-<kernel>) is composed from NFD labels. See Driver Image Selection for the rules.

If any pod is stuck or CrashLoopBackOff, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.

When the operator installs the driver through the driver container (driver.enabled=true in the chart), you can verify that the kernel module is loaded and matches the requested driver version. See Verifying the running driver.

Checking NPU Status¶

Check the NPU capacity that Kubernetes reports for each node. This confirms that the device plugin exposed the expected resources. With devicePlugin.useGenericResourceName: true (the chart default since device plugin v0.4.0), the operator exposes the generic resource name rebellions.ai/npu:

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,NPUs:.status.capacity.'rebellions\.ai/npu'
NAME                            NPUs
rbln-npu-worker-01              16

If the cluster contains multiple NPU models and you need to schedule workloads onto a specific one, combine the generic rebellions.ai/npu resource with the product label applied by NPU Feature Discovery. See Targeting a Specific NPU Product below.

Creating a Pod with an NPU¶

Create a manifest (for example, npu-demo-pod.yaml). This example requests four rebellions.ai/npu devices:

apiVersion: v1
kind: Pod
metadata:
  name: npu-pod
spec:
  containers:
    - name: ubuntu
      image: ubuntu:latest
      imagePullPolicy: IfNotPresent
      command: ["/bin/bash", "-c", "--"]
      args: ["while true; do sleep 300000; done;"]
      resources:
        limits:
          rebellions.ai/npu: 4

Create the Pod.
1
$ kubectl apply -f npu-demo-pod.yaml
Verify the Pod status and resource assignment.
1
$ kubectl get pod npu-pod
Use kubectl describe pod npu-pod to confirm that the requested NPU resources are bound and that the Pod is scheduled onto a node with an NPU.

Targeting a Specific NPU Product¶

In a cluster with mixed NPU products (for example, both RBLN-CA25 and RBLN-CR03 cards), a Pod that requests the generic rebellions.ai/npu resource may be scheduled onto any node with an NPU. Use the product label applied by NPU Feature Discovery, rebellions.ai/npu.product, together with nodeSelector or nodeAffinity to pin the workload to a specific model.

nodeSelector: single product¶

apiVersion: v1
kind: Pod
metadata:
  name: npu-pod-ca25
spec:
  nodeSelector:
    rebellions.ai/npu.product: RBLN-CA25
  containers:
    - name: workload
      image: ubuntu:latest
      resources:
        limits:
          rebellions.ai/npu: 1

nodeAffinity: multiple products¶

apiVersion: v1
kind: Pod
metadata:
  name: npu-pod-atom-family
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: rebellions.ai/npu.family
                operator: In
                values: [ATOM]
  containers:
    - name: workload
      image: ubuntu:latest
      resources:
        limits:
          rebellions.ai/npu: 1

Previous resource names by card

The resource names by card (rebellions.ai/ATOM, rebellions.ai/REBEL) are no longer recommended for new deployments. Prefer the generic rebellions.ai/npu resource paired with NFD labels as shown above. See RBLN NPU Feature Discovery for the full label reference.

Configuration Reference¶

The tables below summarize the key configuration options in values.yaml. Each section maps to an entry in the Core Components overview.

Pass any of these values with --set <key>=<value> (or -f my-values.yaml) when running helm install or helm upgrade.

To see the chart's full values list and template comments, including keys not shown in the tables below, run:

$ export CHART_VERSION=0.4.0

$ helm show values oci://docker.io/rebellions/rbln-npu-operator-chart --version ${CHART_VERSION}

The image.* blocks follow the standard Helm chart convention with four sub-keys:

image:
  registry: <registry>
  repository: <repository>
  tag: <chart default>     # actual value fixed by the chart; see `helm show values` above
  pullPolicy: <pullPolicy>

In the Default column below, this is condensed as <registry>/<repository>:<tag> plus the pullPolicy: on a second line.

Chart-wide¶

Key	Description	Default
`nameOverride`	Prefix applied to every child resource name. Override to avoid collisions when running multiple operator instances.	`rbln-npu-operator`
`workloadType`	Workload mode for the cluster. Set to `vm-passthrough` for KubeVirt deployments. In this mode, the CRD also validates that `vfioManager.enabled` and `sandboxDevicePlugin.enabled` are `true`.	`container`
`nfd.enabled`	Whether the chart deploys Node Feature Discovery as a subchart. Leave `false` and install NFD separately through its upstream Helm chart to keep its lifecycle separate from the operator. Set `true` only for quick trials.	`false`
`podDefaults.labels`	Labels applied to every DaemonSet pod managed by the operator (replaces the removed `daemonsets` block).	`{}`
`podDefaults.annotations`	Annotations applied to every DaemonSet pod managed by the operator.	`{}`
`podDefaults.tolerations`	Tolerations applied to every DaemonSet pod managed by the operator.	`[]`
`podDefaults.priorityClassName`	Priority class applied to every operator-managed DaemonSet pod.	`""`

Operator¶

Key	Description	Default
`operator.image.*`	Image of the controller-manager pod. Override `tag` to pin a specific operator version.	`docker.io/rebellions/rbln-npu-operator:<chart default>` `pullPolicy: IfNotPresent`
`operator.replicas`	Number of operator pods. Increase to `2`+ for high availability.	`1`
`operator.resources.requests`	Minimum guaranteed resources for the operator pod.	`cpu: 50m`, `memory: 128Mi`
`operator.resources.limits`	Maximum resources for the operator pod.	`cpu: 500m`, `memory: 256Mi`
`operator.service.type`	Service type for the operator's webhook/metrics endpoint. Change when integrating with ingress or service mesh.	`ClusterIP`
`operator.service.port`	Service port.	`8443`
`operator.service.targetPort`	Container port the service forwards to.	`8443`
`operator.securityContext.runAsNonRoot`	Pod-level security context. Adjust based on your cluster's security policy.	`true`
`operator.affinity`	Affinity for the operator pod (e.g., to pin to control-plane nodes).	`{}`
`operator.tolerations`	Tolerations for the operator pod (for example, to allow scheduling on nodes with taints).	`[]`

Driver Manager¶

Key	Description	Default
`driver.enabled`	Whether the operator installs and manages the NPU driver. Leave `false` if drivers are already installed on hosts.	`false`
`driver.image.*`	Driver container image. Override this value to pin a private mirror or a specific driver release.	`repo.rebellions.ai/rebellions/rbln-driver:<chart default>` `pullPolicy: IfNotPresent`
`driver.imagePullSecrets`	Image pull secret for the driver image. Change this value to match your actual secret name if you do not use the `drivercred` name created earlier.	`[drivercred]`
`driver.nodeSelector`	Restrict driver pods to specific nodes.	`{}`
`driver.tolerations`	Tolerations for driver pods.	`[]`
`driver.annotations`	Annotations for driver pods.	`{}`
`driver.priorityClassName`	Pod priority class. The CRD defaults to `system-node-critical` when this value is empty.	`""`
`driver.resources`	CPU/memory requests and limits for the driver pod. The CRD field is required; the chart fills defaults when unset.	`{}`
`driver.env`	Environment variables passed to the driver container (e.g., log levels).	`[]`
`driver.manager.image.*`	Image of the driver manager initContainer that performs reconciliation on the node.	`docker.io/rebellions/rbln-k8s-driver-manager:<chart default>` `pullPolicy: IfNotPresent`

driver.upgradePolicy.* keys (autoUpgrade, drain, reboot, etc.) are documented in NPU Driver Upgrade Workflow.

Device Plugin¶

Key	Description	Default
`devicePlugin.enabled`	Whether the standard container Device Plugin is deployed. Disable when running only DRA (`draKubeletPlugin`) or only VM workloads.	`true`
`devicePlugin.image.*`	Device Plugin image.	`docker.io/rebellions/k8s-device-plugin:<chart default>` `pullPolicy: IfNotPresent`
`devicePlugin.useGenericResourceName`	Whether to expose the generic `rebellions.ai/npu` resource. Keep `true`; setting `false` selects a naming mode by card that is not recommended for new deployments.	`true`

DRA Driver¶

Key	Description	Default
`draKubeletPlugin.enabled`	Enable on Kubernetes 1.34 or later to use Dynamic Resource Allocation. Mutually exclusive with `devicePlugin.enabled`.	`false`
`draKubeletPlugin.image.*`	DRA kubelet plugin image.	`docker.io/rebellions/k8s-dra-driver-npu:<chart default>` `pullPolicy: IfNotPresent`
`draKubeletPlugin.driverName`	Driver name; must match the value referenced from `DeviceClass.spec.config.driver`.	`npu.rebellions.ai`
`draKubeletPlugin.kubeletRegistrarDirectoryPath`	Host path where the plugin registers with the kubelet.	`/var/lib/kubelet/plugins_registry`
`draKubeletPlugin.kubeletPluginsDirectoryPath`	Host path where the plugin sockets live.	`/var/lib/kubelet/plugins`
`draKubeletPlugin.healthcheckPort`	TCP port for the plugin's health-check endpoint.	`51515`

See NPU DRA Driver for full DRA usage.

Sandbox Device Plugin¶

Key	Description	Default
`sandboxDevicePlugin.enabled`	Whether the VFIO Sandbox Device Plugin is deployed. Enable for KubeVirt or other VM environments.	`false`
`sandboxDevicePlugin.image.*`	Sandbox Device Plugin image.	`docker.io/rebellions/k8s-device-plugin:<chart default>` `pullPolicy: IfNotPresent`
`sandboxDevicePlugin.resourceList[]`	VFIO resource map by card. Add entries as new SKUs ship.	`RBLN-CA12 → ATOM_PT` `RBLN-CA25 → ATOM_MAX_PT` `RBLN-CR03 → REBEL_PT`

VFIO Manager¶

Key	Description	Default
`vfioManager.enabled`	Whether the VFIO bind/unbind helper is deployed. Enable it together with the Sandbox Device Plugin for VM passthrough.	`false`
`vfioManager.image.*`	VFIO Manager image.	`docker.io/rebellions/rbln-vfio-manager:<chart default>` `pullPolicy: IfNotPresent`

Container Toolkit¶

Key	Description	Default
`containerToolkit.enabled`	Whether the Container Toolkit DaemonSet is deployed. Disable if CDI specs and runtime config are managed externally.	`true`
`containerToolkit.image.*`	Container Toolkit image.	`docker.io/rebellions/rbln-container-toolkit:<chart default>` `pullPolicy: IfNotPresent`
`containerToolkit.imagePullSecrets`	Image pull secret(s); required when pulling from a private registry.	`[]`
`containerToolkit.resources`	CPU/memory requests and limits for the toolkit pod.	`{}`
`containerToolkit.env`	Environment variables passed to the toolkit, including `RBLN_CTK_DAEMON_SOCKET` and `RBLN_CTK_DAEMON_CONFIG_PATH`.	`[]`

NPU Feature Discovery¶

Key	Description	Default
`npuFeatureDiscovery.enabled`	Whether the operator deploys NPU Feature Discovery. Disable if you manage NPU node labels through another mechanism.	`true`
`npuFeatureDiscovery.image.*`	NPU Feature Discovery image.	`docker.io/rebellions/rbln-npu-feature-discovery:<chart default>` `pullPolicy: IfNotPresent`

Metrics Exporter¶

Key	Description	Default
`metricsExporter.enabled`	Whether the Prometheus metrics exporter is deployed. Disable when another telemetry pipeline already covers NPUs.	`true`
`metricsExporter.image.*`	Metrics exporter image.	`docker.io/rebellions/rbln-metrics-exporter:<chart default>` `pullPolicy: IfNotPresent`

Operator Validator¶

Key	Description	Default
`validator.image.*`	Validator DaemonSet image.	`docker.io/rebellions/rbln-npu-operator-validator:<chart default>` `pullPolicy: IfNotPresent`
`validator.imagePullSecrets`	Image pull secret(s); required when pulling from a private registry.	`[]`
`validator.resources`	CPU/memory requests and limits for the Validator pod.	`{}`
`validator.env`	Environment variables for the top level Validator process.	`[]`
`validator.toolkit.env`	Environment variables for the Container Toolkit readiness subcheck.	`[]`
`validator.driver.env`	Environment variables for the driver readiness subcheck.	`[]`

RBLN Daemon¶

Key	Description	Default
`rblnDaemon.enabled`	Whether the `rbln-daemon` host service is deployed on each node. Enable when workloads need the RBLN Daemon in the cluster.	`false`
`rblnDaemon.image.*`	RBLN Daemon image.	`repo.rebellions.ai/rebellions/rbln-daemon:<chart default>` `pullPolicy: IfNotPresent`
`rblnDaemon.imagePullSecrets`	Image pull secret. Uses the same `drivercred` secret as the driver image.	`[drivercred]`
`rblnDaemon.hostPort`	TCP port the daemon listens on, exposed via `hostPort` on each node.	`50051`
`rblnDaemon.resources`	CPU/memory requests and limits for the daemon pod.	`{}`
`rblnDaemon.env`	Environment variables passed to the daemon container.	`[]`