Installing the RBLN NPU Operator¶
This page covers how to perform a fresh installation of the RBLN NPU Operator using its Helm chart, verify the installation, run a quick start NPU workload, configure scheduling for specific models, and customize the chart.
For an architectural overview of the operator, see RBLN NPU Operator. For upgrade and removal, see Upgrading the NPU Operator and Uninstalling the NPU Operator.
Prerequisites¶
- Kubernetes 1.19 or later cluster with access to
kubectlandhelm - Helm 3.8 or later (required for OCI registry support)
- Node Feature Discovery installed in the cluster. To keep its lifecycle separate from the operator, we recommend deploying NFD separately using the upstream Helm chart into a dedicated
node-feature-discoverynamespace. The chart's bundlednfd.enabled=trueoption is convenient for quick trials, but it ties NFD's lifecycle to this Helm release. - A dedicated namespace for the operator, such as
rbln-system - Worker nodes equipped with NPUs
If you do not have Helm installed (or your version is below 3.8), install it first:
Creating an Image Pull Secret¶
The driver container image and rbln-daemon container image are hosted on repo.rebellions.ai, which requires RBLN Portal account authentication. Before installing, create a Docker registry secret in the operator namespace:
Using this exact secret name matches the chart's default driver.imagePullSecrets, so you do not need to override any Helm values.
Driver Installation¶
Before deploying the operator chart, decide whether the operator should install the kernel driver through a container or detect a driver installed directly on each host. See NPU Driver Installation for the two modes and how to configure them with the driver.enabled chart value.
Installing the NPU Operator from the OCI Registry¶
Starting with v0.3.4, the chart is published to Docker Hub as an OCI artifact at oci://docker.io/rebellions/rbln-npu-operator-chart. Pin a version for reproducible installation. Available versions are listed on the chart page on Docker Hub.
Override individual values with --set:
Or supply a custom values file:
Previous chart repository
Earlier versions (≤0.3.3) remain available from the GitHub Pages chart repository at https://rbln-sw.github.io/rbln-npu-operator, but new releases are published only to OCI.
Verifying the Installation¶
After Helm reports a successful installation, ensure that the RBLNClusterPolicy exists and has been reconciled. Both CRDs are scoped to the cluster, so the -n flag is not needed:
If driver management is enabled, confirm that the RBLNDriver custom resource is created:
The STATUS column reflects .status.state (ready or notReady), and the operator reconciles this value.
Next, check that the controller and component pods are running in the operator namespace:
If all pods show Running, the operator is healthy. Pod names follow the rbln-<component>-* pattern. See Core Components for the role of each component. The driver pod's suffix (<os>-<kernel>) is composed from NFD labels. See Driver Image Selection for the rules.
If any pod is stuck or CrashLoopBackOff, check its logs (kubectl logs <pod> -n rbln-system) and review the Helm values for missing prerequisites.
When the operator installs the driver through the driver container (driver.enabled=true in the chart), you can verify that the kernel module is loaded and matches the requested driver version. See Verifying the running driver.
Checking NPU Status¶
Check the NPU capacity that Kubernetes reports for each node. This confirms that the device plugin exposed the expected resources. With devicePlugin.useGenericResourceName: true (the chart default since device plugin v0.4.0), the operator exposes the generic resource name rebellions.ai/npu:
If the cluster contains multiple NPU models and you need to schedule workloads onto a specific one, combine the generic rebellions.ai/npu resource with the product label applied by NPU Feature Discovery. See Targeting a Specific NPU Product below.
Creating a Pod with an NPU¶
-
Create a manifest (for example,
npu-demo-pod.yaml). This example requests fourrebellions.ai/npudevices: -
Create the Pod.
-
Verify the Pod status and resource assignment.
Usekubectl describe pod npu-podto confirm that the requested NPU resources are bound and that the Pod is scheduled onto a node with an NPU.
Targeting a Specific NPU Product¶
In a cluster with mixed NPU products (for example, both RBLN-CA25 and RBLN-CR03 cards), a Pod that requests the generic rebellions.ai/npu resource may be scheduled onto any node with an NPU. Use the product label applied by NPU Feature Discovery, rebellions.ai/npu.product, together with nodeSelector or nodeAffinity to pin the workload to a specific model.
nodeSelector: single product¶
nodeAffinity: multiple products¶
Previous resource names by card
The resource names by card (rebellions.ai/ATOM, rebellions.ai/REBEL) are no longer recommended for new deployments. Prefer the generic rebellions.ai/npu resource paired with NFD labels as shown above. See RBLN NPU Feature Discovery for the full label reference.
Configuration Reference¶
The tables below summarize the key configuration options in values.yaml. Each section maps to an entry in the Core Components overview.
Pass any of these values with --set <key>=<value> (or -f my-values.yaml) when running helm install or helm upgrade.
To see the chart's full values list and template comments, including keys not shown in the tables below, run:
The image.* blocks follow the standard Helm chart convention with four sub-keys:
In the Default column below, this is condensed as <registry>/<repository>:<tag> plus the pullPolicy: on a second line.
Chart-wide¶
| Key | Description | Default |
|---|---|---|
nameOverride |
Prefix applied to every child resource name. Override to avoid collisions when running multiple operator instances. | rbln-npu-operator |
workloadType |
Workload mode for the cluster. Set to vm-passthrough for KubeVirt deployments. In this mode, the CRD also validates that vfioManager.enabled and sandboxDevicePlugin.enabled are true. |
container |
nfd.enabled |
Whether the chart deploys Node Feature Discovery as a subchart. Leave false and install NFD separately through its upstream Helm chart to keep its lifecycle separate from the operator. Set true only for quick trials. |
false |
podDefaults.labels |
Labels applied to every DaemonSet pod managed by the operator (replaces the removed daemonsets block). |
{} |
podDefaults.annotations |
Annotations applied to every DaemonSet pod managed by the operator. | {} |
podDefaults.tolerations |
Tolerations applied to every DaemonSet pod managed by the operator. | [] |
podDefaults.priorityClassName |
Priority class applied to every operator-managed DaemonSet pod. | "" |
Operator¶
| Key | Description | Default |
|---|---|---|
operator.image.* |
Image of the controller-manager pod. Override tag to pin a specific operator version. |
docker.io/rebellions/rbln-npu-operator:<chart default>pullPolicy: IfNotPresent |
operator.replicas |
Number of operator pods. Increase to 2+ for high availability. |
1 |
operator.resources.requests |
Minimum guaranteed resources for the operator pod. | cpu: 50m, memory: 128Mi |
operator.resources.limits |
Maximum resources for the operator pod. | cpu: 500m, memory: 256Mi |
operator.service.type |
Service type for the operator's webhook/metrics endpoint. Change when integrating with ingress or service mesh. | ClusterIP |
operator.service.port |
Service port. | 8443 |
operator.service.targetPort |
Container port the service forwards to. | 8443 |
operator.securityContext.runAsNonRoot |
Pod-level security context. Adjust based on your cluster's security policy. | true |
operator.affinity |
Affinity for the operator pod (e.g., to pin to control-plane nodes). | {} |
operator.tolerations |
Tolerations for the operator pod (for example, to allow scheduling on nodes with taints). | [] |
Driver Manager¶
| Key | Description | Default |
|---|---|---|
driver.enabled |
Whether the operator installs and manages the NPU driver. Leave false if drivers are already installed on hosts. |
false |
driver.image.* |
Driver container image. Override this value to pin a private mirror or a specific driver release. | repo.rebellions.ai/rebellions/rbln-driver:<chart default>pullPolicy: IfNotPresent |
driver.imagePullSecrets |
Image pull secret for the driver image. Change this value to match your actual secret name if you do not use the drivercred name created earlier. |
[drivercred] |
driver.nodeSelector |
Restrict driver pods to specific nodes. | {} |
driver.tolerations |
Tolerations for driver pods. | [] |
driver.annotations |
Annotations for driver pods. | {} |
driver.priorityClassName |
Pod priority class. The CRD defaults to system-node-critical when this value is empty. |
"" |
driver.resources |
CPU/memory requests and limits for the driver pod. The CRD field is required; the chart fills defaults when unset. | {} |
driver.env |
Environment variables passed to the driver container (e.g., log levels). | [] |
driver.manager.image.* |
Image of the driver manager initContainer that performs reconciliation on the node. | docker.io/rebellions/rbln-k8s-driver-manager:<chart default>pullPolicy: IfNotPresent |
driver.upgradePolicy.* keys (autoUpgrade, drain, reboot, etc.) are documented in NPU Driver Upgrade Workflow.
Device Plugin¶
| Key | Description | Default |
|---|---|---|
devicePlugin.enabled |
Whether the standard container Device Plugin is deployed. Disable when running only DRA (draKubeletPlugin) or only VM workloads. |
true |
devicePlugin.image.* |
Device Plugin image. | docker.io/rebellions/k8s-device-plugin:<chart default>pullPolicy: IfNotPresent |
devicePlugin.useGenericResourceName |
Whether to expose the generic rebellions.ai/npu resource. Keep true; setting false selects a naming mode by card that is not recommended for new deployments. |
true |
DRA Driver¶
| Key | Description | Default |
|---|---|---|
draKubeletPlugin.enabled |
Enable on Kubernetes 1.34 or later to use Dynamic Resource Allocation. Mutually exclusive with devicePlugin.enabled. |
false |
draKubeletPlugin.image.* |
DRA kubelet plugin image. | docker.io/rebellions/k8s-dra-driver-npu:<chart default>pullPolicy: IfNotPresent |
draKubeletPlugin.driverName |
Driver name; must match the value referenced from DeviceClass.spec.config.driver. |
npu.rebellions.ai |
draKubeletPlugin.kubeletRegistrarDirectoryPath |
Host path where the plugin registers with the kubelet. | /var/lib/kubelet/plugins_registry |
draKubeletPlugin.kubeletPluginsDirectoryPath |
Host path where the plugin sockets live. | /var/lib/kubelet/plugins |
draKubeletPlugin.healthcheckPort |
TCP port for the plugin's health-check endpoint. | 51515 |
See NPU DRA Driver for full DRA usage.
Sandbox Device Plugin¶
| Key | Description | Default |
|---|---|---|
sandboxDevicePlugin.enabled |
Whether the VFIO Sandbox Device Plugin is deployed. Enable for KubeVirt or other VM environments. | false |
sandboxDevicePlugin.image.* |
Sandbox Device Plugin image. | docker.io/rebellions/k8s-device-plugin:<chart default>pullPolicy: IfNotPresent |
sandboxDevicePlugin.resourceList[] |
VFIO resource map by card. Add entries as new SKUs ship. | RBLN-CA12 → ATOM_PTRBLN-CA25 → ATOM_MAX_PTRBLN-CR03 → REBEL_PT |
VFIO Manager¶
| Key | Description | Default |
|---|---|---|
vfioManager.enabled |
Whether the VFIO bind/unbind helper is deployed. Enable it together with the Sandbox Device Plugin for VM passthrough. | false |
vfioManager.image.* |
VFIO Manager image. | docker.io/rebellions/rbln-vfio-manager:<chart default>pullPolicy: IfNotPresent |
Container Toolkit¶
| Key | Description | Default |
|---|---|---|
containerToolkit.enabled |
Whether the Container Toolkit DaemonSet is deployed. Disable if CDI specs and runtime config are managed externally. | true |
containerToolkit.image.* |
Container Toolkit image. | docker.io/rebellions/rbln-container-toolkit:<chart default>pullPolicy: IfNotPresent |
containerToolkit.imagePullSecrets |
Image pull secret(s); required when pulling from a private registry. | [] |
containerToolkit.resources |
CPU/memory requests and limits for the toolkit pod. | {} |
containerToolkit.env |
Environment variables passed to the toolkit, including RBLN_CTK_DAEMON_SOCKET and RBLN_CTK_DAEMON_CONFIG_PATH. |
[] |
NPU Feature Discovery¶
| Key | Description | Default |
|---|---|---|
npuFeatureDiscovery.enabled |
Whether the operator deploys NPU Feature Discovery. Disable if you manage NPU node labels through another mechanism. | true |
npuFeatureDiscovery.image.* |
NPU Feature Discovery image. | docker.io/rebellions/rbln-npu-feature-discovery:<chart default>pullPolicy: IfNotPresent |
Metrics Exporter¶
| Key | Description | Default |
|---|---|---|
metricsExporter.enabled |
Whether the Prometheus metrics exporter is deployed. Disable when another telemetry pipeline already covers NPUs. | true |
metricsExporter.image.* |
Metrics exporter image. | docker.io/rebellions/rbln-metrics-exporter:<chart default>pullPolicy: IfNotPresent |
Operator Validator¶
| Key | Description | Default |
|---|---|---|
validator.image.* |
Validator DaemonSet image. | docker.io/rebellions/rbln-npu-operator-validator:<chart default>pullPolicy: IfNotPresent |
validator.imagePullSecrets |
Image pull secret(s); required when pulling from a private registry. | [] |
validator.resources |
CPU/memory requests and limits for the Validator pod. | {} |
validator.env |
Environment variables for the top level Validator process. | [] |
validator.toolkit.env |
Environment variables for the Container Toolkit readiness subcheck. | [] |
validator.driver.env |
Environment variables for the driver readiness subcheck. | [] |
RBLN Daemon¶
| Key | Description | Default |
|---|---|---|
rblnDaemon.enabled |
Whether the rbln-daemon host service is deployed on each node. Enable when workloads need the RBLN Daemon in the cluster. |
false |
rblnDaemon.image.* |
RBLN Daemon image. | repo.rebellions.ai/rebellions/rbln-daemon:<chart default>pullPolicy: IfNotPresent |
rblnDaemon.imagePullSecrets |
Image pull secret. Uses the same drivercred secret as the driver image. |
[drivercred] |
rblnDaemon.hostPort |
TCP port the daemon listens on, exposed via hostPort on each node. |
50051 |
rblnDaemon.resources |
CPU/memory requests and limits for the daemon pod. | {} |
rblnDaemon.env |
Environment variables passed to the daemon container. | [] |