NPU Driver Upgrade Workflow¶
This document covers the container driver mode. If you install the kernel driver directly on hosts in host driver mode, see NPU Driver Installation.
This document covers:
- Defining the RBLNDriver: the CRD structure, what the Driver Manager installs, and how the driver image is selected per node
- Architecture and flow: the two components that coordinate upgrades (operator and driver-manager) and what happens on each node when a driver pod starts
- Upgrade modes and policy: rollout managed by the operator and manual rollout,
and how to configure
upgradePolicy(cordon, drain, reboot, etc.) - Operational options: how to exclude specific nodes from upgrades
Defining the RBLNDriver¶
The operator defines the RBLNDriver CRD to manage NPU driver installation. When you create an RBLNDriver custom resource with the desired driver version, the Driver Manager installs and maintains that version across the cluster.
RBLNDriver Sample¶
What the Driver Manager Installs¶
When an RBLNDriver resource is applied, the Driver Manager installs:
- Kernel driver
- UMD libraries
- Tools such as
rbln-smi
Driver Image Selection¶
The Driver Manager selects the driver container image by combining Node Feature Discovery labels on each node:
feature.node.kubernetes.io/system-os_release.IDfeature.node.kubernetes.io/system-os_release.VERSION_IDfeature.node.kubernetes.io/kernel-version.full
For example, a node running Ubuntu 22.04 with kernel 6.8.0-90-generic can select:
docker.io/rebellions/rbln-driver:3.0.0-6.8.0-90-generic-ubuntu22.04.
Architecture and Flow¶
Architecture Overview¶
NPU driver upgrades are coordinated by two components:
| Component | Role |
|---|---|
rbln-npu-operator |
Cluster level orchestration and upgrade policy enforcement |
rbln-k8s-driver-manager |
Driver lifecycle reconciliation on each node |
Depending on configuration, upgrades operate in one of the following modes:
| Mode | Setting | Description |
|---|---|---|
| Rollout managed by the operator | autoUpgrade: true |
Operator orchestrates the rollout across nodes |
| Manual rollout | autoUpgrade: false |
Administrator triggers upgrades explicitly |
Driver upgrades are handled by two layers with distinct responsibilities.
1. rbln-npu-operator (Cluster Orchestration)¶
The operator manages upgrade orchestration across the cluster.
Responsibilities include:
- Detecting nodes that require driver upgrades
- Enforcing upgrade policy (
upgradePolicy) - Controlling rollout parallelism (
maxParallelUpgrades) - Coordinating node maintenance actions such as:
cordondrainreboot
- Managing rollout progression across nodes
The operator does not directly manage driver state on nodes. Instead, it triggers driver pod restarts, which start local reconciliation on the node.
2. rbln-k8s-driver-manager (Node Driver Reconciliation)¶
rbln-k8s-driver-manager runs within the driver DaemonSet and reconciles the driver state on each node.
Responsibilities include:
- Detecting the current driver state on the node
- Temporarily pausing related components during driver upgrades
- Performing driver uninstall/install when necessary
- Restoring node labels so workloads can resume
Reconciliation runs whenever a driver pod starts on a node.
Driver Reconciliation Flow¶
When a driver pod starts on a node, the initContainer runs the reconcile-driver-state logic implemented by rbln-k8s-driver-manager.
| Step | Action |
|---|---|
| 1. Read node labels | Read rebellions.ai/npu.deploy.* labels that determine which Rebellions components run on the node |
| 2. Pause related Rebellions components | Replace labels with paused-for-driver-upgrade so DaemonSets stop and existing pods terminate |
| 3. Wait for pods to terminate | Wait until relevant Rebellions component pods have exited |
| 4. Reconcile driver state | If the driver image digest matches the desired state, skip uninstalling the existing driver. Otherwise, unload kernel modules, remove old artifacts, and install the new driver. |
| 5. Restore node labels | Restore original labels so Rebellions components can be scheduled again |
As a result, all node components restart with the upgraded driver.
Cleanup of stale driver DaemonSets¶
The driver image is selected from NFD labels (OS, kernel version), so a node kernel change, for example after apt upgrade and a reboot, selects a different image tag and creates a new driver DaemonSet. The operator automatically detects and deletes driver DaemonSets whose node selector no longer matches any node, preventing stale driver pods from accumulating after kernel upgrades.
Upgrade Modes and Policy¶
AutoUpgrade Mode (autoUpgrade: true)¶
When AutoUpgrade is enabled, the operator performs a rollout across nodes according to policy.
Upgrade behavior is controlled through upgradePolicy.
Upgrade Flow¶
For each node selected by the operator, the following steps run:
| Step | Action |
|---|---|
| 1 | The node is cordoned to prevent new workloads from being scheduled. |
| 2 | Existing NPU workloads are handled according to policy: waitForCompletion, npuPodDeletion, drain |
| 3 | Driver pod restart triggers local reconciliation on the node |
| 4 | Node reboot is performed if configured |
| 5 | The node is validated |
| 6 | Node is uncordoned |
The operator then proceeds to the next batch of nodes according to maxParallelUpgrades.
Node Upgrade Selection¶
Nodes that require an upgrade are detected when:
- driver DaemonSet revision changes
- an explicit upgrade request is issued
The operator selects nodes for upgrade based on maxParallelUpgrades:
| Value | Behavior |
|---|---|
1 |
Upgrade one node at a time (sequential) |
0 |
Unlimited parallel upgrades |
Reboot Workflow¶
Optional reboot can be enabled:
This block is disabled by default in the chart (reboot.enable: false).
When enabled, the workflow is:
| Step | Action |
|---|---|
| 1 | Operator triggers a reboot through the reboot helper pod |
| 2 | Node temporarily becomes NotReady |
| 3 | Node returns to Ready after reboot validation |
Manual Mode (autoUpgrade: false)¶
When AutoUpgrade is disabled, the driver DaemonSet uses the OnDelete strategy.
Driver pods are not automatically restarted when the DaemonSet template changes.
Instead, upgrades occur only when an administrator performs explicit actions.
Manual Upgrade Procedure¶
- Administrator selects a node
- Node maintenance actions are performed (typically cordon and drain)
- Administrator deletes the driver pod:
- Kubernetes DaemonSet controller creates a new driver pod
- The initContainer triggers the driver reconciliation flow
During reconciliation, rbln-k8s-driver-manager:
- pauses related Rebellions components
- updates the node driver state
- restores node labels once complete
After reconciliation, Rebellions components are automatically rescheduled.
Helm Configuration¶
If driver.enabled: false at the chart level, the RBLNDriver resource is never created and the upgrade behavior described on this page does not apply. The remainder of this section assumes driver.enabled: true.
All upgrade actions must be enabled explicitly. The chart ships with autoUpgrade, drain.enable, and reboot.enable set to false by default.
To turn on rollout managed by the operator, set autoUpgrade: true and enable the subblocks (drain.enable, reboot.enable, etc.) for any maintenance actions you want the operator to perform.
Example configuration:
Upgrade Policy Reference¶
| Setting | Description |
|---|---|
autoUpgrade |
Driver upgrades managed by the operator. false (default) = operator does not orchestrate rollout; true = operator performs rollout |
maxParallelUpgrades |
Maximum number of nodes upgraded concurrently. 1 = sequential (default), 0 = unlimited |
waitForCompletion.timeoutSeconds |
Maximum seconds to wait for selected pods to complete before removal. 0 (default) = wait indefinitely |
waitForCompletion.podSelector |
Label selector for pods to wait on. Empty string (default) = skip the wait step entirely |
npuPodDeletion.force |
false (default) = conservative removal (blocks on pods lacking a controller); true = forced removal |
npuPodDeletion.timeoutSeconds |
Max seconds before forcibly deleting remaining pods. 0 = wait indefinitely (default 300) |
drain.enable |
false (default) = drain skipped; true = operator drains the node before pod restart |
drain.force |
false (default) = drain fails on blocking pods; true = drain proceeds even when blocking pods exist |
drain.deleteEmptyDirData |
false (default) = pods using emptyDir storage block the drain; true = those pods are removed (data is lost) |
drain.podSelector |
Label selector restricting drain to matching pods. Empty string (default) = drain all pods on the node |
drain.timeoutSeconds |
Max seconds for drain to complete. 0 = wait indefinitely (default 300) |
reboot.enable |
false (default) = no reboot; true = node is rebooted as part of the upgrade |
reboot.rebootTimeoutSeconds |
Max seconds to wait for the rebooted node to return Ready. Only relevant when reboot.enable: true. 0 (default) = no timeout |
Operational Options¶
Skipping Driver Upgrades¶
To exclude a node from driver upgrades, set the following label:
To re-enable upgrades, remove the label:
The operator re-checks this label on every upgrade attempt, so it applies to both autoUpgrade: true and manual rollouts.
Don't confuse with rebellions.ai/npu.deploy.skip¶
| Label | Effect |
|---|---|
rebellions.ai/npu-driver-upgrade.skip=true |
Pauses driver upgrades only. The current driver keeps running, and other NPU components are not affected. |
rebellions.ai/npu.deploy.skip=true |
Removes all RBLN NPU components from the node, including the driver itself. NPU workloads on the node will stop working. |
Use npu-driver-upgrade.skip to hold back upgrades. Use npu.deploy.skip only when you want to exclude the node from NPU workloads entirely. See Workload Labeling by Node for the full npu.deploy.skip workflow.
Inspecting state¶
To see which nodes are excluded and where the remaining nodes are in the upgrade state machine:
Verifying the Running Driver¶
To confirm the driver pod has loaded the kernel module and the expected KMD version is in use, run rbln-smi inside the driver container on the target node.
First, find the driver pod for the node. Driver pods follow the rbln-driver-<os>-<kernel>-<hash> name pattern (see Driver Image Selection):
Then exec into the rbln-driver-container and invoke rbln-smi:
The header reports the running KMD version, and the device table lists the NPUs the driver has bound on this node:
After an upgrade, the KMD ver line should match spec.version in your RBLNDriver.