NPU Driver Upgrade Workflow¶
This document describes how NPU driver upgrades work in Kubernetes and how to configure them. It covers:
- Architecture and flow — the two components that coordinate upgrades (operator and driver-manager) and what happens on each node when a driver pod starts
- Upgrade modes and policy — operator-managed vs manual rollout,
and how to configure
upgradePolicy(cordon, drain, reboot, etc.) - Operational options — how to exclude specific nodes from upgrades
Architecture and flow¶
Architecture Overview¶
The NPU driver upgrades are coordinated by two components:
| Component | Role |
|---|---|
rbln-npu-operator |
Cluster-level orchestration and upgrade policy enforcement |
rbln-k8s-driver-manager |
Node-local driver lifecycle reconciliation |
Depending on configuration, upgrades operate in one of the following modes:
| Mode | Setting | Description |
|---|---|---|
| Operator-managed rollout | autoUpgrade: true |
Operator orchestrates rollout across nodes |
| Manual rollout | autoUpgrade: false |
Administrator triggers upgrades explicitly |
Driver upgrades are handled by two layers of components with distinct responsibilities.
1. rbln-npu-operator (Cluster Orchestration)¶
The operator manages cluster-wide upgrade orchestration.
Responsibilities include:
- Detecting nodes that require driver upgrades
- Enforcing upgrade policy (
upgradePolicy) - Controlling rollout parallelism (
maxParallelUpgrades) - Coordinating node maintenance actions such as:
cordondrainreboot
- Managing rollout progression across nodes
The operator does not directly manage driver state on nodes. Instead, it triggers driver pod restarts, which initiate node-local reconciliation.
2. rbln-k8s-driver-manager (Node Driver Reconciliation)¶
rbln-k8s-driver-manager runs within the driver DaemonSet and is
responsible for reconciling the driver state on each node.
Responsibilities include:
- Detecting the current driver state on the node
- Temporarily pausing related system components during driver upgrades
- Performing driver uninstall/install when necessary
- Restoring node labels so workloads can resume
The reconciliation process is triggered whenever a driver pod starts on a node.
Driver Reconciliation Flow¶
When a driver pod starts on a node, the initContainer runs the
reconcile-driver-state logic implemented by rbln-k8s-driver-manager.
| Step | Action |
|---|---|
| 1. Read node labels | Read rebellions.ai/npu.deploy.* labels that determine which Rebellions components run on the node |
| 2. Pause related Rebellions system components | Replace labels with paused-for-driver-upgrade so DaemonSets stop and existing pods terminate |
| 3. Wait for pods to terminate | Wait until relevant Rebellions component pods have exited |
| 4. Reconcile driver state | If driver image digest matches, the desired state, uninstall is skipped. Otherwise unload kernel modules, remove old artifacts, install new driver |
| 5. Restore node labels | Restore original labels so Rebellions components can be scheduled again |
As a result, all node components restart using the upgraded driver.
Upgrade modes and policy¶
AutoUpgrade Mode (autoUpgrade: true)¶
When AutoUpgrade is enabled, the operator performs policy-driven rollout across nodes.
Upgrade behavior is controlled through upgradePolicy.
Upgrade Flow¶
For each node selected by the operator:
| Step | Action |
|---|---|
| 1 | The node is cordoned to prevent new workloads from being scheduled. |
| 2 | Existing NPU workloads handled per policy: waitForCompletion, npuPodDeletion, drain |
| 3 | Driver pod restart triggers node-local reconciliation |
| 4 | Optional node reboot may occur |
| 5 | The node is validated |
| 6 | Node is uncordoned |
The operator then proceeds to the next batch of nodes according to
maxParallelUpgrades.
Node Upgrade Selection¶
Nodes that require upgrade are detected when:
- driver DaemonSet revision changes
- an explicit upgrade request is issued
The operator selects nodes for upgrade based on maxParallelUpgrades:
| Value | Behavior |
|---|---|
1 |
Upgrade one node at a time (sequential) |
0 |
Unlimited parallel upgrades |
Reboot Workflow¶
Optional reboot support can be enabled:
When enabled:
| Step | Action |
|---|---|
| 1 | Operator triggers a reboot through the reboot helper pod |
| 2 | Node temporarily becomes NotReady |
| 3 | Node returns to Ready after reboot validation |
Manual Mode (autoUpgrade: false)¶
When AutoUpgrade is disabled, the driver DaemonSet uses the OnDelete strategy.
Driver pods are not automatically restarted when the DaemonSet template changes.
Instead, upgrades occur through explicit administrator actions.
Manual Upgrade Procedure¶
- Administrator selects a node
- Node maintenance actions are performed (typically cordon and drain)
- The administrator deletes the driver pod:
- Kubernetes DaemonSet controller creates a new driver pod
- The initContainer triggers the driver reconciliation flow
During reconciliation, rbln-k8s-driver-manager:
- pauses related Rebellions system components
- updates the node driver state
- restores node labels once complete
After reconciliation, Rebellions components are automatically rescheduled.
Helm Configuration¶
Example configuration:
Upgrade Policy Reference¶
| Setting | Description |
|---|---|
autoUpgrade |
Enables or disables operator-managed driver upgrades |
maxParallelUpgrades |
Max nodes upgraded concurrently. 1 = sequential, 0 = unlimited |
waitForCompletion |
Waits for selected pods or jobs to complete before starting pod eviction |
npuPodDeletion |
Eviction behavior for NPU pods. force: false = conservative (may block on unmanaged pods). force: true = aggressive eviction |
drain |
Node drain. enable: true = operator performs drain. force: true = proceed even when blocking pods exist |
reboot |
Optional reboot workflow. rebootTimeoutSeconds = max time for reboot validation; 0 = no timeout |
Operational options¶
Skipping Driver Upgrades¶
To skip driver upgrades on a specific node:
To re-enable upgrades:
This label is evaluated by the operator when selecting nodes for upgrade and is mainly intended for operator-managed rollouts (autoUpgrade: true).