NPU Driver Upgrade Workflow¶

This document covers the container driver mode. If you install the kernel driver directly on hosts in host driver mode, see NPU Driver Installation.

This document covers:

Defining the RBLNDriver: the CRD structure, what the Driver Manager installs, and how the driver image is selected per node
Architecture and flow: the two components that coordinate upgrades (operator and driver-manager) and what happens on each node when a driver pod starts
Upgrade modes and policy: rollout managed by the operator and manual rollout, and how to configure upgradePolicy (cordon, drain, reboot, etc.)
Operational options: how to exclude specific nodes from upgrades

Defining the RBLNDriver¶

The operator defines the RBLNDriver CRD to manage NPU driver installation. When you create an RBLNDriver custom resource with the desired driver version, the Driver Manager installs and maintains that version across the cluster.

RBLNDriver Sample¶

apiVersion: rebellions.ai/v1alpha1
kind: RBLNDriver
metadata:
  labels:
    app.kubernetes.io/name: rbln-driver
  name: rblndriver-sample
spec:
  registry: repo.rebellions.ai
  image: rebellions/rbln-driver
  version: "3.0.0"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    - drivercred
  resources:
    requests:
      cpu: 250m
      memory: 64Mi
    limits:
      cpu: 500m
      memory: 128Mi
  manager:
    registry: docker.io
    image: rebellions/rbln-k8s-driver-manager
    version: v0.1.3
    imagePullPolicy: IfNotPresent

What the Driver Manager Installs¶

When an RBLNDriver resource is applied, the Driver Manager installs:

Kernel driver
UMD libraries
Tools such as rbln-smi

Driver Image Selection¶

The Driver Manager selects the driver container image by combining Node Feature Discovery labels on each node:

feature.node.kubernetes.io/system-os_release.ID
feature.node.kubernetes.io/system-os_release.VERSION_ID
feature.node.kubernetes.io/kernel-version.full

For example, a node running Ubuntu 22.04 with kernel 6.8.0-90-generic can select: docker.io/rebellions/rbln-driver:3.0.0-6.8.0-90-generic-ubuntu22.04.

Architecture and Flow¶

Architecture Overview¶

NPU driver upgrades are coordinated by two components:

Component	Role
`rbln-npu-operator`	Cluster level orchestration and upgrade policy enforcement
`rbln-k8s-driver-manager`	Driver lifecycle reconciliation on each node

Depending on configuration, upgrades operate in one of the following modes:

Mode	Setting	Description
Rollout managed by the operator	`autoUpgrade: true`	Operator orchestrates the rollout across nodes
Manual rollout	`autoUpgrade: false`	Administrator triggers upgrades explicitly

Driver upgrades are handled by two layers with distinct responsibilities.

1. rbln-npu-operator (Cluster Orchestration)¶

The operator manages upgrade orchestration across the cluster.

Responsibilities include:

Detecting nodes that require driver upgrades
Enforcing upgrade policy (upgradePolicy)
Controlling rollout parallelism (maxParallelUpgrades)
Coordinating node maintenance actions such as:
- cordon
- drain
- reboot
Managing rollout progression across nodes

The operator does not directly manage driver state on nodes. Instead, it triggers driver pod restarts, which start local reconciliation on the node.

2. rbln-k8s-driver-manager (Node Driver Reconciliation)¶

rbln-k8s-driver-manager runs within the driver DaemonSet and reconciles the driver state on each node.

Responsibilities include:

Detecting the current driver state on the node
Temporarily pausing related components during driver upgrades
Performing driver uninstall/install when necessary
Restoring node labels so workloads can resume

Reconciliation runs whenever a driver pod starts on a node.

Driver Reconciliation Flow¶

When a driver pod starts on a node, the initContainer runs the reconcile-driver-state logic implemented by rbln-k8s-driver-manager.

Step	Action
1. Read node labels	Read `rebellions.ai/npu.deploy.*` labels that determine which Rebellions components run on the node
2. Pause related Rebellions components	Replace labels with `paused-for-driver-upgrade` so DaemonSets stop and existing pods terminate
3. Wait for pods to terminate	Wait until relevant Rebellions component pods have exited
4. Reconcile driver state	If the driver image digest matches the desired state, skip uninstalling the existing driver. Otherwise, unload kernel modules, remove old artifacts, and install the new driver.
5. Restore node labels	Restore original labels so Rebellions components can be scheduled again

As a result, all node components restart with the upgraded driver.

Cleanup of stale driver DaemonSets¶

The driver image is selected from NFD labels (OS, kernel version), so a node kernel change, for example after apt upgrade and a reboot, selects a different image tag and creates a new driver DaemonSet. The operator automatically detects and deletes driver DaemonSets whose node selector no longer matches any node, preventing stale driver pods from accumulating after kernel upgrades.

Driver readiness signaling¶

Starting with chart 0.4.3, the operator determines driver readiness by checking a marker file that the driver container writes after the driver installation completes. Readiness probes do not mark the pod ready while a firmware update or module reload is in progress.

$ kubectl get rblndriver rbln-driver -o jsonpath='{.status.state}'
ready

$ kubectl get rblndriver rbln-driver \
    -o jsonpath='{range .status.nodePools[*]}{.name}{"\t"}{.state}{"\t"}{.ready}/{.desired}{"\n"}{end}'
rbln-driver-ubuntu22.04-6.8.0-90-generic        ready   1/1

Upgrade Modes and Policy¶

AutoUpgrade Mode (`autoUpgrade: true`)¶

When AutoUpgrade is enabled, the operator performs a rollout across nodes according to policy.

Upgrade behavior is controlled through upgradePolicy.

Upgrade Flow¶

For each node selected by the operator, the following steps run:

Step	Action
1	The node is cordoned to prevent new workloads from being scheduled.
2	Existing NPU workloads are handled according to policy: `waitForCompletion`, `npuPodDeletion`, `drain`
3	Driver pod restart triggers local reconciliation on the node
4	Node reboot is performed if configured
5	The node is validated
6	Node is uncordoned

The operator then proceeds to the next batch of nodes according to maxParallelUpgrades.

Node Upgrade Selection¶

Nodes that require an upgrade are detected when:

driver DaemonSet revision changes
an explicit upgrade request is issued

The operator selects nodes for upgrade based on maxParallelUpgrades:

Value	Behavior
`1`	Upgrade one node at a time (sequential)
`0`	Unlimited parallel upgrades

Reboot Workflow¶

Optional reboot can be enabled:

driver:
  upgradePolicy:
    reboot:
      enable: true
      rebootTimeoutSeconds: 600
      image:
        registry: docker.io
        image: rebellions/rbln-node-reboot
        version: latest

This block is disabled by default in the chart (reboot.enable: false).

When enabled, the workflow is:

Step	Action
1	Operator triggers a reboot through the reboot helper pod
2	Node temporarily becomes `NotReady`
3	Node returns to `Ready` after reboot validation

Manual Mode (`autoUpgrade: false`)¶

When AutoUpgrade is disabled, the driver DaemonSet uses the OnDelete strategy.

Driver pods are not automatically restarted when the DaemonSet template changes.

Instead, upgrades occur only when an administrator performs explicit actions.

Manual Upgrade Procedure¶

Administrator selects a node
Node maintenance actions are performed (typically cordon and drain)
Administrator deletes the driver pod:
1
$ kubectl delete pod <driver-pod>
Kubernetes DaemonSet controller creates a new driver pod
The initContainer triggers the driver reconciliation flow

During reconciliation, rbln-k8s-driver-manager:

pauses related Rebellions components
updates the node driver state
restores node labels once complete

After reconciliation, Rebellions components are automatically rescheduled.

Helm Configuration¶

If driver.enabled: false at the chart level, the RBLNDriver resource is never created and the upgrade behavior described on this page does not apply. The remainder of this section assumes driver.enabled: true.

All upgrade actions must be enabled explicitly. The chart ships with autoUpgrade, drain.enable, and reboot.enable set to false by default.

To turn on rollout managed by the operator, set autoUpgrade: true and enable the subblocks (drain.enable, reboot.enable, etc.) for any maintenance actions you want the operator to perform.

Example configuration:

driver:
  upgradePolicy:
    autoUpgrade: true
    maxParallelUpgrades: 1
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    npuPodDeletion:
      force: false
      timeoutSeconds: 300
    drain:
      enable: true
      force: false
      deleteEmptyDirData: false
      podSelector: ""
      timeoutSeconds: 300
    reboot:
      enable: true
      rebootTimeoutSeconds: 0
      image:
        registry: docker.io
        image: rebellions/rbln-node-reboot
        version: latest

Upgrade Policy Reference¶

Setting	Description
`autoUpgrade`	Driver upgrades managed by the operator. `false` (default) = operator does not orchestrate rollout; `true` = operator performs rollout
`maxParallelUpgrades`	Maximum number of nodes upgraded concurrently. `1` = sequential (default), `0` = unlimited
`waitForCompletion.timeoutSeconds`	Maximum seconds to wait for selected pods to complete before removal. `0` (default) = wait indefinitely
`waitForCompletion.podSelector`	Label selector for pods to wait on. Empty string (default) = skip the wait step entirely
`npuPodDeletion.force`	`false` (default) = conservative removal (blocks on pods lacking a controller); `true` = forced removal
`npuPodDeletion.timeoutSeconds`	Max seconds before forcibly deleting remaining pods. `0` = wait indefinitely (default `300`)
`drain.enable`	`false` (default) = drain skipped; `true` = operator drains the node before pod restart
`drain.force`	`false` (default) = drain fails on blocking pods; `true` = drain proceeds even when blocking pods exist
`drain.deleteEmptyDirData`	`false` (default) = pods using `emptyDir` storage block the drain; `true` = those pods are removed (data is lost)
`drain.podSelector`	Label selector restricting drain to matching pods. Empty string (default) = drain all pods on the node
`drain.timeoutSeconds`	Max seconds for drain to complete. `0` = wait indefinitely (default `300`)
`reboot.enable`	`false` (default) = no reboot; `true` = node is rebooted as part of the upgrade
`reboot.rebootTimeoutSeconds`	Max seconds to wait for the rebooted node to return `Ready`. Only relevant when `reboot.enable: true`. `0` (default) = no timeout

Operational Options¶

Skipping Driver Upgrades¶

To exclude a node from driver upgrades, set the following label:

$ kubectl label node <node-name> rebellions.ai/npu-driver-upgrade.skip=true

To re-enable upgrades, remove the label:

$ kubectl label node <node-name> rebellions.ai/npu-driver-upgrade.skip-

The operator re-checks this label on every upgrade attempt, so it applies to both autoUpgrade: true and manual rollouts.

Don't confuse with `rebellions.ai/npu.deploy.skip`¶

Label	Effect
`rebellions.ai/npu-driver-upgrade.skip=true`	Pauses driver upgrades only. The current driver keeps running, and other NPU components are not affected.
`rebellions.ai/npu.deploy.skip=true`	Removes all RBLN NPU components from the node, including the driver itself. NPU workloads on the node will stop working.

Use npu-driver-upgrade.skip to hold back upgrades. Use npu.deploy.skip only when you want to exclude the node from NPU workloads entirely. See Workload Labeling by Node for the full npu.deploy.skip workflow.

Inspecting state¶

To see which nodes are excluded and where the remaining nodes are in the upgrade state machine:

$ kubectl get nodes \
  -L rebellions.ai/npu.present \
  -L rebellions.ai/npu-driver-upgrade-state \
  -L rebellions.ai/npu-driver-upgrade.skip

Verifying the Running Driver¶

To confirm the driver pod has loaded the kernel module and the expected KMD version is in use, run rbln-smi inside the driver container on the target node.

First, find the driver pod for the node. Driver pods follow the rbln-driver-<os>-<kernel>-<hash> name pattern (see Driver Image Selection):

$ kubectl get pods -n rbln-system -o wide | grep rbln-driver

Then exec into the rbln-driver-container and invoke rbln-smi:

$ kubectl exec -n rbln-system <driver-pod> -c rbln-driver-container -- rbln-smi

The header reports the running KMD version, and the device table lists the NPUs the driver has bound on this node:

+-------------------------------------------------------------------------------------------------+
|                                Device Information KMD ver: 3.0.0                                |
+-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
| NPU |    Name   | Device  |   PCI BUS ID  | Temp |  Power  | Perf |  Memory(used/total) |  Util |
+=====+===========+=========+===============+======+=========+======+=====================+=======+
| 0   | RBLN-CA25 | rbln0   |  0000:05:00.0 |  37C |  58.7W  | P14  |    0.0B / 15.7GiB   |   0.0 |
| 1   |           | rbln1   |  0000:06:00.0 |  39C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
+-----+-----------+---------+---------------+------+---------+------+---------------------+-------+

After an upgrade, the KMD ver line should match spec.version in your RBLNDriver.