Skip to content

NPU Driver Upgrade Workflow

This document covers the container driver mode. If you install the kernel driver directly on hosts in host driver mode, see NPU Driver Installation.

This document covers:

  • Defining the RBLNDriver: the CRD structure, what the Driver Manager installs, and how the driver image is selected per node
  • Architecture and flow: the two components that coordinate upgrades (operator and driver-manager) and what happens on each node when a driver pod starts
  • Upgrade modes and policy: rollout managed by the operator and manual rollout, and how to configure upgradePolicy (cordon, drain, reboot, etc.)
  • Operational options: how to exclude specific nodes from upgrades

Defining the RBLNDriver

The operator defines the RBLNDriver CRD to manage NPU driver installation. When you create an RBLNDriver custom resource with the desired driver version, the Driver Manager installs and maintains that version across the cluster.

RBLNDriver Sample

apiVersion: rebellions.ai/v1alpha1
kind: RBLNDriver
metadata:
  labels:
    app.kubernetes.io/name: rbln-driver
  name: rblndriver-sample
spec:
  registry: repo.rebellions.ai
  image: rebellions/rbln-driver
  version: "3.0.0"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    - drivercred
  resources:
    requests:
      cpu: 250m
      memory: 64Mi
    limits:
      cpu: 500m
      memory: 128Mi
  manager:
    registry: docker.io
    image: rebellions/rbln-k8s-driver-manager
    version: v0.1.3
    imagePullPolicy: IfNotPresent

What the Driver Manager Installs

When an RBLNDriver resource is applied, the Driver Manager installs:

  • Kernel driver
  • UMD libraries
  • Tools such as rbln-smi

Driver Image Selection

The Driver Manager selects the driver container image by combining Node Feature Discovery labels on each node:

  • feature.node.kubernetes.io/system-os_release.ID
  • feature.node.kubernetes.io/system-os_release.VERSION_ID
  • feature.node.kubernetes.io/kernel-version.full

For example, a node running Ubuntu 22.04 with kernel 6.8.0-90-generic can select: docker.io/rebellions/rbln-driver:3.0.0-6.8.0-90-generic-ubuntu22.04.


Architecture and Flow

Architecture Overview

NPU driver upgrades are coordinated by two components:

Component Role
rbln-npu-operator Cluster level orchestration and upgrade policy enforcement
rbln-k8s-driver-manager Driver lifecycle reconciliation on each node

Depending on configuration, upgrades operate in one of the following modes:

Mode Setting Description
Rollout managed by the operator autoUpgrade: true Operator orchestrates the rollout across nodes
Manual rollout autoUpgrade: false Administrator triggers upgrades explicitly

Driver upgrades are handled by two layers with distinct responsibilities.

1. rbln-npu-operator (Cluster Orchestration)

The operator manages upgrade orchestration across the cluster.

Responsibilities include:

  • Detecting nodes that require driver upgrades
  • Enforcing upgrade policy (upgradePolicy)
  • Controlling rollout parallelism (maxParallelUpgrades)
  • Coordinating node maintenance actions such as:
    • cordon
    • drain
    • reboot
  • Managing rollout progression across nodes

The operator does not directly manage driver state on nodes. Instead, it triggers driver pod restarts, which start local reconciliation on the node.

2. rbln-k8s-driver-manager (Node Driver Reconciliation)

rbln-k8s-driver-manager runs within the driver DaemonSet and reconciles the driver state on each node.

Responsibilities include:

  • Detecting the current driver state on the node
  • Temporarily pausing related components during driver upgrades
  • Performing driver uninstall/install when necessary
  • Restoring node labels so workloads can resume

Reconciliation runs whenever a driver pod starts on a node.

Driver Reconciliation Flow

When a driver pod starts on a node, the initContainer runs the reconcile-driver-state logic implemented by rbln-k8s-driver-manager.

Step Action
1. Read node labels Read rebellions.ai/npu.deploy.* labels that determine which Rebellions components run on the node
2. Pause related Rebellions components Replace labels with paused-for-driver-upgrade so DaemonSets stop and existing pods terminate
3. Wait for pods to terminate Wait until relevant Rebellions component pods have exited
4. Reconcile driver state If the driver image digest matches the desired state, skip uninstalling the existing driver. Otherwise, unload kernel modules, remove old artifacts, and install the new driver.
5. Restore node labels Restore original labels so Rebellions components can be scheduled again

As a result, all node components restart with the upgraded driver.

Cleanup of stale driver DaemonSets

The driver image is selected from NFD labels (OS, kernel version), so a node kernel change, for example after apt upgrade and a reboot, selects a different image tag and creates a new driver DaemonSet. The operator automatically detects and deletes driver DaemonSets whose node selector no longer matches any node, preventing stale driver pods from accumulating after kernel upgrades.


Upgrade Modes and Policy

AutoUpgrade Mode (autoUpgrade: true)

When AutoUpgrade is enabled, the operator performs a rollout across nodes according to policy.

Upgrade behavior is controlled through upgradePolicy.

Upgrade Flow

For each node selected by the operator, the following steps run:

Step Action
1 The node is cordoned to prevent new workloads from being scheduled.
2 Existing NPU workloads are handled according to policy: waitForCompletion, npuPodDeletion, drain
3 Driver pod restart triggers local reconciliation on the node
4 Node reboot is performed if configured
5 The node is validated
6 Node is uncordoned

The operator then proceeds to the next batch of nodes according to maxParallelUpgrades.

Node Upgrade Selection

Nodes that require an upgrade are detected when:

  • driver DaemonSet revision changes
  • an explicit upgrade request is issued

The operator selects nodes for upgrade based on maxParallelUpgrades:

Value Behavior
1 Upgrade one node at a time (sequential)
0 Unlimited parallel upgrades

Reboot Workflow

Optional reboot can be enabled:

1
2
3
4
5
6
7
8
9
driver:
  upgradePolicy:
    reboot:
      enable: true
      rebootTimeoutSeconds: 600
      image:
        registry: docker.io
        image: rebellions/rbln-node-reboot
        version: latest

This block is disabled by default in the chart (reboot.enable: false).

When enabled, the workflow is:

Step Action
1 Operator triggers a reboot through the reboot helper pod
2 Node temporarily becomes NotReady
3 Node returns to Ready after reboot validation

Manual Mode (autoUpgrade: false)

When AutoUpgrade is disabled, the driver DaemonSet uses the OnDelete strategy.

Driver pods are not automatically restarted when the DaemonSet template changes.

Instead, upgrades occur only when an administrator performs explicit actions.

Manual Upgrade Procedure

  1. Administrator selects a node
  2. Node maintenance actions are performed (typically cordon and drain)
  3. Administrator deletes the driver pod:
    $ kubectl delete pod <driver-pod>
    
  4. Kubernetes DaemonSet controller creates a new driver pod
  5. The initContainer triggers the driver reconciliation flow

During reconciliation, rbln-k8s-driver-manager:

  • pauses related Rebellions components
  • updates the node driver state
  • restores node labels once complete

After reconciliation, Rebellions components are automatically rescheduled.

Helm Configuration

If driver.enabled: false at the chart level, the RBLNDriver resource is never created and the upgrade behavior described on this page does not apply. The remainder of this section assumes driver.enabled: true.

All upgrade actions must be enabled explicitly. The chart ships with autoUpgrade, drain.enable, and reboot.enable set to false by default.

To turn on rollout managed by the operator, set autoUpgrade: true and enable the subblocks (drain.enable, reboot.enable, etc.) for any maintenance actions you want the operator to perform.

Example configuration:

driver:
  upgradePolicy:
    autoUpgrade: true
    maxParallelUpgrades: 1
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    npuPodDeletion:
      force: false
      timeoutSeconds: 300
    drain:
      enable: true
      force: false
      deleteEmptyDirData: false
      podSelector: ""
      timeoutSeconds: 300
    reboot:
      enable: true
      rebootTimeoutSeconds: 0
      image:
        registry: docker.io
        image: rebellions/rbln-node-reboot
        version: latest

Upgrade Policy Reference

Setting Description
autoUpgrade Driver upgrades managed by the operator. false (default) = operator does not orchestrate rollout; true = operator performs rollout
maxParallelUpgrades Maximum number of nodes upgraded concurrently. 1 = sequential (default), 0 = unlimited
waitForCompletion.timeoutSeconds Maximum seconds to wait for selected pods to complete before removal. 0 (default) = wait indefinitely
waitForCompletion.podSelector Label selector for pods to wait on. Empty string (default) = skip the wait step entirely
npuPodDeletion.force false (default) = conservative removal (blocks on pods lacking a controller); true = forced removal
npuPodDeletion.timeoutSeconds Max seconds before forcibly deleting remaining pods. 0 = wait indefinitely (default 300)
drain.enable false (default) = drain skipped; true = operator drains the node before pod restart
drain.force false (default) = drain fails on blocking pods; true = drain proceeds even when blocking pods exist
drain.deleteEmptyDirData false (default) = pods using emptyDir storage block the drain; true = those pods are removed (data is lost)
drain.podSelector Label selector restricting drain to matching pods. Empty string (default) = drain all pods on the node
drain.timeoutSeconds Max seconds for drain to complete. 0 = wait indefinitely (default 300)
reboot.enable false (default) = no reboot; true = node is rebooted as part of the upgrade
reboot.rebootTimeoutSeconds Max seconds to wait for the rebooted node to return Ready. Only relevant when reboot.enable: true. 0 (default) = no timeout

Operational Options

Skipping Driver Upgrades

To exclude a node from driver upgrades, set the following label:

$ kubectl label node <node-name> rebellions.ai/npu-driver-upgrade.skip=true

To re-enable upgrades, remove the label:

$ kubectl label node <node-name> rebellions.ai/npu-driver-upgrade.skip-

The operator re-checks this label on every upgrade attempt, so it applies to both autoUpgrade: true and manual rollouts.

Don't confuse with rebellions.ai/npu.deploy.skip

Label Effect
rebellions.ai/npu-driver-upgrade.skip=true Pauses driver upgrades only. The current driver keeps running, and other NPU components are not affected.
rebellions.ai/npu.deploy.skip=true Removes all RBLN NPU components from the node, including the driver itself. NPU workloads on the node will stop working.

Use npu-driver-upgrade.skip to hold back upgrades. Use npu.deploy.skip only when you want to exclude the node from NPU workloads entirely. See Workload Labeling by Node for the full npu.deploy.skip workflow.

Inspecting state

To see which nodes are excluded and where the remaining nodes are in the upgrade state machine:

1
2
3
4
$ kubectl get nodes \
  -L rebellions.ai/npu.present \
  -L rebellions.ai/npu-driver-upgrade-state \
  -L rebellions.ai/npu-driver-upgrade.skip

Verifying the Running Driver

To confirm the driver pod has loaded the kernel module and the expected KMD version is in use, run rbln-smi inside the driver container on the target node.

First, find the driver pod for the node. Driver pods follow the rbln-driver-<os>-<kernel>-<hash> name pattern (see Driver Image Selection):

$ kubectl get pods -n rbln-system -o wide | grep rbln-driver

Then exec into the rbln-driver-container and invoke rbln-smi:

$ kubectl exec -n rbln-system <driver-pod> -c rbln-driver-container -- rbln-smi

The header reports the running KMD version, and the device table lists the NPUs the driver has bound on this node:

1
2
3
4
5
6
7
8
+-------------------------------------------------------------------------------------------------+
|                                Device Information KMD ver: 3.0.0                                |
+-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
| NPU |    Name   | Device  |   PCI BUS ID  | Temp |  Power  | Perf |  Memory(used/total) |  Util |
+=====+===========+=========+===============+======+=========+======+=====================+=======+
| 0   | RBLN-CA25 | rbln0   |  0000:05:00.0 |  37C |  58.7W  | P14  |    0.0B / 15.7GiB   |   0.0 |
| 1   |           | rbln1   |  0000:06:00.0 |  39C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
+-----+-----------+---------+---------------+------+---------+------+---------------------+-------+

After an upgrade, the KMD ver line should match spec.version in your RBLNDriver.