Skip to content

NPU Driver Upgrade Workflow

This document describes how NPU driver upgrades work in Kubernetes and how to configure them. It covers:

  • Architecture and flow — the two components that coordinate upgrades (operator and driver-manager) and what happens on each node when a driver pod starts
  • Upgrade modes and policy — operator-managed vs manual rollout, and how to configure upgradePolicy (cordon, drain, reboot, etc.)
  • Operational options — how to exclude specific nodes from upgrades

Architecture and flow

Architecture Overview

The NPU driver upgrades are coordinated by two components:

Component Role
rbln-npu-operator Cluster-level orchestration and upgrade policy enforcement
rbln-k8s-driver-manager Node-local driver lifecycle reconciliation

Depending on configuration, upgrades operate in one of the following modes:

Mode Setting Description
Operator-managed rollout autoUpgrade: true Operator orchestrates rollout across nodes
Manual rollout autoUpgrade: false Administrator triggers upgrades explicitly

Driver upgrades are handled by two layers of components with distinct responsibilities.

1. rbln-npu-operator (Cluster Orchestration)

The operator manages cluster-wide upgrade orchestration.

Responsibilities include:

  • Detecting nodes that require driver upgrades
  • Enforcing upgrade policy (upgradePolicy)
  • Controlling rollout parallelism (maxParallelUpgrades)
  • Coordinating node maintenance actions such as:
    • cordon
    • drain
    • reboot
  • Managing rollout progression across nodes

The operator does not directly manage driver state on nodes. Instead, it triggers driver pod restarts, which initiate node-local reconciliation.

2. rbln-k8s-driver-manager (Node Driver Reconciliation)

rbln-k8s-driver-manager runs within the driver DaemonSet and is responsible for reconciling the driver state on each node.

Responsibilities include:

  • Detecting the current driver state on the node
  • Temporarily pausing related system components during driver upgrades
  • Performing driver uninstall/install when necessary
  • Restoring node labels so workloads can resume

The reconciliation process is triggered whenever a driver pod starts on a node.

Driver Reconciliation Flow

When a driver pod starts on a node, the initContainer runs the reconcile-driver-state logic implemented by rbln-k8s-driver-manager.

Step Action
1. Read node labels Read rebellions.ai/npu.deploy.* labels that determine which Rebellions components run on the node
2. Pause related Rebellions system components Replace labels with paused-for-driver-upgrade so DaemonSets stop and existing pods terminate
3. Wait for pods to terminate Wait until relevant Rebellions component pods have exited
4. Reconcile driver state If driver image digest matches, the desired state, uninstall is skipped. Otherwise unload kernel modules, remove old artifacts, install new driver
5. Restore node labels Restore original labels so Rebellions components can be scheduled again

As a result, all node components restart using the upgraded driver.


Upgrade modes and policy

AutoUpgrade Mode (autoUpgrade: true)

When AutoUpgrade is enabled, the operator performs policy-driven rollout across nodes.

Upgrade behavior is controlled through upgradePolicy.

Upgrade Flow

For each node selected by the operator:

Step Action
1 The node is cordoned to prevent new workloads from being scheduled.
2 Existing NPU workloads handled per policy: waitForCompletion, npuPodDeletion, drain
3 Driver pod restart triggers node-local reconciliation
4 Optional node reboot may occur
5 The node is validated
6 Node is uncordoned

The operator then proceeds to the next batch of nodes according to maxParallelUpgrades.

Node Upgrade Selection

Nodes that require upgrade are detected when:

  • driver DaemonSet revision changes
  • an explicit upgrade request is issued

The operator selects nodes for upgrade based on maxParallelUpgrades:

Value Behavior
1 Upgrade one node at a time (sequential)
0 Unlimited parallel upgrades

Reboot Workflow

Optional reboot support can be enabled:

1
2
3
4
5
6
7
8
9
driver:
  upgradePolicy:
    reboot:
      enable: true
      rebootTimeoutSeconds: 600
      image:
        registry: docker.io
        image: rebellions/rbln-node-reboot
        version: latest

When enabled:

Step Action
1 Operator triggers a reboot through the reboot helper pod
2 Node temporarily becomes NotReady
3 Node returns to Ready after reboot validation

Manual Mode (autoUpgrade: false)

When AutoUpgrade is disabled, the driver DaemonSet uses the OnDelete strategy.

Driver pods are not automatically restarted when the DaemonSet template changes.

Instead, upgrades occur through explicit administrator actions.

Manual Upgrade Procedure

  1. Administrator selects a node
  2. Node maintenance actions are performed (typically cordon and drain)
  3. The administrator deletes the driver pod:
    $ kubectl delete pod <driver-pod>
    
  4. Kubernetes DaemonSet controller creates a new driver pod
  5. The initContainer triggers the driver reconciliation flow

During reconciliation, rbln-k8s-driver-manager:

  • pauses related Rebellions system components
  • updates the node driver state
  • restores node labels once complete

After reconciliation, Rebellions components are automatically rescheduled.

Helm Configuration

Example configuration:

driver:
  upgradePolicy:
    autoUpgrade: true
    maxParallelUpgrades: 1
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    npuPodDeletion:
      force: false
      timeoutSeconds: 300
    drain:
      enable: true
      force: false
      deleteEmptyDirData: false
      podSelector: ""
      timeoutSeconds: 300
    reboot:
      enable: true
      rebootTimeoutSeconds: 0
      image:
        registry: docker.io
        image: rebellions/rbln-node-reboot
        version: latest

Upgrade Policy Reference

Setting Description
autoUpgrade Enables or disables operator-managed driver upgrades
maxParallelUpgrades Max nodes upgraded concurrently. 1 = sequential, 0 = unlimited
waitForCompletion Waits for selected pods or jobs to complete before starting pod eviction
npuPodDeletion Eviction behavior for NPU pods. force: false = conservative (may block on unmanaged pods). force: true = aggressive eviction
drain Node drain. enable: true = operator performs drain. force: true = proceed even when blocking pods exist
reboot Optional reboot workflow. rebootTimeoutSeconds = max time for reboot validation; 0 = no timeout

Operational options

Skipping Driver Upgrades

To skip driver upgrades on a specific node:

kubectl label node <node-name> rebellions.ai/npu-driver-upgrade.skip=true

To re-enable upgrades:

kubectl label node <node-name> rebellions.ai/npu-driver-upgrade.skip-

This label is evaluated by the operator when selecting nodes for upgrade and is mainly intended for operator-managed rollouts (autoUpgrade: true).