Workload Labeling by Node¶
The RBLNClusterPolicy is a single CR scoped to the cluster that controls the operator's deployment behavior across all nodes with NPUs. Two node labels let you override the settings for a specific node or exclude that node from deployment without editing the policy itself:
rebellions.ai/npu.deploy.skip: exclude a specific node from operator component deployment entirely.rebellions.ai/npu.workload.config: setcontainerorvm-passthroughmode for a specific node, overriding the cluster default.
The operator applies both labels on its next reconciliation loop.
Excluding a Node from Component Deployment¶
rebellions.ai/npu.deploy.skip=true keeps the node in the cluster and visible to NFD, but prevents the operator from scheduling any components on that node.
Use this in the following cases:
- When a debug or maintenance node should not run daemons managed by the operator.
- When a node is managed by a separate tool and should not be managed by the operator.
- When you need temporary isolation while investigating a hardware or driver issue.
The label value must be exactly true; any other value is ignored.
When the label is applied:
- Every
rebellions.ai/npu.deploy.<component>label on the node is removed, which causes the operator's DaemonSets to remove their pods from that node. - The
rebellions.ai/npu-driver-upgrade-enabledannotation is cleared, so the node is also excluded from driver upgrades managed by the operator. In other words,npu.deploy.skipincludes all behavior from Skipping Driver Upgrades and also excludes the node from component deployment. - NFD labels (
rebellions.ai/npu.present,rebellions.ai/npu.product,rebellions.ai/npu.family, etc.) are preserved. The device itself remains discoverable.
Selecting a Node's Workload Mode¶
rebellions.ai/npu.workload.config overrides RBLNClusterPolicy.spec.workloadType for a single node. Allowed values are container and vm-passthrough; unknown values fall back to the spec default.
Use this for hybrid deployments. In this setup, most nodes serve container workloads while a few nodes host KubeVirt VMs with NPU passthrough.
Required chart configuration¶
For the node override to take effect, both vfioManager and sandboxDevicePlugin must be enabled across the cluster. The CRD intentionally allows the hybrid pattern below, where container is the default while both VM-passthrough components are also enabled:
Label Configuration¶
Each node with the label receives only the components for its mode:
- Container nodes: driver, Device Plugin, Container Toolkit, Metrics
Exporter, NPU Feature Discovery, Validator (optionally
rbln-daemonand the DRA kubelet plugin). vm-passthroughnodes: onlyvfio-managerandsandbox-device-plugin. The host driver is intentionally absent because the device is bound tovfio-pcifor the guest VM.
Switching the mode of a node currently in use¶
Always drain the node before changing its workload mode. During the mode switch, the device is rebound between the kernel driver and vfio-pci, so any container that is still using the NPU may be disconnected abruptly:
Label Precedence¶
If both labels are set on the same node, rebellions.ai/npu.deploy.skip=true takes precedence. Every npu.deploy.<component> label is removed regardless of workload.config.