Sandboxed Workloads with the Rebellions NPU Operator¶
BREAKING CHANGE
The Sandbox Device Plugin no longer takes a sandboxDevicePlugin.resourceList. It now advertises one resource per NPU product, derived automatically, so the resource names change. For example, rebellions.ai/ATOM_CA25_PT becomes rebellions.ai/RBLN-CA25_PF. When upgrading, update the resourceName entries in your KubeVirt permittedHostDevices and the deviceName in each VM's hostDevices to match. See Resource naming.
Overview¶
The RBLN NPU Operator can expose NPUs to guest VMs through VFIO so that virtualized AI workloads achieve near-native acceleration. When sandbox mode is enabled, the operator performs:
-
Rebinds PCI devices to VFIO driver
Thevfio-managerDaemonSet ships thevfio-manage.shhelper through a ConfigMap. It detaches each RBLN PCI device from its native driver and reattaches it tovfio-pci, making the device safe for passthrough. -
Announces VFIO-backed resources
Thesandbox-device-pluginDaemonSet scans VFIO-managed NPUs and automatically advertises one resource per NPU product, such asrebellions.ai/RBLN-CA22_PFandrebellions.ai/RBLN-CA25_PF. Any workload that requests resources through a Kubernetes device plugin, including KubeVirt, can consume them. -
Labels eligible nodes
Node Feature Discovery (NFD) reports the underlying hardware (feature.node.kubernetes.io/pci-1eff.present=true). The operator labels those nodes withrebellions.ai/npu.present=trueand workload-specific keys so that only nodes capable of VFIO passthrough run the sandbox components.
After those controllers reconcile, the NPUs are exposed to KubeVirt VirtualMachine objects through the rebellions.ai/* resource names referenced in each hostDevices stanza.
The Helm setup below enables sandbox mode cluster-wide. To run a hybrid deployment — most nodes serving container workloads, only a few hosting VFIO-backed VMs — keep workloadType: container in the chart, enable both vfioManager.enabled and sandboxDevicePlugin.enabled, and set rebellions.ai/npu.workload.config=vm-passthrough on the nodes that should host VMs. See Workload Labeling by Node for the full procedure and operational caveats.
Device lifecycle¶
Sandbox mode binds and unbinds RBLN PCI devices to the vfio-pci driver on each node according to the node's workload label. The lifecycle is bidirectional, so the same node can be moved between container and VM-passthrough modes without manual cleanup.
- Label selection: Apply the
rebellions.ai/npu.workload.config=vm-passthroughlabel to the node. - Bind on entry: When the VFIO Manager DaemonSet is scheduled on the node, its initContainer unloads the host RBLN kernel driver and the main container binds each NPU PCI device to
vfio-pci. The sandbox device plugin's initContainer then verifies the binding state of every device before the sandbox device plugin starts advertising VFIO resources. - Hybrid clusters: When
spec.workloadType: containeris set, first enable bothvfioManager.enabled: trueandsandboxDevicePlugin.enabled: true. Then opt nodes in by settingrebellions.ai/npu.workload.config=vm-passthrough. Container workloads on the remaining nodes keep using the RBLN driver, untouched. See Workload Labeling by Node for label precedence rules.
Prerequisites¶
- Kubernetes 1.19+ cluster
- Worker nodes with RBLN NPUs (RBLN-CA22/CA25)
- IOMMU enabled in BIOS (
intel_iommu=onoramd_iommu=on) and VFIO kernel modules (vfio,vfio_pci,vfio_iommu_type1) - KubeVirt Operator installed and ready to schedule VMs
- Node Feature Discovery (can be deployed by the Helm chart itself)
Helm Deployment for Sandboxed Workloads¶
-
Install Helm (if necessary)
-
Add the Rebellions chart repository
-
Configure the sandbox workload profile
The chart ships with a ready-made example at sample-values-SandboxWorkload.yaml. It enables the VFIO Manager and the Sandbox Device Plugin:The plugin discovers each NPU and derives its resource name automatically, so no resource list is required. See Resource naming for the convention.
You can also copy the base
values.yamland toggle the relevant keys manually: - setsandboxDevicePlugin.enabled=true- setvfioManager.enabled=true- Ensurenfd.enabled=trueif NFD is not already running -
Install with the sandbox profile
Resource naming¶
The Sandbox Device Plugin advertises one Kubernetes resource per NPU product. On each node it scans the NPUs bound to vfio-pci and resolves their product names from a PCI ID database bundled with the plugin under the Rebellions vendor ID 1eff. The names are derived automatically. There is no resourceList to maintain.
The product name from the database becomes the resource suffix: it is upper-cased, spaces are replaced with _, and characters outside letters, digits, _, and - are dropped. For example, the entry 1250 RBLN-CA25 (PF) is advertised as rebellions.ai/RBLN-CA25_PF:
pci.ids entry (vendor 1eff) |
Advertised resource |
|---|---|
1220 RBLN-CA22 (PF) |
rebellions.ai/RBLN-CA22_PF |
1250 RBLN-CA25 (PF) |
rebellions.ai/RBLN-CA25_PF |
2030 RBLN-CR03 (PF) |
rebellions.ai/RBLN-CR03_PF |
Note
The plugin carries its own PCI ID database, so resource names need no host setup. A product newer than the bundled database is advertised under the device-ID name rebellions.ai/RBLN-<device-id> (for example, rebellions.ai/RBLN-1250) until the plugin is updated to a build that includes it.
Consuming the VFIO Resources from KubeVirt¶
Note
Enable KubeVirt's HostDevices feature gate and list each Rebellions PCI resource under permittedHostDevices.pciHostDevices before attaching them to VMs. Set externalResourceProvider: true on each entry so that KubeVirt delegates allocation to the sandbox device plugin instead of advertising the same resourceName itself:
Create a VirtualMachine manifest where each hostDevices entry references the resource published by the sandbox device plugin:
Tips¶
- Each requested unit corresponds to one VFIO-bound NPU function.
- To request multiple devices, increase both
requestsandlimitsand add multiplehostDevicesentries (rbln1,rbln2, …). - The plugin advertises a separate resource per NPU product, so when a single Kubernetes cluster includes multiple RBLN device types, each workload requests the exact product it needs (for example,
rebellions.ai/RBLN-CA22_PFvsrebellions.ai/RBLN-CA25_PF).