Sandboxed Workloads with the Rebellions NPU Operator¶
Overview¶
The RBLN NPU Operator can expose NPUs to guest VMs through VFIO so that virtualized AI workloads achieve near-native acceleration. When sandbox mode is enabled, the operator performs:
-
Rebinds PCI devices to VFIO driver
Thevfio-managerDaemonSet ships thevfio-manage.shhelper through a ConfigMap. It detaches each RBLN PCI device from its native driver and reattaches it tovfio-pci, making the device safe for passthrough. -
Announces VFIO-backed resources
Thesandbox-device-pluginDaemonSet scans VFIO-managed NPUs and advertises resources such asrebellions.ai/ATOM_CA22_PTandrebellions.ai/ATOM_CA25_PT. Any workload that requests resources through a Kubernetes device plugin—including KubeVirt—can consume them. -
Labels eligible nodes
Node Feature Discovery (NFD) reports the underlying hardware (feature.node.kubernetes.io/pci-1eff.present=true). The operator labels those nodes withrebellions.ai/npu.present=trueand workload-specific keys so that only nodes capable of VFIO passthrough run the sandbox components.
After those controllers reconcile, the NPUs are exposed to KubeVirt VirtualMachine objects through the rebellions.ai/* resource names referenced in each hostDevices stanza.
The Helm setup below enables sandbox mode cluster-wide. To run a hybrid deployment — most nodes serving container workloads, only a few hosting VFIO-backed VMs — keep workloadType: container in the chart, enable both vfioManager.enabled and sandboxDevicePlugin.enabled, and set rebellions.ai/npu.workload.config=vm-passthrough on the nodes that should host VMs. See Workload Labeling by Node for the full procedure and operational caveats.
Device lifecycle¶
Sandbox mode binds and unbinds RBLN PCI devices to the vfio-pci driver on each node according to the node's workload label. The lifecycle is bidirectional, so the same node can be moved between container and VM-passthrough modes without manual cleanup.
- Label selection: Apply the
rebellions.ai/npu.workload.config=vm-passthroughlabel to the node. - Bind on entry: When the VFIO Manager DaemonSet is scheduled on the node, its initContainer unloads the host RBLN kernel driver and the main container binds each NPU PCI device to
vfio-pci. The sandbox device plugin's initContainer then verifies the binding state of every device before the sandbox device plugin starts advertising VFIO resources. - Hybrid clusters: When
spec.workloadType: containeris set, first enable bothvfioManager.enabled: trueandsandboxDevicePlugin.enabled: true. Then opt nodes in by settingrebellions.ai/npu.workload.config=vm-passthrough. Container workloads on the remaining nodes keep using the RBLN driver, untouched. See Workload Labeling by Node for label precedence rules.
Prerequisites¶
- Kubernetes 1.19+ cluster
- Worker nodes with RBLN NPUs (RBLN-CA22/CA25)
- IOMMU enabled in BIOS (
intel_iommu=onoramd_iommu=on) and VFIO kernel modules (vfio,vfio_pci,vfio_iommu_type1) - KubeVirt Operator installed and ready to schedule VMs
- Node Feature Discovery (can be deployed by the Helm chart itself)
Helm Deployment for Sandboxed Workloads¶
-
Install Helm (if necessary)
-
Add the Rebellions chart repository
-
Configure the sandbox workload profile
The chart ships with a ready-made example at sample-values-SandboxWorkload.yaml. It enables the VFIO Manager, Sandbox Device Plugin, and sets suitable resource names:You can also copy the base
values.yamland toggle the relevant keys manually: - setsandboxDevicePlugin.enabled=true- setvfioManager.enabled=true- AdjustsandboxDevicePlugin.resourceList[]to match each card model and VFIO resource name your VMs expect - Ensurenfd.enabled=trueif NFD is not already running -
Install with the sandbox profile
Consuming the VFIO Resources from KubeVirt¶
Note
Enable KubeVirt's HostDevices feature gate and list each Rebellions PCI resource under permittedHostDevices.pciHostDevices before attaching them to VMs:
Create a VirtualMachine manifest where each hostDevices entry references the resource published by the sandbox device plugin:
Tips¶
- Each requested unit corresponds to one VFIO-bound NPU function.
- To request multiple devices, increase both
requestsandlimitsand add multiplehostDevicesentries (rbln1,rbln2, …). - Use distinct resource names in sandboxDevicePlugin.resourceList (for example,
rebellions.ai/ATOM_CA22_PTvsrebellions.ai/ATOM_CA25_PT) when a single Kubernetes cluster includes multiple RBLN device types so workloads can request the exact model they need.