Skip to content

Sandboxed Workloads with the Rebellions NPU Operator

BREAKING CHANGE

The Sandbox Device Plugin no longer takes a sandboxDevicePlugin.resourceList. It now advertises one resource per NPU product, derived automatically, so the resource names change. For example, rebellions.ai/ATOM_CA25_PT becomes rebellions.ai/RBLN-CA25_PF. When upgrading, update the resourceName entries in your KubeVirt permittedHostDevices and the deviceName in each VM's hostDevices to match. See Resource naming.

Overview

The RBLN NPU Operator can expose NPUs to guest VMs through VFIO so that virtualized AI workloads achieve near-native acceleration. When sandbox mode is enabled, the operator performs:

  1. Rebinds PCI devices to VFIO driver
    The vfio-manager DaemonSet ships the vfio-manage.sh helper through a ConfigMap. It detaches each RBLN PCI device from its native driver and reattaches it to vfio-pci, making the device safe for passthrough.

  2. Announces VFIO-backed resources
    The sandbox-device-plugin DaemonSet scans VFIO-managed NPUs and automatically advertises one resource per NPU product, such as rebellions.ai/RBLN-CA22_PF and rebellions.ai/RBLN-CA25_PF. Any workload that requests resources through a Kubernetes device plugin, including KubeVirt, can consume them.

  3. Labels eligible nodes
    Node Feature Discovery (NFD) reports the underlying hardware (feature.node.kubernetes.io/pci-1eff.present=true). The operator labels those nodes with rebellions.ai/npu.present=true and workload-specific keys so that only nodes capable of VFIO passthrough run the sandbox components.

After those controllers reconcile, the NPUs are exposed to KubeVirt VirtualMachine objects through the rebellions.ai/* resource names referenced in each hostDevices stanza.

The Helm setup below enables sandbox mode cluster-wide. To run a hybrid deployment — most nodes serving container workloads, only a few hosting VFIO-backed VMs — keep workloadType: container in the chart, enable both vfioManager.enabled and sandboxDevicePlugin.enabled, and set rebellions.ai/npu.workload.config=vm-passthrough on the nodes that should host VMs. See Workload Labeling by Node for the full procedure and operational caveats.

Device lifecycle

Sandbox mode binds and unbinds RBLN PCI devices to the vfio-pci driver on each node according to the node's workload label. The lifecycle is bidirectional, so the same node can be moved between container and VM-passthrough modes without manual cleanup.

  1. Label selection: Apply the rebellions.ai/npu.workload.config=vm-passthrough label to the node.
  2. Bind on entry: When the VFIO Manager DaemonSet is scheduled on the node, its initContainer unloads the host RBLN kernel driver and the main container binds each NPU PCI device to vfio-pci. The sandbox device plugin's initContainer then verifies the binding state of every device before the sandbox device plugin starts advertising VFIO resources.
  3. Hybrid clusters: When spec.workloadType: container is set, first enable both vfioManager.enabled: true and sandboxDevicePlugin.enabled: true. Then opt nodes in by setting rebellions.ai/npu.workload.config=vm-passthrough. Container workloads on the remaining nodes keep using the RBLN driver, untouched. See Workload Labeling by Node for label precedence rules.

Prerequisites

  • Kubernetes 1.19+ cluster
  • Worker nodes with RBLN NPUs (RBLN-CA22/CA25)
  • IOMMU enabled in BIOS (intel_iommu=on or amd_iommu=on) and VFIO kernel modules (vfio, vfio_pci, vfio_iommu_type1)
  • KubeVirt Operator installed and ready to schedule VMs
  • Node Feature Discovery (can be deployed by the Helm chart itself)

Helm Deployment for Sandboxed Workloads

  1. Install Helm (if necessary)

    1
    2
    3
    $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
         && chmod 700 get_helm.sh \
         && ./get_helm.sh
    

  2. Add the Rebellions chart repository

    $ helm repo add rebellions https://RBLN-SW.github.io/rbln-npu-operator
    $ helm repo update
    

  3. Configure the sandbox workload profile
    The chart ships with a ready-made example at sample-values-SandboxWorkload.yaml. It enables the VFIO Manager and the Sandbox Device Plugin:

    name: rbln
    nfd:
      enabled: true
    
    operator:
      replicas: 1
    
    sandboxDevicePlugin:
      enabled: true
    
    vfioManager:
      enabled: true
    

    The plugin discovers each NPU and derives its resource name automatically, so no resource list is required. See Resource naming for the convention.

    You can also copy the base values.yaml and toggle the relevant keys manually: - set sandboxDevicePlugin.enabled=true - set vfioManager.enabled=true - Ensure nfd.enabled=true if NFD is not already running

  4. Install with the sandbox profile

    1
    2
    3
    4
    $ helm install rbln-npu-operator \
         rebellions/rbln-npu-operator \
         -n rbln-system --create-namespace \
         -f sample-values-SandboxWorkload.yaml
    

Resource naming

The Sandbox Device Plugin advertises one Kubernetes resource per NPU product. On each node it scans the NPUs bound to vfio-pci and resolves their product names from a PCI ID database bundled with the plugin under the Rebellions vendor ID 1eff. The names are derived automatically. There is no resourceList to maintain.

The product name from the database becomes the resource suffix: it is upper-cased, spaces are replaced with _, and characters outside letters, digits, _, and - are dropped. For example, the entry 1250 RBLN-CA25 (PF) is advertised as rebellions.ai/RBLN-CA25_PF:

pci.ids entry (vendor 1eff) Advertised resource
1220 RBLN-CA22 (PF) rebellions.ai/RBLN-CA22_PF
1250 RBLN-CA25 (PF) rebellions.ai/RBLN-CA25_PF
2030 RBLN-CR03 (PF) rebellions.ai/RBLN-CR03_PF

Note

The plugin carries its own PCI ID database, so resource names need no host setup. A product newer than the bundled database is advertised under the device-ID name rebellions.ai/RBLN-<device-id> (for example, rebellions.ai/RBLN-1250) until the plugin is updated to a build that includes it.

Consuming the VFIO Resources from KubeVirt

Note

Enable KubeVirt's HostDevices feature gate and list each Rebellions PCI resource under permittedHostDevices.pciHostDevices before attaching them to VMs. Set externalResourceProvider: true on each entry so that KubeVirt delegates allocation to the sandbox device plugin instead of advertising the same resourceName itself:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration:
      featureGates:
      - HostDevices
    permittedHostDevices:
      pciHostDevices:
      - pciVendorSelector: 1eff:1220
        resourceName: rebellions.ai/RBLN-CA22_PF
        externalResourceProvider: true
      - pciVendorSelector: 1eff:1250
        resourceName: rebellions.ai/RBLN-CA25_PF
        externalResourceProvider: true

Create a VirtualMachine manifest where each hostDevices entry references the resource published by the sandbox device plugin:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm-npu-workload
spec:
  runStrategy: Always
  template:
    metadata:
      labels:
        app: vm-npu
    spec:
      domain:
        devices:
          hostDevices:
          - name: rbln0
            deviceName: rebellions.ai/RBLN-CA25_PF
            tag: "pci"
        resources:
          requests:
            rebellions.ai/RBLN-CA25_PF: 1
          limits:
            rebellions.ai/RBLN-CA25_PF: 1

Tips

  • Each requested unit corresponds to one VFIO-bound NPU function.
  • To request multiple devices, increase both requests and limits and add multiple hostDevices entries (rbln1, rbln2, …).
  • The plugin advertises a separate resource per NPU product, so when a single Kubernetes cluster includes multiple RBLN device types, each workload requests the exact product it needs (for example, rebellions.ai/RBLN-CA22_PF vs rebellions.ai/RBLN-CA25_PF).