Skip to content

Sandboxed Workloads with the Rebellions NPU Operator

Overview

The RBLN NPU Operator can expose NPUs to guest VMs through VFIO so that virtualized AI workloads achieve near-native acceleration. When sandbox mode is enabled, the operator performs:

  1. Rebinds PCI devices to VFIO driver
    The vfio-manager DaemonSet ships the vfio-manage.sh helper through a ConfigMap. It detaches each RBLN PCI device from its native driver and reattaches it to vfio-pci, making the device safe for passthrough.

  2. Announces VFIO-backed resources
    The sandbox-device-plugin DaemonSet scans VFIO-managed NPUs and advertises resources such as rebellions.ai/ATOM_CA22_PT and rebellions.ai/ATOM_CA25_PT. Any workload that requests resources through a Kubernetes device plugin—including KubeVirt—can consume them.

  3. Labels eligible nodes
    Node Feature Discovery (NFD) reports the underlying hardware (feature.node.kubernetes.io/pci-1eff.present=true). The operator labels those nodes with rebellions.ai/npu.present=true and workload-specific keys so that only nodes capable of VFIO passthrough run the sandbox components.

After those controllers reconcile, the NPUs are exposed to KubeVirt VirtualMachine objects through the rebellions.ai/* resource names referenced in each hostDevices stanza.

The Helm setup below enables sandbox mode cluster-wide. To run a hybrid deployment — most nodes serving container workloads, only a few hosting VFIO-backed VMs — keep workloadType: container in the chart, enable both vfioManager.enabled and sandboxDevicePlugin.enabled, and set rebellions.ai/npu.workload.config=vm-passthrough on the nodes that should host VMs. See Workload Labeling by Node for the full procedure and operational caveats.

Device lifecycle

Sandbox mode binds and unbinds RBLN PCI devices to the vfio-pci driver on each node according to the node's workload label. The lifecycle is bidirectional, so the same node can be moved between container and VM-passthrough modes without manual cleanup.

  1. Label selection: Apply the rebellions.ai/npu.workload.config=vm-passthrough label to the node.
  2. Bind on entry: When the VFIO Manager DaemonSet is scheduled on the node, its initContainer unloads the host RBLN kernel driver and the main container binds each NPU PCI device to vfio-pci. The sandbox device plugin's initContainer then verifies the binding state of every device before the sandbox device plugin starts advertising VFIO resources.
  3. Hybrid clusters: When spec.workloadType: container is set, first enable both vfioManager.enabled: true and sandboxDevicePlugin.enabled: true. Then opt nodes in by setting rebellions.ai/npu.workload.config=vm-passthrough. Container workloads on the remaining nodes keep using the RBLN driver, untouched. See Workload Labeling by Node for label precedence rules.

Prerequisites

  • Kubernetes 1.19+ cluster
  • Worker nodes with RBLN NPUs (RBLN-CA22/CA25)
  • IOMMU enabled in BIOS (intel_iommu=on or amd_iommu=on) and VFIO kernel modules (vfio, vfio_pci, vfio_iommu_type1)
  • KubeVirt Operator installed and ready to schedule VMs
  • Node Feature Discovery (can be deployed by the Helm chart itself)

Helm Deployment for Sandboxed Workloads

  1. Install Helm (if necessary)

    1
    2
    3
    $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
         && chmod 700 get_helm.sh \
         && ./get_helm.sh
    

  2. Add the Rebellions chart repository

    $ helm repo add rebellions https://RBLN-SW.github.io/rbln-npu-operator
    $ helm repo update
    

  3. Configure the sandbox workload profile
    The chart ships with a ready-made example at sample-values-SandboxWorkload.yaml. It enables the VFIO Manager, Sandbox Device Plugin, and sets suitable resource names:

    name: rbln
    nfd:
      enabled: true
    
    operator:
      replicas: 1
    
    sandboxDevicePlugin:
      enabled: true
      resourceList:
      - resourceName: ATOM_CA22_PT
        resourcePrefix: rebellions.ai
        productCardNames:
        - RBLN-CA22
      - resourceName: ATOM_CA25_PT
        resourcePrefix: rebellions.ai
        productCardNames:
        - RBLN-CA25
    
    vfioManager:
      enabled: true
    

    You can also copy the base values.yaml and toggle the relevant keys manually: - set sandboxDevicePlugin.enabled=true - set vfioManager.enabled=true - Adjust sandboxDevicePlugin.resourceList[] to match each card model and VFIO resource name your VMs expect - Ensure nfd.enabled=true if NFD is not already running

  4. Install with the sandbox profile

    1
    2
    3
    4
    $ helm install rbln-npu-operator \
         rebellions/rbln-npu-operator \
         -n rbln-system --create-namespace \
         -f sample-values-SandboxWorkload.yaml
    

Consuming the VFIO Resources from KubeVirt

Note

Enable KubeVirt's HostDevices feature gate and list each Rebellions PCI resource under permittedHostDevices.pciHostDevices before attaching them to VMs:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration:
      featureGates:
      - HostDevices
    permittedHostDevices:
      pciHostDevices:
      - pciVendorSelector: 1eff:1220
        resourceName: rebellions.ai/ATOM_CA22_PT
      - pciVendorSelector: 1eff:1250
        resourceName: rebellions.ai/ATOM_CA25_PT

Create a VirtualMachine manifest where each hostDevices entry references the resource published by the sandbox device plugin:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm-npu-workload
spec:
  runStrategy: Always
  template:
    metadata:
      labels:
        app: vm-npu
    spec:
      domain:
        devices:
          hostDevices:
          - name: rbln0
            deviceName: rebellions.ai/ATOM_CA25_PT
            tag: "pci"
        resources:
          requests:
            rebellions.ai/ATOM_PT: 1
          limits:
            rebellions.ai/ATOM_PT: 1

Tips

  • Each requested unit corresponds to one VFIO-bound NPU function.
  • To request multiple devices, increase both requests and limits and add multiple hostDevices entries (rbln1, rbln2, …).
  • Use distinct resource names in sandboxDevicePlugin.resourceList (for example, rebellions.ai/ATOM_CA22_PT vs rebellions.ai/ATOM_CA25_PT) when a single Kubernetes cluster includes multiple RBLN device types so workloads can request the exact model they need.