Skip to content

Sandboxed Workloads with the Rebellions NPU Operator

Overview

The RBLN NPU Operator can expose NPUs to guest VMs through VFIO so that virtualized AI workloads achieve near-native acceleration. When sandbox mode is enabled, the operator performs:

  1. Rebinds PCI devices to VFIO driver
    The vfio-manager DaemonSet ships the vfio-manage.sh helper through a ConfigMap. It detaches each RBLN PCI function from its native driver and reattaches it to vfio-pci, making the device safe for passthrough.

  2. Announces VFIO-backed resources
    The sandbox-device-plugin DaemonSet scans VFIO-managed NPUs and advertises resources such as rebellions.ai/ATOM_CA22_PT and rebellions.ai/ATOM_CA25_PT. Any workload that requests resources through a Kubernetes device plugin—including KubeVirt—can consume them.

  3. Labels eligible nodes
    Node Feature Discovery (NFD) reports the underlying hardware (feature.node.kubernetes.io/pci-1eff.present=true). The operator labels those nodes with rebellions.ai/npu.present=true and workload-specific keys so that only nodes capable of VFIO passthrough run the sandbox components.

After those controllers reconcile, the NPUs are exposed to KubeVirt VirtualMachine objects through the rebellions.ai/* resource names referenced in each hostDevices stanza.

Prerequisites

  • Kubernetes 1.19+ cluster
  • Worker nodes with RBLN NPUs (RBLN-CA12/CA22/CA25)
  • IOMMU enabled in BIOS (intel_iommu=on or amd_iommu=on) and VFIO kernel modules (vfio, vfio_pci, vfio_iommu_type1)
  • KubeVirt Operator installed and ready to schedule VMs
  • Node Feature Discovery (can be deployed by the Helm chart itself)

Helm Deployment for Sandboxed Workloads

  1. Install Helm (if necessary)

    1
    2
    3
    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
         && chmod 700 get_helm.sh \
         && ./get_helm.sh
    

  2. Add the Rebellions chart repository

    helm repo add rebellions https://rebellions-sw.github.io/rbln-npu-operator
    helm repo update
    

  3. Configure the sandbox workload profile
    The chart ships with a ready-made example at sample-values-SandboxWorkload.yaml. It enables the VFIO Manager, Sandbox Device Plugin, and sets suitable resource names:

    name: rbln
    nfd:
      enabled: true
    
    operator:
      replicas: 1
    
    sandboxDevicePlugin:
      enabled: true
      resourceList:
      - resourceName: ATOM_CA22_PT
        resourcePrefix: rebellions.ai
        productCardNames:
        - RBLN-CA22
      - resourceName: ATOM_CA25_PT
        resourcePrefix: rebellions.ai
        productCardNames:
        - RBLN-CA25
    
    vfioManager:
      enabled: true
    

    You can also copy the base values.yaml and toggle the relevant keys manually: - set sandboxDevicePlugin.enabled=true - set vfioManager.enabled=true - Adjust sandboxDevicePlugin.resourceList[] to match each card model and VFIO resource name your VMs expect - Ensure nfd.enabled=true if NFD is not already running

  4. Install with the sandbox profile

    1
    2
    3
    4
    helm install rbln-npu-operator \
         rebellions/rbln-npu-operator \
         -n rbln-system --create-namespace \
         -f sample-values-SandboxWorkload.yaml
    

Consuming the VFIO Resources from KubeVirt

Note

Enable KubeVirt's HostDevices feature gate and list each Rebellions PCI resource under permittedHostDevices.pciHostDevices before attaching them to VMs:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration:
      featureGates:
      - HostDevices
    permittedHostDevices:
      pciHostDevices:
      - pciVendorSelector: 1eff:1220
        resourceName: rebellions.ai/ATOM_CA22_PT
      - pciVendorSelector: 1eff:1250
        resourceName: rebellions.ai/ATOM_CA25_PT

Create a VirtualMachine manifest where each hostDevices entry references the resource published by the sandbox device plugin:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm-npu-workload
spec:
  runStrategy: Always
  template:
    metadata:
      labels:
        app: vm-npu
    spec:
      domain:
        devices:
          hostDevices:
          - name: rbln0
            deviceName: rebellions.ai/ATOM_CA25_PT
            tag: "pci"
        resources:
          requests:
            rebellions.ai/ATOM_PT: 1
          limits:
            rebellions.ai/ATOM_PT: 1

Tips

  • Each requested unit corresponds to one VFIO-bound NPU function.
  • To request multiple devices, increase both requests and limits and add multiple hostDevices entries (rbln1, rbln2, …).
  • Use distinct resource names in sandboxDevicePlugin.resourceList (for example, rebellions.ai/ATOM_CA22_PT vs rebellions.ai/ATOM_CA25_PT) when a single Kubernetes cluster includes multiple RBLN device types so workloads can request the exact model they need.