RBLN Metrics Exporter¶

The RBLN SDK provides a Metrics Exporter that exposes detailed metrics related to Rebellions' NPU devices in Prometheus format. These metrics are designed to be easily scraped by Prometheus and visualized with Grafana, helping you comprehensively monitor your Rebellions NPU device.

Deployment¶

Step 1: Prepare NPU Nodes¶

Follow the same steps as outlined in the device plugin documentation to prepare Kubernetes nodes equipped with RBLN NPUs and ensure the RBLN Driver is installed.

Step 2: Deploy Prometheus¶

Install Prometheus in your Kubernetes cluster using either Helm or the Prometheus Operator.

Note that deploying the RBLN Metrics Exporter does not require Prometheus to be set up beforehand.

Step 3: Deploy RBLN Metrics Exporter¶

Deploy the RBLN Metrics Exporter as a DaemonSet pod on each node with the following command:

$ kubectl apply -f https://raw.githubusercontent.com/rebellions-sw/rbln-metrics-exporter/refs/heads/main/deployments/kubernetes/daemonset.yaml

The provided manifest includes affinity rules to ensure the Metrics Exporter is deployed only on nodes equipped with RBLN NPUs. Specifically, it uses nodeAffinity to target nodes where the rebellions.ai/npu.present label is set to "true", which is typically set by rbln-npu-feature-discovery.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: rebellions.ai/npu.present
              operator: In
              values:
                - "true"

Kubernetes Mode¶

The exporter supports a Kubernetes mode switch to disable pod-resource lookups and label dependencies for non-Kubernetes environments.

Manifest env: set RBLN_METRICS_EXPORTER_KUBERNETES_MODE=off
Binary flag: run ./rbln-metrics-exporter --kubernetes-mode=off

Step 4: (Optional) Configure Prometheus to Scrape Metrics¶

To allow Prometheus to automatically discover and scrape metrics from the RBLN Metrics Exporter, you can create a ServiceMonitor resource. This is especially useful if you're using the Prometheus Operator. Here's an example ServiceMonitor configuration:

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rbln-metrics-exporter
  namespace: monitoring
  labels:
    release: prometheus  # Must match the serviceMonitorSelector you configured when installing Prometheus so that this ServiceMonitor is discovered
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: rbln-metrics-exporter
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

Make sure the selector labels in the ServiceMonitor match the labels on your RBLN Metrics Exporter Pods, and the release label (if used) matches your Prometheus deployment. You can apply this ServiceMonitor with kubectl apply -f servicemonitor.yaml.

Step 5: (Optional) Deploy Grafana¶

If you wish to visualize the Prometheus metrics through Grafana dashboards, deploy Grafana in your Kubernetes cluster using either Helm or the Grafana Operator.

Exported Metrics¶

The following metrics are exported for each NPU device, tagged with the device UUID, card name, and character device node (rblnN).

Name	Description	Unit
`RBLN_DEVICE_STATUS:TEMPERATURE`	Temperature	°C
`RBLN_DEVICE_STATUS:CARD_POWER`	Power usage	W
`RBLN_DEVICE_STATUS:DRAM_USED`	DRAM in use	Bytes
`RBLN_DEVICE_STATUS:DRAM_TOTAL`	Total DRAM	Bytes
`RBLN_DEVICE_STATUS:UTILIZATION`	Utilization	%
`RBLN_DEVICE_STATUS:HEALTH`	NPU health status	0/1

Note: RBLN_DEVICE_STATUS:HEALTH is a binary state metric. 0 means the NPU is active, while 1 means it is inactive.

Common NPU Metrics Label Attributes¶

Label	Description
`name`	Character device node exposed by the kernel driver (`Device.name`, e.g., `rbln0`).
`uuid`	Globally unique identifier for the NPU device (`Device.uuid`).
`card`	Hardware card string surfaced via `DeviceInfo.name` (e.g., `RBLN-CA25`).
`deviceID`	PCIe device ID reported in the proto (`Device.dev_id`, e.g., `1250`).
`hostname`	Name of the Kubernetes node where the Pod using the device is scheduled.
`driver_version`	Kernel driver build returned by `VersionInfo.drv_version`.
`firmware_version`	Firmware revision returned by `VersionInfo.fw_version`.

Kubernetes NPU Metrics Label Attributes¶

Label	Description
`namespace`	Namespace for the workload using the device. Taken from `Pod.metadata.namespace`.
`container`	Name of the container consuming the NPU. Taken from `Pod.spec.containers[].name`.
`pod`	Name of the Pod holding the NPU allocation. Taken from `Pod.metadata.name`.

Metrics Example¶

Here's a look at what the actual metrics look like.

# TYPE RBLN_DEVICE_STATUS:DRAM_TOTAL (Bytes)
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln0",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="55668c63-d739-4193-8212-ad7ba933520c"} 1.6877879296e+10
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln1",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="84389d45-ebf3-4b74-9d80-6ec8a09d8be4"} 1.6877879296e+10
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln2",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="50e9c1d4-1d88-44ee-94e6-fc0dac081b1e"} 1.6877879296e+10
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln3",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="8e65fc0d-df7d-4e21-a81b-a76a1a1e69ab"} 1.6877879296e+10

# TYPE RBLN_DEVICE_STATUS:DRAM_USED (Bytes)
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln0",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="55668c63-d739-4193-8212-ad7ba933520c"} 0
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln1",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="84389d45-ebf3-4b74-9d80-6ec8a09d8be4"} 150994944
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln2",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="50e9c1d4-1d88-44ee-94e6-fc0dac081b1e"} 0
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln3",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="8e65fc0d-df7d-4e21-a81b-a76a1a1e69ab"} 0

# TYPE RBLN_DEVICE_STATUS:TEMPERATURE NPU temperature (C)
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln0",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="55668c63-d739-4193-8212-ad7ba933520c"} 51
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln1",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="84389d45-ebf3-4b74-9d80-6ec8a09d8be4"} 54
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln2",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="50e9c1d4-1d88-44ee-94e6-fc0dac081b1e"} 41
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln3",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="8e65fc0d-df7d-4e21-a81b-a76a1a1e69ab"} 32

# TYPE RBLN_DEVICE_STATUS:UTILIZATION (%)
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln0",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="55668c63-d739-4193-8212-ad7ba933520c"} 0
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln1",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="84389d45-ebf3-4b74-9d80-6ec8a09d8be4"} 0
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln2",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="50e9c1d4-1d88-44ee-94e6-fc0dac081b1e"} 0
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln3",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="8e65fc0d-df7d-4e21-a81b-a76a1a1e69ab"} 0

# TYPE RBLN_DEVICE_STATUS:CARD_POWER usage (w)
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln0",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="55668c63-d739-4193-8212-ad7ba933520c"} 59.33700180053711
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln1",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="84389d45-ebf3-4b74-9d80-6ec8a09d8be4"} 59.33700180053711
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln2",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="50e9c1d4-1d88-44ee-94e6-fc0dac081b1e"} 59.33700180053711
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln3",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="8e65fc0d-df7d-4e21-a81b-a76a1a1e69ab"} 59.33700180053711

# HELP RBLN_DEVICE_STATUS:HEALTH NPU health status
RBLN_DEVICE_STATUS:HEALTH{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln0",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="55668c63-d739-4193-8212-ad7ba933520c"} 0
RBLN_DEVICE_STATUS:HEALTH{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln1",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="84389d45-ebf3-4b74-9d80-6ec8a09d8be4"} 0
RBLN_DEVICE_STATUS:HEALTH{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln2",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="50e9c1d4-1d88-44ee-94e6-fc0dac081b1e"} 0
RBLN_DEVICE_STATUS:HEALTH{card="RBLN-CA25",container="ubuntu",deviceID="1250",driver_version="2.0.1",firmware_version="2.0.1",hostname="sw-mpc-clsdk-bm-worker-01",name="rbln3",namespace="default",pod="rebel-device-plugin-testpod-1",uuid="8e65fc0d-df7d-4e21-a81b-a76a1a1e69ab"} 0