RBLN Prometheus Metrics Exporter
The RBLN SDK provides a Metrics Exporter that exposes detailed metrics related to Rebellions' NPU devices in Prometheus format. These metrics are designed to be easily scraped by Prometheus and visualized with Grafana, helping you comprehensively monitor your Rebellions NPU device.
Deployment
Step 1: Prepare NPU Nodes
Follow the same steps as outlined in the device plugin documentation to prepare Kubernetes nodes equipped with RBLN NPUs and ensure the RBLN Driver is installed.
Step 2: Deploy Prometheus
Install Prometheus in your Kubernetes cluster using either Helm or the Prometheus Operator.
Note that deploying the RBLN Metrics Exporter does not require Prometheus to be set up beforehand.
Step 3: Deploy RBLN Metrics Exporter
Deploy the RBLN Metrics Exporter as a DaemonSet pod on each node with the following command:
| $ kubectl apply -f https://raw.githubusercontent.com/rebellions-sw/rbln-metrics-exporter/refs/heads/main/deployments/kubernetes/daemonset.yaml
|
The provided manifest includes affinity rules to ensure the Metrics Exporter is deployed only on nodes equipped with RBLN NPUs. Specifically, it uses nodeAffinity
to target nodes where the rebellions.ai/npu.present
label is set to "true"
, which is typically set by rbln-npu-feature-discovery.
| affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: rebellions.ai/npu.present
operator: In
values:
- "true"
|
To allow Prometheus to automatically discover and scrape metrics from the RBLN Metrics Exporter, you can create a ServiceMonitor
resource. This is especially useful if you're using the Prometheus Operator. Here's an example ServiceMonitor
configuration:
| ---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rbln-metrics-exporter
namespace: monitoring
labels:
release: prometheus # Must match the serviceMonitorSelector you configured when installing Prometheus so that this ServiceMonitor is discovered
spec:
selector:
matchLabels:
app.kubernetes.io/name: rbln-metrics-exporter
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
|
Make sure the selector
labels in the ServiceMonitor
match the labels on your RBLN Metrics Exporter Pods, and the release
label (if used) matches your Prometheus deployment. You can apply this ServiceMonitor with kubectl apply -f servicemonitor.yaml
.
Step 5: (Optional) Deploy Grafana
If you wish to visualize the Prometheus metrics through Grafana dashboards, deploy Grafana in your Kubernetes cluster using either Helm or the Grafana Operator.
Exported Metrics
The following metrics are exported for each NPU device, tagged with the device UUID, card name, and character device node (rblnN
).
Name |
Description |
RBLN_DEVICE_STATUS:TEMPERATURE |
Temperature (°C) |
RBLN_DEVICE_STATUS:CARD_POWER |
Power usage (W) |
RBLN_DEVICE_STATUS:DRAM_USED |
DRAM in use (GiB) |
RBLN_DEVICE_STATUS:DRAM_TOTAL |
Total DRAM (GiB) |
RBLN_DEVICE_STATUS:UTILIZATION |
Utilization (%) |
Metrics Example
Here's a look at what the actual metrics look like.
| # TYPE RBLN_DEVICE_STATUS:DRAM_TOTAL gauge
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln1"} 15.71875
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln1"} 15.71875
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln0"} 15.71875
RBLN_DEVICE_STATUS:DRAM_TOTAL{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln0"} 15.71875
# TYPE RBLN_DEVICE_STATUS:TEMPERATURE gauge
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln1"} 46
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln0"} 46
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln1"} 45
RBLN_DEVICE_STATUS:TEMPERATURE{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln0"} 51
# TYPE RBLN_DEVICE_STATUS:UTILIZATION gauge
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln0"} 0
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln1"} 0
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln0"} 0
RBLN_DEVICE_STATUS:UTILIZATION{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln1"} 0
# TYPE RBLN_DEVICE_STATUS:CARD_POWER gauge
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln0"} 0
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln1"} 0
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln1"} 32.7760009765625
RBLN_DEVICE_STATUS:CARD_POWER{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln0"} 35.777000427246094
# TYPE RBLN_DEVICE_STATUS:DRAM_USED gauge
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln1"} 0
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln0"} 0.140625
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA02",uuid="ffffffff-ffff-ffff-ffff-ffffffffffff",device="rbln1"} 0
RBLN_DEVICE_STATUS:DRAM_USED{card="RBLN-CA02",uuid="00000000-0000-0000-0000-000000000000",device="rbln0"} 0
|