Skip to content

RBLN System Management Daemon

Overview

RSMD (Rebellions System Management Daemon) is a system management daemon that monitors and controls RBLN NPU devices. RSMD bridges kernel driver interfaces and provides a unified gRPC-based API surface for device management operations.

RSMD enables real-time monitoring of device health metrics including temperature, power consumption, memory usage, clock frequencies, and utilization. The daemon collects kernel events via netlink sockets and maintains event history for diagnostic purposes. Device control operations such as resets are exposed through the gRPC interface.

The system provides a command-line interface (rbln-smdi) for interactive device management. RSMD operates as a systemd service, providing centralized management capabilities for multiple NPU devices in production environments.


Core Components

rbln_daemon

The rbln_daemon is the core service that runs continuously in the background, providing device management capabilities through a gRPC interface.

Functionality

The daemon monitors kernel events via netlink sockets and persists them as CSV logs. Device telemetry including temperature, power consumption, memory usage, clock frequencies, and utilization is exposed through the gRPC interface. Device control operations such as resets are handled through the same interface.

Features

  • Operates as a systemd service (rbln_daemon.service) for automatic startup and lifecycle management
  • Supports both TCP (default port 50051) and Unix domain sockets for client connections
  • Configurable event retention and automatic rotation

Configuration

Environment variables control daemon behavior:

  • RBLN_SMD_PORT (default: 50051): TCP port for gRPC server
  • MAX_RBLN_EVENT (default: 1000): Maximum events stored per device
  • REMOVE_EVENT_NUM_AFTER_EXCEED (default: 500): Number of events to delete when limit is reached

Service Management

The daemon runs as a systemd service. Service status can be checked with:

sudo systemctl status rbln_daemon

Logs are available through the systemd journal:

sudo journalctl -u rbln_daemon -f

The daemon can be configured to use a Unix domain socket in addition to the TCP port by specifying the --uri option in the systemd service configuration:

rbln_daemon --uri unix:///var/run/rbln.sock

Event Logs:

Device events are stored in /var/log/rebellions/rsmd_<device>.event as CSV files. These logs contain:

  • Event type and source
  • Timestamps (UTC and kernel time)
  • Event data and sub-values

Protobuf API

The protobuf API defines the gRPC service interface that all RSMD clients use. The protobuf schema (rbln_services.proto) provides a language-agnostic interface definition that enables client generation in multiple programming languages including Python, C++, Go, and Java. Moreover, by strictly defining data types, the API guarantees data integrity and optimizes performance through efficient binary compression.

Available Services:

The RBLNServices gRPC service provides the following RPC methods:

Method Description Returns
getDeviceList List all detected devices Stream of Device
getServiceableDeviceList List devices ready for operations Stream of Device
resetDevice Reset a specific device StatusMsg
resetAllDevice Reset all devices in system StatusMsg
getVersion Get firmware, driver, and SMC versions VersionInfo
getHWInfo Get temperature and power consumption HWInfo
getMemoryInfo Get total and used memory MemoryInfo
getClockInfo Get device clock frequencies ClockInfo
getEventInfo Get hardware events from kernel Stream of EventInfo
getTotalInfo Get comprehensive device information Stream of DeviceInfo
getUtilization Get NPU utilization percentage UtilInfo

Message Types:

Key message types include:

  • Device: Device identifier (name, UUID, PCI bus ID, device ID)
  • DeviceInfo: Complete device status (memory, temperature, power, version, utilization, status)
  • HWInfo: Hardware telemetry (temperature in milli-Celsius, power in micro-watts)
  • MemoryInfo: Memory usage (total and used in GB)
  • ClockInfo: Clock frequencies for CP, DNC, Bus, SHM, DRAM (MHz)
  • EventInfo: Kernel-reported events with timestamps
  • VersionInfo: Firmware, driver, and SMC versions
  • UtilInfo: Device utilization percentage

Protocol Definition

The protocol buffer definition file (rbln_services.proto) is typically installed in /opt/rebellions/etc/ or a similar system location. Client applications use this file to generate language-specific bindings for gRPC communication with the daemon.


rbln-smdi

The rbln-smdi command-line interface provides interactive access to RSMD functionality. The tool offers formatted table output for human-readable results, with optional JSON output for automation and scripting. The CLI supports remote connections to daemon instances running on different hosts.

Common Options

All commands support these global options:

  • --ip <address>: gRPC server IP (default: localhost)
  • --port <number>: gRPC server port (default: 50051)
  • --jsons: Output in JSON format instead of tables

CLI Examples

Summary

Lists all detected devices (local/remote) and supports JSON output for automation.

Command

Command (local)
$ rbln-smdi get list
Command (remote)
$ rbln-smdi get list --ip 192.168.1.100 --port 50051
Command (JSON)
$ rbln-smdi get list --jsons

Output (example)

Device list (example)
Rebellions Device List

+-----+--------------+
| IDX | DEVICE NAME  |
+-----+--------------+
|  0  |    rbln0     |
|  1  |    rbln1     |
+-----+--------------+

Summary

Retrieves device information (summary for all devices, detailed info for a specific device) and supports JSON output for automation.

Command

Command (summary)
$ rbln-smdi get all
Command (detail)
$ rbln-smdi get info rbln0
Command (JSON)
$ rbln-smdi get all --jsons

Output (example)

Device information (example)
Device Information

+----------+--------------------------------------+--------------+-------------+----------+------------+---------+---------+---------+----------+
| DEV NAME |                 UUID                 | TOTAL MEM(GB)| USED MEM(GB)| TEMP.(c) | Power (mW) | P-state | FW VER  | DRV VER | UTIL(%)  |
+----------+--------------------------------------+--------------+-------------+----------+------------+---------+---------+---------+----------+
|  rbln0   | 00000000-0000-0000-0000-000000000001 |     16.0     |    2.5      |  45.23   |   100.5    |    0    | 1.2.3   | 2.1.0   |   15.5   |
+----------+--------------------------------------+--------------+-------------+----------+------------+---------+---------+---------+----------+

Output fields

The output includes:

  • Device name and UUID
  • Memory usage (total/used in GB)
  • Temperature (°C)
  • Power consumption (mW)
  • P-state
  • Firmware and driver versions
  • Utilization (%)

Summary

Prints event history for a device.

Command

Command
$ rbln-smdi get event rbln0

Output (example)

Event information (example)
Device rbln0 Event Information

+-----+----------+------------------+---------------------+--------------+-------------------+-------+
| IDX | DEV NAME |       TYPE       |      UTC TIME       | KERNEL TIME  |       DATA1       | DATA2 |
+-----+----------+------------------+---------------------+--------------+-------------------+-------+
|  0  |  rbln0   |  NO_RESPONSE     | 2024-01-15 10:30:45 | 12345.678    |     CP_EVENT      |  0x0  |
|  1  |  rbln0   | RESPONSE_REQUIRED| 2024-01-15 10:25:12 | 12300.123    | SINGLE_HARD_RESET |  0x1  |
+-----+----------+------------------+---------------------+--------------+-------------------+-------+

Notes

Events show:

  • Event type (for example: NO_RESPONSE, RESPONSE_REQUIRED)
  • Event source (for example: SINGLE_HARD_RESET, RSD_HARD_RESET, TDR_EVENT, CP_EVENT)
  • UTC and kernel timestamps
  • Event data values

Summary

Resets a specific device or all devices. Supports remote daemon connections.

Command

Command (single device)
$ rbln-smdi reset rbln0
Command (all devices)
$ rbln-smdi reset all
Command (remote daemon)
$ rbln-smdi reset rbln0 --ip 192.168.1.100 --port 50051

Output (example)

Reset result (example)
device rbln0 reset succeeded

On failure

Reset result (failure example)
device rbln0 reset failed

Telemetry Units

RSMD uses the following units for telemetry data:

  • Temperature: milli-degree Celsius (Divide by 1,000 to convert to °C)
  • Power: micro-watts (Divide by 1,000,000 to convert to Watts)
  • Memory: GB (GB)
  • Clock: MHz (MHz)
  • Utilization: percentage (0-100)

NOTE: The CLI tool automatically converts these into human-readable formats.


Recommendations

  1. Regular monitoring: We recommend running scheduled checks to track device health metrics over time.
  2. Event retention: Tune MAX_RBLN_EVENT based on your retention requirements and available disk space.
  3. Network exposure: Prefer Unix domain sockets for local access; if you expose TCP, restrict access with firewall rules.
  4. Error handling: Check err_status fields in API responses before consuming results.
  5. Log review: Periodically review event logs to identify anomalies and recurring patterns.

References

  • gRPC Definition: Protocol buffer definition file (rbln_services.proto) is typically located in /opt/rebellions/etc/ or system configuration directory
  • Systemd Service: Service unit file is installed in /etc/systemd/system/rbln_daemon.service