RBLN System Management Daemon¶
Overview¶
RSMD (Rebellions System Management Daemon) is a system management daemon that monitors and controls RBLN NPU devices. RSMD bridges kernel driver interfaces and provides a unified gRPC-based API surface for device management operations.
RSMD enables real-time monitoring of device health metrics including temperature, power consumption, memory usage, clock frequencies, and utilization. The daemon collects kernel events via netlink sockets and maintains event history for diagnostic purposes. Device control operations such as resets are exposed through the gRPC interface.
The system provides a command-line interface (rbln-smdi) for interactive device management. RSMD operates as a systemd service, providing centralized management capabilities for multiple NPU devices in production environments.
Core Components¶
rbln_daemon¶
The rbln_daemon is the core service that runs continuously in the background, providing device management capabilities through a gRPC interface.
Functionality¶
The daemon monitors kernel events via netlink sockets and persists them as CSV logs. Device telemetry including temperature, power consumption, memory usage, clock frequencies, and utilization is exposed through the gRPC interface. Device control operations such as resets are handled through the same interface.
Features¶
- Operates as a systemd service (
rbln_daemon.service) for automatic startup and lifecycle management - Supports both TCP (default port 50051) and Unix domain sockets for client connections
- Configurable event retention and automatic rotation
Configuration¶
Environment variables control daemon behavior:
RBLN_SMD_PORT(default:50051): TCP port for gRPC serverMAX_RBLN_EVENT(default:1000): Maximum events stored per deviceREMOVE_EVENT_NUM_AFTER_EXCEED(default:500): Number of events to delete when limit is reached
Service Management¶
The daemon runs as a systemd service. Service status can be checked with:
Logs are available through the systemd journal:
The daemon can be configured to use a Unix domain socket in addition to the TCP port by specifying the --uri option in the systemd service configuration:
Event Logs:¶
Device events are stored in /var/log/rebellions/rsmd_<device>.event as CSV files. These logs contain:
- Event type and source
- Timestamps (UTC and kernel time)
- Event data and sub-values
Protobuf API¶
The protobuf API defines the gRPC service interface that all RSMD clients use. The protobuf schema (rbln_services.proto) provides a language-agnostic interface definition that enables client generation in multiple programming languages including Python, C++, Go, and Java. Moreover, by strictly defining data types, the API guarantees data integrity and optimizes performance through efficient binary compression.
Available Services:¶
The RBLNServices gRPC service provides the following RPC methods:
| Method | Description | Returns |
|---|---|---|
getDeviceList |
List all detected devices | Stream of Device |
getServiceableDeviceList |
List devices ready for operations | Stream of Device |
resetDevice |
Reset a specific device | StatusMsg |
resetAllDevice |
Reset all devices in system | StatusMsg |
getVersion |
Get firmware, driver, and SMC versions | VersionInfo |
getHWInfo |
Get temperature and power consumption | HWInfo |
getMemoryInfo |
Get total and used memory | MemoryInfo |
getClockInfo |
Get device clock frequencies | ClockInfo |
getEventInfo |
Get hardware events from kernel | Stream of EventInfo |
getTotalInfo |
Get comprehensive device information | Stream of DeviceInfo |
getUtilization |
Get NPU utilization percentage | UtilInfo |
Message Types:¶
Key message types include:
Device: Device identifier (name, UUID, PCI bus ID, device ID)DeviceInfo: Complete device status (memory, temperature, power, version, utilization, status)HWInfo: Hardware telemetry (temperature in milli-Celsius, power in micro-watts)MemoryInfo: Memory usage (total and used in GB)ClockInfo: Clock frequencies for CP, DNC, Bus, SHM, DRAM (MHz)EventInfo: Kernel-reported events with timestampsVersionInfo: Firmware, driver, and SMC versionsUtilInfo: Device utilization percentage
Protocol Definition¶
The protocol buffer definition file (rbln_services.proto) is typically installed in /opt/rebellions/etc/ or a similar system location. Client applications use this file to generate language-specific bindings for gRPC communication with the daemon.
rbln-smdi¶
The rbln-smdi command-line interface provides interactive access to RSMD functionality. The tool offers formatted table output for human-readable results, with optional JSON output for automation and scripting. The CLI supports remote connections to daemon instances running on different hosts.
Common Options¶
All commands support these global options:
--ip <address>: gRPC server IP (default:localhost)--port <number>: gRPC server port (default:50051)--jsons: Output in JSON format instead of tables
CLI Examples¶
Summary
Lists all detected devices (local/remote) and supports JSON output for automation.
Command
Output (example)
Summary
Retrieves device information (summary for all devices, detailed info for a specific device) and supports JSON output for automation.
Command
Output (example)
Device Information
+----------+--------------------------------------+--------------+-------------+----------+------------+---------+---------+---------+----------+
| DEV NAME | UUID | TOTAL MEM(GB)| USED MEM(GB)| TEMP.(c) | Power (mW) | P-state | FW VER | DRV VER | UTIL(%) |
+----------+--------------------------------------+--------------+-------------+----------+------------+---------+---------+---------+----------+
| rbln0 | 00000000-0000-0000-0000-000000000001 | 16.0 | 2.5 | 45.23 | 100.5 | 0 | 1.2.3 | 2.1.0 | 15.5 |
+----------+--------------------------------------+--------------+-------------+----------+------------+---------+---------+---------+----------+
Output fields
The output includes:
- Device name and UUID
- Memory usage (total/used in GB)
- Temperature (°C)
- Power consumption (mW)
- P-state
- Firmware and driver versions
- Utilization (%)
Summary
Prints event history for a device.
Command
Output (example)
Device rbln0 Event Information
+-----+----------+------------------+---------------------+--------------+-------------------+-------+
| IDX | DEV NAME | TYPE | UTC TIME | KERNEL TIME | DATA1 | DATA2 |
+-----+----------+------------------+---------------------+--------------+-------------------+-------+
| 0 | rbln0 | NO_RESPONSE | 2024-01-15 10:30:45 | 12345.678 | CP_EVENT | 0x0 |
| 1 | rbln0 | RESPONSE_REQUIRED| 2024-01-15 10:25:12 | 12300.123 | SINGLE_HARD_RESET | 0x1 |
+-----+----------+------------------+---------------------+--------------+-------------------+-------+
Notes
Events show:
- Event type (for example:
NO_RESPONSE,RESPONSE_REQUIRED) - Event source (for example:
SINGLE_HARD_RESET,RSD_HARD_RESET,TDR_EVENT,CP_EVENT) - UTC and kernel timestamps
- Event data values
Summary
Resets a specific device or all devices. Supports remote daemon connections.
Command
Output (example)
On failure
Telemetry Units¶
RSMD uses the following units for telemetry data:
- Temperature: milli-degree Celsius (Divide by 1,000 to convert to °C)
- Power: micro-watts (Divide by 1,000,000 to convert to Watts)
- Memory: GB (GB)
- Clock: MHz (MHz)
- Utilization: percentage (0-100)
NOTE: The CLI tool automatically converts these into human-readable formats.
Recommendations¶
- Regular monitoring: We recommend running scheduled checks to track device health metrics over time.
- Event retention: Tune
MAX_RBLN_EVENTbased on your retention requirements and available disk space. - Network exposure: Prefer Unix domain sockets for local access; if you expose TCP, restrict access with firewall rules.
- Error handling: Check
err_statusfields in API responses before consuming results. - Log review: Periodically review event logs to identify anomalies and recurring patterns.
References¶
- gRPC Definition: Protocol buffer definition file (
rbln_services.proto) is typically located in/opt/rebellions/etc/or system configuration directory - Systemd Service: Service unit file is installed in
/etc/systemd/system/rbln_daemon.service