Skip to content

RBLN System Management Daemon

Overview

RSMD (Rebellions System Management Daemon) monitors and controls RBLN NPU devices. It bridges kernel driver interfaces and provides a unified gRPC-based API surface for device management operations.

RSMD enables real-time monitoring of device health metrics including temperature, power consumption, memory usage, clock frequencies, and utilization. The daemon collects kernel events via netlink sockets and maintains event history and coredumps for diagnostic purposes. Device control operations such as resets, RSD group management, and runtime configuration changes are exposed through the gRPC interface.

The system provides a command-line interface (rbln-smdi) for interactive device management. RSMD operates as a systemd service, providing centralized management capabilities for multiple NPU devices in production environments, with mTLS and IP/CN-based access control to make remote operation safe.


Architecture

RSMD follows a daemon–client model. A single rbln-smd process owns access to every NPU on the host and exposes them through one gRPC interface; the rbln-smdi CLI, remote clients, and automation connect to that interface rather than to the kernel directly. Centralizing access behind one endpoint keeps a continuous record of kernel events that on-demand tools would miss and places every remote operation behind a single mTLS and allowlist boundary.

RSMD architecture

Component Role
rbln-smd Background daemon — collects kernel events, reads device telemetry, and executes device control, exposing all of it over gRPC.
rbln-smdi Command-line client — interactive queries and control, including connections to a daemon on a remote host.
rbln_smd.proto gRPC contract — the service and message definitions clients use to generate language bindings.

Core Components

Daemon (rbln-smd)

rbln-smd is the core daemon that runs continuously in the background, providing device management capabilities through a gRPC interface.

Deprecation

A binary-identical copy is also installed as rbln_daemon for backward compatibility with earlier configurations. This legacy name is scheduled for removal in v4.0; migrate scripts and tooling to rbln-smd.

Functionality

rbln-smd performs the following:

  • Monitors kernel events via netlink sockets and persists them as CSV logs
  • Exposes device telemetry (temperature, power, memory, clock, utilization) through the gRPC interface
  • Handles device control operations such as resets
  • Manages RSD groups (creation and destruction)
  • Enforces mTLS and IP/CN-based access control

Features

  • Operates as a systemd service (rbln-smd.service) for automatic startup and lifecycle management
  • Supports both TCP (default port 50051) and Unix domain sockets for client connections
  • Configurable event and coredump retention with automatic rotation

Configuration

The following environment variables control daemon behavior:

  • RBLN_SMD_PORT (default: 50051): TCP port for gRPC server
  • RSMD_MAX_EVENTS_PER_DEV (default: 1000): Maximum events retained per device. The oldest events are removed when the limit is reached.
  • RSMD_MAX_COREDUMPS_PER_DEV (default: 200): Maximum coredump directories retained per device. The oldest entries are removed when the limit is reached.
  • RSMD_CERT_PATH: Server certificate directory when mTLS is enabled. --cert takes precedence.

Command-line Flags

The daemon accepts the following flags (append to ExecStart= in the systemd unit, or override with systemctl edit):

  • --mtls: Enable mTLS mode.
  • --allow-reset: Allow device reset calls (resetDevice, resetAllDevice).
  • --cert <dir>: TLS server certificate directory.
  • -a, --allowlist <file>: Per-client access restriction file (IP / CN based).
  • --uri <uri>: Alternate binding (e.g., unix:///var/run/rbln.sock).
  • --pid-file <path>: PID file path (default /run/rbln-smd.pid).

Info

See Access Control and mTLS for the full behavior of the security-related flags (--mtls, --allow-reset, --cert, -a, --allowlist).

Service Management

The daemon runs as a systemd service. Service status can be checked with:

sudo systemctl status rbln-smd

Logs are available through the systemd journal:

sudo journalctl -u rbln-smd -f

The daemon can be configured to use a Unix domain socket in addition to the TCP port by specifying the --uri option in the systemd service configuration:

rbln-smd --uri unix:///var/run/rbln.sock

Info

rbln-smd.service and rbln_daemon.service point to the same binary and are mutually excluded by systemd Conflicts=. If both units are enabled, only one runs.

Event Logs

Device events are stored in /var/log/rebellions/rsmd_<device>.event as CSV files. These logs contain:

  • Event type and source
  • Timestamps (UTC and kernel time)
  • Event data and sub-values

Protobuf API

The protobuf API defines the gRPC service interface that all RSMD clients use. The protobuf schema (rbln_smd.proto) provides a language-agnostic interface definition that enables client generation in multiple programming languages including Python, C++, Go, and Java. By strictly defining data types, the API ensures data integrity and optimizes performance through efficient binary compression.

Available Services

The RBLNServices gRPC service provides the following RPC methods:

Method Description Returns
getDeviceList List all detected devices Stream of Device
getServiceableDeviceList List devices ready for operations Stream of Device
resetDevice Reset a specific device StatusMsg
resetAllDevice Reset all devices in system StatusMsg
getVersion Get firmware, driver, and SMC versions VersionInfo
getHWInfo Get temperature and power consumption HWInfo
getMemoryInfo Get total and used memory MemoryInfo
getClockInfo Get device clock frequencies ClockInfo
getEventInfo Get hardware events from kernel Stream of EventInfo
getTotalInfo Get comprehensive device information Stream of DeviceInfo
getUtilization Get NPU utilization percentage UtilInfo
RblnListTopology Get device topology (NUMA, CPU affinity, RSD group, PCIe link) RblnListTopologyResponse
RblnListCoredumps List coredump entries for devices RblnListCoredumpsResponse
RblnGetConfig Get a runtime configuration value RblnGetConfigResponse
RblnSetConfig Set a runtime configuration value RblnSetConfigResponse
RblnListGroups List active RSD groups RblnListGroupsResponse
RblnCreateGroup Create an RSD group with the specified devices RblnGroupOpResponse
RblnDestroyGroup Destroy an RSD group by group ID (-1 destroys all) RblnGroupOpResponse

Message Types

Key message types include:

  • Device: Device identifier (name, UUID, PCI bus ID, device ID, card name)
  • DeviceInfo: Complete device status (memory, temperature, power, version, utilization, status, P-state)
  • HWInfo: Hardware telemetry (temperature in milli-Celsius, power in micro-watts)
  • MemoryInfo: Memory usage (total and used in GB)
  • ClockInfo: Clock frequencies for CP, DNC, Bus, SHM, DRAM (MHz)
  • EventInfo: Kernel-reported events with timestamps
  • VersionInfo: Firmware, driver, and SMC versions
  • UtilInfo: Device utilization percentage
  • TopologyEntry: NUMA node, CPU affinity, RSD group, PCIe link speed/width
  • CoredumpEntry: Coredump directory path with timestamp
  • GroupEntry: RSD group ID, name, and member devices
  • DeviceFilter: Server-side device filter (device_ids / group_ids, mutually exclusive)
  • ErrorDetail: Per-item error code and message (empty on success)

Protocol Definition

The protocol buffer definition file (rbln_smd.proto) is typically installed in /opt/rebellions/etc/ or a similar system location. Client applications use this file to generate language-specific bindings for gRPC communication with the daemon.

Deprecation

The same definition is also installed under the legacy name rbln_services.proto for backward compatibility. This alias is scheduled for removal in v4.0; update build scripts to reference rbln_smd.proto.


CLI (rbln-smdi)

The rbln-smdi command-line interface provides interactive access to RSMD functionality. The tool offers formatted table output for human-readable results, with optional JSON output for automation and scripting use cases. The CLI supports remote connections to daemon instances running on different hosts.

Common Options

Most subcommands accept:

  • --ip <address>: gRPC server IP (default: localhost)
  • --port <number>: gRPC server port (overrides ~/.rbln-smdi/config)
  • --jsons: Output in JSON format instead of tables

Global flags appear before the subcommand:

  • --tls <path>: Path to the client certificate for one-off mTLS invocations.
  • --uri <URI>: Daemon connection URI (e.g., unix:///var/run/rbln.sock). Overrides --ip and --port.
  • -V, --version: Print the CLI version.

status, topo, and journal accept additional filter options:

  • -d <SPEC>: Device number filter. N selects rblnN. Examples: 0, 0,2, 0-3. Default: all devices.
  • -g <SPEC>: RSD group filter. N selects rsdN. Ignored when -d is also specified.

Persistent Configuration

The file ~/.rbln-smdi/config accepts the following keys:

  • secure=true|false: Whether to use TLS (default false).
  • cert_path=<path>: TLS client certificate directory.
  • port=<number>: Default port used when --port is not specified.

Info

See Access Control and mTLS for the full behavior of the security-related options (--tls, secure, cert_path).

CLI Examples

Summary

Lists all detected NPU devices with card name, PCI BDF, UUID, firmware versions, RSD group, and status.

Command

Command (local)
$ rbln-smdi discovery
Command (remote)
$ rbln-smdi discovery --ip 192.168.1.100 --port 50051
Command (JSON)
$ rbln-smdi discovery --jsons

Output (example)

Device list (example)
+-------------------------------------------------------------------------+
|                             Device Discovery                            |
+=====+======================================+============+======+========+
| NPU | Identifiers                          | FW Ver     | RSD  | Status |
+-----+--------------------------------------+------------+------+--------+
|   0 | rbln0                      RBLN-CA25 | CP:  1.2.3 | rsd0 | READY  |
|     | 0000:47:00.0                         | SMC: 4.5.6 |      |        |
|     | 00000000-0000-0000-0000-000000000001 |            |      |        |
+-----+--------------------------------------+------------+------+--------+

Summary

Displays temperature, power, memory usage, NPU utilization, and device status. Narrow the scope with -d or -g.

Command

Command (all)
$ rbln-smdi status
Command (device filter)
$ rbln-smdi status -d 0,2
Command (JSON)
$ rbln-smdi status --jsons

Output (example)

Device status (example)
+---------------------------------------------------------------------------------------------------+
|                                           Device Status                                           |
+======+=====+======+========+=======+===========+======+==========+=============+=========+========+
| RSD  | NPU | Name | Device |  Temp | Power(mW) | Perf | DNC(MHz) |    Memory   | Util(%) | Status |
+------+-----+------+--------+-------+-----------+------+----------+-------------+---------+--------+
| rsd0 |   0 | CA25 | rbln0  | 45.23 |     100.5 |  0   |   1000   | 2.5/16.0 GB |   15.5  | READY  |
+------+-----+------+--------+-------+-----------+------+----------+-------------+---------+--------+

Summary

Displays NUMA node, CPU affinity, RSD group, and PCIe link information.

Command

Command (all)
$ rbln-smdi topo
Command (group filter)
$ rbln-smdi topo -g 0
Command (JSON)
$ rbln-smdi topo --jsons

Output (example)

Device topology (example)
+----------------------------------------------------------------------------------------------+
|                                       Device Topology                                        |
+===========+=====+========+==============+===========+==============+============+============+
| RSD Group | NPU | Device |   PCI BDF    | NUMA Node | CPU Affinity | PCIe Speed | PCIe Width |
+-----------+-----+--------+--------------+-----------+--------------+------------+------------+
| rsd0      |   0 | rbln0  | 0000:47:00.0 |     0     | 0-63         |  16 GT/s   |    x16     |
+-----------+-----+--------+--------------+-----------+--------------+------------+------------+

Summary

Shows kernel event history and coredump entries together. Use --type to limit the output to one of them.

Command

Command (both)
$ rbln-smdi journal
Command (events only)
$ rbln-smdi journal --type event
Command (coredumps only)
$ rbln-smdi journal --type coredump

Output (example)

Events / coredumps (example)
+-------------------------------------------------------------------------------------+
|                                        Events                                       |
+=====+==========+=============+=====================+=============+==========+=======+
| IDX | DEV NAME |     TYPE    |       UTC TIME      | KERNEL TIME |  DATA1   | DATA2 |
+-----+----------+-------------+---------------------+-------------+----------+-------+
|  0  |  rbln0   | NO_RESPONSE | 2026-01-15 10:30:45 |  12345.678  | CP_EVENT |  0x0  |
+-----+----------+-------------+---------------------+-------------+----------+-------+

+-------------------------------------------------------------------------------+
|                                   Coredumps                                   |
+=====+==========+=====================+========================================+
| IDX | DEV NAME |      TIMESTAMP      |                  PATH                  |
+-----+----------+---------------------+----------------------------------------+
|  0  |  rbln0   | 2026-01-15 10:25:12 | /var/lib/rebellions/coredump/rbln0/... |
+-----+----------+---------------------+----------------------------------------+

Summary

Lists, creates, or destroys RSD groups. Use --attach with --create to specify group-local NPU IDs. --destroy all destroys every group.

Command

Command (list)
$ rbln-smdi group
Command (create)
$ rbln-smdi group --create 1 --attach 0,1
Command (destroy)
$ rbln-smdi group --destroy 1
Command (destroy all)
$ rbln-smdi group --destroy all

Output (example)

RSD groups (example)
+------------------------------------------------+
|                   RSD Groups                   |
+==========+============+=========+==============+
| Group ID | Group Name | NPU IDs | Device Count |
+----------+------------+---------+--------------+
|    1     |    rsd1    |   0,1   |      2       |
+----------+------------+---------+--------------+

Summary

Gets or sets a daemon runtime configuration value. Changes take effect immediately without a daemon restart.

Command

Command (all)
$ rbln-smdi config
Command (get)
$ rbln-smdi config --get max_events_per_dev
Command (set)
$ rbln-smdi config --set max_events_per_dev --value 2000

Output (example)

All config values (example)
+----------------------------+
|           Config           |
+----------------------------+
| RSMD Ver: 3.2.0            |
| max_events_per_dev: 1000   |
| max_coredumps_per_dev: 200 |
| log_level: 6               |
+----------------------------+
Get a value (example)
+--------------------------+
|          Config          |
+--------------------------+
| max_events_per_dev: 1000 |
+--------------------------+
Set a value (example)
+--------------------------+
|     Config (updated)     |
+--------------------------+
| max_events_per_dev: 2000 |
+--------------------------+

Summary

Resets a specific device or all devices. The daemon must be running with --allow-reset.

Command

Command (single device)
$ rbln-smdi reset rbln0
Command (all devices)
$ rbln-smdi reset all
Command (remote daemon)
$ rbln-smdi reset rbln0 --ip 192.168.1.100 --port 50051

Output (example)

Reset result (example)
device rbln0 reset succeeded

On failure

Reset result (failure example)
device rbln0 reset failed

Access Control and mTLS

This section covers the configuration needed to operate RSMD with mTLS and access control, split into daemon-side and client-side settings. Both sides must be configured consistently for the secure channel to come up.

Check the current mode

sudo systemctl status rbln-smd
Match the daemon's command line against the "Operating Modes" table below. If no flags appear, the daemon is running the default "Monitoring only" mode.

Daemon (rbln-smd) Configuration

rbln-smd configures transport security and access control with three independent flags plus a certificate directory. Specify them in the systemd unit or as CLI options when starting the daemon.

Access Control Flags

Flag Purpose
--mtls Enable mTLS mode. Opens a TLS-only channel on RBLN_SMD_PORT. Without this flag, an insecure channel is opened.
--allow-reset Allow device reset calls (resetDevice, resetAllDevice). Without this flag, reset calls are always denied.
--allowlist <path> 2-tier allowlist file. Restricts which clients may access the daemon and which may call reset.

Operating Modes

Mode Flags Behavior
Monitoring only (default) (none) Insecure channel, reset denied
Insecure + reset --allow-reset Insecure channel, any client may reset
mTLS monitoring --mtls TLS-only channel, reset denied
mTLS + reset --mtls --allow-reset TLS-only channel, any mTLS client may reset
Restricted access --allow-reset -a /path/to/allowlist Allowlisted clients only; privileged tier may reset

Warning

With --mtls alone, any client presenting a valid certificate signed by the trusted CA can access all APIs except reset. To restrict per-client access, combine --mtls with --allowlist.

Allowlist

The allowlist is an INI-style file with [basic] and [privileged] sections. When the specified file does not exist, the daemon creates a template with guidance comments.

# rbln-smd allowlist configuration
# Changes take effect on the next gRPC request (no restart needed).

[basic]
# Calls other than reset (getDeviceList, getVersion, getHWInfo, etc.)
192.168.1.0/24
::1
monitoring-service

[privileged]
# All operations including resetDevice / resetAllDevice
# [privileged] implies [basic] access.
10.0.0.1
admin-client

Supported entry types:

  • IPv4 address: 192.168.1.100
  • IPv4 CIDR: 192.168.10.0/24
  • IPv6 address: ::1
  • IPv6 CIDR: fe80::/10
  • CN string: client-service-name (evaluated only on mTLS channels)

Channel rules:

  • Insecure channel: Only IP / CIDR entries are evaluated.
  • mTLS channel: Entries are evaluated in this order, and the first matching step decides the result:

    1. CN listed in [privileged] → privileged
    2. CN listed in [basic] → basic
    3. IP / CIDR matches [privileged] → privileged
    4. IP / CIDR matches [basic] → basic
    5. No match → denied

    When a client is registered under both CN and IP, the earlier step wins. For example, a client whose CN appears in [basic] while its IP appears in [privileged] resolves to basic.

The allowlist file is checked for modifications via mtime on each gRPC request. Changes take effect on the next request without restarting the daemon. Entities that do not appear in any section are denied all access, including monitoring APIs.

Server Certificates

With --mtls enabled, the daemon reads the following three files from the certificate directory:

Filename Purpose
rsmd_ca.crt CA certificate used to verify client certificates
rsmd_server.crt Server certificate
rsmd_server.key Server private key

The daemon resolves the directory in this order:

  1. The --cert <dir> CLI flag
  2. The RSMD_CERT_PATH environment variable
  3. The OS default path — /etc/ssl/certs/ on Debian / Ubuntu, /etc/pki/tls/certs/ on RHEL / CentOS / Fedora

Restrict the private key (rsmd_server.key) so users other than the daemon cannot read it (for example, the daemon's user account with mode 0600). Symbolic links are rejected for safety (O_NOFOLLOW).

Client (rbln-smdi) Configuration

rbln-smdi decides whether to connect to the daemon over an mTLS channel and which certificate to present. When the daemon runs with --mtls, the client must also enable TLS.

TLS Activation

TLS channel use and the certificate location are resolved in this order:

  1. The --tls <path> CLI flag — enables TLS for the current invocation and connects with the certificate at the given path. Accepts either a .crt / .key file or a base path without the extension.
  2. The RSMD_IF_CERT_PATH environment variable — same base path format as above, applied per shell session.
  3. The secure=true and cert_path keys in ~/.rbln-smdi/config — a persistent setting that applies to every subsequent invocation.

--tls and RSMD_IF_CERT_PATH exit with an error if the files are not found at the resolved path. cert_path acts only as a fallback and is disabled silently when the files are missing. If none of the three apply, the client connects over the insecure channel.

Client Certificates

When the client uses the mTLS channel, it reads the following three files:

Filename Purpose
rsmd_ca.crt CA certificate used to verify the server certificate
rsmd_client.crt Client certificate
rsmd_client.key Client private key

Symbolic links are rejected (O_NOFOLLOW). When per-client access control is required, register the CN of the client certificate in the server-side allowlist.

Certificate Renewal

When overwriting certificates in place, the server side requires a daemon restart and incurs a brief downtime. The client side picks up the new certificate on the next call.

When moving to new paths, update RSMD_CERT_PATH or --cert on the server and restart the daemon. On the client, update RSMD_IF_CERT_PATH or the cert_path value in the config.

Renewing only the client-side certificate requires no daemon restart.


Telemetry Units

RSMD uses the following units for telemetry data:

  • Temperature: milli-degree Celsius (Divide by 1,000 to convert to °C)
  • Power: micro-watts (Divide by 1,000,000 to convert to Watts)
  • Memory: GB (GB)
  • Clock: MHz (MHz)
  • Utilization: percentage (0-100)

NOTE: The CLI tool automatically converts these into human-readable formats.


Recommendations

  1. Regular monitoring: Run scheduled checks to track device health metrics over time.
  2. Event and coredump retention: Tune RSMD_MAX_EVENTS_PER_DEV and RSMD_MAX_COREDUMPS_PER_DEV to match available disk space.
  3. Network security: Prefer Unix domain sockets for local access; for remote access, combine --mtls with --allowlist. Restrict the TCP port further with firewall rules.
  4. Error handling: Check err_status fields in API responses before consuming results.
  5. Log review: Periodically review event logs to identify anomalies and recurring patterns.

References

  • gRPC Definition: The protocol buffer definition file (rbln_smd.proto) is typically located in /opt/rebellions/etc/ or a similar system configuration directory. Until v4.0, a copy is also installed under the name rbln_services.proto.
  • Systemd Service: The service unit file is installed at /etc/systemd/system/rbln-smd.service. A compatibility alias /etc/systemd/system/rbln_daemon.service runs a binary-identical copy of the same daemon (scheduled for removal in v4.0).