Hardware Diagnostics (rbln-vs)¶
rbln-vs is a hardware and software diagnostic tool that validates RBLN NPU health through a tiered test suite covering software integrity, hardware presence, PCIe bandwidth, and NPU memory.
Info
rbln-vsrequires root privileges (sudo).rbln-vsis supported on the following server models. It exits with an error on unsupported models.AS -4125GS-TNRT2AS -5126GS-TNRT2
- The following system tools must be available:
lspci,lsmod,lscpu,dmidecode,dmesg,fuser. - The following RBLN tools must be installed:
rbln-smi,rblnBandwidthLatencyTest.
Quick Start¶
Key Concepts and Terminology¶
Test levels¶
rbln-vs organizes tests into two levels:
- L1 (basic checks): Validates the software stack and hardware presence without stressing the device.
- L2 (stress tests): Exercises PCIe links, NPU memory integrity, and memory bandwidth stability.
Test targets¶
Individual targets can be selected with -t, --target:
| Target | Level | Description |
|---|---|---|
software_integrity |
L1 | Validates rbln-smi binary, kernel module, driver/FW/SMC versions, and /dev/rbln* access permissions. |
hardware_presence |
L1 | Checks PCI device recognition, device node access, occupancy, idle health (temperature/power), and dmesg error history. |
pcie_bandwidth |
L2 | Verifies PCIe link speed/width, host-to-device bandwidth, peer-to-peer bandwidth/latency, and PCIe remove/rescan. |
npu_memory |
L2 | Writes and reads multiple data patterns across device memory to verify integrity. |
memory_bandwidth |
L2 | Runs repeated bandwidth measurements and checks result stability. |
Group targets run multiple tests at once:
| Target | Equivalent tests |
|---|---|
l1_test |
software_integrity + hardware_presence |
l2_test |
All L1 tests + pcie_bandwidth + npu_memory + memory_bandwidth |
all |
Same as l2_test |
Output directory¶
Each run creates a timestamped directory under the results path (default: results/):
| File | Content |
|---|---|
run.log |
Timestamped execution log with per-step PASS/FAIL/SKIP counts. |
system_info.txt |
Server hardware, OS, kernel, driver, and per-NPU details. |
test_report.txt |
Formatted final report with per-step results, metrics, and pass rate. |
test_logs/<name>.json |
Machine-readable per-test result. |
test_logs/<name>.txt |
Human-readable per-test summary. |
debug_logs.tar.gz |
Encrypted debug log archive (firmware logs, SMC logs, kernel logs). |
Command Reference¶
General usage¶
$ sudo rbln-vs -t <target> [options]
$ sudo rbln-vs -t <target1> <target2> ... [options]
$ sudo rbln-vs -l
Tip
For the full, version-specific option reference, run rbln-vs --help.
Options¶
| Option | Description |
|---|---|
-h, --help |
Show help message and exit. |
--version |
Show version information and exit. |
-t, --target <TARGET ...> |
One or more test targets: software_integrity, hardware_presence, pcie_bandwidth, npu_memory, memory_bandwidth, l1_test, l2_test, or all. |
-l, --list |
List available targets and exit. |
-j, --json |
Output results in JSON format. |
-d, --npu <NPU> |
Restrict L2 tests to a specific NPU index (e.g., 0). |
-r, --results <PATH> |
Base output directory (default: results/). |
Note
The -d, --npu option only applies to L2 tests. It is ignored when running L1-only targets.
CLI Examples¶
Summary
Lists available test targets on the current system.
Command
Output (example)
Summary
Runs the full L1 + L2 test suite on all detected NPUs.
Command
Output (example)
+-------------------- System Information --------------------+
| |
| OS : ubuntu 22.04.5 LTS (Jammy Jellyfish) |
| Kernel : 6.8.0-90-generic |
| CPU Model : AMD EPYC 9355 32-Core Processor |
| Manufacturer : Supermicro |
| Architecture : x86_64 |
| |
+========================= Level 1 ==========================+
| |
+--- Software Integrity -------------------------------------+
| |
| Result : PASS |
| -------------------------------------------------------- |
| Binary Integrity : PASS |
| Module Status : PASS |
| Version Consistency : PASS |
| Access Permission : PASS |
| |
+--- Hardware Presence --------------------------------------+
| |
| Result : PASS |
| -------------------------------------------------------- |
| Device Count : 4 |
| Device Recognition : PASS |
| Device Occupancy : PASS |
| Device Health : PASS |
| Error History : PASS |
| |
+========================= Level 2 ==========================+
| |
+--- NPU Memory ---------------------------------------------+
| |
| Result : PASS |
| Test Duration : 25.81 seconds (0.43 minutes) |
| -------------------------------------------------------- |
| Device Count : 4 |
| Pattern Integrity : PASS |
| |
+--- Memory Bandwidth ---------------------------------------+
| |
| Result : PASS |
| Test Duration : 164.98 seconds (2.75 minutes) |
| -------------------------------------------------------- |
| Device Count : 4 |
| Data Integrity : PASS |
| Performance Stability : PASS |
| |
+--- PCIe Bandwidth -----------------------------------------+
| |
| Result : PASS |
| Test Duration : 23.38 seconds (0.39 minutes) |
| -------------------------------------------------------- |
| Device Count : 4 |
| Link & Signal Health : PASS |
| Host-Device Speed : PASS |
| P2P Connectivity : PASS |
| Transfer Latency : PASS |
| PCIe Remove/Rescan : PASS |
| |
+============================================================+
Debug logs archived: results/20260223_155311/debug_logs.tar.gz
Results saved to: results/20260223_155311/
Summary
Runs basic software and hardware presence checks without stressing the device.
Command
Output (example)
+-------------------- System Information --------------------+
| |
| OS : ubuntu 22.04.5 LTS (Jammy Jellyfish) |
| Kernel : 6.8.0-90-generic |
| CPU Model : AMD EPYC 9355 32-Core Processor |
| Manufacturer : Supermicro |
| Architecture : x86_64 |
| |
+========================= Level 1 ==========================+
| |
+--- Software Integrity -------------------------------------+
| |
| Result : PASS |
| -------------------------------------------------------- |
| Binary Integrity : PASS |
| Module Status : PASS |
| Version Consistency : PASS |
| Access Permission : PASS |
| |
+--- Hardware Presence --------------------------------------+
| |
| Result : PASS |
| -------------------------------------------------------- |
| Device Count : 4 |
| Device Recognition : PASS |
| Device Occupancy : PASS |
| Device Health : PASS |
| Error History : PASS |
| |
+============================================================+
Debug logs archived: results/20260223_160012/debug_logs.tar.gz
Results saved to: results/20260223_160012/
Summary
Runs a specific L2 test restricted to a single NPU.
Command
Output (example)
+-------------------- System Information --------------------+
| |
| OS : ubuntu 22.04.5 LTS (Jammy Jellyfish) |
| Kernel : 6.8.0-90-generic |
| CPU Model : AMD EPYC 9355 32-Core Processor |
| Manufacturer : Supermicro |
| Architecture : x86_64 |
| |
+--- PCIe Bandwidth -----------------------------------------+
| |
| Result : PASS |
| Test Duration : 4.80 seconds (0.08 minutes) |
| -------------------------------------------------------- |
| Device Count : 1 |
| NPU IDs : [0] |
| Link & Signal Health : PASS |
| Host-Device Speed : PASS |
| P2P Connectivity : PASS |
| Transfer Latency : PASS |
| PCIe Remove/Rescan : PASS |
| |
+------------------------------------------------------------+
Debug logs archived: results/20260223_155844/debug_logs.tar.gz
Results saved to: results/20260223_155844/
Summary
Saves test results to a custom directory.
Command
Results are saved under ./my_results/<YYYYMMDD_HHMMSS>/.
Summary
Outputs test results in JSON format for programmatic consumption.
Command
Output (example)
{
"version": "3.2.0",
"system_information": {
"os": "ubuntu 22.04.5 LTS (Jammy Jellyfish)",
"kernel": "6.8.0-90-generic",
"cpu_model": "AMD EPYC 9355 32-Core Processor",
"manufacturer": "Supermicro",
"architecture": "x86_64"
},
"level_1": {
"software_integrity": {
"result": "PASS",
"binary_integrity": "PASS",
"module_status": "PASS",
"version_consistency": "PASS",
"access_permission": "PASS"
},
"hardware_presence": {
"result": "PASS",
"device_count": 4,
"device_recognition": "PASS",
"device_occupancy": "PASS",
"device_health": "PASS",
"error_history": "PASS"
}
}
}
Test Details¶
L1: Software integrity¶
Validates that the RBLN software stack is correctly installed, including rbln-smi binary, kernel module, driver/firmware/SMC versions, and device node permissions.
L1: Hardware presence¶
Validates that NPU hardware is present and healthy at idle, including PCI device recognition, device node access, temperature/power readings, and dmesg error history.
Note
Idle health check thresholds (temperature/power) are configurable via config/test_parameter.yml. Set the RBLN_VS_TEST_PARAMETER_CONFIG environment variable to use a custom configuration file instead of the default path. If the configuration file is missing or fails to parse, built-in defaults apply.
L2: PCIe bandwidth¶
Exercises the PCIe link per device: link speed/width, host-to-device bandwidth, peer-to-peer bandwidth and latency, and PCIe remove/rescan recovery.
L2: NPU memory¶
Writes and reads device memory with multiple data patterns to verify data integrity.
L2: Memory bandwidth¶
Runs repeated bandwidth measurements and verifies that all results are stable and pass data integrity checks.
Troubleshooting¶
Permission denied¶
rbln-vs requires root privileges. Rerun the command with sudo.
Unsupported server model¶
The tool exits with an error if the current server model is not in the supported list:
Error: Unsupported server model: '<model>'
rbln-vs is only supported on: AS -4125GS-TNRT2, AS -5126GS-TNRT2
Missing system tools¶
The tool exits with an error if required system tools are not installed:
L2 tests fail with "BUSY"¶
Another process is using the NPU. Stop all workloads and retry. rbln-vs retries BUSY errors automatically with increasing backoff. The retry count varies by test (up to 2–10 times). Persistent contention requires manual intervention.
PCIe remove/rescan does not recover¶
- Verify that no process holds a reference to the device nodes during the test.
- Check
dmesgfor PCIe AER errors that may indicate a hardware issue.
See also¶
rbln-smi: NPU status monitoring and topology inspectionrblnBandwidthLatencyTest: host-to-NPU and NPU-to-NPU benchmarkrbln-bios: system configuration validationrbln-flash: firmware update tool