Skip to content

Hardware Diagnostics (rbln-vs)

rbln-vs is a hardware and software diagnostic tool that validates RBLN NPU health through a tiered test suite covering software integrity, hardware presence, PCIe bandwidth, and NPU memory.

Info

  • rbln-vs requires root privileges (sudo).
  • rbln-vs is supported on the following server models. It exits with an error on unsupported models.
    • AS -4125GS-TNRT2
    • AS -5126GS-TNRT2
  • The following system tools must be available: lspci, lsmod, lscpu, dmidecode, dmesg, fuser.
  • The following RBLN tools must be installed: rbln-smi, rblnBandwidthLatencyTest.

Quick Start

$ sudo rbln-vs -t all

Key Concepts and Terminology

Test levels

rbln-vs organizes tests into two levels:

  • L1 (basic checks): Validates the software stack and hardware presence without stressing the device.
  • L2 (stress tests): Exercises PCIe links, NPU memory integrity, and memory bandwidth stability.

Test targets

Individual targets can be selected with -t, --target:

Target Level Description
software_integrity L1 Validates rbln-smi binary, kernel module, driver/FW/SMC versions, and /dev/rbln* access permissions.
hardware_presence L1 Checks PCI device recognition, device node access, occupancy, idle health (temperature/power), and dmesg error history.
pcie_bandwidth L2 Verifies PCIe link speed/width, host-to-device bandwidth, peer-to-peer bandwidth/latency, and PCIe remove/rescan.
npu_memory L2 Writes and reads multiple data patterns across device memory to verify integrity.
memory_bandwidth L2 Runs repeated bandwidth measurements and checks result stability.

Group targets run multiple tests at once:

Target Equivalent tests
l1_test software_integrity + hardware_presence
l2_test All L1 tests + pcie_bandwidth + npu_memory + memory_bandwidth
all Same as l2_test

Output directory

Each run creates a timestamped directory under the results path (default: results/):

File Content
run.log Timestamped execution log with per-step PASS/FAIL/SKIP counts.
system_info.txt Server hardware, OS, kernel, driver, and per-NPU details.
test_report.txt Formatted final report with per-step results, metrics, and pass rate.
test_logs/<name>.json Machine-readable per-test result.
test_logs/<name>.txt Human-readable per-test summary.
debug_logs.tar.gz Encrypted debug log archive (firmware logs, SMC logs, kernel logs).

Command Reference

General usage

$ sudo rbln-vs -t <target> [options]
$ sudo rbln-vs -t <target1> <target2> ... [options]
$ sudo rbln-vs -l

Tip

For the full, version-specific option reference, run rbln-vs --help.

Options

Option Description
-h, --help Show help message and exit.
--version Show version information and exit.
-t, --target <TARGET ...> One or more test targets: software_integrity, hardware_presence, pcie_bandwidth, npu_memory, memory_bandwidth, l1_test, l2_test, or all.
-l, --list List available targets and exit.
-j, --json Output results in JSON format.
-d, --npu <NPU> Restrict L2 tests to a specific NPU index (e.g., 0).
-r, --results <PATH> Base output directory (default: results/).

Note

The -d, --npu option only applies to L2 tests. It is ignored when running L1-only targets.

CLI Examples

Summary

Lists available test targets on the current system.

Command

Command
$ sudo rbln-vs -l

Output (example)

Targets list (example)
Available targets:
        software_integrity
        hardware_presence
        npu_memory
        memory_bandwidth
        pcie_bandwidth
        l1_test (runs up to L1: software_integrity, hardware_presence)
        l2_test (runs up to L2: L1 + pcie_bandwidth, npu_memory, memory_bandwidth)

Summary

Runs the full L1 + L2 test suite on all detected NPUs.

Command

Command
$ sudo rbln-vs -t all

Output (example)

Full test run (example)
+-------------------- System Information --------------------+
|                                                            |
|  OS             : ubuntu 22.04.5 LTS (Jammy Jellyfish)     |
|  Kernel         : 6.8.0-90-generic                         |
|  CPU Model      : AMD EPYC 9355 32-Core Processor          |
|  Manufacturer   : Supermicro                               |
|  Architecture   : x86_64                                   |
|                                                            |
+========================= Level 1 ==========================+
|                                                            |
+--- Software Integrity -------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  --------------------------------------------------------  |
|  Binary Integrity         : PASS                           |
|  Module Status            : PASS                           |
|  Version Consistency      : PASS                           |
|  Access Permission        : PASS                           |
|                                                            |
+--- Hardware Presence --------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  --------------------------------------------------------  |
|  Device Count             : 4                              |
|  Device Recognition       : PASS                           |
|  Device Occupancy         : PASS                           |
|  Device Health            : PASS                           |
|  Error History            : PASS                           |
|                                                            |
+========================= Level 2 ==========================+
|                                                            |
+--- NPU Memory ---------------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  Test Duration            : 25.81 seconds (0.43 minutes)   |
|  --------------------------------------------------------  |
|  Device Count             : 4                              |
|  Pattern Integrity        : PASS                           |
|                                                            |
+--- Memory Bandwidth ---------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  Test Duration            : 164.98 seconds (2.75 minutes)  |
|  --------------------------------------------------------  |
|  Device Count             : 4                              |
|  Data Integrity           : PASS                           |
|  Performance Stability    : PASS                           |
|                                                            |
+--- PCIe Bandwidth -----------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  Test Duration            : 23.38 seconds (0.39 minutes)   |
|  --------------------------------------------------------  |
|  Device Count             : 4                              |
|  Link & Signal Health     : PASS                           |
|  Host-Device Speed        : PASS                           |
|  P2P Connectivity         : PASS                           |
|  Transfer Latency         : PASS                           |
|  PCIe Remove/Rescan       : PASS                           |
|                                                            |
+============================================================+

Debug logs archived: results/20260223_155311/debug_logs.tar.gz

Results saved to: results/20260223_155311/

Summary

Runs basic software and hardware presence checks without stressing the device.

Command

Command
$ sudo rbln-vs -t l1_test

Output (example)

L1 test run (example)
+-------------------- System Information --------------------+
|                                                            |
|  OS             : ubuntu 22.04.5 LTS (Jammy Jellyfish)     |
|  Kernel         : 6.8.0-90-generic                         |
|  CPU Model      : AMD EPYC 9355 32-Core Processor          |
|  Manufacturer   : Supermicro                               |
|  Architecture   : x86_64                                   |
|                                                            |
+========================= Level 1 ==========================+
|                                                            |
+--- Software Integrity -------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  --------------------------------------------------------  |
|  Binary Integrity         : PASS                           |
|  Module Status            : PASS                           |
|  Version Consistency      : PASS                           |
|  Access Permission        : PASS                           |
|                                                            |
+--- Hardware Presence --------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  --------------------------------------------------------  |
|  Device Count             : 4                              |
|  Device Recognition       : PASS                           |
|  Device Occupancy         : PASS                           |
|  Device Health            : PASS                           |
|  Error History            : PASS                           |
|                                                            |
+============================================================+

Debug logs archived: results/20260223_160012/debug_logs.tar.gz

Results saved to: results/20260223_160012/

Summary

Runs a specific L2 test restricted to a single NPU.

Command

Command
$ sudo rbln-vs -t pcie_bandwidth -d 0

Output (example)

Single NPU test (example)
+-------------------- System Information --------------------+
|                                                            |
|  OS             : ubuntu 22.04.5 LTS (Jammy Jellyfish)     |
|  Kernel         : 6.8.0-90-generic                         |
|  CPU Model      : AMD EPYC 9355 32-Core Processor          |
|  Manufacturer   : Supermicro                               |
|  Architecture   : x86_64                                   |
|                                                            |
+--- PCIe Bandwidth -----------------------------------------+
|                                                            |
|  Result                   : PASS                           |
|  Test Duration            : 4.80 seconds (0.08 minutes)    |
|  --------------------------------------------------------  |
|  Device Count             : 1                              |
|  NPU IDs                  : [0]                            |
|  Link & Signal Health     : PASS                           |
|  Host-Device Speed        : PASS                           |
|  P2P Connectivity         : PASS                           |
|  Transfer Latency         : PASS                           |
|  PCIe Remove/Rescan       : PASS                           |
|                                                            |
+------------------------------------------------------------+

Debug logs archived: results/20260223_155844/debug_logs.tar.gz

Results saved to: results/20260223_155844/

Summary

Saves test results to a custom directory.

Command

Command
$ sudo rbln-vs -t all -r ./my_results/

Results are saved under ./my_results/<YYYYMMDD_HHMMSS>/.

Summary

Outputs test results in JSON format for programmatic consumption.

Command

Command
$ sudo rbln-vs -t l1_test -j

Output (example)

JSON output (example)
{
    "version": "3.2.0",
    "system_information": {
        "os": "ubuntu 22.04.5 LTS (Jammy Jellyfish)",
        "kernel": "6.8.0-90-generic",
        "cpu_model": "AMD EPYC 9355 32-Core Processor",
        "manufacturer": "Supermicro",
        "architecture": "x86_64"
    },
    "level_1": {
        "software_integrity": {
            "result": "PASS",
            "binary_integrity": "PASS",
            "module_status": "PASS",
            "version_consistency": "PASS",
            "access_permission": "PASS"
        },
        "hardware_presence": {
            "result": "PASS",
            "device_count": 4,
            "device_recognition": "PASS",
            "device_occupancy": "PASS",
            "device_health": "PASS",
            "error_history": "PASS"
        }
    }
}

Test Details

L1: Software integrity

Validates that the RBLN software stack is correctly installed, including rbln-smi binary, kernel module, driver/firmware/SMC versions, and device node permissions.

L1: Hardware presence

Validates that NPU hardware is present and healthy at idle, including PCI device recognition, device node access, temperature/power readings, and dmesg error history.

Note

Idle health check thresholds (temperature/power) are configurable via config/test_parameter.yml. Set the RBLN_VS_TEST_PARAMETER_CONFIG environment variable to use a custom configuration file instead of the default path. If the configuration file is missing or fails to parse, built-in defaults apply.

L2: PCIe bandwidth

Exercises the PCIe link per device: link speed/width, host-to-device bandwidth, peer-to-peer bandwidth and latency, and PCIe remove/rescan recovery.

L2: NPU memory

Writes and reads device memory with multiple data patterns to verify data integrity.

L2: Memory bandwidth

Runs repeated bandwidth measurements and verifies that all results are stable and pass data integrity checks.

Troubleshooting

Permission denied

rbln-vs requires root privileges. Rerun the command with sudo.

Error: This script requires sudo privileges. Please run with sudo.

Unsupported server model

The tool exits with an error if the current server model is not in the supported list:

Error: Unsupported server model: '<model>'
rbln-vs is only supported on: AS -4125GS-TNRT2, AS -5126GS-TNRT2

Missing system tools

The tool exits with an error if required system tools are not installed:

Error: Required system tools not found: <tools>
Please install the missing tools and try again.

L2 tests fail with "BUSY"

Another process is using the NPU. Stop all workloads and retry. rbln-vs retries BUSY errors automatically with increasing backoff. The retry count varies by test (up to 2–10 times). Persistent contention requires manual intervention.

PCIe remove/rescan does not recover

  • Verify that no process holds a reference to the device nodes during the test.
  • Check dmesg for PCIe AER errors that may indicate a hardware issue.

See also