Skip to content

Bandwidth Benchmark (rblnBandwidthLatencyTest)

rblnBandwidthLatencyTest measures transfer performance between host and NPU, and between NPUs, and reports:

  • Bandwidth results (GB/s)
  • Latency results (us)
  • NPU connectivity (distance matrix) and topology context

Quick Start

Default run (recommended)

$ rblnBandwidthLatencyTest

Environment and topology

$ rblnBandwidthLatencyTest --test_env_info

Next steps

Key Concepts and Terminology

Transfer directions

  • H2D: host -> NPU
  • D2H: NPU -> host
  • D2D: NPU -> NPU (peer-to-peer)

Use --only_d2d when you want to focus on NPU-to-NPU transfers only.

Measurement configuration

  • --test_size: payload size (default: 512M).
  • --iteration_cnt: measurement iterations.
  • --warmup: run one warmup iteration before measurement.

Target selection

  • Identifiers
    • NPU ID: numeric index used by --npu_id_list (for example: 0, 1).
    • Device label: runtime label shown in output (for example: rbln0, rbln1).
  • Selection rules
    • --npu_id_list: comma-separated list of NPU IDs (for example: 0,1,2).
    • For test cases 3-6, the set of NPUs involved depends on whether --npu_id_list is provided.

Test cases

The available test case IDs are printed by --test_list. The ID is the 0-based index of the entry in the list.

Based on the --test_list output shown later on this page:

ID (--tc) --test_list entry Purpose (high level)
0 RBLN_PERF_TEST_ALL Run the default benchmark set (combined bandwidth + latency sweep).
1 RBLN_PERF_BANDWIDTH_TEST_ALL Bandwidth-only sweep.
2 RBLN_PERF_LATENCY_TEST_ALL Latency-only sweep.
3 RBLN_PERF_BANDWIDTH_TEST Bandwidth test for a specific pair (or “first device vs others” if --npu_id_list is not provided).
4 RBLN_PERF_LATENCY_TEST Latency test for a specific pair (or “first device vs others” if --npu_id_list is not provided).
5 RBLN_PERF_BIDIR_BANDWIDTH_TEST Bandwidth test across devices.
6 RBLN_PERF_PARALLEL_BANDWIDTH_TEST Bandwidth test across devices.

You can run a specific test case by specifying its ID with the --tc option.

Note

If your environment prints a different --test_list order, treat the table above as a guide and rely on the --test_list output as the source of truth.

Command Reference

General usage

$ rblnBandwidthLatencyTest [options]
$ rblnBandwidthLatencyTest --tc <id> [options]

Global options

These options are accepted by the base command and the test-case execution path.

Common options

Option Description
-h, --help Show the help message and exit.
-t, --timeout <int> Set the timeout in seconds.
-i, --iteration_cnt <int> Set the iteration count.
--warmup Run a warmup iteration before measurement.
--test_size <str> Set the data size (e.g., 4K, 128M, 1G). Default: 512M.
--npu_id_list <list> Provide a comma-separated list of NPU IDs (e.g., 0,1,2).
--only_d2d Execute only D2D transfers; skip H2D/D2H stages.
--test_list Print the list of available test cases.
--tc <int> Select the test case ID to run (0-based index). Use the index from --test_list.
--test_env_info Print system and device information.
--skip_test_env_info Skip printing environment information when running tests.

Advanced options

Option Description
-n, --device_name <int> Specify the source device name (ID).
--get_perf <int> Enable performance reporting (1: average, 2: per_inference).
--priority <int> Set context priority (0-2: min/normal/high, 3: kernel, 4: real-time).
-r, --hdma_read <int> Configure the HDMA direction (0: write, 1: read).
--numa_node <int> Force a NUMA node (default: -1).
--data_pattern <int> Set the data pattern value.
-f, --flags <int> Set context flags.
-g <int> Set the execution group ID (no long option is provided).

Tip

If you are unsure about an option, start with Quick Start and rely on rblnBandwidthLatencyTest --help for the complete help text.

CLI Examples

Summary

Output block What it is
System + device enumeration CPU and device list (typically includes PCI BDF and NUMA node).
P2P Connectivity Distance NPU-to-NPU connectivity matrix (reported by the tool).
Bandwidth table Throughput results (GB/s) across HOST and rbln* device labels.
Latency results Synchronization latency in microseconds (us).

Command

Command
$ rblnBandwidthLatencyTest

Output (excerpt)

Default run (excerpt)
CPU : AMD EPYC 9254 24-Core Processor, Vendor ID : AuthenticAMD
Device :
0000:00:01.1/01:00.0 PCI bridge: Broadcom / LSI PEX890xx PCIe Gen 5 Switch (rev b0)
    0000:05:00.0 NODE 0, rbln0, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 0, IOMMU: 70
    0000:06:00.0 NODE 1, rbln1, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 0, IOMMU: 71
    ...
P2P Connectivity Distance
rbln0  0  4  4  ...
rbln1  4  0  4  ...

Device Write bandwidth (GB/s)
S \\ D  HOST  rbln0  rbln1 ...
HOST   0.00  50.91  50.72 ...
rbln0 59.56   0.00  59.31 ...

Device P2P sync latency (us)
rbln0 -> rbln1 2.07 (us)

Summary

Prints the list of available performance test cases.

Command

Command
$ rblnBandwidthLatencyTest --test_list

Output (excerpt)

--test_list (example)
List of tests:
RBLN_PERF_TEST_ALL
RBLN_PERF_BANDWIDTH_TEST_ALL
RBLN_PERF_LATENCY_TEST_ALL
RBLN_PERF_BANDWIDTH_TEST
RBLN_PERF_LATENCY_TEST
RBLN_PERF_BIDIR_BANDWIDTH_TEST
RBLN_PERF_PARALLEL_BANDWIDTH_TEST

Summary

Prints CPU information and system topology details (PCI BDF, NUMA node, and PCIe link information).

Command

Command
$ rblnBandwidthLatencyTest --test_env_info

Output (excerpt)

--test_env_info (excerpt)
CPU : AMD EPYC 9254 24-Core Processor, Vendor ID : AuthenticAMD
Device :
0000:00:01.1/01:00.0 PCI bridge: Broadcom / LSI PEX890xx PCIe Gen 5 Switch (rev b0)
    0000:05:00.0 NODE 0, rbln0, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 0, IOMMU: 70
                                   PCIe full bandwidth 64.0 GB/s (MAX: 64.0), PCIe Gen 5 (MAX: 5), 16 lane (MAX: 16)
                                   MaxPayload 512 bytes, MaxReadReq 512 bytes
...
0000:ce:00.0 NODE 30, rbln30, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 1, IOMMU: 136
                                   PCIe full bandwidth 64.0 GB/s (MAX: 64.0), PCIe Gen 5 (MAX: 5), 16 lane (MAX: 16)
                                   MaxPayload 512 bytes, MaxReadReq 512 bytes
0000:cf:00.0 NODE 31, rbln31, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 1, IOMMU: 137
                                   PCIe full bandwidth 64.0 GB/s (MAX: 64.0), PCIe Gen 5 (MAX: 5), 16 lane (MAX: 16)
                                   MaxPayload 512 bytes, MaxReadReq 512 bytes

Summary

Runs a single bandwidth test case (RBLN_PERF_BANDWIDTH_TEST) for a specific NPU pair.

Command

Command
$ rblnBandwidthLatencyTest --tc 3 --npu_id_list 0,1 --iteration_cnt 1 --test_size 1M --skip_test_env_info

Output (excerpt)

tc 3 (excerpt)
This test will run with size (0x100000)
==============================================================================
[      Run ] : RBLN_PERF_BANDWIDTH_TEST
==============================================================================

[Measure Iteration cnt : 1]
[Measure Buffer Size : 0x100000]

P2P Connectivity Distance
rbln0       0   4   4   4   8   ...
rbln1       4   0   4   4   8   ...

---------------------------------
Device bandwidth (GB/s) *Write
---------------------------------
[H2D] HOST -> rbln0 : MAX 14.71, MIN 14.71, AVG 14.71
[D2D] rbln0 -> rbln1 : MAX 10.29, MIN 10.29, AVG 10.29
[D2H] rbln1 -> HOST : MAX 19.41, MIN 19.41, AVG 19.41

==============================================================================
[     PASS ] : RBLN_PERF_BANDWIDTH_TEST
==============================================================================

Summary

Runs a single latency test case (RBLN_PERF_LATENCY_TEST) for a specific NPU pair.

Command

Command
$ rblnBandwidthLatencyTest --tc 4 --npu_id_list 0,1 --skip_test_env_info

Output (excerpt)

tc 4 (excerpt)
==============================================================================
[      Run ] : RBLN_PERF_LATENCY_TEST
==============================================================================

[Measure Iteration cnt : 1]
[Measure Buffer Size : 0x20000000]

P2P Connectivity Distance
rbln0       0   4   4   4   8   ...
rbln1       4   0   4   4   8   ...

---------------------------------
Device P2P sync latency (us)
---------------------------------
rbln0 -> rbln1 2.07 (us)

==============================================================================
[     PASS ] : RBLN_PERF_LATENCY_TEST
==============================================================================

Troubleshooting

The command times out

  • Increase --timeout.
  • Reduce --test_size or --iteration_cnt to shorten the test.
  • Run --test_env_info first to confirm devices are detected and topology is printed.

Results fluctuate a lot

  • Use --warmup.
  • Increase --iteration_cnt and compare averages.
  • Keep the system load stable (pin workloads, avoid background jobs during benchmarking).

I only want device-to-device numbers

Use --only_d2d to skip H2D/D2H stages and focus on NPU-to-NPU transfers.

See also