Bandwidth Benchmark (rblnBandwidthLatencyTest)¶
rblnBandwidthLatencyTest measures transfer performance between host and NPU, and between NPUs, and reports:
- Bandwidth results (GB/s)
- Latency results (us)
- NPU connectivity (distance matrix) and topology context
Quick Start¶
Default run (recommended)
Environment and topology
Next steps¶
- For concrete command+output pairs, see CLI Examples.
- For
--tcIDs and selection rules, see Test cases.
Key Concepts and Terminology¶
Transfer directions¶
- H2D: host -> NPU
- D2H: NPU -> host
- D2D: NPU -> NPU (peer-to-peer)
Use --only_d2d when you want to focus on NPU-to-NPU transfers only.
Measurement configuration¶
--test_size: payload size (default:512M).--iteration_cnt: measurement iterations.--warmup: run one warmup iteration before measurement.
Target selection¶
- Identifiers
- NPU ID: numeric index used by
--npu_id_list(for example:0,1). - Device label: runtime label shown in output (for example:
rbln0,rbln1).
- NPU ID: numeric index used by
- Selection rules
--npu_id_list: comma-separated list of NPU IDs (for example:0,1,2).- For test cases
3-6, the set of NPUs involved depends on whether--npu_id_listis provided.
Test cases¶
The available test case IDs are printed by --test_list. The ID is the 0-based index of the entry in the list.
Based on the --test_list output shown later on this page:
ID (--tc) |
--test_list entry |
Purpose (high level) |
|---|---|---|
0 |
RBLN_PERF_TEST_ALL |
Run the default benchmark set (combined bandwidth + latency sweep). |
1 |
RBLN_PERF_BANDWIDTH_TEST_ALL |
Bandwidth-only sweep. |
2 |
RBLN_PERF_LATENCY_TEST_ALL |
Latency-only sweep. |
3 |
RBLN_PERF_BANDWIDTH_TEST |
Bandwidth test for a specific pair (or “first device vs others” if --npu_id_list is not provided). |
4 |
RBLN_PERF_LATENCY_TEST |
Latency test for a specific pair (or “first device vs others” if --npu_id_list is not provided). |
5 |
RBLN_PERF_BIDIR_BANDWIDTH_TEST |
Bandwidth test across devices. |
6 |
RBLN_PERF_PARALLEL_BANDWIDTH_TEST |
Bandwidth test across devices. |
You can run a specific test case by specifying its ID with the --tc option.
Note
If your environment prints a different --test_list order, treat the table above as a guide and rely on the --test_list output as the source of truth.
Command Reference¶
General usage¶
Global options¶
These options are accepted by the base command and the test-case execution path.
Common options
| Option | Description |
|---|---|
-h, --help |
Show the help message and exit. |
-t, --timeout <int> |
Set the timeout in seconds. |
-i, --iteration_cnt <int> |
Set the iteration count. |
--warmup |
Run a warmup iteration before measurement. |
--test_size <str> |
Set the data size (e.g., 4K, 128M, 1G). Default: 512M. |
--npu_id_list <list> |
Provide a comma-separated list of NPU IDs (e.g., 0,1,2). |
--only_d2d |
Execute only D2D transfers; skip H2D/D2H stages. |
--test_list |
Print the list of available test cases. |
--tc <int> |
Select the test case ID to run (0-based index). Use the index from --test_list. |
--test_env_info |
Print system and device information. |
--skip_test_env_info |
Skip printing environment information when running tests. |
Advanced options
| Option | Description |
|---|---|
-n, --device_name <int> |
Specify the source device name (ID). |
--get_perf <int> |
Enable performance reporting (1: average, 2: per_inference). |
--priority <int> |
Set context priority (0-2: min/normal/high, 3: kernel, 4: real-time). |
-r, --hdma_read <int> |
Configure the HDMA direction (0: write, 1: read). |
--numa_node <int> |
Force a NUMA node (default: -1). |
--data_pattern <int> |
Set the data pattern value. |
-f, --flags <int> |
Set context flags. |
-g <int> |
Set the execution group ID (no long option is provided). |
Tip
If you are unsure about an option, start with Quick Start and rely on rblnBandwidthLatencyTest --help for the complete help text.
CLI Examples¶
Summary
| Output block | What it is |
|---|---|
| System + device enumeration | CPU and device list (typically includes PCI BDF and NUMA node). |
P2P Connectivity Distance |
NPU-to-NPU connectivity matrix (reported by the tool). |
| Bandwidth table | Throughput results (GB/s) across HOST and rbln* device labels. |
| Latency results | Synchronization latency in microseconds (us). |
Command
Output (excerpt)
CPU : AMD EPYC 9254 24-Core Processor, Vendor ID : AuthenticAMD
Device :
0000:00:01.1/01:00.0 PCI bridge: Broadcom / LSI PEX890xx PCIe Gen 5 Switch (rev b0)
0000:05:00.0 NODE 0, rbln0, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 0, IOMMU: 70
0000:06:00.0 NODE 1, rbln1, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 0, IOMMU: 71
...
P2P Connectivity Distance
rbln0 0 4 4 ...
rbln1 4 0 4 ...
Device Write bandwidth (GB/s)
S \\ D HOST rbln0 rbln1 ...
HOST 0.00 50.91 50.72 ...
rbln0 59.56 0.00 59.31 ...
Device P2P sync latency (us)
rbln0 -> rbln1 2.07 (us)
Summary
Prints the list of available performance test cases.
Command
Output (excerpt)
Summary
Prints CPU information and system topology details (PCI BDF, NUMA node, and PCIe link information).
Command
Output (excerpt)
CPU : AMD EPYC 9254 24-Core Processor, Vendor ID : AuthenticAMD
Device :
0000:00:01.1/01:00.0 PCI bridge: Broadcom / LSI PEX890xx PCIe Gen 5 Switch (rev b0)
0000:05:00.0 NODE 0, rbln0, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 0, IOMMU: 70
PCIe full bandwidth 64.0 GB/s (MAX: 64.0), PCIe Gen 5 (MAX: 5), 16 lane (MAX: 16)
MaxPayload 512 bytes, MaxReadReq 512 bytes
...
0000:ce:00.0 NODE 30, rbln30, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 1, IOMMU: 136
PCIe full bandwidth 64.0 GB/s (MAX: 64.0), PCIe Gen 5 (MAX: 5), 16 lane (MAX: 16)
MaxPayload 512 bytes, MaxReadReq 512 bytes
0000:cf:00.0 NODE 31, rbln31, RBLN-CA25(rev-03), pciDevID: 0x1250, NUMA: 1, IOMMU: 137
PCIe full bandwidth 64.0 GB/s (MAX: 64.0), PCIe Gen 5 (MAX: 5), 16 lane (MAX: 16)
MaxPayload 512 bytes, MaxReadReq 512 bytes
Summary
Runs a single bandwidth test case (RBLN_PERF_BANDWIDTH_TEST) for a specific NPU pair.
Command
$ rblnBandwidthLatencyTest --tc 3 --npu_id_list 0,1 --iteration_cnt 1 --test_size 1M --skip_test_env_info
Output (excerpt)
This test will run with size (0x100000)
==============================================================================
[ Run ] : RBLN_PERF_BANDWIDTH_TEST
==============================================================================
[Measure Iteration cnt : 1]
[Measure Buffer Size : 0x100000]
P2P Connectivity Distance
rbln0 0 4 4 4 8 ...
rbln1 4 0 4 4 8 ...
---------------------------------
Device bandwidth (GB/s) *Write
---------------------------------
[H2D] HOST -> rbln0 : MAX 14.71, MIN 14.71, AVG 14.71
[D2D] rbln0 -> rbln1 : MAX 10.29, MIN 10.29, AVG 10.29
[D2H] rbln1 -> HOST : MAX 19.41, MIN 19.41, AVG 19.41
==============================================================================
[ PASS ] : RBLN_PERF_BANDWIDTH_TEST
==============================================================================
Summary
Runs a single latency test case (RBLN_PERF_LATENCY_TEST) for a specific NPU pair.
Command
Output (excerpt)
==============================================================================
[ Run ] : RBLN_PERF_LATENCY_TEST
==============================================================================
[Measure Iteration cnt : 1]
[Measure Buffer Size : 0x20000000]
P2P Connectivity Distance
rbln0 0 4 4 4 8 ...
rbln1 4 0 4 4 8 ...
---------------------------------
Device P2P sync latency (us)
---------------------------------
rbln0 -> rbln1 2.07 (us)
==============================================================================
[ PASS ] : RBLN_PERF_LATENCY_TEST
==============================================================================
Troubleshooting¶
The command times out¶
- Increase
--timeout. - Reduce
--test_sizeor--iteration_cntto shorten the test. - Run
--test_env_infofirst to confirm devices are detected and topology is printed.
Results fluctuate a lot¶
- Use
--warmup. - Increase
--iteration_cntand compare averages. - Keep the system load stable (pin workloads, avoid background jobs during benchmarking).
I only want device-to-device numbers¶
Use --only_d2d to skip H2D/D2H stages and focus on NPU-to-NPU transfers.
See also¶
rbln-smi: device monitoring and topology inspectionrblnvs: system validationrbln-flash: firmware update tool