TroubleShoot
How to generate core dump file¶
If you encounter a problem while running PyTorch RBLN, please send the generated core dump file to client_support@rebellions.ai. To create a core dump file, you first need to remove the ulimit restrictions by the following command.
Verify that the ulimit restrictions have been removed by running:
Re-run the problematic model script. When the error message occurs, a core dump file will be created under /var/crash.
Logging Operators Running on CPU¶
When a PyTorch operator or a specific data type is encountered that is not yet supported by PyTorch RBLN, the operation is executed on the CPU instead to ensure seamless execution.
While this feature enhances model compatibility, these operations do not leverage the performance benefits of the NPU. Therefore, it is crucial to identify which operations are falling back to the CPU during the optimization process.
By default, the PyTorch RBLN log level is set to WARNING, so debugging (DEBUG) messages are not displayed. Therefore, to identify all operators running on the CPU for NPU performance optimization, you must explicitly set the environment variable to DEBUG as shown below.
Usage:
To set it back to the default value, set the environment variable as follows.
Example Output:
With this environment, running a model in Eager Mode will print a log, as shown below, whenever an operation is performed on the CPU instead of the Rebellions NPU, containing the operator's name and (if traceable) the source code location.
Wrong results after torch.Tensor.to¶
Symptoms¶
- Mismatch vs. CPU baselines after moving tensors from RBLN to CPU; odd FP16 behavior in downstream CPU code.
Impact¶
- Numerical drift or silent accuracy issues.
Root cause¶
- RBLN runs FP16 math with a custom 16-bit FP. Tensors can display as
torch.float16yet retain custom-FP semantics if not explicitly materialized. Usingto("cpu")alone carries those semantics to CPU.
Resolution¶
Verify (minimal comparison)¶
Example Output¶
Notes¶
- On device,
tensor.to(dtype=torch.float16)materializes if needed - When moving to CPU, prefer
to("cpu", dtype=torch.float16)before comparisons or CPU-side ops.
Memory Statistics Showing Lower Than Expected Values¶
Symptoms¶
- Memory statistics APIs (e.g.,
memory_allocated(),memory_stats()) return lower values than expected immediately after creating tensors on RBLN device - Device memory usage appears to be zero or very low even after allocating large tensors
Impact¶
- Confusion about actual memory usage
- Difficulty in monitoring memory consumption during model execution
Root Cause¶
- RBLN tensors use lazy memory allocation. When you create a tensor on an RBLN device:
- The tensor is initially allocated in CPU memory immediately upon creation
- Device memory allocation is deferred until the tensor is actually needed for device operations
- When a device operation is required, the tensor data is lazily transferred from CPU to device memory
- All memory-related APIs (e.g.,
memory_allocated(),memory_reserved(),memory_stats()) reflect device memory only, not CPU memory - Additionally, Dynamo caching (used by
torch.compile()) may cache compiled graphs and their associated device memory, which can also contribute to device memory usage. To see accurate memory statistics for your tensors only, reset the Dynamo cache before checking statistics
Resolution¶
- Check memory statistics after performing operations that require the tensors to be materialized on the device
- Memory statistics will increase when tensors are actually used in device computations
Example¶
Notes¶
- This is expected behavior and not a bug. The lazy allocation strategy optimizes memory usage by deferring device memory allocation until needed
- To see accurate device memory usage, check statistics after performing operations that require the tensors to be materialized on the device
- CPU memory is allocated immediately, but device memory statistics only reflect memory after materialization
- Dynamo caching can hold compiled graphs in device memory. Use
torch._dynamo.reset()before checking memory statistics to exclude cached graph memory and see only your tensor memory usage