Skip to content

TroubleShoot

How to generate core dump file

If you encounter a problem while running PyTorch RBLN, please send the generated core dump file to client_support@rebellions.ai. To create a core dump file, you first need to remove the ulimit restrictions by the following command.

$ ulimit -c unlimited

Verify that the ulimit restrictions have been removed by running:

$ ulimit -c
unlimited

Re-run the problematic model script. When the error message occurs, a core dump file will be created under /var/crash.

1
2
3
$ ls /var/crash
-rw-r----- 1 rebel1    root   779026 Jul  2 17:50 /var/crash/_usr_bin_python3.10.2029.crash
-rw-r----- 1 rebel2    root 94849351 Jun 25 18:27 /var/crash/_usr_bin_python3.10.2035.crash

Logging Operators Running on CPU

When a PyTorch operator or a specific data type is encountered that is not yet supported by PyTorch RBLN, the operation is executed on the CPU instead to ensure seamless execution.

While this feature enhances model compatibility, these operations do not leverage the performance benefits of the NPU. Therefore, it is crucial to identify which operations are falling back to the CPU during the optimization process.

By default, the PyTorch RBLN log level is set to WARNING, so debugging (DEBUG) messages are not displayed. Therefore, to identify all operators running on the CPU for NPU performance optimization, you must explicitly set the environment variable to DEBUG as shown below.

Usage:

$ export TORCH_RBLN_LOG=DEBUG

To set it back to the default value, set the environment variable as follows.

$ export TORCH_RBLN_LOG=WARNING

Example Output:

With this environment, running a model in Eager Mode will print a log, as shown below, whenever an operation is performed on the CPU instead of the Rebellions NPU, containing the operator's name and (if traceable) the source code location.

1
2
3
[TORCH-RBLN][DEBUG] 'aten::pow' ran on CPU instead of RBLN
/transformers/models/llama/modeling_llama.py:73: UserWarning: TRACE
  variance = hidden_states.pow(2).mean(-1, keepdim=True)

Wrong results after torch.Tensor.to

Symptoms

  • Mismatch vs. CPU baselines after moving tensors from RBLN to CPU; odd FP16 behavior in downstream CPU code.

Impact

  • Numerical drift or silent accuracy issues.

Root cause

  • RBLN runs FP16 math with a custom 16-bit FP. Tensors can display as torch.float16 yet retain custom-FP semantics if not explicitly materialized. Using to("cpu") alone carries those semantics to CPU.

Resolution

tensor = tensor.to("cpu", dtype=torch.float16)

Verify (minimal comparison)

import torch

def elements_sum(n, device):
    x = torch.ones(n, dtype=torch.float16, device=device)
    y = torch.full_like(x, 0.5, dtype=torch.float16, device=device)
    return x + y

out = elements_sum(2, "rbln")
baseline = elements_sum(2, "cpu")

implicit_move = out.to("cpu")
explicit_cast = out.to("cpu", dtype=torch.float16)

print("baseline     :", baseline.tolist())
print("implicit_move:", implicit_move.tolist())
print("explicit_cast:", explicit_cast.tolist())

Example Output

1
2
3
baseline     : [1.5, 1.5]
implicit_move: [1.75, 1.75]
explicit_cast: [1.5, 1.5]

Notes

  • On device, tensor.to(dtype=torch.float16) materializes if needed
  • When moving to CPU, prefer to("cpu", dtype=torch.float16) before comparisons or CPU-side ops.

Memory Statistics Showing Lower Than Expected Values

Symptoms

  • Memory statistics APIs (e.g., memory_allocated(), memory_stats()) return lower values than expected immediately after creating tensors on RBLN device
  • Device memory usage appears to be zero or very low even after allocating large tensors

Impact

  • Confusion about actual memory usage
  • Difficulty in monitoring memory consumption during model execution

Root Cause

  • RBLN tensors use lazy memory allocation. When you create a tensor on an RBLN device:
    • The tensor is initially allocated in CPU memory immediately upon creation
    • Device memory allocation is deferred until the tensor is actually needed for device operations
    • When a device operation is required, the tensor data is lazily transferred from CPU to device memory
  • All memory-related APIs (e.g., memory_allocated(), memory_reserved(), memory_stats()) reflect device memory only, not CPU memory
  • Additionally, Dynamo caching (used by torch.compile()) may cache compiled graphs and their associated device memory, which can also contribute to device memory usage. To see accurate memory statistics for your tensors only, reset the Dynamo cache before checking statistics

Resolution

  • Check memory statistics after performing operations that require the tensors to be materialized on the device
  • Memory statistics will increase when tensors are actually used in device computations

Example

import torch
import torch.rbln

# Create tensors
x = torch.randn(1024, 1024, device="rbln")
y = torch.randn(1024, 1024, device="rbln")

# Memory stats immediately after creation may be low
print("After creation:", torch.rbln.memory_allocated() / 1024, "KB")

# Perform a device operation to materialize tensors
z = x + y

# Reset Dynamo cache to exclude cached graph memory from statistics
torch._dynamo.reset()

# Memory stats after materialization will reflect actual device memory usage
print("After operation:", torch.rbln.memory_allocated() / 1024, "KB")

Notes

  • This is expected behavior and not a bug. The lazy allocation strategy optimizes memory usage by deferring device memory allocation until needed
  • To see accurate device memory usage, check statistics after performing operations that require the tensors to be materialized on the device
  • CPU memory is allocated immediately, but device memory statistics only reflect memory after materialization
  • Dynamo caching can hold compiled graphs in device memory. Use torch._dynamo.reset() before checking memory statistics to exclude cached graph memory and see only your tensor memory usage