TroubleShoot

How to generate core dump file¶

If you encounter a problem while running PyTorch RBLN, please send the generated core dump file to client_support@rebellions.ai. To create a core dump file, you first need to remove the ulimit restrictions by the following command.

$ ulimit -c unlimited

Verify that the ulimit restrictions have been removed by running:

$ ulimit -c
unlimited

Re-run the problematic model script. When the error message occurs, a core dump file will be created under /var/crash.

$ ls /var/crash
-rw-r----- 1 rebel1    root   779026 Jul  2 17:50 /var/crash/_usr_bin_python3.10.2029.crash
-rw-r----- 1 rebel2    root 94849351 Jun 25 18:27 /var/crash/_usr_bin_python3.10.2035.crash

Logging Operators Running on CPU¶

When a PyTorch operator or a specific data type is encountered that is not yet supported by PyTorch RBLN, the operation is executed on the CPU instead to ensure seamless execution.

While this feature enhances model compatibility, these operations do not leverage the performance benefits of the NPU. Therefore, it is crucial to identify which operations are falling back to the CPU during the optimization process.

By default, the PyTorch RBLN log level is set to WARNING, so debugging (DEBUG) messages are not displayed. Therefore, to identify all operators running on the CPU for NPU performance optimization, you must explicitly set the environment variable to DEBUG as shown below.

Usage:

$ export TORCH_RBLN_LOG=DEBUG

To set it back to the default value, set the environment variable as follows.

$ export TORCH_RBLN_LOG=WARNING

Example Output:

With this environment, running a model in Eager Mode will print a log, as shown below, whenever an operation is performed on the CPU instead of the Rebellions NPU, containing the operator's name and (if traceable) the source code location.

[TORCH-RBLN][DEBUG] 'aten::pow' ran on CPU instead of RBLN
/transformers/models/llama/modeling_llama.py:73: UserWarning: TRACE
  variance = hidden_states.pow(2).mean(-1, keepdim=True)

Wrong results after `torch.Tensor.to`¶

Symptoms¶

Mismatch vs. CPU baselines after moving tensors from RBLN to CPU; odd FP16 behavior in downstream CPU code.

Impact¶

Numerical drift or silent accuracy issues.

Root cause¶

RBLN runs FP16 math with a custom 16-bit FP. Tensors can display as torch.float16 yet retain custom-FP semantics if not explicitly materialized. Using to("cpu") alone carries those semantics to CPU.

Resolution¶

tensor = tensor.to("cpu", dtype=torch.float16)

Verify (minimal comparison)¶

import torch

def elements_sum(n, device):
    x = torch.ones(n, dtype=torch.float16, device=device)
    y = torch.full_like(x, 0.5, dtype=torch.float16, device=device)
    return x + y

out = elements_sum(2, "rbln")
baseline = elements_sum(2, "cpu")

implicit_move = out.to("cpu")
explicit_cast = out.to("cpu", dtype=torch.float16)

print("baseline     :", baseline.tolist())
print("implicit_move:", implicit_move.tolist())
print("explicit_cast:", explicit_cast.tolist())

Example Output¶

baseline     : [1.5, 1.5]
implicit_move: [1.75, 1.75]
explicit_cast: [1.5, 1.5]

Notes¶

On device, tensor.to(dtype=torch.float16) materializes if needed
When moving to CPU, prefer to("cpu", dtype=torch.float16) before comparisons or CPU-side ops.