Skip to content

Troubleshoot

Core issues and resolutions for Optimum RBLN.


Runtime creation fails after compilation

1
2
3
4
5
Failed to create RBLN runtime: ...

If you only need to compile the model without loading it to NPU, you can use:
  from_pretrained(..., rbln_create_runtimes=False) or
  from_pretrained(..., rbln_config={..., 'create_runtimes': False})

from_pretrained creates runtimes on the NPU immediately after compilation. Device memory exhaustion is not the only possible cause — inspect the original exception at the top of the message first. To skip runtime creation:

model = RBLNModel.from_pretrained(model_id, rbln_create_runtimes=False)
model.save_pretrained(save_dir)

Load the saved artifacts separately:

model = RBLNModel.from_pretrained(save_dir)

Device / tensor parallel configuration errors

Validation of device and tensor_parallel_size may raise any of these errors:

Device {device_id} is not a valid NPU device. Please check your NPU status with 'rbln-smi' command.
Tensor parallel size {N} is greater than the number of available devices {M}.
The number of devices ({len_device}) does not match tensor parallel size ({tensor_parallel_size}).

Run rbln-smi to list available devices, then verify the following:

  • The specified device IDs exist on the system.
  • tensor_parallel_size ≤ the number of available devices.
  • The device list length equals tensor_parallel_size.

Flash attention configuration errors

Flash attention is enabled by setting attn_impl="flash_attn" or kvcache_partition_len. Compilation fails when any of the following constraints is violated:

`max_seq_len` ({X}) must be a multiple of `kvcache_partition_len` ({Y}) when using 'flash_attn'.
`kvcache_partition_len` ({X}) is out of the supported range (4096 <= kvcache_partition_len <= 32768).
`max_seq_len` ({X}) is too small for 'flash_attn'. The minimum supported value is 8192.
  • 4,096 ≤ kvcache_partition_len ≤ 32,768
  • max_seq_len ≥ 8,192
  • max_seq_len must be a multiple of kvcache_partition_len and at least 2 × kvcache_partition_len
    (e.g. kvcache_partition_len=16,384 requires max_seq_len ≥ 32,768)

Logging and debugging

Control log verbosity with the OPTIMUM_RBLN_VERBOSE environment variable (default info):

$ OPTIMUM_RBLN_VERBOSE=debug python inference.py    # detailed logs
$ OPTIMUM_RBLN_VERBOSE=warning python inference.py  # warnings and errors only

Supported levels: debug, info, warning, error, critical.


Topic guides