Triton Ops API (RBLN NPU 특화)¶
Functions¶
to_dynamic_index(data, max=-1)
¶
Convert a constant buffer to a dynamic index for memory load and store operations.
Syntax
to_dynamic_index(data, max=-1) -> index
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
tensor
|
A constant buffer holding the dynamic integer index. |
required |
max
|
int
|
An upper bound value for the dynamic index. This serves as a compiler hint. If not specified, the compiler infers the size from the output shape of the operation. Defaults to -1. |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
index |
dynamic_index
|
The dynamic index for memory load and store operations. |
dynamic_load(ptr, axis=-1, index=None)
¶
Load a value from memory (ptr) with a specified length along a given axis.
Syntax
dynamic_load(ptr, axis=-1, index=None) -> output
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ptr
|
tensor
|
The memory pointer from which data is retrieved. |
required |
axis
|
int
|
The axis along which the data is loaded. If neither axis nor length is specified, the entire data in ptr is loaded. Defaults to -1. |
-1
|
index
|
dynamic_index
|
The size of the data to be loaded. This value comes from the result of a to_dynamic_index operation. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
output |
tensor
|
The loaded data. |
dynamic_store(ptr, value, axis=-1, index=None)
¶
Store a value to memory (ptr) with a specified length along a given axis.
Syntax
dynamic_store(ptr, value, axis=-1, index=None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ptr
|
tensor
|
The memory pointer where the data will be stored. |
required |
value
|
tensor
|
The tensor data to be stored. |
required |
axis
|
int
|
The axis along which the data is stored. If neither axis nor length is specified, the entire value is stored to ptr. Defaults to -1. |
-1
|
index
|
dynamic_index
|
The size of the data to be stored. This value comes from the result of a to_dynamic_index operation. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
|
None |
insert(data, updates, axis, index)
¶
Update a data array with updates.
Syntax
insert(data, updates, axis, index) -> updated
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
tensor
|
The original tensor. |
required |
updates
|
tensor
|
The new tensor containing updates. The shape of updates must match the shape of data, except along the specified axis. |
required |
axis
|
int
|
The axis along which updates are inserted. Constraint: 0 <= axis < len(data.shape). |
required |
index
|
dynamic_index
|
The location where updates are inserted into the data tensor. Constraint: 0 <= index < data.shape[axis]. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
updated |
tensor
|
The updated tensor. |
softmax(input, axis=-1)
¶
Applies the softmax function to the input tensor along the specified axis.
Syntax
softmax(input, axis=-1) -> output
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor. |
required |
axis
|
int
|
The axis along which the softmax is computed. If not specified, the last dimension of the input tensor is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1. |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
output |
tensor
|
The softmax output. |
masked_softmax(input, mask, sink=None, axis=-1)
¶
Applies the softmax function only to the valid elements of the input tensor as defined by the mask. Only "valid" elements (where the mask is non-zero) contribute to the exponential sum, and "invalid" (where the mask is zero) elements are assigned a probability of zero in the output. This is typically used in attention mechanisms. When the sink tensor is specified, a sink attention operation is performed.
Syntax
masked_softmax(input, mask, sink=None, axis=-1) -> output
Algorithm
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel]. |
required |
mask
|
tensor
|
The mask tensor. Shape: [Nmb, 1, 1, Seq, Channel] or [Nmb, 1, Seq, Channel]. |
required |
sink
|
tensor
|
The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None. |
None
|
axis
|
int
|
The axis along which the softmax is computed. If not specified, the last dimension of the input is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1. |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
output |
tensor
|
The masked softmax output. |
dynamic_masked_softmax(input, index, mask=None, sink=None, axis=-1)
¶
Applies the softmax function only to the valid elements of the input tensor as defined by the internally generated causal mask. The causal mask is created using the mask tensor with a causal constraint starting from the length (current decoding step). Invalid elements contribute zero to the exponential sum and result in a probability of zero. When the sink tensor is specified, a sink attention operation is performed.
Syntax
causal_masked_softmax(input, index, mask=None, sink=None, axis=-1) -> output
Algorithm
if mask is None:
# mask[i] = 1 if i < length else 0
mask = create_mask(length, channel)
# mask.shape = [Nmb, channel], causal_mask = [Nmb, 1, Seq, Channel]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# causal(tril) on expanded_mask[:, 0, 0:Seq, length:length+Seq]
causal_mask = make_causal_mask(expanded_mask, length)
softmax_input = where(causal_mask, input, MASK_VALUE)
if sink is not None:
softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
output = output[...,:-1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel]. |
required |
index
|
dynamic_index
|
The starting offset of the current input sequence. |
required |
mask
|
tensor
|
The mask tensor. Shape: [Nmb, Channel]. Head, Group dimensions are broadcasted to input shape, and Seq dims of mask are generated internally. Defaults to None. |
None
|
sink
|
tensor
|
The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None. |
None
|
axis
|
int
|
The axis along which the softmax is computed. If not specified, the last dimension of the input is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1. |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
output |
tensor
|
The causal masked softmax output. |
flash_attn_tile(input, mask, row_max_i=None)
¶
Performs a partial softmax operation on a specific partition (tile) of the attention matrix to support the Flash Attention algorithm. It produces the updated row statistics — row_max_global, row_exp_norm, row_sum_cur — required for incremental normalization. Only "valid" elements (where the mask is non-zero) contribute to the statistics.
Syntax
flash_attn_tile(input, mask, row_max_i=None) -> (row_max_global, row_exp_norm, row_sum_cur)
Algorithm
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel]. |
required |
mask
|
tensor
|
The mask tensor. Shape: [Nmb, 1, 1, Seq, Channel] or [Nmb, 1, Seq, Channel]. |
required |
row_max_i
|
tensor
|
The running maximum value of the softmax rows accumulated from previous computation partitions. When the current partition is the first partition, row_max_i tensor is None. Shape: [Nmb, Head, Group, Seq, 1] or [Nmb, Head x Group, Seq, 1]. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
row_max_global |
tensor
|
The updated maximum value for each row, considering both previous partitions and the current partition. The shape of row_max_global is the same as the row_max_i tensor. |
row_exp_norm |
tensor
|
The current partition's exponential values, normalized by row_max_global. The shape of row_exp_norm is the same as the input tensor. |
row_sum_cur |
tensor
|
The sum of exponentials for the current tile, used to update the running denominator. The shape of row_sum_cur is the same as the row_max_i tensor. |
dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None)
¶
Performs a partial softmax operation on a specific partition of the attention matrix with an internally generated dynamic mask. It computes updated row statistics (row_max_global, row_exp_norm, row_sum_cur) required for the Flash Attention algorithm. The dynamic mask is internally generated by applying a causal constraint to the mask tensor starting from a given offset. Only "valid" elements (where the mask is non-zero) contribute to the statistics.
Syntax
dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None) -> (row_max_global, row_exp_norm, row_sum_cur)
Algorithm
if mask is None:
# mask[i] = 1 if i < length else 0
mask = create_padding_mask(length, channel)
# mask.shape = [Nmb, channel], causal_mask = [Nmb, 1, Seq, Channel]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# causal(tril) on expanded_mask[:, 0, 0:Seq, length:length+Seq]
causal_mask = make_causal_mask(expanded_mask, length)
softmax_input = where(causal_mask, input, MASK_VALUE)
if sink is not None:
softmax_input = concatenate((softmax_input, sink), axis=-1)
row_max_current = max(softmax_input, axis=-1)
if row_max_i is not None:
row_max_global = max(row_max_i, row_max_current)
else:
row_max_global = row_max_current
row_exp_norm = exp(softmax_input - row_max_global)
row_sum_cur = sum(row_exp_norm, axis=-1)
if sink is not None:
row_exp_norm = row_exp_norm[...,:-1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel]. |
required |
index
|
dynamic_index
|
The starting offset of the current input sequence. |
required |
row_max_i
|
tensor
|
The running maximum value of the softmax rows accumulated from previous computation partitions. When the current partition is the first partition, row_max_i tensor is None. Shape: [Nmb, Head, Group, Seq, 1] or [Nmb, Head x Group, Seq, 1]. Defaults to None. |
None
|
mask
|
tensor
|
The mask tensor. Shape: [Nmb, Channel]. Head, Group dimensions are broadcasted to input shape, and Seq dims of mask are generated internally. Defaults to None. |
None
|
sink
|
tensor
|
The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
row_max_global |
tensor
|
The updated maximum value for each row, considering both previous partitions and the current partition. The shape of row_max_global is the same as the row_max_i tensor. |
row_exp_norm |
tensor
|
The current partition's exponential values, normalized by row_max_global. The shape of row_exp_norm is the same as the input tensor. |
row_sum_cur |
tensor
|
The sum of exponentials for the current tile, used to update the running denominator. The shape of row_sum_cur is the same as the row_max_i tensor. |
flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur)
¶
Updates and merges attention outputs and row statistics from different partitions to produce a globally normalized result. It is used in the Flash Attention algorithm to consolidate local results (row_max_global, row_sum_cur, attn_out_cur) with previously accumulated statistics (row_max_prev, row_sum_prev, attn_out_prev).
Syntax
flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur) -> (row_sum_prev, attn_out_prev)
Algorithm
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_max_prev
|
tensor
|
The previous maximum value for each row. |
required |
row_max_global
|
tensor
|
The updated maximum value for each row, considering both previous partitions and the current partition. |
required |
row_sum_prev
|
tensor
|
The sum of exponential for the previous tile. |
required |
row_sum_cur
|
tensor
|
The sum of exponential for the current tile. |
required |
attn_out_prev
|
tensor
|
The attention output for the previous tile. |
required |
attn_out_cur
|
tensor
|
The attention output for the current tile. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
row_sum_prev |
tensor
|
The updated sum of exponentials for the previous tile. |
attn_out_prev |
tensor
|
The updated attention output for the previous tile. |
window_insert(data, updates, axis, index)
¶
Insert updates into the data tensor at index along the axis, or append to the end if index >= data.shape[axis].
Syntax
window_insert(data, updates, axis, index) -> updated
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
tensor
|
The base tensor. |
required |
updates
|
tensor
|
The tensor containing the elements to be inserted. |
required |
axis
|
int
|
The axis of insertion. Constraint: 0 <= axis < len(data.shape). |
required |
index
|
dynamic_index
|
The location where updates are inserted into the data tensor. Constraint: 0 <= index. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
updated |
tensor
|
The updated tensor. |
window_slice(data, index, axis, window_size)
¶
Slice data along axis: [0:offset] if offset < window_size; [offset-window_size:offset] if window_size <= offset < data.shape[axis]; [data.shape[axis]-window_size:data.shape[axis]] if offset >= data.shape[axis].
Syntax
window_slice(data, index, axis, window_size) -> sliced
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
tensor
|
The input tensor. |
required |
index
|
dynamic_index
|
The slicing offset. Constraint: 0 <= index. |
required |
axis
|
int
|
The axis along which to slice. Constraint: 0 <= axis < len(data.shape). |
required |
window_size
|
int
|
The size of the window to slice. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sliced |
tensor
|
The sliced tensor. |
window_softmax(input, index, sink=None, window_size=None, axis=-1, pad_c=-1)
¶
Applies the softmax function only on valid elements of the input tensor as defined by the sliding_window_mask. The sliding_window_mask is generated internally using the window_size and the offset (start offset). It limits the valid range for each sequence index i to the interval [i + offset - W + 1, i + offset], where W is window_size. When the sink tensor is not None, a sink attention operation is performed.
Syntax
window_softmax(input, index, sink=None, window_size, axis=-1, pad_c=-1) -> output
Algorithm
# mask[i] = 1 if i < offset else 0
mask = create_padding_mask(offset, channel)
# mask.shape = [Nmb, channel], sliding_window_mask = [Nmb,1,Seq,C]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# sliding_window_mask: [i + offset - W + 1, i + offset]
sliding_window_mask = make_sliding_window_mask(
expanded_mask, offset, window_size,
)
softmax_input = where(sliding_window_mask, input, MASK_VALUE)
if sink is not None:
softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
output = output[...,:-1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor. |
required |
index
|
dynamic_index
|
The starting offset of the current input sequence. Constraint: 0 <= offset <= window_size. |
required |
sink
|
tensor
|
The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None. |
None
|
window_size
|
int
|
The size of the sliding window. Constraint: window_size >= 128 and window_size % 64 == 0. Defaults to None. |
None
|
axis
|
int
|
The axis along which softmax is computed. Defaults to -1. |
-1
|
pad_c
|
int
|
Padding constant. Defaults to -1. |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
output |
tensor
|
The window softmax output. |
nn_pad(input, value, axis, pad_size, mode)
¶
Pads a tensor along a specified axis with a constant value. The input tensor is padded along the given axis by adding pad_size elements on both sides (symmetric padding) or only one side (depending on mode). Commonly used for preparing tensors for convolution, pooling, or other operations that require specific input dimensions (e.g., padding to a multiple of 64 for hardware vectorization in Rebel).
Syntax
nn_pad(input, value, axis, pad_size, mode) -> output
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
tensor
|
The input tensor to be padded. |
required |
value
|
int
|
The constant value used to fill the padded regions. The type of the value must match the input tensor's data type (e.g., float32 value for a float32 tensor). |
required |
axis
|
int
|
The dimension along which padding is applied. |
required |
pad_size
|
int
|
The number of elements to pad on each side of the axis. pad_size = [left_pad_value, right_pad_value]. The total number of added elements is 2 * pad_size (for symmetric constant/reflect/replicate modes). |
required |
mode
|
str
|
The padding mode. Supported values: "constant" (default) - Pad with the specified constant value; "reflect" - Pad by reflecting values at the border; "edge" - Pad by repeating the edge value. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
output |
tensor
|
The padded tensor with the same data type as the input. The shape along the specified axis increases by 2 * pad_size (for symmetric constant/reflect/replicate modes). |