`Triton Ops API (RBLN NPU 특화)`¶

Functions¶

`to_dynamic_index(data, max=-1)` ¶

Convert a constant buffer to a dynamic index for memory load and store operations.

Syntax

to_dynamic_index(data, max=-1) -> index

Parameters:

Name	Type	Description	Default
`data`	`tensor`	A constant buffer holding the dynamic integer index.	required
`max`	`int`	An upper bound value for the dynamic index. This serves as a compiler hint. If not specified, the compiler infers the size from the output shape of the operation. Defaults to -1.	`-1`

Returns:

Name	Type	Description
`index`	`dynamic_index`	The dynamic index for memory load and store operations.

`dynamic_load(ptr, axis=-1, index=None)` ¶

Load a value from memory (ptr) with a specified length along a given axis.

Syntax

dynamic_load(ptr, axis=-1, index=None) -> output

Parameters:

Name	Type	Description	Default
`ptr`	`tensor`	The memory pointer from which data is retrieved.	required
`axis`	`int`	The axis along which the data is loaded. If neither axis nor length is specified, the entire data in ptr is loaded. Defaults to -1.	`-1`
`index`	`dynamic_index`	The size of the data to be loaded. This value comes from the result of a to_dynamic_index operation. Defaults to None.	`None`

Returns:

Name	Type	Description
`output`	`tensor`	The loaded data.

`dynamic_store(ptr, value, axis=-1, index=None)` ¶

Store a value to memory (ptr) with a specified length along a given axis.

Syntax

dynamic_store(ptr, value, axis=-1, index=None)

Parameters:

Name	Type	Description	Default
`ptr`	`tensor`	The memory pointer where the data will be stored.	required
`value`	`tensor`	The tensor data to be stored.	required
`axis`	`int`	The axis along which the data is stored. If neither axis nor length is specified, the entire value is stored to ptr. Defaults to -1.	`-1`
`index`	`dynamic_index`	The size of the data to be stored. This value comes from the result of a to_dynamic_index operation. Defaults to None.	`None`

Returns:

Type	Description
	None

`insert(data, updates, axis, index)` ¶

Update a data array with updates.

Syntax

insert(data, updates, axis, index) -> updated

Parameters:

Name	Type	Description	Default
`data`	`tensor`	The original tensor.	required
`updates`	`tensor`	The new tensor containing updates. The shape of updates must match the shape of data, except along the specified axis.	required
`axis`	`int`	The axis along which updates are inserted. Constraint: 0 <= axis < len(data.shape).	required
`index`	`dynamic_index`	The location where updates are inserted into the data tensor. Constraint: 0 <= index < data.shape[axis].	required

Returns:

Name	Type	Description
`updated`	`tensor`	The updated tensor.

`softmax(input, axis=-1)` ¶

Applies the softmax function to the input tensor along the specified axis.

Syntax

softmax(input, axis=-1) -> output

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor.	required
`axis`	`int`	The axis along which the softmax is computed. If not specified, the last dimension of the input tensor is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1.	`-1`

Returns:

Name	Type	Description
`output`	`tensor`	The softmax output.

`masked_softmax(input, mask, sink=None, axis=-1)` ¶

Applies the softmax function only to the valid elements of the input tensor as defined by the mask. Only "valid" elements (where the mask is non-zero) contribute to the exponential sum, and "invalid" (where the mask is zero) elements are assigned a probability of zero in the output. This is typically used in attention mechanisms. When the sink tensor is specified, a sink attention operation is performed.

Syntax

masked_softmax(input, mask, sink=None, axis=-1) -> output

Algorithm

softmax_input = where(mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
    output = output[...:-1]

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].	required
`mask`	`tensor`	The mask tensor. Shape: [Nmb, 1, 1, Seq, Channel] or [Nmb, 1, Seq, Channel].	required
`sink`	`tensor`	The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.	`None`
`axis`	`int`	The axis along which the softmax is computed. If not specified, the last dimension of the input is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1.	`-1`

Returns:

Name	Type	Description
`output`	`tensor`	The masked softmax output.

`dynamic_masked_softmax(input, index, mask=None, sink=None, axis=-1)` ¶

Applies the softmax function only to the valid elements of the input tensor as defined by the internally generated causal mask. The causal mask is created using the mask tensor with a causal constraint starting from the length (current decoding step). Invalid elements contribute zero to the exponential sum and result in a probability of zero. When the sink tensor is specified, a sink attention operation is performed.

Syntax

causal_masked_softmax(input, index, mask=None, sink=None, axis=-1) -> output

Algorithm

if mask is None:
    # mask[i] = 1 if i < length else 0
    mask = create_mask(length, channel)
# mask.shape = [Nmb, channel], causal_mask = [Nmb, 1, Seq, Channel]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# causal(tril) on expanded_mask[:, 0, 0:Seq, length:length+Seq]
causal_mask = make_causal_mask(expanded_mask, length)
softmax_input = where(causal_mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
    output = output[...,:-1]

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].	required
`index`	`dynamic_index`	The starting offset of the current input sequence.	required
`mask`	`tensor`	The mask tensor. Shape: [Nmb, Channel]. Head, Group dimensions are broadcasted to input shape, and Seq dims of mask are generated internally. Defaults to None.	`None`
`sink`	`tensor`	The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.	`None`
`axis`	`int`	The axis along which the softmax is computed. If not specified, the last dimension of the input is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1.	`-1`

Returns:

Name	Type	Description
`output`	`tensor`	The causal masked softmax output.

`flash_attn_tile(input, mask, row_max_i=None)` ¶

Performs a partial softmax operation on a specific partition (tile) of the attention matrix to support the Flash Attention algorithm. It produces the updated row statistics — row_max_global, row_exp_norm, row_sum_cur — required for incremental normalization. Only "valid" elements (where the mask is non-zero) contribute to the statistics.

Syntax

flash_attn_tile(input, mask, row_max_i=None) -> (row_max_global, row_exp_norm, row_sum_cur)

Algorithm

softmax_input = where(mask, input, MASK_VALUE)
row_max_current = max(softmax_input, axis=-1)
if row_max_i is not None:
    row_max_global = max(row_max_i, row_max_current)
else:
    row_max_global = row_max_current
row_exp_norm = exp(softmax_input - row_max_global)
row_sum_cur = sum(row_exp_norm, axis=-1)

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].	required
`mask`	`tensor`	The mask tensor. Shape: [Nmb, 1, 1, Seq, Channel] or [Nmb, 1, Seq, Channel].	required
`row_max_i`	`tensor`	The running maximum value of the softmax rows accumulated from previous computation partitions. When the current partition is the first partition, row_max_i tensor is None. Shape: [Nmb, Head, Group, Seq, 1] or [Nmb, Head x Group, Seq, 1]. Defaults to None.	`None`

Returns:

Name	Type	Description
`row_max_global`	`tensor`	The updated maximum value for each row, considering both previous partitions and the current partition. The shape of row_max_global is the same as the row_max_i tensor.
`row_exp_norm`	`tensor`	The current partition's exponential values, normalized by row_max_global. The shape of row_exp_norm is the same as the input tensor.
`row_sum_cur`	`tensor`	The sum of exponentials for the current tile, used to update the running denominator. The shape of row_sum_cur is the same as the row_max_i tensor.

`dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None)` ¶

Performs a partial softmax operation on a specific partition of the attention matrix with an internally generated dynamic mask. It computes updated row statistics (row_max_global, row_exp_norm, row_sum_cur) required for the Flash Attention algorithm. The dynamic mask is internally generated by applying a causal constraint to the mask tensor starting from a given offset. Only "valid" elements (where the mask is non-zero) contribute to the statistics.

Syntax

dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None) -> (row_max_global, row_exp_norm, row_sum_cur)

Algorithm

if mask is None:
    # mask[i] = 1 if i < length else 0
    mask = create_padding_mask(length, channel)
# mask.shape = [Nmb, channel], causal_mask = [Nmb, 1, Seq, Channel]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# causal(tril) on expanded_mask[:, 0, 0:Seq, length:length+Seq]
causal_mask = make_causal_mask(expanded_mask, length)
softmax_input = where(causal_mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
row_max_current = max(softmax_input, axis=-1)
if row_max_i is not None:
    row_max_global = max(row_max_i, row_max_current)
else:
    row_max_global = row_max_current
row_exp_norm = exp(softmax_input - row_max_global)
row_sum_cur = sum(row_exp_norm, axis=-1)
if sink is not None:
    row_exp_norm = row_exp_norm[...,:-1]

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].	required
`index`	`dynamic_index`	The starting offset of the current input sequence.	required
`row_max_i`	`tensor`	The running maximum value of the softmax rows accumulated from previous computation partitions. When the current partition is the first partition, row_max_i tensor is None. Shape: [Nmb, Head, Group, Seq, 1] or [Nmb, Head x Group, Seq, 1]. Defaults to None.	`None`
`mask`	`tensor`	The mask tensor. Shape: [Nmb, Channel]. Head, Group dimensions are broadcasted to input shape, and Seq dims of mask are generated internally. Defaults to None.	`None`
`sink`	`tensor`	The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.	`None`

Returns:

Name	Type	Description
`row_max_global`	`tensor`	The updated maximum value for each row, considering both previous partitions and the current partition. The shape of row_max_global is the same as the row_max_i tensor.
`row_exp_norm`	`tensor`	The current partition's exponential values, normalized by row_max_global. The shape of row_exp_norm is the same as the input tensor.
`row_sum_cur`	`tensor`	The sum of exponentials for the current tile, used to update the running denominator. The shape of row_sum_cur is the same as the row_max_i tensor.

`flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur)` ¶

Updates and merges attention outputs and row statistics from different partitions to produce a globally normalized result. It is used in the Flash Attention algorithm to consolidate local results (row_max_global, row_sum_cur, attn_out_cur) with previously accumulated statistics (row_max_prev, row_sum_prev, attn_out_prev).

Syntax

flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur) -> (row_sum_prev, attn_out_prev)

Algorithm

scale_factor = exp(row_max_prev - row_max_global)
row_sum_prev = scale_factor * row_sum_prev + row_sum_global
attn_out_prev = scale_factor * attn_out_prev + attn_out_cur

Parameters:

Name	Type	Description	Default
`row_max_prev`	`tensor`	The previous maximum value for each row.	required
`row_max_global`	`tensor`	The updated maximum value for each row, considering both previous partitions and the current partition.	required
`row_sum_prev`	`tensor`	The sum of exponential for the previous tile.	required
`row_sum_cur`	`tensor`	The sum of exponential for the current tile.	required
`attn_out_prev`	`tensor`	The attention output for the previous tile.	required
`attn_out_cur`	`tensor`	The attention output for the current tile.	required

Returns:

Name	Type	Description
`row_sum_prev`	`tensor`	The updated sum of exponentials for the previous tile.
`attn_out_prev`	`tensor`	The updated attention output for the previous tile.

`window_insert(data, updates, axis, index)` ¶

Insert updates into the data tensor at index along the axis, or append to the end if index >= data.shape[axis].

Syntax

window_insert(data, updates, axis, index) -> updated

Parameters:

Name	Type	Description	Default
`data`	`tensor`	The base tensor.	required
`updates`	`tensor`	The tensor containing the elements to be inserted.	required
`axis`	`int`	The axis of insertion. Constraint: 0 <= axis < len(data.shape).	required
`index`	`dynamic_index`	The location where updates are inserted into the data tensor. Constraint: 0 <= index.	required

Returns:

Name	Type	Description
`updated`	`tensor`	The updated tensor.

`window_slice(data, index, axis, window_size)` ¶

Slice data along axis: [0:offset] if offset < window_size; [offset-window_size:offset] if window_size <= offset < data.shape[axis]; [data.shape[axis]-window_size:data.shape[axis]] if offset >= data.shape[axis].

Syntax

window_slice(data, index, axis, window_size) -> sliced

Parameters:

Name	Type	Description	Default
`data`	`tensor`	The input tensor.	required
`index`	`dynamic_index`	The slicing offset. Constraint: 0 <= index.	required
`axis`	`int`	The axis along which to slice. Constraint: 0 <= axis < len(data.shape).	required
`window_size`	`int`	The size of the window to slice.	required

Returns:

Name	Type	Description
`sliced`	`tensor`	The sliced tensor.

`window_softmax(input, index, sink=None, window_size=None, axis=-1, pad_c=-1)` ¶

Applies the softmax function only on valid elements of the input tensor as defined by the sliding_window_mask. The sliding_window_mask is generated internally using the window_size and the offset (start offset). It limits the valid range for each sequence index i to the interval [i + offset - W + 1, i + offset], where W is window_size. When the sink tensor is not None, a sink attention operation is performed.

Syntax

window_softmax(input, index, sink=None, window_size, axis=-1, pad_c=-1) -> output

Algorithm

# mask[i] = 1 if i < offset else 0
mask = create_padding_mask(offset, channel)
# mask.shape = [Nmb, channel], sliding_window_mask = [Nmb,1,Seq,C]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# sliding_window_mask: [i + offset - W + 1, i + offset]
sliding_window_mask = make_sliding_window_mask(
    expanded_mask, offset, window_size,
)
softmax_input = where(sliding_window_mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
    output = output[...,:-1]

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor.	required
`index`	`dynamic_index`	The starting offset of the current input sequence. Constraint: 0 <= offset <= window_size.	required
`sink`	`tensor`	The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.	`None`
`window_size`	`int`	The size of the sliding window. Constraint: window_size >= 128 and window_size % 64 == 0. Defaults to None.	`None`
`axis`	`int`	The axis along which softmax is computed. Defaults to -1.	`-1`
`pad_c`	`int`	Padding constant. Defaults to -1.	`-1`

Returns:

Name	Type	Description
`output`	`tensor`	The window softmax output.

`nn_pad(input, value, axis, pad_size, mode)` ¶

Pads a tensor along a specified axis with a constant value. The input tensor is padded along the given axis by adding pad_size elements on both sides (symmetric padding) or only one side (depending on mode). Commonly used for preparing tensors for convolution, pooling, or other operations that require specific input dimensions (e.g., padding to a multiple of 64 for hardware vectorization in Rebel).

Syntax

nn_pad(input, value, axis, pad_size, mode) -> output

Parameters:

Name	Type	Description	Default
`input`	`tensor`	The input tensor to be padded.	required
`value`	`int`	The constant value used to fill the padded regions. The type of the value must match the input tensor's data type (e.g., float32 value for a float32 tensor).	required
`axis`	`int`	The dimension along which padding is applied.	required
`pad_size`	`int`	The number of elements to pad on each side of the axis. pad_size = [left_pad_value, right_pad_value]. The total number of added elements is 2 * pad_size (for symmetric constant/reflect/replicate modes).	required
`mode`	`str`	The padding mode. Supported values: "constant" (default) - Pad with the specified constant value; "reflect" - Pad by reflecting values at the border; "edge" - Pad by repeating the edge value.	required

Returns:

Name	Type	Description
`output`	`tensor`	The padded tensor with the same data type as the input. The shape along the specified axis increases by 2 * pad_size (for symmetric constant/reflect/replicate modes).

Triton Ops API (RBLN NPU 특화)¶

Functions¶

to_dynamic_index(data, max=-1) ¶

dynamic_load(ptr, axis=-1, index=None) ¶

dynamic_store(ptr, value, axis=-1, index=None) ¶

insert(data, updates, axis, index) ¶

softmax(input, axis=-1) ¶

masked_softmax(input, mask, sink=None, axis=-1) ¶

dynamic_masked_softmax(input, index, mask=None, sink=None, axis=-1) ¶

flash_attn_tile(input, mask, row_max_i=None) ¶

dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None) ¶

flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur) ¶

window_insert(data, updates, axis, index) ¶

window_slice(data, index, axis, window_size) ¶

window_softmax(input, index, sink=None, window_size=None, axis=-1, pad_c=-1) ¶

nn_pad(input, value, axis, pad_size, mode) ¶

`Triton Ops API (RBLN NPU 특화)`¶

`to_dynamic_index(data, max=-1)` ¶

`dynamic_load(ptr, axis=-1, index=None)` ¶

`dynamic_store(ptr, value, axis=-1, index=None)` ¶

`insert(data, updates, axis, index)` ¶

`softmax(input, axis=-1)` ¶

`masked_softmax(input, mask, sink=None, axis=-1)` ¶

`dynamic_masked_softmax(input, index, mask=None, sink=None, axis=-1)` ¶

`flash_attn_tile(input, mask, row_max_i=None)` ¶

`dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None)` ¶

`flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur)` ¶

`window_insert(data, updates, axis, index)` ¶

`window_slice(data, index, axis, window_size)` ¶

`window_softmax(input, index, sink=None, window_size=None, axis=-1, pad_c=-1)` ¶

`nn_pad(input, value, axis, pad_size, mode)` ¶