콘텐츠로 이동

Triton Ops API (RBLN NPU 특화)

Functions

to_dynamic_index(data, max=-1)

Convert a constant buffer to a dynamic index for memory load and store operations.

Syntax

to_dynamic_index(data, max=-1) -> index

Parameters:

Name Type Description Default
data tensor

A constant buffer holding the dynamic integer index.

required
max int

An upper bound value for the dynamic index. This serves as a compiler hint. If not specified, the compiler infers the size from the output shape of the operation. Defaults to -1.

-1

Returns:

Name Type Description
index dynamic_index

The dynamic index for memory load and store operations.

dynamic_load(ptr, axis=-1, index=None)

Load a value from memory (ptr) with a specified length along a given axis.

Syntax

dynamic_load(ptr, axis=-1, index=None) -> output

Parameters:

Name Type Description Default
ptr tensor

The memory pointer from which data is retrieved.

required
axis int

The axis along which the data is loaded. If neither axis nor length is specified, the entire data in ptr is loaded. Defaults to -1.

-1
index dynamic_index

The size of the data to be loaded. This value comes from the result of a to_dynamic_index operation. Defaults to None.

None

Returns:

Name Type Description
output tensor

The loaded data.

dynamic_store(ptr, value, axis=-1, index=None)

Store a value to memory (ptr) with a specified length along a given axis.

Syntax

dynamic_store(ptr, value, axis=-1, index=None)

Parameters:

Name Type Description Default
ptr tensor

The memory pointer where the data will be stored.

required
value tensor

The tensor data to be stored.

required
axis int

The axis along which the data is stored. If neither axis nor length is specified, the entire value is stored to ptr. Defaults to -1.

-1
index dynamic_index

The size of the data to be stored. This value comes from the result of a to_dynamic_index operation. Defaults to None.

None

Returns:

Type Description

None

insert(data, updates, axis, index)

Update a data array with updates.

Syntax

insert(data, updates, axis, index) -> updated

Parameters:

Name Type Description Default
data tensor

The original tensor.

required
updates tensor

The new tensor containing updates. The shape of updates must match the shape of data, except along the specified axis.

required
axis int

The axis along which updates are inserted. Constraint: 0 <= axis < len(data.shape).

required
index dynamic_index

The location where updates are inserted into the data tensor. Constraint: 0 <= index < data.shape[axis].

required

Returns:

Name Type Description
updated tensor

The updated tensor.

softmax(input, axis=-1)

Applies the softmax function to the input tensor along the specified axis.

Syntax

softmax(input, axis=-1) -> output

Parameters:

Name Type Description Default
input tensor

The input tensor.

required
axis int

The axis along which the softmax is computed. If not specified, the last dimension of the input tensor is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1.

-1

Returns:

Name Type Description
output tensor

The softmax output.

masked_softmax(input, mask, sink=None, axis=-1)

Applies the softmax function only to the valid elements of the input tensor as defined by the mask. Only "valid" elements (where the mask is non-zero) contribute to the exponential sum, and "invalid" (where the mask is zero) elements are assigned a probability of zero in the output. This is typically used in attention mechanisms. When the sink tensor is specified, a sink attention operation is performed.

Syntax

masked_softmax(input, mask, sink=None, axis=-1) -> output

Algorithm
softmax_input = where(mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
    output = output[...:-1]

Parameters:

Name Type Description Default
input tensor

The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].

required
mask tensor

The mask tensor. Shape: [Nmb, 1, 1, Seq, Channel] or [Nmb, 1, Seq, Channel].

required
sink tensor

The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.

None
axis int

The axis along which the softmax is computed. If not specified, the last dimension of the input is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1.

-1

Returns:

Name Type Description
output tensor

The masked softmax output.

dynamic_masked_softmax(input, index, mask=None, sink=None, axis=-1)

Applies the softmax function only to the valid elements of the input tensor as defined by the internally generated causal mask. The causal mask is created using the mask tensor with a causal constraint starting from the length (current decoding step). Invalid elements contribute zero to the exponential sum and result in a probability of zero. When the sink tensor is specified, a sink attention operation is performed.

Syntax

causal_masked_softmax(input, index, mask=None, sink=None, axis=-1) -> output

Algorithm
if mask is None:
    # mask[i] = 1 if i < length else 0
    mask = create_mask(length, channel)
# mask.shape = [Nmb, channel], causal_mask = [Nmb, 1, Seq, Channel]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# causal(tril) on expanded_mask[:, 0, 0:Seq, length:length+Seq]
causal_mask = make_causal_mask(expanded_mask, length)
softmax_input = where(causal_mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
    output = output[...,:-1]

Parameters:

Name Type Description Default
input tensor

The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].

required
index dynamic_index

The starting offset of the current input sequence.

required
mask tensor

The mask tensor. Shape: [Nmb, Channel]. Head, Group dimensions are broadcasted to input shape, and Seq dims of mask are generated internally. Defaults to None.

None
sink tensor

The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.

None
axis int

The axis along which the softmax is computed. If not specified, the last dimension of the input is used as the axis. Constraint: 0 <= axis < len(input.shape). Defaults to -1.

-1

Returns:

Name Type Description
output tensor

The causal masked softmax output.

flash_attn_tile(input, mask, row_max_i=None)

Performs a partial softmax operation on a specific partition (tile) of the attention matrix to support the Flash Attention algorithm. It produces the updated row statistics — row_max_global, row_exp_norm, row_sum_cur — required for incremental normalization. Only "valid" elements (where the mask is non-zero) contribute to the statistics.

Syntax

flash_attn_tile(input, mask, row_max_i=None) -> (row_max_global, row_exp_norm, row_sum_cur)

Algorithm
softmax_input = where(mask, input, MASK_VALUE)
row_max_current = max(softmax_input, axis=-1)
if row_max_i is not None:
    row_max_global = max(row_max_i, row_max_current)
else:
    row_max_global = row_max_current
row_exp_norm = exp(softmax_input - row_max_global)
row_sum_cur = sum(row_exp_norm, axis=-1)

Parameters:

Name Type Description Default
input tensor

The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].

required
mask tensor

The mask tensor. Shape: [Nmb, 1, 1, Seq, Channel] or [Nmb, 1, Seq, Channel].

required
row_max_i tensor

The running maximum value of the softmax rows accumulated from previous computation partitions. When the current partition is the first partition, row_max_i tensor is None. Shape: [Nmb, Head, Group, Seq, 1] or [Nmb, Head x Group, Seq, 1]. Defaults to None.

None

Returns:

Name Type Description
row_max_global tensor

The updated maximum value for each row, considering both previous partitions and the current partition. The shape of row_max_global is the same as the row_max_i tensor.

row_exp_norm tensor

The current partition's exponential values, normalized by row_max_global. The shape of row_exp_norm is the same as the input tensor.

row_sum_cur tensor

The sum of exponentials for the current tile, used to update the running denominator. The shape of row_sum_cur is the same as the row_max_i tensor.

dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None)

Performs a partial softmax operation on a specific partition of the attention matrix with an internally generated dynamic mask. It computes updated row statistics (row_max_global, row_exp_norm, row_sum_cur) required for the Flash Attention algorithm. The dynamic mask is internally generated by applying a causal constraint to the mask tensor starting from a given offset. Only "valid" elements (where the mask is non-zero) contribute to the statistics.

Syntax

dynamic_flash_attn_tile(input, index, row_max_i=None, mask=None, sink=None) -> (row_max_global, row_exp_norm, row_sum_cur)

Algorithm
if mask is None:
    # mask[i] = 1 if i < length else 0
    mask = create_padding_mask(length, channel)
# mask.shape = [Nmb, channel], causal_mask = [Nmb, 1, Seq, Channel]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# causal(tril) on expanded_mask[:, 0, 0:Seq, length:length+Seq]
causal_mask = make_causal_mask(expanded_mask, length)
softmax_input = where(causal_mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
row_max_current = max(softmax_input, axis=-1)
if row_max_i is not None:
    row_max_global = max(row_max_i, row_max_current)
else:
    row_max_global = row_max_current
row_exp_norm = exp(softmax_input - row_max_global)
row_sum_cur = sum(row_exp_norm, axis=-1)
if sink is not None:
    row_exp_norm = row_exp_norm[...,:-1]

Parameters:

Name Type Description Default
input tensor

The input tensor. Shape: [Nmb, Head, Group, Seq, Channel] or [Nmb, Head x Group, Seq, Channel].

required
index dynamic_index

The starting offset of the current input sequence.

required
row_max_i tensor

The running maximum value of the softmax rows accumulated from previous computation partitions. When the current partition is the first partition, row_max_i tensor is None. Shape: [Nmb, Head, Group, Seq, 1] or [Nmb, Head x Group, Seq, 1]. Defaults to None.

None
mask tensor

The mask tensor. Shape: [Nmb, Channel]. Head, Group dimensions are broadcasted to input shape, and Seq dims of mask are generated internally. Defaults to None.

None
sink tensor

The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.

None

Returns:

Name Type Description
row_max_global tensor

The updated maximum value for each row, considering both previous partitions and the current partition. The shape of row_max_global is the same as the row_max_i tensor.

row_exp_norm tensor

The current partition's exponential values, normalized by row_max_global. The shape of row_exp_norm is the same as the input tensor.

row_sum_cur tensor

The sum of exponentials for the current tile, used to update the running denominator. The shape of row_sum_cur is the same as the row_max_i tensor.

flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur)

Updates and merges attention outputs and row statistics from different partitions to produce a globally normalized result. It is used in the Flash Attention algorithm to consolidate local results (row_max_global, row_sum_cur, attn_out_cur) with previously accumulated statistics (row_max_prev, row_sum_prev, attn_out_prev).

Syntax

flash_attn_recompute(row_max_prev, row_max_global, row_sum_prev, row_sum_cur, attn_out_prev, attn_out_cur) -> (row_sum_prev, attn_out_prev)

Algorithm
scale_factor = exp(row_max_prev - row_max_global)
row_sum_prev = scale_factor * row_sum_prev + row_sum_global
attn_out_prev = scale_factor * attn_out_prev + attn_out_cur

Parameters:

Name Type Description Default
row_max_prev tensor

The previous maximum value for each row.

required
row_max_global tensor

The updated maximum value for each row, considering both previous partitions and the current partition.

required
row_sum_prev tensor

The sum of exponential for the previous tile.

required
row_sum_cur tensor

The sum of exponential for the current tile.

required
attn_out_prev tensor

The attention output for the previous tile.

required
attn_out_cur tensor

The attention output for the current tile.

required

Returns:

Name Type Description
row_sum_prev tensor

The updated sum of exponentials for the previous tile.

attn_out_prev tensor

The updated attention output for the previous tile.

window_insert(data, updates, axis, index)

Insert updates into the data tensor at index along the axis, or append to the end if index >= data.shape[axis].

Syntax

window_insert(data, updates, axis, index) -> updated

Parameters:

Name Type Description Default
data tensor

The base tensor.

required
updates tensor

The tensor containing the elements to be inserted.

required
axis int

The axis of insertion. Constraint: 0 <= axis < len(data.shape).

required
index dynamic_index

The location where updates are inserted into the data tensor. Constraint: 0 <= index.

required

Returns:

Name Type Description
updated tensor

The updated tensor.

window_slice(data, index, axis, window_size)

Slice data along axis: [0:offset] if offset < window_size; [offset-window_size:offset] if window_size <= offset < data.shape[axis]; [data.shape[axis]-window_size:data.shape[axis]] if offset >= data.shape[axis].

Syntax

window_slice(data, index, axis, window_size) -> sliced

Parameters:

Name Type Description Default
data tensor

The input tensor.

required
index dynamic_index

The slicing offset. Constraint: 0 <= index.

required
axis int

The axis along which to slice. Constraint: 0 <= axis < len(data.shape).

required
window_size int

The size of the window to slice.

required

Returns:

Name Type Description
sliced tensor

The sliced tensor.

window_softmax(input, index, sink=None, window_size=None, axis=-1, pad_c=-1)

Applies the softmax function only on valid elements of the input tensor as defined by the sliding_window_mask. The sliding_window_mask is generated internally using the window_size and the offset (start offset). It limits the valid range for each sequence index i to the interval [i + offset - W + 1, i + offset], where W is window_size. When the sink tensor is not None, a sink attention operation is performed.

Syntax

window_softmax(input, index, sink=None, window_size, axis=-1, pad_c=-1) -> output

Algorithm
# mask[i] = 1 if i < offset else 0
mask = create_padding_mask(offset, channel)
# mask.shape = [Nmb, channel], sliding_window_mask = [Nmb,1,Seq,C]
expanded_mask = broadcast(mask, [Nmb, 1, Seq, Channel])
# sliding_window_mask: [i + offset - W + 1, i + offset]
sliding_window_mask = make_sliding_window_mask(
    expanded_mask, offset, window_size,
)
softmax_input = where(sliding_window_mask, input, MASK_VALUE)
if sink is not None:
    softmax_input = concatenate((softmax_input, sink), axis=-1)
output = softmax(softmax_input, axis=-1)
if sink is not None:
    output = output[...,:-1]

Parameters:

Name Type Description Default
input tensor

The input tensor.

required
index dynamic_index

The starting offset of the current input sequence. Constraint: 0 <= offset <= window_size.

required
sink tensor

The sink tensor. Shape: [1, Head, Group, 1, 1] or [1, Head x Group, 1, 1]. Defaults to None.

None
window_size int

The size of the sliding window. Constraint: window_size >= 128 and window_size % 64 == 0. Defaults to None.

None
axis int

The axis along which softmax is computed. Defaults to -1.

-1
pad_c int

Padding constant. Defaults to -1.

-1

Returns:

Name Type Description
output tensor

The window softmax output.

nn_pad(input, value, axis, pad_size, mode)

Pads a tensor along a specified axis with a constant value. The input tensor is padded along the given axis by adding pad_size elements on both sides (symmetric padding) or only one side (depending on mode). Commonly used for preparing tensors for convolution, pooling, or other operations that require specific input dimensions (e.g., padding to a multiple of 64 for hardware vectorization in Rebel).

Syntax

nn_pad(input, value, axis, pad_size, mode) -> output

Parameters:

Name Type Description Default
input tensor

The input tensor to be padded.

required
value int

The constant value used to fill the padded regions. The type of the value must match the input tensor's data type (e.g., float32 value for a float32 tensor).

required
axis int

The dimension along which padding is applied.

required
pad_size int

The number of elements to pad on each side of the axis. pad_size = [left_pad_value, right_pad_value]. The total number of added elements is 2 * pad_size (for symmetric constant/reflect/replicate modes).

required
mode str

The padding mode. Supported values: "constant" (default) - Pad with the specified constant value; "reflect" - Pad by reflecting values at the border; "edge" - Pad by repeating the edge value.

required

Returns:

Name Type Description
output tensor

The padded tensor with the same data type as the input. The shape along the specified axis increases by 2 * pad_size (for symmetric constant/reflect/replicate modes).