How to analyze the RBLN Profiler data using Perfetto¶

This section provides the foundation for analyzing profiling results by explaining the information displayed on the Perfetto screen.

Understanding Profiled Infos¶

Task Details¶

Name Structure

The RBLN Profiler uses a naming convention for tasks structured as {sequence_index}_{layer_name}_{command_index}. This format helps identify the specific stage in a neural network layer's operation and evaluate its performance at that stage.
- sequence_index represents the order in which a module is executed, indicating its position in a series of runtime iterations.
- layer_name refers to the specific tensor-level operation being performed. In cases where multiple operations are fused, they are grouped under the label "fused()" and "+".
- command_index identifies the specific command generated by the Compiler when the operation is divided into multiple commands, distinguishing each individual command within that division.
Category

The RBLN profiler provides the following seven commands as Category. For more detailed information on each category, please refer to the RBLN NPU Architecture page.
- Host
- Neural Engine Clusters
- Neural DMA
- Task DMA
- External HDMA
- Device HDMA
- Device Sync
The numbers next to the command names in the Perfetto UI are internal ID values assigned by the RBLN Profiler. These numbers are unrelated to the actual performance of the RBLN NPU.
Flow Connection (Preceding Flows & Following Flows)

The Preceding Flows and Following Flows displayed at the bottom of Perfetto represent dependencies among commands. During model compilation, these dependencies are determined to ensure that the Command Processor handles them correctly, taking the device's configuration and allocation status into account. Dependencies related to Neural DMAs are managed by the Task Manager within the Neural Engine, and, therefore, are not displayed in Perfetto.
- Between External HDMA and Task DMA
  
  External HDMA transfers weight and input data from host DRAM to ATOM™ device DRAM, creating a dependency with the initial Task DMA command that sends data from the ATOM™ device DRAM to the on-chip Shared Memory. Similarly, since the final output data retrieved by Task DMA must be sent back to host DRAM via External HDMA, it is essential to enforce a dependency ensuring that External HDMA executes only after Task DMA completes.
  
  These visualized flows demonstrate explicit dependencies among commands. However, it doesn't imply that a command without connections has no sequential dependencies. For instance, Task Managers in each Neural Engine preside synchronization on a local hardware level. Also, there is a barrier that prevent an first Neural Engine Clusters commands from executing until related External HDMA commands are done.
- Between Neural Engine Clusters and Task DMA
  
  The connection between the Neural Engine Clusters and Task DMA mainly shows dependency related to transfer weights or in/output data between neural network layers, and it is designed to support each tasks accordingly. In this example, the dependency is determined by the compiler because the kernel weights need to be loaded for the given Neural Engine Clusters operation 0_fused(bn1+conv1+relu)_1. Additionally, if the compiler determines a connection is needed between the Neural Engine Clusters and Task DMA, it is considered a dependency. For example, data generated after neural network layers computation is temporarily stored in the ATOM™ device memory due to its large size.
- Between Device HDMA and Device Sync
  
  In large models using multiple ATOM™ devices, layers are divided through sharding, allowing them to run across multiple devices.
  
  In case 1, the root ATOM™ device (node 0) synchronizes with leaf ATOM™ devices (node 1, 2, 3) via Device HDMA to receive data. The figure (1) illustrates how data are gathered into the root ATOM™ device (node 0) through Device HDMA, after which the synchronization on each leaf ATOM™ device (node 1, 2, 3) is released.
  
  Conversely, in case 2, the root ATOM™ device (node 0) also synchronizes with leaf ATOM™ devices (node 1, 2, 3) via Device HDMA to send data. The provided figure (2) illustrates how data are distributed to the leaf ATOM™ devices (node 1, 2, 3) through Device HDMA. Afterward, the leaf ATOM™ devices send a synchronization signal to confirm they have correctly received the data.
  
  These tensor-parallel models generate synchronization commands during compilation to establish dependencies with the Device HDMA.

Naming Policy in Detail¶

Execution of a neural network can be represented as a sequence of tensor operations, and with RBLN Compiler, each tensor operation can be represented as a sequence of commands. Therefore, a sequence of commands in a visualized profiling result can be mapped to a neural network module. This section introduces the naming policy of commands in timeline.

Case 1. Matching Names with Model (PyTorch)¶

We can recognize the structure of a model with a printed specification or code-level implementation of the model. By comparing that information with visualized nodes, we can identify how an operation gets converted into a sequence of commands. In the above example, model.layers[0].self_attn.q_proj of a LlamaForCausalLM instance gets converted into 0_causal_lm.model.0.self_attn.q_proj.

The structure of a model and the order of its layers may differ from the actual sequence in which commands are processed. Several commands can be omitted during compilation. In that case, command indices of a single layer can be discrete. Arbitrary commands are implicitly generated during the compilation process for executing various operations on ATOM™ efficiently. In the above example, 0_causal_lm.model_0 is added to process a rotary positional embedding layer (RoPE).

Case 2. Command Generation¶

To improve performance and efficiency, RBLN Compiler supports operation fusion, which combines a sequence of operations into a single function. The command_name of a fused function will be fused(name0+name1...). In the above example, a convolution layer, a batch normalization layer and activation function layer are merged into a single function named fused(layer1.2+layer1.2.bn3+layer1.2.conv3).

A sequence of commands for processing a module shares the same command_name, and we can distinguish each of them with the command_index. Several commands such as device memory accesses on Task DMA and executions on the Neural Engine Clusters for running fused(layer1.2+layer1.2.bn3+layer1.2.conv3) are expressed in the form 0_fused(layer1.2+layer1.2.bn3+layer1.2.conv3)_{command_index}.

Case 3. Reusing Layers (PyTorch)¶

# YOUR_PYTHON_PACKAGE_PATH/torchvision/models/resnet.py
class Bottleneck(nn.Module):
    ...
    def __init__(...) -> None:
    super().__init__()
        ...
        self.relu = nn.ReLU(inplace=True)
        ...

    def forward(self, x: Tensor) -> Tensor:
        ...
        out = self.relu(out)
        ...
        out = self.relu(out)
        ...
        out = self.relu(out)
        ...
        return out

Inline operations such as torch.add, torch.nn.functional.relu are compiled without name because the model doesn't have their attribute names. This makes it hard to identify which commands correspond to the operations in Perfetto. To prevent this, we highly recommend using an explicit class attribute inheriting torch.nn.module instead of inline operations.

Additionally, repeatedly using a single class attribute may also create difficulties in mapping commands in Perfetto to specific operations. For example, in the ResNet50 model, as shown in the Pytorch code, there is only one explicit class attribute named self.relu in the Bottleneck module, which allocates nn.ReLU. In this case, as seen in the profiling results above, relu is omitted from the command names of the second and third Neural Engines Clusters, because the same self.relu attribute is repeatedly used in the forward function. Only the first instance retains the relu.

# YOUR_PYTHON_PACKAGE_PATH/torchvision/models/resnet.py
class Bottleneck(nn.Module):
    ...
    def __init__(...) -> None:
        super().__init__()
        ...
        # self.relu = nn.ReLU(inplace=True)
        ...
        self.relu1 = nn.ReLU(inplace=True)
        self.relu2 = nn.ReLU(inplace=True)
        self.relu3 = nn.ReLU(inplace=True)

    def forward(self, x: Tensor) -> Tensor:
        ...
        out = self.relu1(out)
        ...
        out = self.relu2(out)
        ...
        out = self.relu3(out)

        return out

If you modify the forward function in the Bottleneck module to use different attributes (self.relu1, self.relu2, self.relu3), as shown in the Pytorch code above, you will notice that relu is explicitly included not only in the first Neural Engine Clusters command name (1) but also in subsequent ones (2) and (3), without being omitted.