RBLN NPU Architecture¶

ATOM™ Architecture¶

ATOM™ is a multi-core System-on-Chip (SoC) that integrates all essential components for running deep neural networks into a single chip. It combines Neural Engines, a Command Processor, on-chip local and global scratchpad memory hierarchy, Network-on-Chip (NoC) bus fabric, and PCIe 5.0 and GDDR6 interfaces into a compact design. ATOM™ features 4MB of local SRAM (Scratch Pad) within each Neural Engine, 32MB of global SRAM (Shared Memory) shared across all Neural Engines, and 16GB of off-chip GDDR6 DRAM. This hierarchical structure of on-chip and off-chip memory reduces execution time while minimizing energy consumption for memory access.

For large-scale models, such as large language models (LLMs) that often require more than 16GB of memory, a single ATOM™ device may not be able to load an entire model at once. In this case, large-scale models are partitioned across multiple devices for parallel execution through Rebellions Scalable Design (RSD) architecture. During this process, multiple devices must communicate and synchronize data to ensure proper execution.

For more detailed information, please refer to our white papers:

Commands¶

RBLN Profiler provides a total of seven commands: [Host, Neural Engine Clusters, Neural DMA, Task DMA, External HDMA, Device HDMA, Device Sync]. Each command represents a specific operation, such as computation, data movement, or signal transmission between hardware components. The RBLN Profiler chronologically traces these commands and visualizes them using Perfetto. More detailed information about Perfetto can be found in the Introduction to Perfetto. To effectively analyze profiling results, refer to the image below, which illustrates the connections between hardware components and the commands described above.

`Host`¶

The Host command represents operations that are executed on the host CPU when they are either more efficient than running on the NPU or not supported by it. It also includes operations that adjust input shapes to better optimize execution on the RBLN NPU.

`Neural Engine Clusters`¶

The Neural Engine Clusters command represents operations that are running on the Neural Engines in ATOM™. The Neural Engines are designed for computation with a focus on low latency, high utilization, and flexibility. The list of operations supported by the Neural Engines is summarized in the Supported OPs.

`Neural DMA`¶

The Neural DMA command represents data transfers between the device DRAM and the Neural Engine's Scratch Pad, including program binaries, input tensors, and kernel weights of the target model. The Neural DMA command can operate simultaneously with other commands, so the RBLN Compiler generates the required dependencies between the commands to ensure correct execution. These dependencies are managed by the Neural Engine's Task Manager at runtime.

`Task DMA`¶

The Task DMA command represents data transfers between the device DRAM and the Shared Memory in the SoC, including input tensors, intermediate tensors, and kernel weights of the target model. The RBLN Compiler leverages two types of Task DMA and Neural DMA commands to achieve optimal performance of the target workloads.

`External/Device HDMA`¶

The External HDMA command represents data transfers between the host DRAM and the device DRAM, while Device HDMA command represents data transfers between the device DRAMs or the Shared Memories across different devices under the RSD configuration.

`Device Sync`¶

The Device Sync command represents synchronization between different devices under the RSD configuration. The RBLN Compiler ensures data synchronization across multiple devices while minimizing inter-device communication overhead and maximizing effective memory bandwidth. Device Sync commands generated by the RBLN Compiler are managed by the Command Processor at runtime to verify that data communication is successfully completed and to enable the immediate execution of subsequent commands.

RBLN NPU Architecture¶

ATOM™ Architecture¶

Commands¶

Host¶

Neural Engine Clusters¶

Neural DMA¶

Task DMA¶

External/Device HDMA¶

Device Sync¶