YOLOv8 (Object Detection)¶
YOLOv8 is a prominent deep learning model in object detection and can be applied across a wide range of tasks. It shows excellent performance relative to its latency and the number of parameters. YOLOv8 utilizes a C2f and a Spatial Pyramid Pooling Fast (SPPF) blocks as backbone to effectively perform various tasks. Detailed information of YOLOv8 can be found in YOLOv8 official docs.
Preliminaries¶
Installation¶
Compile the Model and Extract Profiled Data¶
When using RBLN Runtime API, set activate_profiler=True
to activate RBLN Profiler.
Analysis of Profiled data from YOLOv8l with RBLN Profiler¶
Profiling Result¶
As seen in the above figure, YOLOv8l are well accelerated by ATOM™, and we will now analyze these operations in detail.
Input Flow¶
As shown in the example above, the first Neural Engine Clusters
command of YOLOv8l is 0_0.conv.input_np_1
. This command’s flow indicates a dependency on the immediately preceding Task DMA
command (0_0.conv.input_np_0
). Additionally, this dependency can also be inferred from the fact that both commands (0_0.conv.input_np_0
, 0_0.conv.input_np_1
) have the same name except for the postfix index. This Neural Engine Clusters
command represents the preprocess step performed on the input data before being computed on ATOM™. The name of these commands is derived from a combination of conv
, the name of the first operation in the model, and input_np
, the name assigned to the input data during compilation.
External HDMA
command (0_H0_hdma
). Subsequently, the input data is transferred to the shared memory using the Task DMA
command (0_0.conv.input_np_0
) and is then used as an operand in the Neural Engine Clusters
command (0_0.conv.input_np_1
).
Similarly, the output data is transferred in the reverse order of the input data transmission. Using Task DMA
and External HDMA
commands, the required output data is delivered to the host through shared memory to DRAM, and from DRAM to host memory, respectively.
Analysis of C2f block¶
Ambiguous command name¶
The operation 0_fused(2.cv1+2.cv1.bn+2.cv1.conv)_1
in the blue box corresponds to fused self.cv1(x)
block in the PyTorch code and first (cv1)
block in the result of print(model)
. Therefore, we can infer that the subsequent 0_2_0
in the red box corresponds to the .split((self.c, self.c), 1)
in the PyTorch code. For detailed information about how these names are assigned, refer to the Details in Naming Policy.
Connection flow¶
The Neural Engine Clusters
commands such as (1-1) and (2-1), which come after the 0_2_0
, correspond to the convolution blocks within the bottleneck operation defined by self.m
in the PyTorch code. The flows between Task DMA
commands (1-0, 2-0) and Neural Engine Clusters
commands (1-1, 2-1) shows that the Neural Engine Clusters
command needs pre-loaded weight when it implements practical tensor operations. The command (3) corresponds to the add operation inside the bottleneck blocks, and the command (4) corresponds to the concatenation operation in the source code, self.cv2(torch.cat(y, 1))
.