YOLOv8 (Object Detection)¶

YOLOv8 is a prominent deep learning model in object detection and can be applied across a wide range of tasks. It shows excellent performance relative to its latency and the number of parameters. YOLOv8 utilizes a C2f and a Spatial Pyramid Pooling Fast (SPPF) blocks as backbone to effectively perform various tasks. Detailed information of YOLOv8 can be found in YOLOv8 official docs.

Preliminaries¶

Installation¶

$ cd rbln-model-zoo/pytorch/vision/detection/yolov8
$ git submodule update --init --remote ./ultralytics
$ pip3 install -r requirements.txt

Compile Model and Activate RBLN Profiler¶

# sample image: people.jpg
$ python3 compile.py --model_name yolov8l
$ RBLN_PROFILER=1 python3 inference.py --model_name yolov8l

When using RBLN Runtime API, set activate_profiler=True to activate RBLN Profiler:

# Revise `inference.py` like below.
...
# Load compiled model to RBLN runtime module
module = rebel.Runtime(f"{model_name}.rbln", activate_profiler=True)
...

Extract Profiled Data with Inference¶

You can obtain profiling data with model inference:

# `inference.py`
...
rebel_result = module.run(batch)
...

Alternatively, you can use scoped profiling methods as shown below:

# `inference.py`
...
from rebel.profiler import profile

with profile(output_dir="./temp"):
  rebel_result = module.run(batch)
...

Analysis of Profiled data from YOLOv8l with RBLN Profiler¶

Profiling Result¶

As seen in the above figure, YOLOv8l are well accelerated by ATOM™, and we will now analyze these operations in detail.

`Host` Command¶

As shown in the figure above, the Host commands (0_host_default_fused_nn_pad_5_0 and 0_host_default_fused_transpose_1_1) transform input data into a format suitable for ATOM™ before transferring it via the External HDMA command. Similarly, after inference, the results are sent back to the host CPU through External HDMA and are then converted into the desired output format using the corresponding Host commands. The name of Host commands are generated by internal logic and always include the host prefix.

Input Flow¶

As shown in the example above, the first Neural Engine Clusters command of YOLOv8l is 0_0.conv.input_np_1. This command’s flow indicates a dependency on the immediately preceding Task DMA command (0_0.conv.input_np_0). Additionally, this dependency can also be inferred from the fact that both commands (0_0.conv.input_np_0, 0_0.conv.input_np_1) have the same name except for the postfix index. The name of these commands is derived from a combination of conv, the name of the first operation in the model, and input_np, the name assigned to the input data during compilation.

# rbln-model-zoo/pytorch/vision/detection/yolov8/compile.py
input_info = [
        ("input_np", [1, 3, 640, 640], torch.float32),
    ]
compiled_model = rebel.compile_from_torch(model, input_info)

# >>> print(model)
'''
DetectionModel(
  (model): Sequential(
    (0): Conv(
      (conv): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
  )
)
'''

This input data is transferred from the host to the device memory inside ATOM™ via the External HDMA command (0_H0_hdma). Subsequently, the input data is transferred to the shared memory using the Task DMA command (0_0.conv.input_np_0) and is then used as an operand in the Neural Engine Clusters command (0_0.conv.input_np_1).

Similarly, the output data is transferred in the reverse order of the input data transmission. Using Task DMA and External HDMA commands, the required output data is delivered to the host through shared memory to DRAM, and from DRAM to host memory, respectively.

Analysis of C2f block¶

# https://github.com/autogyro/yolo-V8/blob/cc3c774bde86ffce694d202b7383da6cc1721c1b/ultralytics/nn/modules.py#L179C1-L191C41
class C2f(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

# >>> print(model)
'''
(2): C2f(
  (cv1): Conv(
    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1),bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
    (act): SiLU(inplace=True)
  )
  (cv2): Conv(
    (conv): Conv2d(96, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
    (act): SiLU(inplace=True)
  )
  (m): ModuleList(
    (0): Bottleneck(
      (cv1): Conv(
        (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU(inplace=True)
      )
      (cv2): Conv(
        (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU(inplace=True)
      )
    )
  )
)
'''

Ambiguous command name¶

The operation 0_fused(2.cv1+2.cv1.bn+2.cv1.conv)_1 in the blue box corresponds to fused self.cv1(x) block in the PyTorch code and first (cv1) block in the result of print(model). Therefore, we can infer that the subsequent 0_2_0 in the red box corresponds to the .split((self.c, self.c), 1) in the PyTorch code. For detailed information about how these names are assigned, refer to the Details in Naming Policy.

Connection flow¶

The Neural Engine Clusters commands such as (1-1) and (2-1), which come after the 0_2_0, correspond to the convolution blocks within the bottleneck operation defined by self.m in the PyTorch code. The flows between Task DMA commands (1-0, 2-0) and Neural Engine Clusters commands (1-1, 2-1) shows that the Neural Engine Clusters command needs pre-loaded weight when it implements practical tensor operations. The command (3) corresponds to the add operation inside the bottleneck blocks, and the command (4) corresponds to the concatenation operation in the source code, self.cv2(torch.cat(y, 1)).

`Device Workload`¶

As described above, a large number of commands are processed within the device. The status of the device as occupied and in use is represented by the Device Workload of the Host command. It allows for the clear identification of the start and end of tasks performed within the device.

`Profiling Overhead`¶

Executing the RBLN profiler involves organizing the profiled data collected from ATOM™ after Device Workload has finished. This step incurs a certain delay, which is recorded as Profiling Overhead for the Host command. Once the profiled data has been organized, Additional Host commands may be processed.