Skip to content

YOLOv8 (Object Detection)

YOLOv8 is a prominent deep learning model in object detection and can be applied across a wide range of tasks. It shows excellent performance relative to its latency and the number of parameters. YOLOv8 utilizes a C2f and a Spatial Pyramid Pooling Fast (SPPF) blocks as backbone to effectively perform various tasks. Detailed information of YOLOv8 can be found in YOLOv8 official docs.

Preliminaries

Installation

1
2
3
$ cd RBLN_MODEL_ZOO_PATH/pytorch/vision/detection/yolov8
$ git submodule update --init --remote ./ultralytics
$ pip3 install -r requirements.txt

Compile the Model and Extract Profiled Data

1
2
3
# sample image: people.jpg
$ python3 compile.py --model_name yolov8l
$ RBLN_PROFILER=1 python3 inference.py --model_name yolov8l

When using RBLN Runtime API, set activate_profiler=True to activate RBLN Profiler.

1
2
3
4
5
# Revise `inference.py` like below.
...
# Load compiled model to RBLN runtime module
module = rebel.Runtime(f"{model_name}.rbln", activate_profiler=True)
...

Analysis of Profiled data from YOLOv8l with RBLN Profiler

Profiling Result

As seen in the above figure, YOLOv8l are well accelerated by ATOM™, and we will now analyze these operations in detail.

Input Flow

As shown in the example above, the first Neural Engine Clusters command of YOLOv8l is 0_0.conv.input_np_1. This command’s flow indicates a dependency on the immediately preceding Task DMA command (0_0.conv.input_np_0). Additionally, this dependency can also be inferred from the fact that both commands (0_0.conv.input_np_0, 0_0.conv.input_np_1) have the same name except for the postfix index. This Neural Engine Clusters command represents the preprocess step performed on the input data before being computed on ATOM™. The name of these commands is derived from a combination of conv, the name of the first operation in the model, and input_np, the name assigned to the input data during compilation.

# rbln_model_zoo/pytorch/vision/detection/yolov8/compile.py
input_info = [
        ("input_np", [1, 3, 640, 640], torch.float32),
    ]
compiled_model = rebel.compile_from_torch(model, input_info)

# >>> print(model)
'''
DetectionModel(
  (model): Sequential(
    (0): Conv(
      (conv): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
  )
)
'''
This input data is transferred from the host to the device memory inside ATOM™ via the External HDMA command (0_H0_hdma). Subsequently, the input data is transferred to the shared memory using the Task DMA command (0_0.conv.input_np_0) and is then used as an operand in the Neural Engine Clusters command (0_0.conv.input_np_1).

Similarly, the output data is transferred in the reverse order of the input data transmission. Using Task DMA and External HDMA commands, the required output data is delivered to the host through shared memory to DRAM, and from DRAM to host memory, respectively.

Analysis of C2f block

# https://github.com/autogyro/yolo-V8/blob/cc3c774bde86ffce694d202b7383da6cc1721c1b/ultralytics/nn/modules.py#L179C1-L191C41
class C2f(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

# >>> print(model)
'''
(2): C2f(
  (cv1): Conv(
    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1),bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
    (act): SiLU(inplace=True)
  )
  (cv2): Conv(
    (conv): Conv2d(96, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
    (act): SiLU(inplace=True)
  )
  (m): ModuleList(
    (0): Bottleneck(
      (cv1): Conv(
        (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU(inplace=True)
      )
      (cv2): Conv(
        (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU(inplace=True)
      )
    )
  )
)
'''

Ambiguous command name

The operation 0_fused(2.cv1+2.cv1.bn+2.cv1.conv)_1 in the blue box corresponds to fused self.cv1(x) block in the PyTorch code and first (cv1) block in the result of print(model). Therefore, we can infer that the subsequent 0_2_0 in the red box corresponds to the .split((self.c, self.c), 1) in the PyTorch code. For detailed information about how these names are assigned, refer to the Details in Naming Policy.

Connection flow

The Neural Engine Clusters commands such as (1-1) and (2-1), which come after the 0_2_0, correspond to the convolution blocks within the bottleneck operation defined by self.m in the PyTorch code. The flows between Task DMA commands (1-0, 2-0) and Neural Engine Clusters commands (1-1, 2-1) shows that the Neural Engine Clusters command needs pre-loaded weight when it implements practical tensor operations. The command (3) corresponds to the add operation inside the bottleneck blocks, and the command (4) corresponds to the concatenation operation in the source code, self.cv2(torch.cat(y, 1)).