Skip to content

YOLOv8

In this tutorial, we will guide you through the steps required to integrate RBLN SDK with TorchServe using a precompiled YOLOv8 model. For instructions on setting up the TorchServe environment, refer to TorchServe.

You can check out the actual commands required to compile the model and initialize Nvidia Triton Python Backend on our model zoo.

Note

This tutorial assumes that you are familiar with compiling and running inference using the RBLN SDK. If you are not familiar with RBLN SDK, refer to PyTorch/TensorFlow tutorials and the API Documentation.

Prerequisites

Before we start, please make sure you have prepared the following prerequisites in your system:

Quick Start with TorchServe

In TorchServe, models are served as Model Archive (.mar) units, which contain all necessary information for serving the model. The following guide explains how to create a .mar file and use it for model serving.

Write the Model Request Handler

Below is a simple handler that inherits from TorchServe BaseHandler for YOLOv8 inference requests. This handler defines initialize(), inference(), postprocess(), and handle() for model serving. The initialize() method is called when the model is loaded from the model_store directory, and the handle() method is invoked for TorchServe Inference API's predictions request.

yolov8l_handler.py
# yolov8l_handler.py

"""
ModelHandler defines a custom model handler.
"""

import os
import torch
import rebel  # RBLN Runtime
import PIL.Image as Image
import numpy as np
import yaml
import io
from ultralytics.data.augment import LetterBox
from ultralytics.yolo.utils.ops import non_max_suppression as nms, scale_boxes
from ts.torch_handler.base_handler import BaseHandler


class YOLOv8_Handler(BaseHandler):
    """
    A custom model handler implementation.
    """

    def __init__(self):
        self._context = None
        self.initialized = False
        self.explain = False
        self.target = 0
        self.input_image = None

    def initialize(self, context):
        """
        Initialize model. This will be called during model loading time
        :param context: Initial context contains model server system properties.
        :return:
        """
        self._context = context
        #  load the model, refer 'custom handler class' above for details
        model_dir = context.system_properties.get("model_dir")
        serialized_file = context.manifest["model"].get("serializedFile")
        model_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_path):
            raise RuntimeError(
                f"[RBLN ERROR] File not found at the specified model_path({model_path})."
            )

        self.module = rebel.Runtime(model_path)
        self.initialized = True

    def preprocess(self, data):
        """
        Transform raw input into model input data.
        :param batch: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        # Take the input data and make it inference ready
        preprocessed_data = data[0].get("data")
        if preprocessed_data is None:
            preprocessed_data = data[0].get("body")

            image = Image.open(io.BytesIO(preprocessed_data)).convert("RGB")
            image = np.array(image)

            preprocessed_data = LetterBox(new_shape=(640, 640))(image=image)
            preprocessed_data = preprocessed_data.transpose((2, 0, 1))[::-1]
            preprocessed_data = np.ascontiguousarray(
                preprocessed_data, dtype=np.float32
            )
            preprocessed_data = preprocessed_data[None]
            preprocessed_data /= 255
            self.input_image = preprocessed_data

        return preprocessed_data

    def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data
        :return: list of inference output in NDArray
        """
        # Do some inference call to engine here and return output
        model_output = self.module.run(model_input)
        return model_output

    def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: list of inference output
        :return: list of predict results
        """
        # Take output from network and post-process to desired format
        postprocess_output = inference_output

        pred = nms(
            torch.from_numpy(postprocess_output), 0.25, 0.45, None, False, max_det=1000
        )[0]
        pred[:, :4] = scale_boxes(
            self.input_image.shape[2:], pred[:, :4], self.input_image.shape
        )
        yaml_path = "./coco128.yaml"

        postprocess_output = []
        with open(yaml_path) as f:
            data = yaml.safe_load(f)
        names = list(data["names"].values())
        for *xyxy, conf, cls in reversed(pred):
            xyxy_str = f"{xyxy[0]}, {xyxy[1]}, {xyxy[2]}, {xyxy[3]}"
            postprocess_output.append(
                f"xyxy : {xyxy_str}, conf : {conf}, cls : {names[int(cls)]}"
            )

        return postprocess_output

    def handle(self, data, context):
        """
        Invoke by TorchServe for prediction request.
        Do pre-processing of data, prediction using model and postprocessing of prediciton output
        :param data: Input data for prediction
        :param context: Initial context contains model server system properties.
        :return: prediction output
        """
        model_input = self.preprocess(data)
        model_output = self.inference(model_input)
        return [{"result": self.postprocess(model_output[0])}]

Write the Model Configuration

Create a config.properties file as follows. maximum_request_size is configured to 100 MB for input image size in this example.

config.properties
max_request_size=104857600
max_response_size=104857600

models={\
  "yolov8l": {\
    "1.0": {\
        "marName": "yolov8l.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "responseTimeout": 120\
    }\
  }\
}

Model Archiving with torch-model-archiver

The model_store directory stores .mar files, including the YOLOv8 model archive used in this tutorial, for serving.

$ mkdir model_store 

Now that the setup for model archiving is complete, run the torch-model-archiver command to create the model archive file. The model_store folder, where the generated resnet50.mar archive file is located, will be passed as a parameter when TorchServe starts.

1
2
3
4
5
6
$ torch-model-archiver --model-name yolov8l \
        --version 1.0 \
        --serialized-file ./yolov8l.rbln \
        --handler ./yolov8l_handler.py \
        --extra-files ./coco128.yaml \
        --export-path ./model_store

The options passed to torch-model-archiver are as follows.

  • --model-name: Specifies the name of the model to be served, set as yolov8l.
  • --version: Defines the version of the model to be served with TorchServe.
  • --serialized-file: Specifies the weight file of the compiled model. Set to yolov8l.rbln.
  • --handler: Specifies the handler script for the model, set as yolov8l_handler.py.
  • --extra-file: Specify the files that need to be included in the archive, set as coco128.yaml.
  • --export-path: Specifies the output directory for the archived file. The previously created model_store folder is set as the destination.

After executing the command, the yolov8l.mar file is generated in the model_store directory specified by --export-path.

1
2
3
4
5
6
7
+-- (YOUR_PATH)/
|   +-- model_store/
|   |   +-- yolov8l.mar
|   +-- yolov8l.rbln
|   +-- yolov8l_handler.py
|   +-- coco128.yaml
|   +-- config.properties

Run the torchserve

TorchServe can be started using the following command. For a simple test where token authentication is not required, you can use the --disable-token-auth option.

$ torchserve --start --ncs --ts-config ./config.properties --model-store ./model_store --models yolov8l.mar --disable-token-auth
  • --start: Starts the TorchServe service.
  • --ncs: Disable snapshot feature.
  • --ts-config: TorchServe configuration.
  • --model-store: Specifies the directory containing model archives (.mar) files.
  • --models: Specify the model to serve. If all is specified, all models in the model_store directory are designated as serving models.
  • --disable-token-auth: Disables token authentication.

When TorchServe is successfully started, it operates in the background. The command to stop TorchServe is shown below:

$ torchserve --stop

TorchServe provides the Management API on port 8081 and the Inference API on port 8080 by default.

You can check the list of models currently being served using the following Management API.

$ curl -X GET "http://localhost:8081/models"

If the operation is successful, you can verify that the YOLOv8 model is being served.

1
2
3
4
5
6
7
8
{
  "models": [
    {
      "modelName": "yolov8l",
      "modelUrl": "yolov8l.mar"
    }
  ]
}

Inference Request with TorchServe Inference API

Now we can send an inference request using the Prediction API from the TorchServe Inference API to test the YOLOv8 model served with TorchServe.

Download a sample image for the YOLOv8 inference request.

$ wget https://rbln-public.s3.ap-northeast-2.amazonaws.com/images/people4.jpg

Make an inference request using the TorchServe inference API with curl.

$ curl -X POST "http://127.0.0.1:8080/predictions/yolov8l" -H "Content-Type: application/octet-stream" --data-binary @./people4.jpg

If the inference request is successful, the following response is returned.

1
2
3
4
5
6
7
8
{
  "result": [
    "xyxy : 1.5238770246505737, 0.10898438096046448, 1.8791016340255737, 1.0, conf : 0.91015625, cls : person",
    "xyxy : 0.6436523795127869, 0.2138671875, 0.968994140625, 1.0, conf : 0.916015625, cls : person",
    "xyxy : 0.90380859375, 0.29179689288139343, 1.296240210533142, 1.0, conf : 0.9296875, cls : person",
    "xyxy : 1.9107422828674316, 0.17695312201976776, 2.558789014816284, 1.0, conf : 0.943359375, cls : person"
  ]
}