YOLOv8¶

개요¶

이 페이지에서는 Ray Serve에서 RBLN SDK를 활용하여 컴파일된 YOLOv8 모델을 서빙하는 방법을 소개합니다.

전체 튜토리얼의 주요 흐름은 다음과 같습니다:

환경 및 모델 컴파일 확인
RBLN NPU를 활용하는 Ray Serve deployment 정의
Serve CLI를 이용한 모델 서빙 애플리케이션 실행
엔드포인트 검증을 위한 추론 요청 전송
배치 및 버케팅 기능 적용

Ray Serve 환경 구성 방법에 대해서는 Ray Serve 개요를 먼저 참고 바랍니다. 모델 컴파일 및 배포에 대한 전체 스크립트 기반 예제는 모델 주를 참고 바랍니다.

환경 설정 및 설치 확인¶

시작하기 전에 시스템 환경이 올바르게 구성되어 있으며, 필요한 모든 필수 패키지가 설치되어 있는지 확인하십시오. 다음 항목이 포함됩니다:

시스템 요구 사항:
- Ubuntu 20.04 LTS (Debian bullseye) or higher
- System with RBLN NPUs equipped (e.g., RBLN ATOM™)
필수 패키지:
- RBLN SDK (driver, compiler) (Driver ≥ 2.0.1, rebel-compiler ≥ 0.9.4)
- Ray Serve

설치 명령어:

pip install -U ray[serve] requests torch --extra-index-url https://download.pytorch.org/whl/cpu

Note

이 튜토리얼은 사용자가 RBLN SDK 기반의 모델 컴파일 및 추론에 대해 잘 이해하고 있다는 가정하에 작성되었습니다. RBLN SDK 사용법에 익숙하지 않을 경우 파이토치/텐서플로우 튜토리얼 및 파이썬 API 페이지를 참고 바랍니다.

사전준비¶

컴파일된 YOLOv8 모델¶

Note

사전에 컴파일된 모델 아티팩트(예: Model Zoo – Ray Serve YOLOv8에서 생성된 yolov8l.rbln)를 준비해 주세요. 또한 COCO 라벨 파일(coco128.yaml)도 바이너리와 함께 위치해야 합니다. 아래 단계들은 해당 번들 파일을 Ray Serve로 서빙하는 방법에 중점을 둡니다.

Deployment¶

Deployment 개요¶

Step	Description
1. Deployment 구현	Ray를 RBLN NPU와 함께 사용하도록 설정하고, 컴파일된 모델을 로드하여 런타임을 초기화하고 엔드포인트를 노출하는 Ray Serve deployment를 정의합니다.
2. 실행	Ray Serve CLI(`serve run`)를 사용하여 deployment를 실행합니다. 애플리케이션 이름, 디바이스 세트 또는 원격 Ray 클러스터를 옵션으로 구성할 수 있습니다.
3. 추론 요청	Serve 엔드포인트로 HTTP 요청을 보내고 응답을 검사하여 deployment를 검증합니다.

아래 섹션은 위의 단계를 순서대로 설명합니다.

1.1 리소스 할당¶

Ray Serve에서 리소스를 할당하는 방식은 각 서빙 작업(Actor, Deployment)별로 resources 파라미터를 통해 NPU 자원을 할당할 수 있습니다.

아래 Actor는 @ray.remote(resources={"RBLN": 1})로 RBLN NPU 리소스를 요청하는 방법을 보여줍니다. 배포에 필요한 NPU 수만큼 값을 조절할 수 있습니다. 이에 대한 자세한 내용은 Ray에서 RBLN NPU 사용을 참고 바랍니다.

@ray.remote(resources={"RBLN": 1})
class RBLNActor:
    def getDeviceId(self):
        return ray.get_runtime_context().get_accelerator_ids()["RBLN"]

1.2 Deployment 구현¶

Ray Serve deployment는 @serve.deployment 데코레이터를 활용하여 클래스를 하나의 Deployment(배포 단위)로 정의합니다. 이 데코레이터를 적용하면 해당 클래스가 Ray Serve의 서비스 엔드포인트로 등록되어, 각 Deployment의 라이프사이클(배포, 확장, 업데이트 등)과 관리를 Ray Serve가 담당하게 됩니다.

resnet50.py
# File name: resnet50.py
import io
import json
import os

import ray
import rebel
import torch
from PIL import Image
from ray import serve
from starlette.requests import Request
from torchvision.models import ResNet50_Weights

ray.init(resources={"RBLN": 1})


@ray.remote(resources={"RBLN": 1})
class RBLNActor:
    def getDeviceId(self):
        return ray.get_runtime_context().get_accelerator_ids()["RBLN"]


@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 4})
class Resnet50:
    def __init__(self, rbln_actor: RBLNActor):
        self.initialized = False
        self.weights = None
        self.rbln_actor = rbln_actor
        self.ids = ray.get(rbln_actor.getDeviceId.remote())
        self.rbln_devices()
        self.initialize()

    def initialize(self):
        """
        Initialize model. This will be called during model loading time
        :return:
        """
        model_path = "./resnet50.rbln"
        if not os.path.isfile(model_path):
            raise RuntimeError(
                f"[RBLN ERROR] File not found at the specified model_path({model_path})."
            )
        self.module = rebel.Runtime(
            model_path, tensor_type="pt", device=int(self.ids[0])
        )
        self.weights = ResNet50_Weights.DEFAULT
        self.initialized = True

    def rbln_devices(self):
        """
        Redefine the environment variables to be passed to the RBLN runtime
        :return:
        """
        if self.ids is None or len(self.ids) <= 0:
            os.environ.pop("RBLN_DEVICES")
        os.environ["RBLN_DEVICES"] = ",".join(self.ids)

    def preprocess(self, input_data):
        """
        Transform raw input into model input data.
        :param batch: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        assert input_data is not None, print(
            "[RBLN][ERROR] Data not found with client request."
        )
        if not isinstance(input_data, (bytes, bytearray)):
            raise ValueError("[RBLN][ERROR] Preprocessed data is not binary data.")

        try:
            image = Image.open(io.BytesIO(input_data))
        except Exception as e:
            raise ValueError(f"[RBLN][ERROR]Invalid image data: {e}") from e
        prep = self.weights.transforms()
        batch = prep(image).unsqueeze(0)
        preprocessed_data = batch.numpy()

        return torch.from_numpy(preprocessed_data)

    def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data
        :return: list of inference output in NDArray
        """

        model_output = self.module.run(model_input)
        return model_output

    def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: list of inference output
        :return: list of predict results
        """
        score, class_id = torch.topk(inference_output, 1, dim=1)
        category_name = self.weights.meta["categories"][class_id]
        return category_name

    def handle(self, data):
        """
        Invoke by TorchServe for prediction request.
        Do pre-processing of data, prediction using model and postprocessing of prediciton output
        :param data: Input data for prediction
        :param context: Initial context contains model server system properties.
        :return: prediction output
        """
        model_input = self.preprocess(data)
        model_output = self.inference(model_input)
        category_name = self.postprocess(model_output)

        return json.dumps({"result": category_name})

    async def __call__(self, http_request: Request) -> str:
        image_byte = await http_request.body()
        return self.handle(image_byte)


rbln_actor = RBLNActor.remote()
app = Resnet50.bind(rbln_actor)

2. 실행¶

Ray Serve CLI(serve run) 명령어를 사용하여 애플리케이션을 실행합니다. 파라미터는 module:application 형식으로 지정되며, 여기서 module은 Python 파일명(확장자 .py 제외), application은 Serve 엔트리포인트 객체명입니다.

이 예시에서는 yolov8l.py에서 app 객체를 정의하므로 아래와 같은 명령어로 deployment를 시작할 수 있습니다. 원격 Ray 클러스터로 연결하거나 RBLN_DEVICES를 이용하여 특정 카드를 지정하는 등 추가 옵션을 함께 사용할 수 있습니다.

$ serve run yolov8:app --name "yolov8"

예시 출력:

1	`Application 'yolov8' is ready at http://127.0.0.1:8000/.`

3. 추론 요청¶

YOLOv8 추론 요청에 사용할 샘플 이미지를 다운로드한 후, curl을 이용해 HTTP POST로 엔드포인트 동작을 검증합니다.

# Download a sample image
$ wget https://rbln-public.s3.ap-northeast-2.amazonaws.com/images/people4.jpg

# Send an inference request
$ curl -X POST http://127.0.0.1:8000/ --header "Content-Type: image/jpeg" --data-binary @./people4.jpg | jq .

예시 출력:

{
    "result": [
        "xyxy : 1.5246094465255737, 0.111328125, 1.8785157203674316, 1.0, conf : 0.91015625, cls : person",
        "xyxy : 0.643359363079071, 0.21562500298023224, 0.9691406488418579, 1.0, conf : 0.9169921875, cls : person",
        "xyxy : 0.9041016101837158, 0.29414063692092896, 1.2966797351837158, 1.0, conf : 0.9306640625, cls : person",
        "xyxy : 1.9113281965255737, 0.17812500894069672, 2.560546875, 1.0, conf : 0.9423828125, cls : person"
    ]
}

고급 기능¶

배치 추론(Batch Inference)¶

Ray Serve에서는 여러 요청을 모아 한 번에 처리할 수 있습니다. 진입 함수는 async def로 정의해야 하고, 요청은 List로 인자로 받으며, 해당 함수에 @serve.batch 데코레이터를 적용합니다. 아래 옵션들을 조절하여 처리량(throughput)과 지연(latency) 간의 균형을 맞출 수 있습니다:

max_batch_size: 한 번의 배치에서 처리할 최대 요청 수를 지정합니다.
batch_wait_timeout: 최대 요청 수에 도달하지 않아도 요청을 일정 시간까지 모아 배치 처리할 때 대기하는 최대 시간을 지정합니다.

모델 컴파일¶

RBLN 컴파일러는 하나의 모델을 컴파일 하여 다양한 입력 형태를 효율적으로 지원하는 "버케팅(Bucketing)" 기능을 지원합니다. 버케팅 기능을 활용하면, 단일 배포에서도 여러 배치 크기를 재 컴파일 없이 효율적으로 처리할 수 있습니다. 자세한 내용은 버케팅 튜토리얼을 참고 바랍니다.

bucketing_compile.py
import argparse
import os
import sys

import rebel
import torch

sys.path.append(os.path.join(sys.path[0], "ultralytics"))
from ultralytics import YOLO


def main():

    model_name = "yolov8l"
    batches = [1, 2, 3, 4]

    model = YOLO(model_name + ".pt").model
    model.eval()

    input_infos = []
    # Compile torch model for ATOM™
    for i, batch in enumerate(batches):
        input_info = [
            ("input_np", [batch, 3, 640, 640], torch.float32),
        ]
        input_infos.append(input_info)


    compiled_model = rebel.compile_from_torch(model, input_info=input_infos)

    # Save compiled results to disk
    compiled_model.save(f"{model_name}.rbln")

if __name__ == "__main__":
    main()

버케팅 모델을 이용한 배치 모델 배포¶

버케팅을 지원하는 모델을 Ray Serve의 @serve.batch 데코레이터와 함께 사용하여 동적 배치 크기를 효율적으로 서빙할 수 있습니다. 실행 및 추론은 Deployment 섹션과 동일하므로, 동일한 CLI 명령어와 HTTP 요청을 재사용합니다.

yolov8_batch.py
# File name: yolov8_batch.py
import io
import json
import os
from typing import List

import numpy as np
import ray
import rebel
import torch
import yaml
from PIL import Image
from ray import serve
from starlette.requests import Request
from ultralytics.data.augment import LetterBox
from ultralytics.utils.ops import non_max_suppression as nms
from ultralytics.utils.ops import scale_boxes

ray.init(resources={"RBLN": 1})


@ray.remote(resources={"RBLN": 1})
class RBLNActor:
    def getDeviceId(self):
        return ray.get_runtime_context().get_accelerator_ids()["RBLN"]


@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 4})
class Yolov8:
    async def __init__(self, rbln_actor: RBLNActor):
        self.initialized = False
        self.rbln_actor = rbln_actor
        self.ids = ray.get(rbln_actor.getDeviceId.remote())
        self.input_images = []
        self.batch_size = 0
        self.batch_size = None
        await self.rbln_devices()
        await self.initialize()

    async def initialize(self):
        """
        Initialize model. This will be called during model loading time
        :return:
        """
        model_path = "./yolov8l_bucketing.rbln"
        if not os.path.isfile(model_path):
            raise RuntimeError(
                f"[RBLN ERROR] File not found at the specified model_path({model_path})."
            )
        compiled_model = rebel.RBLNCompiledModel(model_path)
        self.module = rebel.AsyncRuntime(
            compiled_model, tensor_type="pt", device=int(self.ids[0])
        )
        self.initialized = True

    async def rbln_devices(self):
        """
        Redefine the environment variables to be passed to the RBLN runtime
        :return:
        """
        if self.ids is None or len(self.ids) <= 0:
            os.environ.pop("RBLN_DEVICES")
        os.environ["RBLN_DEVICES"] = ",".join(self.ids)

    async def preprocess(self, input_data_list):
        """
        Transform raw input into model input data.
        :param input_data_list: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        self.input_images.clear()

        for input_data in input_data_list:
            assert input_data is not None, print(
                "[RBLN][ERROR] Data not found with client request."
            )
            if not isinstance(input_data, (bytes, bytearray)):
                raise ValueError("[RBLN][ERROR] Preprocessed data is not binary data.")

            try:
                image = Image.open(io.BytesIO(input_data)).convert("RGB")
            except Exception as e:
                raise ValueError(f"[RBLN][ERROR]Invalid image data: {e}") from e
            image = np.array(image)

            preprocessed_data = LetterBox(new_shape=(640, 640))(image=image)
            preprocessed_data = preprocessed_data.transpose((2, 0, 1))[::-1]
            preprocessed_data = np.ascontiguousarray(
                preprocessed_data, dtype=np.float32
            )
            preprocessed_data = preprocessed_data[None]
            preprocessed_data /= 255
            self.input_images.append(preprocessed_data)

        preprocessed_datas = np.concatenate(self.input_images, axis=0).copy()

        return torch.from_numpy(preprocessed_datas)

    async def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data (batch)
        :return: list of inference output in NDArray
        """
        task = self.module.run(model_input)
        return task.wait()

    async def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: batch of inference output
        :return: list of predict results
        """
        chunky_batched_result = np.array_split(
            inference_output, self.batch_size, axis=0
        )

        postprocess_outputs = []
        for idx, result in enumerate(chunky_batched_result):
            nms_result = nms(result, 0.25, 0.45, None, False, max_det=1000)
            pred = nms_result[0]
            pred[:, :4] = scale_boxes(
                self.input_images[idx].shape[2:],
                pred[:, :4],
                self.input_images[idx].shape,
            )
            yaml_path = "./coco128.yaml"

            postprocess_output = []
            with open(yaml_path) as f:
                data = yaml.safe_load(f)
            names = list(data["names"].values())
            for *xyxy, conf, cls in reversed(pred):
                xyxy_str = f"{xyxy[0]}, {xyxy[1]}, {xyxy[2]}, {xyxy[3]}"
                postprocess_output.append(
                    f"xyxy : {xyxy_str}, conf : {conf}, cls : {names[int(cls)]}"
                )
            postprocess_outputs.append(
                [{f"result[{len(postprocess_outputs)}]": postprocess_output}]
            )

        return postprocess_outputs

    @serve.batch(max_batch_size=4, batch_wait_timeout_s=0.5)
    async def __call__(self, http_requests: List[Request]) -> List[str]:
        """
        Handle batch of HTTP requests
        :param http_requests: List of HTTP requests
        :return: List of JSON string results
        """
        self.batch_size = len(http_requests)
        image_bytes_list = []
        for request in http_requests:
            image_byte = await request.body()
            image_bytes_list.append(image_byte)

        # Process batch
        model_input = await self.preprocess(image_bytes_list)
        model_output = await self.inference(model_input)
        result_list = await self.postprocess(model_output[0])

        # Return list of JSON strings for each prediction
        results = []
        for idx, result in enumerate(result_list):
            results.append(json.dumps({f"{idx}": result}))

        return results


rbln_actor = RBLNActor.remote()
app = Yolov8.bind(rbln_actor)

실행 및 추론¶

버케팅을 이용한 배치 모델 배포를 위한 실행 단계와 추론 요청 단계는 Deployment 섹션과 동일한 단계와 동일합니다.