YOLOv8¶

In this tutorial, we will guide you through the steps required to integrate RBLN SDK with Nvidia Triton Inference Server using a precompiled YOLOv8 model.

You can check out the actual commands required to compile the model and initialize Nvidia Triton Python Backend on our model-zoo.

Note

This tutorial is written with the assumption that the reader already has a good understanding of how to compile and infer models using RBLN SDK. If you are not familiar with RBLN SDK, please refer to the PyTorch/TensorFlow tutorials and the API page.

Prerequisites¶

Before we start, please make sure you have prepared the following prerequisites in your system:

Ubuntu 20.04 LTS (Debian bullseye) or higher
RBLN NPUs equipped (e.g., RBLN ATOM™)
Python (supports 3.9 - 3.12)
Docker
RBLN SDK (driver, compiler)
Compiled YOLOv8 model (yolov8l.rbln)
COCO label (coco128.yaml)
- The COCO label file can be found at ultralytics/ultralytics/cfg/datasets/coco128.yaml when compiling in the model-zoo - yolov8.

Quick Start with Triton Inference Server Container¶

If you are not running from Backend.AI, skip to Step 1.

Step 0. Starting session with Triton server image¶

When starting your session via Backend.AI, select Triton Server (NGC) as your environment. This will automatically set up the environment to rebellions/tritonserver:24.12-vllm-python-py3.

Step 1. Prepare the Nvidia Triton `python_backend`¶

First, clone the Nvidia Triton Inference Server python_backend repository using the following command:

$ git clone https://github.com/triton-inference-server/python_backend -b r24.12

Before proceeding to the next step, you must place the precompiled yolov8l.rbln and coco128.yaml(Prerequisites - coco label) files to the python_backend/examples/rbln/yolov8l/1 directory:

$ mkdir -p python_backend/examples/rbln/yolov8l/1
$ mv yolov8l.rbln python_backend/examples/rbln/yolov8l/1/
$ mv coco128.yaml python_backend/examples/rbln/yolov8l/1/

Step 2. Write your own `TritonPythonModel` using RBLN SDK¶

The Triton python_backend requires users to write TritonPythonModel class with the following member methods:

auto_complete_config()
initialize()
execute()
finalize()

Please refer to the official Triton python_backend repository for more detailed information about each function.

Below is a simple model.py, where we define the initialize() and execute() functions for loading the model and performing inference, respectively. Save this code along with the yolov8l.rbln file in the following directory: python_backend/examples/rbln/yolov8l/1/model.py.

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#
# model.py
#
import json
import os
import rebel  # RBLN Runtime
import triton_python_backend_utils as pb_utils

# Number of devices to allocate.
# Available device numbers can be found through `rbln-stat` command.
NUM_OF_DEVICES = 1


class TritonPythonModel:
    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_instance_name: A string containing model instance name in form of <model_name>_<instance_group_id>_<instance_id>
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """

        self.model_config = model_config = json.loads(args["model_config"])
        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config["data_type"]
        )

        # Path to rbln compiled model file
        rbln_path = os.path.join(
            args["model_repository"],
            args["model_version"],
            f"{args['model_name']}.rbln",
        )

        # Create rbln runtime module
        self.module = rebel.Runtime(rbln_path)

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """
        output0_dtype = self.output0_dtype
        responses = []

        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT__0")

            # Run inference
            result = self.module.run(in_0.as_numpy())
            out_tensor_0 = pb_utils.Tensor("OUTPUT__0", result[0].astype(output0_dtype))
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor_0]
            )
            responses.append(inference_response)

        return responses

The following config.pbtxt file must be saved in python_backend/examples/rbln/yolov8l/config.pbtxt.

name: "yolov8l"
backend: "python"

input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1, 84, 8400 ]
  },
  {
    name: "OUTPUT__1"
    data_type: TYPE_FP32
    dims: [ 1, 144, 80, 80 ]
  },
  {
    name: "OUTPUT__2"
    data_type: TYPE_FP32
    dims: [ 1, 144, 80, 80 ]
  },
  {
    name: "OUTPUT__3"
    data_type: TYPE_FP32
    dims: [ 1, 144, 40, 40 ]
  }
]

# Configure instance group
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

max_batch_size: 1

If you have successfully completed the steps so far, you will have the following directory structure:

+-- yolov8l
|    +-- config.pbtxt
|    +-- 1/
|    |   +-- coco128.yaml
|    │   +-- model.py
|    │   +-- yolov8l.rbln

Step 3. Run the inference server in the container¶

We are now ready to run the inference server. If you are using Backend.AI, please refer to the Backend.AI section. If you are not a Backend.AI user, proceed to the On-premise server section.

Backend.AI

Start within Backend.AI docker

Install the RBLN SDK:

$ pip3 install -i https://pypi.rbln.ai/simple/ rebel-compiler

Start the Triton server:

$ tritonserver --model-repository /opt/tritonserver/python_backend/examples/rbln

You will see the following messages that indicate successful initiation of the server:

1
2
3

Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002

On-premise server

If you are not using Backend.AI, follow these steps to start the inference server in the Docker container. (Backend.AI users can skip to Step 4.)

To access the RBLN NPU devices, the inference server container must be run in privileged mode or with the required RBLN NPU mounted. In this tutorial, we will run the container with /dev/rbln0 mounted and the cloned python_backend repository should also be mounted.

Use the following command to execute the container:

$ docker run --device /dev/rbln0  --shm-size=1g --ulimit memlock=-1 \
   -v /PATH/TO/YOUR/python_backend:/opt/tritonserver/python_backend \
   -p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:24.12-py3

Install the RBLN SDK in the container:

$ pip3 install -i https://pypi.rbln.ai/simple/ rebel-compiler

Start the Triton Server in the container:

$ tritonserver --model-repository /opt/tritonserver/python_backend/examples/rbln

You will see the following messages indicating successful initiation of the server:

1
2
3

Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002

Step 4. Requesting inference via HTTP API¶

Before proceeding, install the required dependencies:

$ pip3 install tritonclient==2.41.1 gevent geventhttpclient
$ pip3 install fire opencv-python nvidia-pytriton

Install the required packages for testing YOLOv8 with client.py :

$ pip3 install --extra-index-url https://download.pytorch.org/whl/cpu "matplotlib>=3.8.0" "scipy>=1.10.1" "torch<=2.5.1" "torchvision<=0.20.1" "opencv-python-headless==4.10.0.84" "pyyaml>=6.0.1" "requests==2.32.3" "ultralytics==8.0.145"

Below is a sample client.py for making a YOLOv8 inference request:

client.py
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import os
import urllib.request

import cv2
import fire
import torch
import yaml
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype

from ultralytics.data.augment import LetterBox
from ultralytics.yolo.utils.ops import non_max_suppression as nms, scale_boxes
from ultralytics.yolo.utils.plotting import Annotator

DEFAULT_URL = "localhost:8000"
DEFAULT_REQUESTS = 10
MODEL_NAME = "yolov8l"


def preprocess(image):
    preprocess_input = LetterBox(new_shape=(640, 640))(image=image)
    preprocess_input = preprocess_input.transpose((2, 0, 1))[::-1]
    preprocess_input = np.ascontiguousarray(preprocess_input, dtype=np.float32)
    preprocess_input = preprocess_input[None]
    preprocess_input /= 255

    return preprocess_input


def postprocess(outputs, input_image, origin_image):
    pred = nms(torch.from_numpy(outputs), 0.25, 0.45, None, False, max_det=1000)[0]
    pred[:, :4] = scale_boxes(input_image.shape[2:], pred[:, :4], origin_image.shape)
    annotator = Annotator(origin_image, line_width=3)
    yaml_path = os.path.abspath(os.path.dirname(__file__)) + "/coco128.yaml"
    with open(yaml_path) as f:
        data = yaml.safe_load(f)
    names = list(data["names"].values())
    for *xyxy, conf, cls in reversed(pred):
        c = int(cls)
        label = f"{names[c]} {conf:.2f}"
        annotator.box_label(xyxy, label=label)

    return annotator.result()


def infer(
    url: str = DEFAULT_URL,
    requests: int = DEFAULT_REQUESTS,
    verbose: bool = False,
):
    # Prepare input image
    img_url = "https://rbln-public.s3.ap-northeast-2.amazonaws.com/images/people4.jpg"
    img_path = "./people.jpg"
    with urllib.request.urlopen(img_url) as response, open(img_path, "wb") as f:
        f.write(response.read())
    img = cv2.imread(img_path)
    batch = preprocess(img)

    # configure httpclient
    with httpclient.InferenceServerClient(url=url, verbose=verbose) as client:
        inputs = []
        inputs.append(
            httpclient.InferInput(
                "INPUT__0", batch.shape, np_to_triton_dtype(batch.dtype)
            )
        )
        inputs[0].set_data_from_numpy(batch)
        outputs = [
            httpclient.InferRequestedOutput("OUTPUT__0"),
        ]
        responses = []
        # inference
        for i in range(requests):
            responses.append(
                client.infer(MODEL_NAME, inputs, request_id=str(i), outputs=outputs)
            )
        # check result
        for i, response in enumerate(responses):
            rebel_result = response.as_numpy("OUTPUT__0")
            rebel_post_output = postprocess(rebel_result, batch, img)
            ret = cv2.imwrite(f"people_{MODEL_NAME}_{i}.jpg", rebel_post_output)
            assert ret is not None


if __name__ == "__main__":
    fire.Fire(infer)

If the requests are processed properly, the output responses for all 10 requests should be saved.

people_yolov8l_0.jpg
people_yolov8l_1.jpg
people_yolov8l_2.jpg
people_yolov8l_3.jpg
people_yolov8l_4.jpg
people_yolov8l_5.jpg
people_yolov8l_6.jpg
people_yolov8l_7.jpg
people_yolov8l_8.jpg
people_yolov8l_9.jpg

Your output should resemble the following:

Advanced features¶

Multiple model instances with multiple RBLN NPU devices¶

You can configure multiple model instances to distribute the inference workloads across multiple RBLN NPU devices. Multiple model instances can be configured by specifying the count value of the instance_group field inside the config.pbtxt file.

To create two execution instances of a model in the previous YOLOv8 tutorial, for example, you can increase the count value to 2 in python_backend/examples/rbln/yolov8l/config.pbtxt as follows:

...

instance_group [
  {
    count: 2  # number of instances
    kind: KIND_MODEL
  }
]

...

Inside the model.py, you need to set the device index for the runtime instance to run with the device parameter:

#
# model.py
#

def initalize():
    # .......
    module = rebel.Runtime(rbln_path, device=instance_idx)

The instance_idx parameter is an index that represents the instance number within the instance_group. You can map this index to the appropriate RBLN NPU device based on your hardware configuration. Here is a simple example of mapping instance_idx to the corresponding RBLN NPU device:

model.py
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#
# model.py
#
import json
import os
import rebel
import triton_python_backend_utils as pb_utils

# Number of devices to allocate.
# Available device numbers can be found through `rbln-stat` command.
NUM_OF_DEVICES = 2

class TritonPythonModel:
    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_instance_name: A string containing model instance name in form of <model_name>_<instance_group_id>_<instance_id>
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """

        self.model_config = model_config = json.loads(args["model_config"])
        instance_group_config = model_config["instance_group"][0]
        instance_count = instance_group_config["count"]
        instance_idx = 0
        # Get `instance_idx` for multiple instances.
        # instance_group's count should be bigger than 1 in config.pbtxt.
        if instance_count > 1:
            instance_name_parts = args["model_instance_name"].split("_")
            if not instance_name_parts[-1].isnumeric():
                raise pb_utils.TritonModelException(
                    "model instance name should end with '_<instance_idx>', got {}".format(
                        args["model_instance_name"]
                    )
                )
            instance_idx = int(instance_name_parts[-1])

        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config["data_type"]
        )
        rbln_path = os.path.join(
            args["model_repository"],
            args["model_version"],
            f"{args['model_name']}.rbln",
        )
         # Allocate instance to device.
         # Simple example of round robin assignment to multiple devices.
        self.module = rebel.Runtime(rbln_path, device=instance_idx % NUM_OF_DEVICES)

    def execute(self, requests):
        # ... Same as previous ...
        return responses

Dynamic Batching¶

Triton offers dynamic batching, enabling the server to automatically group multiple incoming inference requests into batches for processing. Instead of handling each request individually, the server dynamically combines relevant inference requests to create batches on the fly.

Bucketing is the process of compiling a model multiple times with different target input shapes to create optimized bucketed models. The RBLN Compiler supports bucketing by compiling models for various input shapes, enhancing Dynamic Batching and improving memory efficiency.

Learn more about Bucketing

For a comprehensive guide on bucketing concepts, implementation details, and best practices, see our dedicated Bucketing tutorial.

Below is an example code snippet demonstrating how to define a bucketed model that supports batch sizes ranging from 1 to 4:

bucketing_compile.py
import argparse
import os
import sys

import rebel
import torch

sys.path.append(os.path.join(sys.path[0], "ultralytics"))
from ultralytics import YOLO


def parsing_argument():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name",
        default="yolov8s",
        choices=["yolov8s", "yolov8n", "yolov8m", "yolov8l", "yolov8x"],
        help="available model variations",
    )
    return parser.parse_args()


def main():

    args = parsing_argument()
    model_name = args.model_name
    batches = [1, 2, 3, 4]

    model = YOLO(model_name + ".pt").model
    model.eval()

    input_infos = []
    # Compile torch model for ATOM™
    for i, batch in enumerate(batches):
        input_info = [
            ("input_np", [batch, 3, 640, 640], torch.float32),
        ]
        input_infos.append(input_info)


    compiled_model = rebel.compile_from_torch(model, input_info=input_infos)

    # Save compiled results to disk
    compiled_model.save(f"{model_name}.rbln")

if __name__ == "__main__":
    main()

In the config.pbtxt file, specify the maximum batch size for the compiled model like below:

#
# config.pbtxt
#

max_batch_size: 4

In the model.py file, you can create a runtime based on a specific batch size and define the inference computation using the runtime and the provided input data.

model.py
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import json
import os

import numpy as np
import rebel

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def _get_index(self, name):
        parts = name.split("__")
        return int(parts[1])

    def initialize(self, args):
        self.model_config = model_config = json.loads(args["model_config"])

        # Configure input_dict
        self.input_dict = {}
        for config_input in model_config["input"]:
            index = self._get_index(config_input["name"])
            self.input_dict[index] = [
                config_input["name"],
                config_input["data_type"],
                config_input["dims"],
            ]

        # Configure output_dict
        self.output_dict = {}
        for config_output in model_config["output"]:
            index = self._get_index(config_output["name"])
            self.output_dict[index] = [
                config_output["name"],
                config_output["data_type"],
                config_output["dims"],
            ]

        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config["data_type"]
        )
        rbln_path = os.path.join(
            args["model_repository"],
            args["model_version"],
            f"{args['model_name']}.rbln",
        )
        self.model_name = args["model_name"]

        # Load compiled model
        compiled_model = rebel.RBLNCompiledModel(
            rbln_path
        )

        # Create runner for each batch size
        self.runner = rebel.Runtime(compiled_model)

    def execute(self, requests):
        # Preprocess the input data
        responses = []
        inputs = []
        num_requests = len(requests)
        request_batch_sizes = []
        for i in self.input_dict.keys():
            name, dt, _ = self.input_dict[i]
            first_tensor = pb_utils.get_input_tensor_by_name(
                requests[0], name
            ).as_numpy()
            request_batch_sizes.append(first_tensor.shape[0])
            batched_tensor = first_tensor
            for j in range(1, num_requests):
                tensor = pb_utils.get_input_tensor_by_name(requests[j], name).as_numpy()
                request_batch_sizes.append(request_batch_sizes[-1] + tensor.shape[0])
                batched_tensor = np.concatenate((batched_tensor, tensor), axis=0)

            inputs.append(batched_tensor)

        batch_size = batched_tensor.shape[0]

        # Run inference on the RBLN model
        batched_results = self.runner(batched_tensor)

        # Postprocess the output data
        chunky_batched_results = []
        for i in self.output_dict.keys():
            batch = (
                batched_results[i]
                if (isinstance(batched_results, tuple) or
                    isinstance(batched_results, list))
                else batched_results
            )
            chunky_batched_results.append(
                np.array_split(batch, request_batch_sizes, axis=0)
            )

        # Send response
        for i in range(num_requests):
            output_tensors = []
            for j in self.output_dict.keys():
                name, dt, _ = self.output_dict[j]
                result = chunky_batched_results[j][i]
                output_tensor = pb_utils.Tensor(
                    name, result.astype(pb_utils.triton_string_to_numpy(dt))
                )
                output_tensors.append(output_tensor)
            inference_response = pb_utils.InferenceResponse(
                output_tensors=output_tensors
            )
            responses.append(inference_response)

        return responses

    def finalize(self):
        print("Cleaning up...")