ResNet50¶

In this tutorial, we will guide you through the steps required to integrate RBLN SDK with TorchServe using a precompiled ResNet50 model. For instructions on setting up the TorchServe environment, please refer to TorchServe.

You can check out the actual commands required to compile the model and set to TorchServe on our model zoo.

Note

This tutorial assumes that you are familiar with compiling and running inference using the RBLN SDK. If you are not familiar with RBLN SDK, refer to PyTorch/TensorFlow tutorials and the API Documentation.

Prerequisites¶

Before we start, please make sure you have prepared the following prerequisites in your system:

Ubuntu 20.04 LTS (Debian bullseye) or higher
RBLN NPUs equipped (e.g., RBLN ATOM™)
Python (supports 3.9 - 3.12)
RBLN SDK (driver, compiler)
TorchServe
Compiled ResNet50 model (resnet50.rbln)

Quick Start with TorchServe¶

In TorchServe, models are served as Model Archive (.mar) units, which contain all necessary information for serving the model. The following guide explains how to create a .mar file and use it for model serving.

Write the Model Request Handler¶

Below is a simple handler that inherits from TorchServe BaseHandler for ResNet50 inference requests. This handler defines initialize(), inference(), postprocess(), and handle() for model serving. The initialize() method is called when the model is loaded from the model_store directory, and the handle() method is invoked for TorchServe inference API's predictions request.

resnet50_handler.py
# resnet50_handler.py

import os
import torch
from torchvision.models import ResNet50_Weights
import rebel  # RBLN Runtime
import PIL.Image as Image
import io

from ts.torch_handler.base_handler import BaseHandler


class Resnet50Handler(BaseHandler):
    def __init__(self):
        self._context = None
        self.initialized = False
        self.model = None
        self.weights = None

    def initialize(self, context):
        """
        Initialize model. This will be called during model loading time
        :param context: Initial context contains model server system properties.
        :return:
        """
        self._context = context
        #  load the model, refer 'custom handler class' above for details
        model_dir = context.system_properties.get("model_dir")
        serialized_file = context.manifest["model"].get("serializedFile")
        model_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_path):
            raise RuntimeError(
                f"[RBLN ERROR] File not found at the specified model_path({model_path})."
            )

        self.module = rebel.Runtime(model_path, tensor_type="pt")
        self.weights = ResNet50_Weights.DEFAULT
        self.initialized = True

    def preprocess(self, data):
        """
        Transform raw input into model input data.
        :param batch: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        input_data = data[0].get("data")
        if input_data is None:
            input_data = data[0].get("body")
        assert input_data is not None, print(
            "[RBLN][ERROR] Data not found with client request."
        )
        if not isinstance(input_data, (bytes, bytearray)):
            raise ValueError("[RBLN][ERROR] Preprocessed data is not binary data.")

        try:
            image = Image.open(io.BytesIO(input_data))
        except Exception as e:
            raise ValueError(f"[RBLN][ERROR]Invalid image data: {e}")
        prep = self.weights.transforms()
        batch = prep(image).unsqueeze(0)
        preprocessed_data = batch.numpy()

        return torch.from_numpy(preprocessed_data)

    def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data
        :return: list of inference output in NDArray
        """

        model_output = self.module.run(model_input)

        return model_output

    def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: list of inference output
        :return: list of predict results
        """
        score, class_id = torch.topk(inference_output, 1, dim=1)
        category_name = self.weights.meta["categories"][class_id]
        return category_name

    def handle(self, data, context):
        """
        Invoke by TorchServe for prediction request.
        Do pre-processing of data, prediction using model and postprocessing of prediciton output
        :param data: Input data for prediction
        :param context: Initial context contains model server system properties.
        :return: prediction output
        """
        model_input = self.preprocess(data)
        model_output = self.inference(model_input)
        category_name = self.postprocess(model_output)

        print("[RBLN][INFO] Top1 category: ", category_name)

        return [{"result": category_name}]

Write the Model Configuration¶

Create the config.properties file as shown below. This file contains the necessary information for serving the model. In this tutorial, to limit the number of workers to a single instance, set default_workers_per_model to 1.

config.properties
1 2 3 4 5 6 7 8 9 10	`default_workers_per_model:1 models={\ "resnet50":{\ "1.0":{\ "marName": "resnet50.mar",\ "responseTimeout": 120\ }\ }\ }`

Model Archiving with `torch-model-archiver`¶

The model_store directory stores .mar files, including the ResNet50 model archive used in this tutorial, for serving.

1	`$ mkdir model_store`

Once the model archiving setup is complete, run the torch-model-archiver command to create the model archive file. The model_store folder, where the generated resnet50.mar archive file is located, will be passed as a parameter when TorchServe starts.

$ torch-model-archiver \
        --model-name resnet50 \
        --version 1.0 \
        --serialized-file ./resnet50.rbln \
        --handler ./resnet50_handler.py \
        --export-path ./model_store/

The options passed to torch-model-archiver are as follows.

--model-name: Specifies the name of the model to be served, set as resnet50.
--version: Defines the version of the model to be served with TorchServe.
--serialized-file: Specifies the weight file. Set to ./resnet50.rbln.
--handler: Specifies the handler script for the model, set as ./resnet50_handler.py.
--export-path: Specifies the output directory for the archived file. The previously created model_store folder is set as the destination.

After executing the command, the resnet50.mar file is generated in the model_store directory specified by --export-path.

+--(YOUR_PATH)/
|      +--model_store/
|      |       +--resnet50.mar
|      +--resnet50.rbln
|      +--resnet50_handler.py
|      +--config.properties

Run `torchserve`¶

TorchServe can be started using the following command. For a simple test where token authentication is not required, you can use the --disable-token-auth option.

$ torchserve --start --ncs \
        --ts-config ./config.properties \
        --model-store ./model_store \
        --models resnet50=resnet50.mar \
        --disable-token-auth

--start: Starts the TorchServe service.
--ncs: Disable snapshot feature.
--ts-config: Specifies the settings for torchserve. Set toconfig.properties.
--model-store: Specifies the directory containing model archives (.mar) files.
--models: Specify the model to serve. If all is specified, all models in the model_store directory are designated as serving models.
--disable-token-auth: Disables token authentication.

When TorchServe is successfully started, it operates in the background. The command to stop TorchServe is shown below:

1	`$ torchserve --stop`

TorchServe provides the Management API on port 8081 and the Inference API on port 8080 by default.

You can check the list of models currently being served using the following Management API.

$ curl -X GET "http://localhost:8081/models"

If the operation is successful, you can verify that the resnet50 model is being served.

{
  "models": [
    {
      "modelName": "resnet50",
      "modelUrl": "resnet50.mar"
    }
  ]
}

Inference Request with `TorchServe Inference API`¶

Now, we can send an inference request using the Prediction API from the TorchServe Inference API to test the ResNet50 model served with TorchServe.

Download a sample image for the ResNet50 inference request.

1	`$ wget https://rbln-public.s3.ap-northeast-2.amazonaws.com/images/tabby.jpg`

Make an inference request using the TorchServe Inference API with curl.

$ curl -X POST "http://127.0.0.1:8080/predictions/resnet50" -H "Content-Type: application/octet-stream" --data-binary @./tabby.jpg

If the inference request is successful, the following response is returned.

1
2
3

{
  "result": "tabby"
}

Advanced Features¶

`Batch Inference` in TorchServe¶

TorchServe supports Batch Inference, a method of grouping multiple inference requests together and processing them all at once.

`Batch Inference` Configuration¶

To use Batch Inference in TorchServe, the model configuration must include the following two required settings.

batchSize: The maximum batch size that the model can handle.
maxBatchDelay: The maximum wait time (in milliseconds) that TorchServe will hold requests to reach the defined batchSize. If the number of received requests does not reach the maximum batch size within the specified delay, all currently received requests will be sent to the handler for processing.

In the config.properties file, specify the batch settings using batchSize and maxBatchDelay as shown below.

config_b4.properties
1 2 3 4 5 6 7 8 9 10 11	`default_workers_per_model=1 models={\ "resnet50":{\ "1.0":{\ "marName": "resnet50.mar",\ "batchSize": 4,\ "maxBatchDelay": 100,\ "responseTimeout": 120\ }\ }\ }`

Model Compilation¶

Bucketing is the process of compiling a model multiple times with different target input shapes to create optimized bucketed models. The RBLN Compiler supports bucketing by compiling models for various input shapes, enhancing Batch Inference and improving memory efficiency.

Learn more about Bucketing

For a comprehensive guide on bucketing concepts, implementation details, and best practices, see our dedicated Bucketing tutorial.

Below is an example code snippet demonstrating how to define a bucketed model that supports batch sizes ranging from 1 to 4:

bucketing_compile.py
import torch
from torchvision.models import resnet50, ResNet50_Weights
import rebel  # RBLN Compiler

# Instantiate TorchVision ResNet50 model
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.eval()

size = 224 # Width and height of image
batches = [1, 2, 3, 4] # Supported batch sizes
input_infos = []

# Create input information for each batch size
for i, batch in enumerate(batches):
    input_info = [("input_np", [batch, 3, size, size], "float32")]
    input_infos.append(input_info)

# Compile the model with the pre-defined input information
compiled_model = rebel.compile_from_torch(model, input_info=input_infos)

# Save the compiled model to local storage
compiled_model.save("resnet50.rbln")

When saving the compiled model, the file name must match the --serialized-file parameter specified in torch-model-archiver to be correctly loaded by the Model Handler.

Model Handler¶

The model handler creates a runtime for a specific batch size and uses it to perform inference operations based on the provided input data.

resnet50_batch_handler.py
# resnet50_batch_handler.py

import io
import os

import numpy as np
import PIL.Image as Image
import rebel  # RBLN Runtime
import torch
from torchvision.models import ResNet50_Weights
from ts.torch_handler.base_handler import BaseHandler


class Resnet50Handler(BaseHandler):
    def __init__(self):
        self._context = None
        self.initialized = False
        self.model = None
        self.weights = None
        self.prep = None
        self.batch_size = None
        self.max_batch_size = None

    def initialize(self, context):
        """
        Initialize model. This will be called during model loading time
        :param context: Initial context contains model server system properties.
        :return:
        """
        self._context = context
        model_dir = context.system_properties.get("model_dir")
        serialized_file = context.manifest["model"].get("serializedFile")
        self.max_batch_size = context.system_properties["batch_size"]

        model_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_path):
            raise RuntimeError(
                f"[RBLN ERROR] File not found at the specified model_path({model_path})."
            )

        compiled_model = rebel.RBLNCompiledModel(model_path)

        self.module = rebel.Runtime(compiled_model, tensor_type='pt')
        self.weights = ResNet50_Weights.DEFAULT
        self.prep = self.weights.transforms()
        self.initialized = True

    def preprocess(self, data):
        """
        Transform raw input into model input data.
        :param batch: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        # Take the input data and make it inference ready
        self.batch_size = num_requests = len(data)
        assert self.batch_size <= self.max_batch_size, print(
            f"[RBLN][ERROR] Inputed batched number({self.batch_size})"
            f" is over the batchSize({self.max_batch_size}) in configuration."
        )

        images = []
        for i in range(num_requests):
            input_data = data[i].get("data")
            if input_data is None:
                input_data = data[i].get("body")
            assert input_data is not None, print(
                "[RBLN][ERROR] Data not found with client request."
            )

            if not isinstance(input_data, (bytes, bytearray)):
                raise ValueError("[RBLN][ERROR] Preprocessed data is not binary data.")

            try:
                image = Image.open(io.BytesIO(input_data))
            except Exception as e:
                raise ValueError(f"[RBLN][ERROR]Invalid image data: {e}")
            batch = self.prep(image).unsqueeze(0)
            images.append(batch.numpy())

        preprocessed_data = np.concatenate(images, axis=0).copy()

        return torch.from_numpy(preprocessed_data)

    def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data
        :return: list of inference output in NDArray
        """
        model_output = self.module.run(model_input)

        return model_output

    def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: list of inference output
        :return: list of predict results
        """
        category_names = []
        chunky_batched_result = np.array_split(
            inference_output, self.batch_size, axis=0
        )
        for result in chunky_batched_result:
            score, class_id = torch.topk(result, 1, dim=1)
            category_names.append(self.weights.meta["categories"][class_id])
        return category_names

    def handle(self, data, context):
        """
        Invoke by TorchServe for prediction request.
        Do pre-processing of data, prediction using model and postprocessing of prediciton output
        :param data: Input data for prediction
        :param context: Initial context contains model server system properties.
        :return: prediction output
        """
        model_input = self.preprocess(data)
        model_output = self.inference(model_input)
        category_names = self.postprocess(model_output)

        results = []
        for idx, category_name in enumerate(category_names):
            print("[RBLN][INFO][", idx, "] Top1 category: ", category_name)
            results.append(f"result[{idx}] : {category_name}")

        return results

Model Serving¶

Using the previously created Configuration, Model, and Model handler, start model serving by referring to the steps in “Model Archiving with torch-model-archiver” and “'Run torchserve'”.

You can verify whether the configuration has been applied correctly by using the following Management API command:

$ curl -X GET "http://localhost:8081/models/resnet50"

Check whether batchSize and maxBatchDelay are set to the specified values in the response.

[
  {
    "modelName": "resnet50",
    "modelVersion": "1.0",
    "modelUrl": "resnet50.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 4,
    "maxBatchDelay": 100,
      :
      :
    "workers": [
      {
        :
        :
      }
    ],
      :
      :
  }
]