Llama3-8B with Continous Batching¶

TorchServe provides a vLLM Handler that utilizes a custom handler to support the vLLM engine. With this handler, vllm-rbln can be leveraged to serve LLM models efficiently using continuous batching. This tutorial guides you through serving the Llama3-8B model using TorchServe’s vLLM Handler with vllm-rbln.

For instructions on setting up the TorchServe environment, refer to TorchServe. To check the YAML files, model compilation and TorchServe configuration introduced on this page, visit the Model Zoo.

Note

This tutorial assumes that you are familiar with compiling and running inference using the RBLN SDK. If you are not familiar with RBLN SDK, refer to PyTorch/TensorFlow tutorials and the API Documentation.

Prerequisites¶

The following prerequisites should be prepared for this tutorial.

Ubuntu 20.04 LTS (Debian bullseye) or higher
RBLN NPUs equipped (e.g., RBLN ATOM)
Python (supports 3.9 - 3.12)
RBLN SDK (driver, compiler) (RBLN SDK Driver >= 1.2.92, rebel-compiler >= 0.7.2)
TorchServe
Llama3-8B
vllm-rbln >= 0.7.2
optimum-rbln >= 0.7.2

Note

To use the Llama3-8B model, 4 RBLN NPUs are required. You can refer to the recommended number of RBLN NPUs for each model in Optimum RBLN Multi-NPUs Supported Models.

The vllm-rbln package does not depend on vllm, and installing both may cause operational issues. If you installed vllm after vllm-rbln, please reinstall vllm-rbln to ensure proper functionality.

Llama3-8B Compile¶

To prepare the model for serving, create the rbln_model folder and navigate into it.

$ mkdir rbln_model
$ cd rbln_model

Compile the Llama3-8B model using optimum-rbln. This code is based on the Rebellions Model Zoo.

get_model.py
import os

from optimum.rbln import RBLNLlamaForCausalLM

def main():
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

    # Compile and export
    model = RBLNLlamaForCausalLM.from_pretrained(
        model_id=model_id,
        export=True,  # export a PyTorch model to RBLN model with optimum
        rbln_batch_size=4,
        rbln_max_seq_len=8192,  # default "max_position_embeddings"
        rbln_tensor_parallel_size=4,
    )

    # Save compiled results to disk
    model.save_pretrained(os.path.basename(model_id))

if __name__ == "__main__":
    main()

You need to select an appropriate batch size. In this case, it is set to 4.

Quick Start with TorchServe¶

In TorchServe, models are served as Model Archive (.mar) units, which contain all necessary information for serving the model. The following guide explains how to create a .mar file and use it for model serving.

RBLN vLLM Handler¶

TorchServe provides a vLLM Handler to utilize the vLLM Engine. Because the handler code may have a dependency issue with the installed vLLM version, we suggest using RBLN vLLM Handler, which is compatible with the latest version of vllm-rbln, as shown below:

rbln_vllm_handler.py
import asyncio
import logging
import os
import pathlib
import time
from unittest.mock import MagicMock

from ts.handler_utils.utils import send_intermediate_predict_response
from ts.service import PredictionException
from ts.torch_handler.base_handler import BaseHandler
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest, ErrorResponse
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from vllm.entrypoints.openai.serving_models import BaseModelPath, OpenAIServingModels

from vllm import AsyncEngineArgs, AsyncLLMEngine

logger = logging.getLogger(__name__)


class RBLN_VLLMHandler(BaseHandler):
    def __init__(self):
        super().__init__()

        self.vllm_engine = None
        self.model_name = None
        self.model_dir = None
        self.adapters = None
        self.openai_serving_model = None
        self.chat_completion_service = None
        self.completion_service = None
        self.raw_request = None
        self.initialized = False

    def initialize(self, ctx):
        self.model_dir = ctx.system_properties.get("model_dir")
        vllm_engine_config = self._get_vllm_engine_config(ctx.model_yaml_config.get("handler", {}))

        os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

        self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)

        if vllm_engine_config.served_model_name:
            served_model_names = vllm_engine_config.served_model_name
        else:
            served_model_names = [vllm_engine_config.model]

        chat_template = ctx.model_yaml_config.get("handler", {}).get("chat_template", None)

        loop = asyncio.get_event_loop()
        model_config = loop.run_until_complete(self.vllm_engine.get_model_config())

        base_model_paths = [
            BaseModelPath(name=name, model_path=self.model_dir) for name in served_model_names
        ]

        self.openai_serving_models = OpenAIServingModels(
            engine_client=self.vllm_engine,
            model_config=model_config,
            base_model_paths=base_model_paths,
        )

        self.completion_service = OpenAIServingCompletion(
            self.vllm_engine,
            model_config,
            self.openai_serving_models,
            request_logger=None,
        )

        self.chat_completion_service = OpenAIServingChat(
            self.vllm_engine,
            model_config,
            self.openai_serving_models,
            "assistant",
            request_logger=None,
            chat_template=chat_template,
        )

        async def isd():
            return False

        self.raw_request = MagicMock()
        self.raw_request.headers = {}
        self.raw_request.is_disconnected = isd

        self.initialized = True

    async def handle(self, data, context):
        start_time = time.time()

        metrics = context.metrics

        data_preprocess = await self.preprocess(data, context)
        output = await self.inference(data_preprocess, context)
        output = await self.postprocess(output)

        stop_time = time.time()
        metrics.add_time("HandlerTime", round((stop_time - start_time) * 1000, 2), None, "ms")
        return output

    async def preprocess(self, requests, context):
        assert len(requests) == 1, "Expecting batch_size = 1"
        req_data = requests[0]
        data = req_data.get("data") or req_data.get("body")
        if isinstance(data, (bytes, bytearray)):
            data = data.decode("utf-8")

        return [data]

    async def inference(self, input_batch, context):
        url_path = context.get_request_header(0, "url_path")

        if url_path == "v1/models":
            models = await self.chat_completion_service.show_available_models()
            return [models.model_dump()]

        directory = {
            "v1/completions": (
                CompletionRequest,
                self.completion_service,
                "create_completion",
            ),
            "v1/chat/completions": (
                ChatCompletionRequest,
                self.chat_completion_service,
                "create_chat_completion",
            ),
        }

        RequestType, service, func = directory.get(url_path, (None, None, None))

        if RequestType is None:
            raise PredictionException(f"Unknown API endpoint: {url_path}", 404)

        request = RequestType.model_validate(input_batch[0])
        g = await getattr(service, func)(
            request,
            self.raw_request,
        )

        if isinstance(g, ErrorResponse):
            return [g.model_dump()]
        if request.stream:
            async for response in g:
                if response != "data: [DONE]\n\n":
                    send_intermediate_predict_response(
                        [response], context.request_ids, "Result", 200, context
                    )
            return [response]
        else:
            return [g.model_dump()]

    async def postprocess(self, inference_outputs):
        return inference_outputs

    def _get_vllm_engine_config(self, handler_config: dict):
        vllm_engine_params = handler_config.get("vllm_engine_config", {})
        model = vllm_engine_params.get("model", {})
        if len(model) == 0:
            model_path = handler_config.get("model_path", {})
            assert (
                len(model_path) > 0
            ), "please define model in vllm_engine_config or model_path in handler"
            model = pathlib.Path(self.model_dir).joinpath(model_path)
            if not model.exists():
                logger.debug(
                    f"Model path ({model}) does not exist locally."
                    " Trying to give without model_dir as prefix."
                )
                model = model_path
            else:
                model = model.as_posix()
        logger.debug(f"EngineArgs model: {model}")
        vllm_engine_config = AsyncEngineArgs(model=model)
        self._set_attr_value(vllm_engine_config, vllm_engine_params)
        return vllm_engine_config

    def _set_attr_value(self, obj, config: dict):
        items = vars(obj)
        for k, v in config.items():
            if k in items:
                setattr(obj, k, v)

Write the Model Configuration¶

Let’s create a model_config.yaml file to configure the number of workers and TorchServe frontend parameters for serving the Llama3-8B model. This yaml file contains the vLLM engine settings for LLM serving.

For more details, refer to TorchServe Document - Advanced configuration.

To utilize RBLN NPUs, set device: "rbln" in vllm_engine_config field. Additionally, the model field must exactly match the directory path where the model is stored for serving.

model_config.yaml
# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1             # Set the number of worker to create a single model instance
maxBatchDelay: 100
startupTimeout: 1200      # (in seconds) Give the worker time to load the model weights
asyncCommunication: true  # This ensures we can cummunicate asynchronously with the worker

# Handler parameters
handler:
    # model_path can be a model identifier for Hugging Face hub or a local path
    vllm_engine_config: # vLLM configuration which gets fed into AsyncVLLMEngine
        max_num_seqs: 4
        max_model_len: 4096
        max_num_batched_tokens: 4096
        device: "rbln"
        model: "Meta-Llama-3-8B-Instruct" # Can be a model identifier for Hugging Face hub or a local path
        served_model_name:
            - "llama3-8b"

Model Archiving with `torch-model-archiver`¶

The model_store directory stores .mar files, including the Llama3-8B model archive used in this tutorial, for serving.

1	`$ mkdir model_store`

Now that the setup is complete, run the torch-model-archiver command to create the model archive file.

$ torch-model-archiver \
        --model-name llama3-8b \
        --version 1.0 \
        --handler ./rbln_vllm_handler.py \
        --config-file ./model_config.yaml \
        --archive-format no-archive \
        --export-path model_store/ \
        --extra-files rbln_model/

The options passed to torch-model-archiver are as follows.

--model-name: Set the name of the model to be served as llama3-8b.
--version: Specifies the version of the model to be served with TorchServe.
--handler: Specifies the handler script for the model, set as rbln_vllm_handler.py.
--config-file: Specifies the yaml configuration file for the model, set as model_config.yaml.
--archive-format: An option to specify the archiving format. Set as no-archive.
--export-path: Specifies the directory where the archived model will be stored, set to the model_store folder created earlier.
--extra-files: Specifies a list of additional dependency files to include in the archive. Multiple files or directories can be specified, separated by commas (,). The internal folder structure of the specified directories is preserved in the archive.

Once the archiving process using torch-model-archiver is complete, a folder named llama3-8b will be created in model_store, where the model will be served. Since the no-archive option was used, the archive’s internal files will be stored in this folder instead of being packaged into a .mar file.

+--(YOUR_PATH)/
|   +-- model_store/
|   |   +-- llama3-8b
|   |   |   +-- MAR-INF
|   |   |   |   +-- MANIFEST.json
|   |   |   +-- Meta-Llama-3-8B-Instruct
|   |   |   |   +-- prefill.rbln
|   |   |   |   +-- decoder.rbln
|   |   |   |   +-- config.json
|   |   |   |   +-- (else model files)
|   |   |   +-- model_config.yaml

Run `torchserve`¶

TorchServe can be started by running the following command. For a simple test where token authentication is not required, you can use the --disable-token-auth option.

$ torchserve --start --ncs --model-store model_store --models llama3-8b --disable-token-auth

--start: Starts the TorchServe service.
--ncs: Disable snapshot feature.
--model-store: Specifies the directory containing models.
--models: Loads a specific model. Loads all models available in the model_store directory.
--disable-token-auth: Disables authentication for management API endpoints, simplifying testing.

When TorchServe is started in success, it operates in the background. The command to stop TorchServe is as follows:

1	`$ torchserve --stop`

The Management API of TorchServe receives requests on port 8081 by default.

You can check the list of models currently being served using the following Management API.

$ curl -X GET "http://localhost:8081/models"

If the operation is successful, you can verify that the Llama3-8B model is being served.

{
  "models": [
    {
      "modelName": "llama3-8b",
      "modelUrl": "llama3-8b"
    }
  ]
}

Inference Request with `TorchServe Inference API`¶

Simple Request with curl¶

Now, we can send an inference request using the Prediction API from the TorchServe Inference API to test the Llama3-8B model served with TorchServe.

The Inference API of TorchServe receives requests on port 8080 by default.

Make an inference request using the TorchServe Inference API with curl.

$ echo '{
  "model": "llama3-8b",
  "prompt": "A robot may not injure a human being",
  "stream": 0
}' | curl --header "Content-Type: application/json"   --request POST --data-binary @-   http://localhost:8080/predictions/llama3-8b/1.0/v1/completions

If the inference request is successful, the following response is returned.

{
  "id": "cmpl-2826f5c0dc164a5d91d3ae0d4b71a480",
  "object": "text_completion",
  "created": 1737599538,
  "model": "llama3-8b",
  "choices": [
    {
      "index": 0,
      "text": " or, through inaction, allow a human being to come to harm.\nA",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 26,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  }
}

TorchServe Repository Example Test¶

The TorchServe repository provides examples of model serving with vLLM. For more details, refer to the README.md file in the TorchServe GitHub repository.

For testing the Llama3-8B model, we will use two methods: OpenAI’s Text Completion and Chat Interface.

$ git clone https://github.com/pytorch/serve
$ cd examples/large_models/vllm/llama3

Text Completion¶

Testing command

$ python3 ../../utils/test_llm_streaming_response.py -m llama3-8b -o 50 -t 2 -n 4 --prompt-text "@prompt.json" --prompt-json --openai-api

Result

OUTPUT

Tasks are completed
payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm.
A robot must obey the orders of human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

These three laws, developed by science fiction author Isaac Asimov, are a cornerstone of the robot rights movement. They provide a framework for robots to operate within, ensuring that they prioritize human well-being and safety above all else.

In this world, robots have become an integral part of our daily lives. They work alongside humans, helping with everything from men

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm.
A robot must obey the orders of human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

As I held the small, sleek device in my hand, I couldn't help but feel a sense of excitement and trepidation. This was it, the moment I had been waiting for. The moment when I would finally be able to see if my theories were correct. If my robot, my creation, was truly the first of its kind.

I

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm.
A robot must obey the orders given to it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
— The Three Laws of Robotics, developed by science fiction author Isaac Asimov

When artificial intelligence (AI) is compared to mu ltiple human actors working together, it can be frustrating and challenging to manage the complexity of their relationships. With AI, we don't have a single "actor" with its own motivations, goals, and

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm. A robot must obey the orders given to it by human beings except where such orders would conflict with the First Law. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

These rules, devised by Dr. Isaac Asimov, are the foundation of the Three Laws of Robotics. They were first introduced in his 1942 science fiction short story "Runaround" and have since become a cornerstone of the science fiction genre.

The Three Laws are designed to ensure that robots behave in a way that is safe and

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm.
Robotics and artificial intelligence (AI) are transforming industries and revolutionizing the way we live and work. However, with the increasing development of autonomous machines, there is a growing need to consider the ethics and moral implications of these technologies.

One of the most important ethical principles in robotics and AI is the Three Laws of Robotics, which were first proposed by science fiction author Isaac Asimov in the 1940s. The three laws are:

1. A robot may not injure a human being or, through inaction, allow a human being to come to

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm.
A robot must obey the orders of human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
These three laws were written by Dr. Isaac Asimov, a renowned science fiction author, in his 1942 short story "Runaround." They are a fundamental part of the science fiction genre and have been widely adopted as a framework for exploring the ethics of artificial intelligence. Today, we'll be exploring the first law: A robot may not inj

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output=, or, through inaction, allow a human being to come to harm. A robot must protect its own existence as long as such protection does not conflict with the First Law. A robot must follow the instructions of its human masters, except where such instructions conflict with the First or Second Law. A robot must not have a significant negative influence on human existence.

The Three Laws of Robotics, as formulated by science fiction author Isaac Asimov, are a set of rules designed to govern the behavior of robots and artificial intelligence (AI) in a way that prioritizes human safety and well-being. The laws are often seen as a way to

payload={'prompt': 'A robot may not injure a human being', 'temperature': 0.8, 'logprobs': 1, 'max_tokens': 128, 'model': 'llama3-8b'}
, output= or, through inaction, allow a human being to come to harm.
To prevent harm, a robot must also obey the following three laws:

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2. A robot must obey the orders given to it by human beings except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

These laws were formulated by Dr. Isaac Asimov in his science fiction stories, and they have since become a standard

Chat Interface¶