Skip to content

Serving Large Language Model with Continuous Batching

To serve Large Language Models (LLMs) with maximum utilization, a popular serving optimization technique known as continuous batching is required.

This tutorial will guide you through implementing continuous batching with vllm-rbln to improve LLM serving costs. vllm-rbln is an extension to the well-known LLM serving library vLLM, modified to enable vllm to work with optimum-rbln.

Prerequisites

Install the required dependencies

First, make sure you have the latest versions of the required packages including rebel-compiler, optimum-rbln, and vllm-rbln. You can find the latest version of the packages in Release Note

$ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.5.10" "optimum-rbln>=0.1.11" "vllm-rbln>=0.0.7"

Compile Llama2-7B

You need to compile the Llama2-7B model using optimum-rbln.

from optimum.rbln import RBLNLlamaForCausalLM

# Export huggingFace pytorch llama2 model to RBLN compiled model
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_max_seq_len=4096,
    rbln_tensor_parallel_size=4,  # number of ATOM+ for Rebellions Scalable Design (RSD)
    rbln_batch_size=4,            # batch_size > 1 is recommended for continuous batching
)

compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")

Choose an appropriate batch size for your serving needs. Here, it is set to 4.

This doc introduces three ways to use the compiled model:

  1. Directly using vLLM API,
  2. Using Triton Inference Server with vLLM backend,
  3. vLLM-provided OpenAI compatible API server.

vLLM API

You can use the compiled model with vLLM APIs. The following code shows how to initialize the vLLM engine with the compiled model and run the inference with the engine.

vllm_api_example.py
import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams


# Please make sure the engine configurations match the parameters used when compiling.
model_id = "meta-llama/Llama-2-7b-chat-hf"
max_seq_len = 4096
batch_size = 4

engine_args = AsyncEngineArgs(
  model=model_id,
  device="rbln",
  max_num_seqs=batch_size,
  max_num_batched_tokens=max_seq_len,
  max_model_len=max_seq_len,
  block_size=max_seq_len,
  compiled_model_dir="rbln-Llama-2-7b-chat-hf",
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)

def stop_tokens():
  eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
  if eot_id is not None:
    return [tokenizer.eos_token_id, eot_id]
  else:
    return [tokenizer.eos_token_id]

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=stop_tokens(),
)


# Runs a single inference for an example
async def run_single(chat, request_id):
  results_generator = engine.generate(chat, sampling_params, request_id=request_id)
  final_result = None
  async for result in results_generator:
    # You can use the intermediate `result` here, if needed.
    final_result = result
  return final_result


conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
print(result)


async def run_multi(chats):
  tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
  return [await task for task in tasks]

# Runs multiple inferences in parallel
conversations = [
  [{"role": "user", "content": "What is the first letter of English alphabets?"}],
  [{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
  tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
  for conversation in conversations
]
results = asyncio.run(run_multi(chats))
print(results)

Please refer to the vLLM doc for more information on the vLLM API.

Triton Inference Server with vLLM enabled

vLLM also provides the backend for the Triton Inference Server.

If you are using Backend.AI, refer to Step 1. If you are using an on-premise server, skip Step 1 and proceed directly to Step 2.

Step 1. Setting Up the Backend.AI Environment

  1. Start a session via Backend.AI.
  2. Select Triton Server (ngc-triton) as your environment. You can see the version of 24.01 / vllm / x86_64 / python-py3.

Step 2. Prepare Nvidia Triton vllm_backend and Modify Model Configurations for Llama2-7B

A. Clone the Nvidia Triton Inference Server vllm_backend repository:

$ git clone https://github.com/triton-inference-server/vllm_backend.git -b r24.01

B. Place the precompiled rbln-Llama-2-7b-chat-hf directory into the cloned vllm_backend/samples/model_repository/vllm_model/1 directory:

$ cp -R /PATH/TO/YOUR/rbln-Llama-2-7b-chat-hf /PATH/TO/YOUR/CLONED/vllm_backend/samples/model_repository/vllm_model/1

Your directory should look like the following at this point:

+-- vllm_backend/
|   +-- samples/
|   |   +-- model_repository/
|   |   |   +-- vllm_model/
|   |   |   |   +-- config.pbtxt
|   |   |   |   +-- 1/
|   |   |   |   |   +-- model.json
|   |   |   |   |   +-- rbln-Llama-2-7b-chat-hf/
|   |   |   |   |   |   +-- compiled_model.rbln
|   |   |   |   |   |   +-- config.json
|   |   |   |   |   |   +-- (and others)
|   |   +-- (and others)
|   +-- (and others)

C. Modify model.json

Modify vllm_backend/samples/model_repository/vllm_model/1/model.json.

1
2
3
4
5
6
7
8
9
{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "device": "rbln",
    "max_num_seqs": 4,
    "compiled_model_dir": "/ABSOLUTE/PATH/TO/rbln-Llama-2-7b-chat-hf",
    "max_num_batched_tokens": 4096,
    "max_model_len": 4096,
    "block_size": 4096
}
  • model: The name or path of a HuggingFace Transformers model. You may want to use the absolute path to your model if your model is not from HuggingFace Reference.
  • device: Device type for vLLM execution. Please set this to rbln.
  • max_num_seqs: Maximum number of sequences per iteration. This MUST match the compiled batch_size
  • compiled_model_dir: Absolute path to compiled model directory for RBLN (optimum-rbln)
  • When targeting RBLN device, the max_model_len, block_size, and max_num_batched_tokens fields should be set to the same value as the max sequence length.

Step 3. Run the Inference Server

We are now ready to run the inference server. If you are using Backend.AI, please refer to the A. Backend.AI section. If you are not a Backend.AI user, proceed to the B. On-premise server section.

A. Backend.AI

Before proceeding, install the required dependencies:

$ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.5.2" "optimum-rbln>=0.1.4" vllm-rbln
Start the Triton Server:
$ tritonserver --model-repository PATH/TO/YOUR/vllm_backend/samples/model_repository

You will see the following messages that indicate successful initiation of the server:

1
2
3
Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002

B. On-premise server

If you are not using Backend.AI, follow these steps to start the inference server in the Docker container. (Backend.AI users can skip to Step 5.)

To access the RBLN NPU devices, the inference server container must be run in privileged mode. Add a mount option for the cloned vllm_backend repository as below:

1
2
3
sudo docker run --privileged --shm-size=1g --ulimit memlock=-1 \
   -v /PATH/TO/YOUR/vllm_backend:/opt/tritonserver/vllm_backend \
   -p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:24.01-vllm-python-py3

Install the required dependencies inside the container:

$ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.5.2" "optimum-rbln>=0.1.4" vllm-rbln
Start the Triton Server inside the container:
$ tritonserver --model-repository /opt/tritonserver/vllm_backend/samples/model_repository

You will see the following messages indicating successful initiation of the server:

1
2
3
Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002

Step 4. Requesting Inference via gRPC API

vLLM has its own model.py while we defined our own model.py in the previous section. In our model.py, the input parameter was called INPUT__0 and the output was called OUTPUT__0. But the input parameter of vLLM has the name text_input and the output is called text_output. Our client should be modified accordingly. Please refer to vLLM model.py for more detail.

The following shows the client code for vLLM backend. This client also requires tritonclient and grpcio packages.

simple_vllm_client.py
import asyncio
import numpy as np
import tritonclient.grpc.aio as grpcclient

async def try_request():
  url = "<host and port number of the triton inference server>"  # e.g. "localhost:8001"
  client = grpcclient.InferenceServerClient(url=url, verbose=False)

  model_name = "vllm_model"

  def create_request(prompt, request_id):
    input = grpcclient.InferInput("text_input", [1], "BYTES")
    prompt_data = np.array([prompt.encode("utf-8")])
    input.set_data_from_numpy(prompt_data)

    stream_setting = grpcclient.InferInput("stream", [1], "BOOL")
    stream_setting.set_data_from_numpy(np.array([True]))

    inputs = [input, stream_setting]

    output = grpcclient.InferRequestedOutput("text_output")
    outputs = [output]

    return {
      "model_name": model_name,
      "inputs": inputs,
      "outputs": outputs,
      "request_id": request_id
    }

  prompt = "What is the first letter of English alphabets?"

  async def requests_gen():
    yield create_request(prompt, "req-0")

  response_stream = client.stream_infer(requests_gen())

  async for response in response_stream:
    result, error = response
    if error:
      print("Error occurred!")
    else:
      output = result.as_numpy("text_output")
      for i in output:
          decoded = i.decode("utf-8")
          if decoded.startswith(prompt):
              decoded = decoded[len(prompt):]
          print(decoded, end="", flush=True)

asyncio.run(try_request())

If you need to change other sampling paramaters (such as temperature, top_p, top_k, max_tokens, early_stopping...) please refer to VLLM's python client.

OpenAI Compatible API Server

vLLM provides an OpenAI API compatible server. The server implements the completions and chat APIs.

First, make sure that vllm-rbln is installed. Then you can start the API server by running vllm.entrypoints.openai.api_server module as shown below.

1
2
3
4
5
6
7
8
$ python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --device rbln \
  --max-num-seqs 4 \
  --compiled-model-dir </ABSOLUTE/PATH/TO/rbln-Llama-2-7b-chat-hf> \
  --max-num-batched-tokens 4096 \
  --max-model-len 4096 \
  --block-size 4096

Note that the command line arguments have the same contents with the model.json defined in the previous section.

You may want to add --api-key <Random string to be used as API key> to enable authentication.

Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.

$ curl http://<host and port number of the server>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API key, if specified when running the server>" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "stream": true
  }'

Please refer to the OpenAI Docs for more detail.