Serving Large Language Model with Continuous Batching
To serve Large Language Models (LLMs) with maximum utilization, a popular serving optimization technique known as continuous batching is required.
This tutorial will guide you through implementing continuous batching with vllm-rbln
to improve LLM serving costs.
vllm-rbln
is an extension to the well-known LLM serving library vLLM, modified to enable vllm
to work with optimum-rbln
.
Install the required dependencies
First, make sure you have the latest versions of the required packages including rebel-compiler
, optimum-rbln
, and vllm-rbln
. You can find the latest version of the packages in Release Note
| $ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.6.1" "optimum-rbln>=0.1.13" "vllm-rbln>=0.1.2"
|
Compile Llama2-7B
You need to compile the Llama2-7B model using optimum-rbln.
| from optimum.rbln import RBLNLlamaForCausalLM
# Export huggingFace pytorch llama2 model to RBLN compiled model
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True,
rbln_max_seq_len=4096,
rbln_tensor_parallel_size=4, # number of ATOM+ for Rebellions Scalable Design (RSD)
rbln_batch_size=4, # batch_size > 1 is recommended for continuous batching
)
compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")
|
Choose an appropriate batch size for your serving needs. Here, it is set to 4.
This doc introduces three ways to use the compiled model:
- Directly using vLLM API,
- Using Triton Inference Server with vLLM backend,
- vLLM-provided OpenAI compatible API server.
vLLM API
You can use the compiled model with vLLM APIs.
The following code shows how to initialize the vLLM engine with the compiled model and run the inference with the engine.
vllm_api_example.py |
---|
| import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams
# Please make sure the engine configurations match the parameters used when compiling.
model_id = "meta-llama/Llama-2-7b-chat-hf"
max_seq_len = 4096
batch_size = 4
engine_args = AsyncEngineArgs(
model=model_id,
device="rbln",
max_num_seqs=batch_size,
max_num_batched_tokens=max_seq_len,
max_model_len=max_seq_len,
block_size=max_seq_len,
compiled_model_dir="rbln-Llama-2-7b-chat-hf",
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def stop_tokens():
eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
if eot_id is not None:
return [tokenizer.eos_token_id, eot_id]
else:
return [tokenizer.eos_token_id]
sampling_params = SamplingParams(
temperature=0.0,
skip_special_tokens=True,
stop_token_ids=stop_tokens(),
)
# Runs a single inference for an example
async def run_single(chat, request_id):
results_generator = engine.generate(chat, sampling_params, request_id=request_id)
final_result = None
async for result in results_generator:
# You can use the intermediate `result` here, if needed.
final_result = result
return final_result
conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
print(result)
async def run_multi(chats):
tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
return [await task for task in tasks]
# Runs multiple inferences in parallel
conversations = [
[{"role": "user", "content": "What is the first letter of English alphabets?"}],
[{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
for conversation in conversations
]
results = asyncio.run(run_multi(chats))
print(results)
|
Please refer to the vLLM doc for more information on the vLLM API.
Triton Inference Server with vLLM enabled
vLLM also provides the backend for the Triton Inference Server.
If you are using Backend.AI, refer to Step 1. If you are using an on-premise server, skip Step 1 and proceed directly to Step 2.
Step 1. Setting Up the Backend.AI Environment
- Start a session via Backend.AI.
- Select Triton Server (
ngc-triton
) as your environment. You can see the version of 24.01 / vllm / x86_64 / python-py3
.
Step 2. Prepare Nvidia Triton vllm_backend
and Modify Model Configurations for Llama2-7B
A. Clone the Nvidia Triton Inference Server vllm_backend
repository:
| $ git clone https://github.com/triton-inference-server/vllm_backend.git -b r24.01
|
B. Place the precompiled rbln-Llama-2-7b-chat-hf
directory into the cloned vllm_backend/samples/model_repository/vllm_model/1
directory:
| $ cp -R /PATH/TO/YOUR/rbln-Llama-2-7b-chat-hf /PATH/TO/YOUR/CLONED/vllm_backend/samples/model_repository/vllm_model/1
|
Your directory should look like the following at this point:
| +-- vllm_backend/
| +-- samples/
| | +-- model_repository/
| | | +-- vllm_model/
| | | | +-- config.pbtxt
| | | | +-- 1/
| | | | | +-- model.json
| | | | | +-- rbln-Llama-2-7b-chat-hf/
| | | | | | +-- compiled_model.rbln
| | | | | | +-- config.json
| | | | | | +-- (and others)
| | +-- (and others)
| +-- (and others)
|
C. Modify model.json
Modify vllm_backend/samples/model_repository/vllm_model/1/model.json
.
| {
"model": "/ABSOLUTE/PATH/TO/rbln-Llama-2-7b-chat-hf",
"device": "rbln",
"max_num_seqs": 4,
"max_num_batched_tokens": 4096,
"max_model_len": 4096,
"block_size": 4096
}
|
model
: The name or path of a HuggingFace transformers model. You may want to use the absolute path to your model if your model is not from HuggingFace Reference.
device
: Device type for vLLM execution. Please set this to rbln
.
max_num_seqs
: Maximum number of sequences per iteration. This MUST match the compiled batch_size
- When targeting RBLN device, the
max_model_len
, block_size
, and max_num_batched_tokens
fields should be set to the same value as the max sequence length.
Step 3. Run the Inference Server
We are now ready to run the inference server. If you are using Backend.AI, please refer to the A. Backend.AI section. If you are not a Backend.AI user, proceed to the B. On-premise server section.
A. Backend.AI
Before proceeding, install the required dependencies:
| $ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.5.2" "optimum-rbln>=0.1.4" vllm-rbln
|
Start the Triton Server:
| $ tritonserver --model-repository PATH/TO/YOUR/vllm_backend/samples/model_repository
|
You will see the following messages that indicate successful initiation of the server:
| Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002
|
B. On-premise server
If you are not using Backend.AI
, follow these steps to start the inference server in the Docker container. (Backend.AI
users can skip to Step 5.)
To access the RBLN NPU devices, the inference server container must be run in privileged mode. Add a mount option for the cloned vllm_backend
repository as below:
| sudo docker run --privileged --shm-size=1g --ulimit memlock=-1 \
-v /PATH/TO/YOUR/vllm_backend:/opt/tritonserver/vllm_backend \
-p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:24.01-vllm-python-py3
|
Install the required dependencies inside the container:
| $ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.5.2" "optimum-rbln>=0.1.4" vllm-rbln
|
Start the Triton Server inside the container:
| $ tritonserver --model-repository /opt/tritonserver/vllm_backend/samples/model_repository
|
You will see the following messages indicating successful initiation of the server:
| Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002
|
Step 4. Requesting Inference via gRPC API
vLLM has its own model.py
while we defined our own model.py
in the previous section. In our model.py
, the input parameter was called INPUT__0
and the output was called OUTPUT__0
. But the input parameter of vLLM has the name text_input
and the output is called text_output
. Our client should be modified accordingly. Please refer to vLLM model.py for more detail.
The following shows the client code for vLLM backend. This client also requires tritonclient
and grpcio
packages.
simple_vllm_client.py |
---|
| import asyncio
import numpy as np
import tritonclient.grpc.aio as grpcclient
async def try_request():
url = "<host and port number of the triton inference server>" # e.g. "localhost:8001"
client = grpcclient.InferenceServerClient(url=url, verbose=False)
model_name = "vllm_model"
def create_request(prompt, request_id):
input = grpcclient.InferInput("text_input", [1], "BYTES")
prompt_data = np.array([prompt.encode("utf-8")])
input.set_data_from_numpy(prompt_data)
stream_setting = grpcclient.InferInput("stream", [1], "BOOL")
stream_setting.set_data_from_numpy(np.array([True]))
inputs = [input, stream_setting]
output = grpcclient.InferRequestedOutput("text_output")
outputs = [output]
return {
"model_name": model_name,
"inputs": inputs,
"outputs": outputs,
"request_id": request_id
}
prompt = "What is the first letter of English alphabets?"
async def requests_gen():
yield create_request(prompt, "req-0")
response_stream = client.stream_infer(requests_gen())
async for response in response_stream:
result, error = response
if error:
print("Error occurred!")
else:
output = result.as_numpy("text_output")
for i in output:
decoded = i.decode("utf-8")
if decoded.startswith(prompt):
decoded = decoded[len(prompt):]
print(decoded, end="", flush=True)
asyncio.run(try_request())
|
If you need to change other sampling paramaters (such as temperature
, top_p
, top_k
, max_tokens
, early_stopping
...) please refer to VLLM's python client.
OpenAI Compatible API Server
vLLM provides an OpenAI API compatible server.
The server implements the completions and chat APIs.
First, make sure that vllm-rbln
is installed. Then you can start the API server by running vllm.entrypoints.openai.api_server
module as shown below.
| $ python -m vllm.entrypoints.openai.api_server \
--model <PATH/TO/rbln-Llama-2-7b-chat-hf> \
--device rbln \
--max-num-seqs 4 \
--max-num-batched-tokens 4096 \
--max-model-len 4096 \
--block-size 4096
|
Note that the command line arguments have the same contents with the model.json
defined in the previous section.
You may want to add --api-key <Random string to be used as API key>
to enable authentication.
Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.
| $ curl http://<host and port number of the server>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <API key, if specified when running the server>" \
-d '{
"model": "<PATH/TO/rbln-Llama-2-7b-chat-hf>",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"stream": true
}'
|
Please refer to the OpenAI Docs for more detail.