Skip to content

vLLM Native API

Using vllm-rbln, you can easily utilize the vLLM API for large language models (LLMs).

How to install

First, make sure you have the latest versions of the required packages including rebel-compiler, optimum-rbln, and vllm-rbln. You need access rights to Rebellions' private PyPI server. Please refer to Installation Guide for more information. You can find the latest version of the packages in Release Note.

$ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.7.1" "optimum-rbln>=0.2.0" "vllm-rbln>=0.2.0"

Compile Llama2-7B

You need to compile the Llama2-7B model using optimum-rbln.

from optimum.rbln import RBLNLlamaForCausalLM

# Export huggingFace pytorch llama2 model to RBLN compiled model
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_max_seq_len=4096,
    rbln_tensor_parallel_size=4,  # number of ATOM+ for Rebellions Scalable Design (RSD)
    rbln_batch_size=4,            # batch_size > 1 is recommended for continuous batching
)

compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")

Choose an appropriate batch size for your serving needs. Here, it is set to 4.

vLLM API

You can use the compiled model with vLLM APIs. The following code shows how to initialize the vLLM engine with the compiled model and run the inference with the engine.

vllm_api_example.py
import asyncio
from transformers import AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams


# Please make sure the engine configurations match the parameters used when compiling.
model_id = "rbln-Llama-2-7b-chat-hf"
max_seq_len = 4096
batch_size = 4

engine_args = AsyncEngineArgs(
  model=model_id,
  device="rbln",
  max_num_seqs=batch_size,
  max_num_batched_tokens=max_seq_len,
  max_model_len=max_seq_len,
  block_size=max_seq_len,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

tokenizer = AutoTokenizer.from_pretrained(model_id)

def stop_tokens():
  eot_id = next((k for k, t in tokenizer.added_tokens_decoder.items() if t.content == "<|eot_id|>"), None)
  if eot_id is not None:
    return [tokenizer.eos_token_id, eot_id]
  else:
    return [tokenizer.eos_token_id]

sampling_params = SamplingParams(
  temperature=0.0,
  skip_special_tokens=True,
  stop_token_ids=stop_tokens(),
)


# Runs a single inference for an example
async def run_single(chat, request_id):
  results_generator = engine.generate(chat, sampling_params, request_id=request_id)
  final_result = None
  async for result in results_generator:
    # You can use the intermediate `result` here, if needed.
    final_result = result
  return final_result


conversation = [{"role": "user", "content": "What is the first letter of English alphabets?"}]
chat = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
result = asyncio.run(run_single(chat, "123"))
print(result)


async def run_multi(chats):
  tasks = [asyncio.create_task(run_single(chat, i)) for (i, chat) in enumerate(chats)]
  return [await task for task in tasks]

# Runs multiple inferences in parallel
conversations = [
  [{"role": "user", "content": "What is the first letter of English alphabets?"}],
  [{"role": "user", "content": "What is the last letter of English alphabets?"}],
]
chats = [
  tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
  for conversation in conversations
]
results = asyncio.run(run_multi(chats))
print(results)

You can find more vLLM API usage examples for encoder-decoder model and multi-modal models in RBLN Model Zoo.

Please refer to the vLLM doc for more information on the vLLM API.