Skip to content

OpenAI Compatible Server

vLLM provides an OpenAI compatible HTTP server that implements OpenAI's completions API and chat API.

Please refer to the vLLM document for more detail about OpenAI compatible server.

How to install

First, make sure you have the latest versions of the required packages including rebel-compiler, optimum-rbln, and vllm-rbln. You need access rights to Rebellions' private PyPI server. Please refer to Installation Guide for more information. You can find the latest version of the packages in Release Note

$ pip3 install -i https://pypi.rbln.ai/simple/ "rebel-compiler>=0.7.1" "optimum-rbln>=0.2.0" "vllm-rbln>=0.2.0"

Compile Llama2-7B

You need to compile the Llama2-7B model using optimum-rbln.

from optimum.rbln import RBLNLlamaForCausalLM

# Export huggingFace pytorch llama2 model to RBLN compiled model
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_max_seq_len=4096,
    rbln_tensor_parallel_size=4,  # number of ATOM+ for Rebellions Scalable Design (RSD)
    rbln_batch_size=4,            # batch_size > 1 is recommended for continuous batching
)

compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")

Choose an appropriate batch size for your serving needs. Here, it is set to 4.

Run OpenAI API server

First, make sure that vllm-rbln is installed. Then you can start the API server by running vllm.entrypoints.openai.api_server module as shown below.

1
2
3
4
5
6
7
$ python -m vllm.entrypoints.openai.api_server \
  --model <PATH/TO/rbln-Llama-2-7b-chat-hf> \
  --device rbln \
  --max-num-seqs 4 \
  --max-num-batched-tokens 4096 \
  --max-model-len 4096 \
  --block-size 4096
  • model: Absolute path of the compiled model.

  • device: Device type for vLLM execution. Please set this to rbln.

  • max_num_seqs: Maximum number of sequences per iteration. This MUST match the compiled batch_size

  • When targeting RBLN device, the max_model_len, block_size, and max_num_batched_tokens fields should be set to the same value as max_num_seqs.

  • You may want to add --api-key <Random string to be used as API key> to enable authentication.

Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.

$ curl http://<host and port number of the server>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API key, if specified when running the server>" \
  -d '{
    "model": "<PATH/TO/rbln-Llama-2-7b-chat-hf>",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "stream": true
  }'

Please refer to the OpenAI Docs for more detail.