Skip to content

OpenAI Compatible Server

vLLM provides an OpenAI compatible HTTP server that implements OpenAI's completions API and chat API. Please refer to the vLLM documentation for more detail about OpenAI compatible server. In this tutorial, we will guide you through setting up an OpenAI compatible server using the Llama3-8B and Llama3.1-8B models with Eager and Flash Attention, respectively. You'll learn how to deploy these models to create your own OpenAI API server.

How to install

First, make sure you have the latest versions of the required packages including rebel-compiler, optimum-rbln, and vllm-rbln. You need access rights to Rebellions' private PyPI server. Please refer to Installation Guide for more information. You can find the latest version of the packages in Release Note.

$ pip3 install --extra-index https://pypi.rbln.ai/simple/ "rebel-compiler>=0.7.3" "optimum-rbln>=0.7.3.post2" "vllm-rbln>=0.7.3"

Standard Model Example: Llama3-8B

Step1: Compile Llama3-8B

You need to compile the Llama3-8B model using optimum-rbln.

from optimum.rbln import RBLNLlamaForCausalLM
import os

# Export HuggingFace PyTorch Llama3 model to RBLN compiled model
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_max_seq_len=8192,
    rbln_tensor_parallel_size=4,  # number of ATOM+ for Rebellions Scalable Design (RSD)
    rbln_batch_size=4,            # batch_size > 1 is recommended for continuous batching
)

# Save compiled results to disk
compiled_model.save_pretrained(os.path.basename(model_id))

Note

You can choose an appropriate batch size for your serving needs. Here, it is set to 4.

Step2: Run OpenAI API server

First, make sure that vllm-rbln is installed. Then you can start the API server by running the vllm.entrypoints.openai.api_server module as shown below.

1
2
3
4
5
6
7
$ python3 -m vllm.entrypoints.openai.api_server \
  --model <PATH/TO/Meta-Llama-3-8B-Instruct> \
  --device rbln \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --max-model-len 8192 \
  --block-size 8192
  • model: Absolute path of the compiled model.
  • device: Device type for vLLM execution. Please set this to rbln.
  • max_num_seqs: Maximum number of sequences per iteration. This MUST match the compiled argument batch_size
  • block_size: This should be set to the same value as max_model_len (When applying Flash Attention, this needs to be set differently; please refer to the example.)
  • When targeting RBLN device with Eager Attention mode, the block_size and max_num_batched_tokens fields should be set to the same value as max_model_len.
  • You may want to add --api-key <Random string to be used as API key> to enable authentication.

Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.

$ curl http://<host and port number of the server>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API key, if specified when running the server>" \
  -d '{
    "model": "<PATH/TO/Meta-Llama-3-8B-Instruct>",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "stream": true
  }'

Note

When running an API server, the --model value is used as the unique ID for that API server. Therefore, the "model" value in the curl command should be exactly the same as the --model value used when starting the API server.

Please refer to the OpenAI Docs for more information.

Advanced Example: Llama3.1-8B with Flash Attention

Flash Attention enables efficient handling of long contexts in models like Llama3.1-8B by reducing memory usage and improving throughput. When working with optimum-rbln, Flash Attention can be enabled by adding rbln_kvcache_partition_len parameter when compiling.

Step1: Compile Llama3.1-8B

from optimum.rbln import RBLNLlamaForCausalLM
import os

# Export HuggingFace PyTorch Llama3.1 model to RBLN compiled model
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,  # Export a PyTorch model to RBLN model with Optimum
    rbln_batch_size=1, # Batch size
    rbln_max_seq_len=131_072,  # Maximum sequence length
    rbln_tensor_parallel_size=8,  # Tensor parallelism
    rbln_kvcache_partition_len=16_384,  # Length of KV cache partitions for Flash Attention
)

# Save compiled results to disk
model.save_pretrained(os.path.basename(model_id))

Note

You can choose an appropriate batch size for your serving needs. Here, it is set to 1.

Step2: Run OpenAI API server

First, make sure that vllm-rbln is installed. Then you can start the API server by running vllm.entrypoints.openai.api_server module as shown below.

1
2
3
4
5
6
7
$ python3 -m vllm.entrypoints.openai.api_server \
  --model <PATH/TO/Llama-3.1-8B-Instruct> \
  --device rbln \
  --max-num-seqs 1 \
  --max-num-batched-tokens 131072 \
  --max-model-len 131072 \
  --block-size 16384
  • model: Absolute path of the compiled model.
  • device: Device type for vLLM execution. Please set this to rbln.
  • max_num_seqs: Maximum number of sequences per iteration. This MUST match the compiled argument batch_size.
  • block_size: The size of the block for Paged Attention. When using Flash Attention, the block size must be equal to rbln_kvcache_partition_len.
  • The max_num_batched_tokens fields should be set to the same value as max_model_len.
  • You may want to add --api-key <Random string to be used as API key> to enable authentication.

Once your API server is running, you can call the API server using OpenAI python and node.js clients or curl commands like the following.

$ curl http://<host and port number of the server>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API key, if specified when running the server>" \
  -d '{
    "model": "<PATH/TO/Llama-3.1-8B-Instruct>",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "stream": true
  }'

Note

When running an API server, the --model value is used as the unique ID for that API server. Therefore, the "model" value in the curl command should be exactly the same as the --model value used when starting the API server.

Please refer to the OpenAI Docs for more information.