OpenAI Compatible Server¶

Overview¶

vLLM provides an OpenAI compatible HTTP server that implements OpenAI's completions API and chat API. Please refer to the vLLM documentation for more detail about OpenAI compatible server. In this tutorial, we will guide you through setting up an OpenAI compatible server using the Llama3-8B and Llama3.1-8B models with Eager and Flash Attention, respectively. You'll learn how to deploy these models to create your own OpenAI API server.

Setup & Installation¶

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

System Requirements:
- Python: 3.9–3.12
- RBLN Driver
Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs vLLM

Installation Command:

pip install optimum-rbln>=0.8.2 vllm-rbln>=0.8.2
pip install --extra-index-url https://pypi.rbln.ai/simple/ rebel-compiler>=0.8.2

Note

Please note that rebel-compiler requires an RBLN Portal account.

Example (Llama3-8B)¶

Model Compilation¶

To begin, import the RBLNLlamaForCausalLM class from optimum-rbln. This class's from_pretrained() method downloads the Llama 3 model from the HuggingFace Hub and compiles it using the RBLN Compiler. When exporting the model, specify the following parameters:

export: Must be True to compile the model.
rbln_batch_size: Defines the batch size for compilation.
rbln_max_seq_len: Defines the maximum sequence length.
rbln_tensor_parallel_size: Defines the number of NPUs to be used for inference.

After compilation, save the model artifacts to disk using the save_pretrained() method. This will create a directory (e.g., rbln-Llama-3-8B-Instruct) containing the compiled model. You need to compile the Llama3-8B model using optimum-rbln.

from optimum.rbln import RBLNLlamaForCausalLM

# Define the HuggingFace model ID
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile the model for 4 RBLN NPUs
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=4,
    rbln_max_seq_len=8192,
    rbln_tensor_parallel_size=4,
)

compiled_model.save_pretrained("rbln-Llama-3-8B-Instruct")

Run OpenAI API server¶

First, make sure that vllm-rbln is installed. Then you can start the API server by running the vllm.entrypoints.openai.api_server module as shown below.

1	`python3 -m vllm.entrypoints.openai.api_server --model rbln-Llama-3-8B-Instruct --device rbln --max-num-seqs 4 --max-num-batched-tokens 8192 --max-model-len 8192 --block-size 8192`

model: Absolute path of the compiled model.
device: Device type for vLLM execution. Please set this to rbln.
max_num_seqs: Maximum number of sequences per iteration. This MUST match the compiled argument batch_size
block_size: This should be set to the same value as max_model_len (When applying Flash Attention, this needs to be set differently; please refer to the example.)
When targeting RBLN device with Eager Attention mode, the block_size and max_num_batched_tokens fields should be set to the same value as max_model_len.
You may want to add --api-key <Random string to be used as API key> to enable authentication.

Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.

$ curl http://<host and port number of the server>/v1/chat/completions                             -H "Content-Type: application/json"                             -H "Authorization: Bearer <API key, if specified when running the server>"                             -d '{
    "model": "rbln-Llama-3-8B-Instruct",
    "messages": [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": "Hello!"
    }
    ],
    "stream": true
}'

Note

When running an API server, the --model value is used as the unique ID for that API server. Therefore, the "model" value in the curl command should be exactly the same as the --model value used when starting the API server.

Example Output:

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"Hello","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"!","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" It","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"'s","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" nice","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" to","tool_calls":[]}}]}

…

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" to","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>","object":"chat.completion.chunk","created":<time>,"model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" discuss","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>","object":"chat.completion.chunk","created":<time>,"model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":".","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created":<time>,"model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"","tool_calls":[]},"finish_reason":"stop"}]}

data: [DONE]