OpenAI Compatible Server¶
vLLM provides an OpenAI compatible HTTP server that implements OpenAI's completions API and chat API. Please refer to the vLLM documentation for more detail about OpenAI compatible server. In this tutorial, we will guide you through setting up an OpenAI compatible server using the Llama3-8B and Llama3.1-8B models with Eager and Flash Attention, respectively. You'll learn how to deploy these models to create your own OpenAI API server.
How to install¶
First, make sure you have the latest versions of the required packages including rebel-compiler
, optimum-rbln
, and vllm-rbln
. You need access rights to Rebellions' private PyPI server. Please refer to Installation Guide for more information. You can find the latest version of the packages in Release Note.
Standard Model Example: Llama3-8B¶
Step1: Compile Llama3-8B¶
You need to compile the Llama3-8B model using optimum-rbln.
Note
You can choose an appropriate batch size for your serving needs. Here, it is set to 4.
Step2: Run OpenAI API server¶
First, make sure that vllm-rbln
is installed. Then you can start the API server by running the vllm.entrypoints.openai.api_server
module as shown below.
model
: Absolute path of the compiled model.device
: Device type for vLLM execution. Please set this torbln
.max_num_seqs
: Maximum number of sequences per iteration. This MUST match the compiled argumentbatch_size
block_size
: This should be set to the same value asmax_model_len
(When applying Flash Attention, this needs to be set differently; please refer to the example.)- When targeting RBLN device with Eager Attention mode, the
block_size
andmax_num_batched_tokens
fields should be set to the same value asmax_model_len
. - You may want to add
--api-key <Random string to be used as API key>
to enable authentication.
Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.
Note
When running an API server, the --model
value is used as the unique ID for that API server.
Therefore, the "model"
value in the curl command should be exactly the same as the --model
value used when starting the API server.
Please refer to the OpenAI Docs for more information.
Advanced Example: Llama3.1-8B with Flash Attention¶
Flash Attention enables efficient handling of long contexts in models like Llama3.1-8B
by reducing memory usage and improving throughput. When working with optimum-rbln
, Flash Attention can be enabled by adding rbln_kvcache_partition_len
parameter when compiling.
Step1: Compile Llama3.1-8B¶
Note
You can choose an appropriate batch size for your serving needs. Here, it is set to 1.
Step2: Run OpenAI API server¶
First, make sure that vllm-rbln
is installed. Then you can start the API server by running vllm.entrypoints.openai.api_server
module as shown below.
model
: Absolute path of the compiled model.device
: Device type for vLLM execution. Please set this torbln
.max_num_seqs
: Maximum number of sequences per iteration. This MUST match the compiled argumentbatch_size
.block_size
: The size of the block for Paged Attention. When using Flash Attention, the block size must be equal torbln_kvcache_partition_len
.- The
max_num_batched_tokens
fields should be set to the same value asmax_model_len
. - You may want to add
--api-key <Random string to be used as API key>
to enable authentication.
Once your API server is running, you can call the API server using OpenAI python and node.js clients or curl commands like the following.
Note
When running an API server, the --model
value is used as the unique ID for that API server.
Therefore, the "model"
value in the curl command should be exactly the same as the --model
value used when starting the API server.
Please refer to the OpenAI Docs for more information.