OpenAI Compatible Server
vLLM provides an OpenAI compatible HTTP server that implements OpenAI's completions API and chat API.
Please refer to the vLLM document for more detail about OpenAI compatible server.
How to install¶
First, make sure you have the latest versions of the required packages including rebel-compiler
, optimum-rbln
, and vllm-rbln
. You need access rights to Rebellions' private PyPI server. Please refer to Installation Guide for more information. You can find the latest version of the packages in Release Note
Compile Llama2-7B¶
You need to compile the Llama2-7B model using optimum-rbln.
Choose an appropriate batch size for your serving needs. Here, it is set to 4.
Run OpenAI API server¶
First, make sure that vllm-rbln
is installed. Then you can start the API server by running vllm.entrypoints.openai.api_server
module as shown below.
-
model
: Absolute path of the compiled model. -
device
: Device type for vLLM execution. Please set this torbln
. -
max_num_seqs
: Maximum number of sequences per iteration. This MUST match the compiledbatch_size
-
When targeting RBLN device, the
max_model_len
,block_size
, andmax_num_batched_tokens
fields should be set to the same value asmax_num_seqs
. -
You may want to add
--api-key <Random string to be used as API key>
to enable authentication.
Once your API server is on, you can call the API server using OpenAI python and node.js clients or curl command like the following.
Please refer to the OpenAI Docs for more detail.