OpenAI 호환 서버¶

개요¶

vLLM은 OpenAI의 completions API와 chat API 등을 구현한 OpenAI 호환 HTTP Server를 제공합니다. OpenAI compatible server에 대한 더 자세한 정보는 vLLM 문서를 참고하시기 바랍니다. 이 튜토리얼에서는 Eager와 Flash Attention을 각각 사용하는 Llama3-8B 모델을 이용해 OpenAI 호환 서버를 설정하는 방법을 안내합니다. 이 모델을 배포하는 과정을 통해 유저가 원하는 모델로 OpenAI API 서버를 구축하는 방법을 배울 수 있습니다.

환경 설정 및 설치 확인¶

시작하기 전에 시스템 환경이 올바르게 구성되어 있으며, 필요한 모든 필수 패키지가 설치되어 있는지 확인하십시오. 다음 항목이 포함됩니다:

시스템 요구 사항:
- Python: 3.9–3.12
- RBLN Driver
필수 패키지:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – vllm은 함께 자동으로 설치됨.

설치 명령어:

pip install optimum-rbln>=0.9.4 vllm-rbln>=0.9.4
pip install --extra-index-url https://pypi.rbln.ai/simple/ rebel-compiler>=0.9.4

Note

rebel-compiler를 사용하려면 RBLN 포털 계정이 필요하니 참고하십시오.

예제 (Llama3-8B)¶

모델 컴파일¶

먼저, optimum-rbln에서 RBLNLlamaForCausalLM 클래스를 가져옵니다. 이 클래스의 from_pretrained() 메서드는 HuggingFace 허브에서 Llama 3 모델을 다운로드하고 RBLN SDK를 사용하여 컴파일합니다. 모델을 익스포트할 때 다음 매개변수를 지정하십시오:

export: 모델을 컴파일하려면 True로 설정해야 합니다.
rbln_batch_size: 컴파일을 위한 배치 크기를 정의합니다.
rbln_max_seq_len: 최대 시퀀스 길이를 정의합니다.
rbln_tensor_parallel_size: 추론에 사용할 NPU의 수를 정의합니다.

컴파일 후에는 save_pretrained() 메서드를 사용하여 모델 아티팩트를 디스크에 저장합니다. 이 과정은 컴파일된 모델을 포함하는 디렉터리(예: rbln-Llama-3-8B-Instruct)를 생성합니다.

from optimum.rbln import RBLNLlamaForCausalLM

# Define the HuggingFace model ID
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile the model for 4 RBLN NPUs
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=4,
    rbln_max_seq_len=8192,
    rbln_tensor_parallel_size=4,
)

compiled_model.save_pretrained("rbln-Llama-3-8B-Instruct")

OpenAI API server 실행¶

다음과 같이 vllm.entrypoints.openai.api_server 모듈을 실행하면 API 서버가 시작됩니다.

1 2	`python3 -m vllm.entrypoints.openai.api_server \ --model rbln-Llama-3-8B-Instruct`

API 서버가 실행되고 나면 OpenAI의 파이썬 및 node.js 클라이언트를 이용해 API를 호출하거나, 다음과 같이 curl 명령을 이용해 API를 실행할 수 있습니다.

$ curl http://<host and port number of the server>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <API key, if specified when running the server>" \
-d '{
    "model": "rbln-Llama-3-8B-Instruct",
    "messages": [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": "Hello!"
    }
    ],
    "stream": true
}'

Note

API 서버를 실행할 때 사용한 --model 값은 해당 API 서버의 ID로 사용됩니다. 따라서, curl 명령의 "model" 값은 API 서버 실행 시 사용한 --model 값과 완전히 동일해야 합니다.

예시 출력:

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"Hello","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"!","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" It","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"'s","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" nice","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" to","tool_calls":[]}}]}

…

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created”:<time>,”model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" to","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>","object":"chat.completion.chunk","created":<time>,"model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":" discuss","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>","object":"chat.completion.chunk","created":<time>,"model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":".","tool_calls":[]}}]}

data: {"id":"chatcmpl-<ID>”,”object":"chat.completion.chunk","created":<time>,"model":"rbln-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"","tool_calls":[]},"finish_reason":"stop"}]}

data: [DONE]