Llama3 8B¶

개요¶

이 튜토리얼은 여러 개의 RBLN NPU를 사용하여 HuggingFace의 Llama 3 모델을 컴파일하고 배포하는 방법을 설명합니다. 이 가이드에서는 meta-llama/Meta-Llama-3-8B-Instruct 모델을 사용합니다.

환경 설정 및 설치 확인¶

시작하기 전에 시스템 환경이 올바르게 구성되어 있으며, 필요한 모든 필수 패키지가 설치되어 있는지 확인하십시오. 다음 항목이 포함됩니다:

시스템 요구 사항:
- Python: 3.10–3.13
- RBLN 드라이버
필수 패키지:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – vllm은 함께 자동으로 설치됨.

설치 명령어:

pip install \
  --extra-index-url https://pypi.rbln.ai/simple \
  rebel-compiler==0.10.4.post1
pip install \
  --extra-index-url https://wheels.vllm.ai/0.18.0/cpu \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  vllm-rbln==0.10.4

Note

rebel-compiler를 사용하려면 RBLN 포털 계정이 필요하니 참고하십시오.
위 명령은 Debian 계열 Linux(예: Ubuntu)에서 pip로 패키지를 설치하는 일반적인 절차를 전제로 합니다. OS나 환경이 다른 경우에는 설치 가이드에서 지원되는 설치 조합과 적용 가능한 명령을 확인하세요.

Note

HuggingFace의 meta-llama/Meta-Llama-3-8B-Instruct 모델은 접근이 제한되어 있습니다. 접근 권한을 부여받은 후, 아래와 같이 hf (huggingface-cli) 명령어를 사용하여 로그인할 수 있습니다:

$ hf auth login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) .
Enter your token (input will not be visible):

1. 실행: 사전 컴파일 사용¶

이 단계는 여러 개의 RBLN NPU를 사용해 vLLM으로 meta-llama/Meta-Llama-3-8B-Instruct 모델을 서빙하는 방법을 설명합니다.

1.1. 모델 컴파일¶

먼저, optimum-rbln에서 RBLNLlamaForCausalLM 클래스를 임포트합니다. 이 클래스의 from_pretrained() 메서드는 HuggingFace Hub에서 Llama 3 모델을 다운로드하고 RBLN Compiler를 사용해 컴파일합니다. 모델을 내보낼 때는 다음과 같은 파라미터를 지정해야 합니다:

rbln_batch_size: 컴파일을 위한 배치 크기를 정의합니다.
rbln_max_seq_len: 최대 시퀀스 길이를 정의합니다.
rbln_tensor_parallel_size: 추론에 사용할 NPU의 수를 정의합니다.

컴파일 후에는 save_pretrained() 메서드를 사용하여 컴파일된 모델을 디스크에 저장합니다. 이 과정은 컴파일된 모델을 포함하는 디렉터리(예: rbln-Llama-3-8B-Instruct)를 생성합니다.

Note

모델 크기와 NPU 사양에 따라 적절한 배치 사이즈를 선택하세요. 또한, vllm-rbln은 최적의 처리량과 자원 활용을 보장하기 위해 동적 배치(Dynamic Batching)을 지원합니다. 자세한 내용은 Dynamic Batching를 참고하세요.

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile and export
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=4,
    rbln_max_seq_len=8192,
    rbln_tensor_parallel_size=4,
)

# Save compiled results to disk
model.save_pretrained("rbln-Llama-3-8B-Instruct")

1.2. vLLM을 활용한 추론¶

컴파일된 모델은 vLLM과 함께 사용할 수 있습니다. 아래 예시는 컴파일된 모델을 사용하여 vLLM 엔진을 설정하고 추론을 수행하는 방법을 보여줍니다.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

def main():
    model_id = "rbln-Llama-3-8B-Instruct"
    llm = LLM(model=model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    sampling_params = SamplingParams(
        temperature=0.0,
        skip_special_tokens=True,
        stop_token_ids=[tokenizer.eos_token_id],
    )

    conversation = [
        {
            "role": "user",
            "content": "What is the first letter of English alphabets?"
        }
    ]

    chat = tokenizer.apply_chat_template(
        conversation, 
        add_generation_prompt=True,
        tokenize=False
    )

    outputs = llm.generate(chat, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(generated_text)

if __name__ == "__main__":
    main()

예시 출력:

1	`The first letter of the English alphabet is "A".`

2. 실행: 사전 컴파일 미사용 (베타)¶

Info

이 기능은 베타 단계입니다. vLLM RBLN 0.10.4부터는 별도의 사전 컴파일 스크립트 없이 모델을 서빙할 수 있습니다. vLLM 엔진 파라미터를 LLM()에 직접 전달하면 vLLM RBLN이 엔진 시작 시 optimum-rbln의 컴파일 단계를 자동으로 처리합니다.

2.1. vLLM을 활용한 컴파일 및 추론¶

VLLM_RBLN_TP_SIZE 환경 변수로 텐서 병렬 크기를 설정하고, block_size, max_model_len, max_num_seqs를 LLM()에 직접 전달합니다.

Note

컴파일된 모델은 $VLLM_CACHE_ROOT/compiled_models 아래에 저장됩니다. VLLM_CACHE_ROOT는 기본값이 ~/.cache/vllm이며, 환경 변수로 다른 위치를 지정할 수 있습니다.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import os

def main():
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    os.environ["VLLM_RBLN_TP_SIZE"] = "4"
    llm = LLM(
        model=model_id,
        block_size=8192,
        max_model_len=8192,
        max_num_seqs=4,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # ... (이하 main()의 나머지는 동일)

if __name__ == "__main__":
    main()

Llama3 8B¶

개요¶

환경 설정 및 설치 확인¶

1. 실행: 사전 컴파일 사용¶

1.1. 모델 컴파일¶

1.2. vLLM을 활용한 추론¶

2. 실행: 사전 컴파일 미사용 (베타)¶

2.1. vLLM을 활용한 컴파일 및 추론¶

참고¶