Llama2-7B (챗봇)¶

이 튜토리얼은 여러 개의 RBLN NPU들을 대상으로 Llama2 모델을 컴파일하고 배포하는 방법을 소개합니다. 이 가이드에서는 챗봇 작업을 수행하는 허깅페이스 meta-llama/Llama-2-7b-chat-hf 모델을 사용합니다.

이 튜토리얼은 두 단계로 구성되어 있습니다:

파이토치 Llama2-7b 모델을 여러 개의 RBLN NPU를 대상으로 컴파일하고 컴파일된 모델을 저장하는 방법.
컴파일된 모델을 런타임 기반 추론 환경에서 배포하는 방법.

Note

다중 NPU 기능은 ATOM+ (RBLN-CA12)에서만 지원됩니다. 지금 사용 중인 NPU의 종류는 rbln-stat 명령어로 확인할 수 있습니다.

Note

사전준비¶

Llama2-7b 모델을 효율적으로 컴파일하고 배포하려면 4개의 RBLN NPU(ATOM+, RBLN-CA12)를 사용해야 합니다. 모델을 실행하는 데 필요한 NPU 수에 대한 자세한 내용은 optimum-rbln 문서에서 확인할 수 있습니다. 시작하기 전에 다음의 파이썬 패키지들이 시스템에 설치되어 있는지 확인하세요:

허깅페이스 meta-llama/Llama-2-7b-chat-hf 모델은 접근이 제한되어 있습니다. 접근 권한을 얻기 위해서는 라이선스를 승인하고 모델 카드에서 접근 권한을 얻기 위한 서류를 작성해야 합니다. 접근 권한이 부여되면 아래와 같이 huggingface-cli 커맨드를 통해 모델에 접근할 수 있습니다:

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: *****

Note

모델을 RBLN NPU에서 빠르게 컴파일하고 배포하는 자세한 내용을 생략하고 싶다면, 이 튜토리얼의 요약 섹션으로 바로 이동할 수 있습니다. 이 섹션의 코드는 모델을 컴파일하고 배포하는 데 필요한 모든 단계를 포함하고 있어, 자신의 프로젝트를 빠르게 시작할 수 있습니다.

1단계. 컴파일 방법¶

4개의 RBLN NPU를 대상으로 허깅페이스 Llama2-7b 모델을 컴파일합니다.

다중 NPU 컴파일¶

먼저 optimum-rbln 라이브러리에서 RBLNLlamaForCausalLM 클래스를 가져옵니다. 이 클래스는 RBLNLlamaForCausalLM.from_pretrained() 메서드를 제공하며, 이는 허깅페이스 허브에서 Llama2 모델을 다운로드하고 RBLN SDK를 이용하여 컴파일합니다. 이 메서드를 사용할 때 다음 매개변수를 입력해야합니다:

export : True로 설정되어야 RBLN SDK를 이용하여 모델을 컴파일 합니다. False로 설정할 경우 사전컴파일된 모델을 로드하게 됩니다.
rbln_batch_size : 배치크기 정의
rbln_max_seq_len : 최대 문자열 길이 정의
rbln_tensor_parallel_size : 추론에 사용할 NPU 개수 설정

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=1,
    rbln_max_seq_len=4096,
    rbln_tensor_parallel_size=4, # 4개의 RBLN NPU 대상 컴파일
)

컴파일된 모델 저장¶

모델이 컴파일된 후, compiled_model.generate() 메서드를 사용하여 직접 텍스트를 생성할 수 있습니다. 또한, compiled_model.save_pretrained() 메서드를 사용하여 컴파일된 모델을 디스크에 저장할 수 있습니다

compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")

2단계. 배포 방법¶

이 섹션에서는 컴파일된 모델을 로드하고 텍스트를 생성하는 방법을 소개합니다.

컴파일된 RBLN 모델 로드¶

먼저 RBLNLlamaForCausalLM.from_pretrained() 메서드를 사용하여 컴파일된 RBLN 모델을 로드할 수 있습니다. 컴파일된 모델의 경로 를 인수로 전달하고, export 매개변수를 False로 설정하세요.

from optimum.rbln import RBLNLlamaForCausalLM

compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id="rbln-Llama-2-7b-chat-hf",
    export=False
)

입력 데이터 준비¶

트랜스포머 라이브러리에서 제공하는 LlamaTokenizer 를 사용하여 입력문자열을 토큰화하고 사전처리 과정을 수행합니다. 올바른 배치 텍스트 생성을 위해 padding_side 매개변수를 left로 설정하세요.

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)

텍스트 생성¶

generate() 메서드를 사용하여 텍스트를 생성할 수 있습니다.

output_sequence = compiled_model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    do_sample=False,
    max_length=4096,
)

tokenizer.batch_decode(output_sequence, skip_special_tokens=True, clean_up_tokenization_spaces=True)

아래와 유사한 내용의 텍스트가 생성됩니다.

Hello! I'm just an AI, I'm not conscious in the way that humans are. However, I'm here to help you with any questions or tasks you may have. I'm a large language model trained on a wide range of texts and can understand and respond to natural language inputs. I'm not capable of experiencing emotions or consciousness like a human, but I'm here to assist you in any way I can. Is there something specific you'd like to talk about or ask?

텍스트 스트리밍¶

텍스트 생성 중 각 디코딩 단계에서 BatchTextIteratorStreamer를 사용하여 개별 토큰들을 실시간으로 가져올 수 있습니다. 이 클래스는 transformers 라이브러리의 TextIteratorStreamer를 확장하여 배치 텍스트 생성을 지원합니다. TextIteratorStreamer에서 사용되는 매개변수 외에도 추가로 batch_size 키워드 인자를 넣어줘야합니다.

from threading import Thread
from optimum.rbln import BatchTextIteratorStreamer

# BatchTextIteratorStreamer의 인스턴스화
batch_size = 1
streamer = BatchTextIteratorStreamer(
    tokenizer=tokenizer,        # 텍스트를 토큰화하기 위한 토크나이저
    batch_size=batch_size,      # 배치 크기
    skip_special_tokens=True,   # 특별 토큰을 생략
    skip_prompt=True,           # 프롬프트를 생략
)

# 문장 생성
generation_kwargs = dict(
    **inputs,
    streamer=streamer,
    do_sample=False,
    max_length=4096,
)
thread = Thread(target=compiled_model.generate, kwargs=generation_kwargs)
thread.start()

# 단어가 생성되어 토큰화되는 즉시 가져오기
for new_text in streamer:
    for i in range(batch_size):
        print(new_text[i], end="", flush=True)

요약¶

허깅페이스 Llama2-7B 모델을 컴파일하는 코드는 아래와 같습니다:

from optimum.rbln import RBLNLlamaForCausalLM

# 모델을 컴파일하고 내보내기
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=1,
    rbln_max_seq_len=4096, # default max_positional_embeddings
    rbln_tensor_parallel_size=4, # using multiple NPUs
)
# 컴파일된 모델을 디스크에 저장
compiled_model.save_pretrained(model_save_dir)

컴파일된 `Llama2-7b' 모델의 문장 생성을 위한 코드는 아래와 같습니다:

from threading import Thread
from transformers import LlamaTokenizer
from optimum.rbln import BatchTextIteratorStreamer, RBLNLlamaForCausalLM

# 컴파일된 모델 로드
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id="rbln-Llama-2-7b-chat-hf",
    export=False,
)

# input 준비
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)

# 배치스트리밍을 위한 스트리머 준비
batch_size = 1
streamer = BatchTextIteratorStreamer(
    tokenizer=tokenizer,
    batch_size=batch_size,
    skip_special_tokens=True,
    skip_prompt=True,
)

# 문장 생성
generation_kwargs = dict(
    **inputs,
    streamer=streamer,
    do_sample=False,
    max_length=4096,
)
thread = Thread(target=compiled_model.generate, kwargs=generation_kwargs)
thread.start()

# 단어가 생성되어 토큰화되는 즉시 가져오기
for new_text in streamer:
    for i in range(batch_size):
        print(new_text[i], end="", flush=True)