Llama2-7B (Chatbot)¶

This tutorial introduces how to compile and deploy the Llama2 models from HuggingFace using multiple RBLN NPUs. For this guide, we will use the HuggingFace meta-llama/Llama-2-7b-chat-hf model which serves as a chatbot.

The tutorial consists of two main steps:

How to compile the PyTorch Llama2-7b model with the multiple numbers of RBLN NPUs and save the compiled model.
How to deploy the compiled model in a runtime-based inference environment.

Note

Rebellions Scalable Design (RSD) is only available on ATOM+ (RBLN-CA12). You can check the type of your current RBLN NPU using the rbln-stat command.

Note

Prerequisites¶

To compile and deploy the Llama2-7b model efficiently, it is required to use 4 RBLN NPUs (ATOM+, RBLN-CA12). You can find details on the required number of NPUs in the optimum-rbln page. Before we proceed, please make sure you have installed the following Python packages in your system:

Please note that the HuggingFace meta-llama/Llama-2-7b-chat-hf model is gated. To obtain the access right, please acknowledge the licence and complete the form on the model card. Once you have been granted access, you can access the model with the following huggingface-cli command:

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: *****

Note

If you want to skip the details and quickly compile and deploy the models on RBLN NPUs, directly jump to the summary. It provides the complete code with all the necessary steps to compile and deploy the model as a quick starting point for your own project.

Step 1. How to compile¶

This step demonstrates how to compile the Llama2-7b model to run on 4 RBLN NPUs.

Compile Llama2-7b for multiple NPUs¶

To begin, we can import the RBLNLlamaForCausalLM class from optimum-rbln. This class provides the RBLNLlamaForCausalLM.from_pretrained() method, which downloads the Llama2 models from HuggingFace Hub and compiles them using RBLN SDK. When exporting the model with this method, please specify the following parameters:

export: Set to True if you want to compile the model using RBLN SDK.
rbln_batch_size: Defines the batch size.
rbln_max_seq_len: Defines the maximum sequence length.
rbln_tensor_parallel_size: Defines how many NPUs will be used for inference.

from optimum.rbln import RBLNLlamaForCausalLM

# Export huggingFace pytorch llama2 model to RBLN compiled model
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=1,
    rbln_max_seq_len=4096,
    rbln_tensor_parallel_size=4, # target 4 RBLN NPUs
)

Save the compiled model¶

The compiled model can be saved to disk using the compiled_model.save_pretrained() method.

compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")

Step 2. How to deploy¶

This section demonstrates how to load the compiled model and generate texts.

Load the compiled RBLN model¶

First, you can load the compiled RBLN model using the RBLNLlamaForCausalLM.from_pretrained() method. Pass the saved directory path as an input argument and set the export parameter to False.

from optimum.rbln import RBLNLlamaForCausalLM

# Export huggingFace pytorch llama2 model to RBLN compiled model
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id="rbln-Llama-2-7b-chat-hf",
    export=False
)

Prepare the inputs¶

We need to employ LlamaTokenizer from the transformers package to tokenize the input sequences. Set the padding_side argument to left for the proper batch text generations.

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)

Generate the texts¶

You can generate text using the generate() method.

output_sequence = compiled_model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    do_sample=False,
    max_length=4096,
)

tokenizer.batch_decode(output_sequence, skip_special_tokens=True, clean_up_tokenization_spaces=True)

Here is the generated text:

Hello! I'm just an AI, I'm not conscious in the way that humans are. However, I'm here to help you with any questions or tasks you may have. I'm a large language model trained on a wide range of texts and can understand and respond to natural language inputs. I'm not capable of experiencing emotions or consciousness like a human, but I'm here to assist you in any way I can. Is there something specific you'd like to talk about or ask?

Streaming the texts¶

At each decoding step in the generation process, you can retrieve the individual words as they are formed with the BatchTextIteratorStreamer. It extends TextIteratorStreamer from the transformers library to support streaming batch of text generation. It requires an additional batch_size argument, in addition to the arguments used by TextIteratorStreamer.

from threading import Thread
from optimum.rbln import BatchTextIteratorStreamer

# Initialize an instance of BatchTextIteratorStreamer.
batch_size = 1
streamer = BatchTextIteratorStreamer(
    tokenizer=tokenizer,          # The tokenizer used to convert text into tokens.
    batch_size=batch_size,        # The batch size for processing; here it processes one text at a time.
    skip_special_tokens=True,     # Skips special tokens (e.g., [PAD], [CLS]) in the output.
    skip_prompt=True,             # Skips any prompt text, only processing and outputting the results.
)

# Generate the texts
generation_kwargs = dict(
    **inputs,
    streamer=streamer,
    do_sample=False,
    max_length=4096,
)
thread = Thread(target=compiled_model.generate, kwargs=generation_kwargs)
thread.start()

# Retrive the formed words
for new_text in streamer:
    for i in range(batch_size):
        print(new_text[i], end="", flush=True)

Summary¶

The complete code for Llama2-7B model compilation:

from optimum.rbln import RBLNLlamaForCausalLM

# Compile and export the model
model_id = "meta-llama/Llama-2-7b-chat-hf"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=1,
    rbln_max_seq_len=4096, # default max_positional_embeddings
    rbln_tensor_parallel_size=4, # using multiple NPUs
)
# Save the compiled model to disk
compiled_model.save_pretrained("rbln-Llama-2-7b-chat-hf")

Below is the complete code for deploying a compiled Llama2-7B model:

from threading import Thread
from transformers import LlamaTokenizer
from optimum.rbln import BatchTextIteratorStreamer, RBLNLlamaForCausalLM

# Load the compiled model
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id="rbln-Llama-2-7b-chat-hf",
    export=False,
)

# Prepare the inputs
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)

# Prepare the streamer for streaming
batch_size = 1
streamer = BatchTextIteratorStreamer(
    tokenizer=tokenizer,
    batch_size=batch_size,
    skip_special_tokens=True,
    skip_prompt=True,
)

# Generate the texts
generation_kwargs = dict(
    **inputs,
    streamer=streamer,
    do_sample=False,
    max_length=4096,
)
thread = Thread(target=compiled_model.generate, kwargs=generation_kwargs)
thread.start()

# Fetch the words as they are formed
for new_text in streamer:
    for i in range(batch_size):
        print(new_text[i], end="", flush=True)