Llama3-8B (Chatbot)
This tutorial introduces how to compile and deploy the Llama3 models from HuggingFace using multiple RBLN NPUs. For this guide, we will use the meta-llama/Meta-Llama-3-8B-Instruct model available on HuggingFace.
The tutorial consists of two main steps:
- How to compile the PyTorch
Llama3-8B
model targeting multiple RBLN NPUs and save the compiled model.
- How to deploy the compiled model in a runtime-based inference environment.
Note
Rebellions Scalable Design (RSD) is only available on ATOM+ (RBLN-CA12
). You can check the type of your current RBLN NPU using the rbln-stat
command.
Note
Llama3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Prerequisites
To efficiently compile and deploy the Llama3-8B model, You need to use 4 RBLN NPUs (ATOM+, RBLN-CA12
). . For more details about the required number of NPUs and multi-NPU capabilities, please refer to the optimum-rbln page.
Before we proceed, please ensure the following Python packages are installed on your system:
Please note that the meta-llama/Meta-Llama-3-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can access the model through the huggingface-cli command as shown below:
| $ huggingface-cli login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: *****
|
Note
If you want to skip the details and quickly compile and deploy the models on RBLN NPUs, directly jump to the summary section of this tutorial. It provides the complete code with all the necessary steps to compile and deploy the model as a quick starting point for your own project.
Step 1. How to compile
This step demonstrates how to compile the Llama3-8b model to run on 4 RBLN NPUs.
Compile Llama3-8B for multiple NPUs
To begin, we can import the RBLNLlamaForCausalLM
class from optimum-rbln
. This class provides the RBLNLlamaForCausalLM.from_pretrained()
method, which downloads the Llama3
models from HuggingFace Hub and compiles them using RBLN SDK. When exporting the model with this method, please specify the following parameters:
export
: Set to True
if you want to compile the model using RBLN SDK.
rbln_batch_size
: Defines the batch size.
rbln_max_seq_len
: Defines the maximum sequence length.
rbln_tensor_parallel_size
: Defines how many NPUs will be used for inference.
| from optimum.rbln import RBLNLlamaForCausalLM
# Export huggingFace pytorch llama3 model to RBLN compiled model
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True,
rbln_batch_size=1,
rbln_max_seq_len=8192,
rbln_tensor_parallel_size=4, # target 4 RBLN NPUs
)
|
Save the compiled model
The compiled model can be saved to disk using the compiled_model.save_pretrained()
method.
| compiled_model.save_pretrained("rbln-Llama-3-8B-Instruct")
|
Step 2. How to deploy
This section demonstrates how to load the compiled model and generate texts.
Load the compiled RBLN model
First, you can load the compiled RBLN model using the RBLNLlamaForCausalLM.from_pretrained()
method. Pass the saved directory path
as an input argument and set the export
parameter to False
.
| from optimum.rbln import RBLNLlamaForCausalLM
# Export huggingFace pytorch llama3 model to RBLN compiled model
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
model_id="rbln-Llama-3-8B-Instruct",
export=False
)
|
We need to employ AutoTokenizer
from the transformers
package to tokenize the input sequences.
| from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)
|
Generate the texts
You can generate text using the generate()
method.
| output_sequence = compiled_model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
do_sample=False,
max_length=8192,
)
out = tokenizer.batch_decode(output_sequence, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(out[0])
|
Here is the generated text:
| Hello! I'm an AI, which means I'm a computer program designed to simulate conversation and answer questions to the best of my ability. I don't have consciousness in the way that humans do, but I'm designed to be very responsive and interactive.
I can understand and respond to language, and I can even learn and improve over time based on the conversations I have with users like you. So, in a sense, I'm "awake" and ready to chat with you!
What would you like to talk about? Do you have a specific question or topic in mind, or do you just want to chat about something random? I'm here to listen and help if I can!
|
Summary
The complete code for Llama3-8B
model compilation:
| from optimum.rbln import RBLNLlamaForCausalLM
# Compile and export the model
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_save_dir = "rbln-Llama-3-8B-Instruct"
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True,
rbln_batch_size=1,
rbln_max_seq_len=8192, # default max_positional_embeddings
rbln_tensor_parallel_size=4, # using multiple NPUs
)
# Save the compiled model to disk
compiled_model.save_pretrained(model_save_dir)
|
Below is the complete code for deploying a compiled Llama3-8B
model:
| from transformers import AutoTokenizer
from optimum.rbln import RBLNLlamaForCausalLM
# Load the compiled model
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
model_id="rbln-Llama-3-8B-Instruct",
export=False,
)
# Prepare the inputs
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)
# Generates sequences of token ids
output_sequence = compiled_model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
do_sample=False,
max_length=8192,
)
# Decode the generated tokens and print the texts
out = tokenizer.batch_decode(output_sequence, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(out[0])
|