Llama-3-8B (Chatbot)¶
Overview¶
This tutorial explains how to compile and deploy the Llama 3 model from HuggingFace using multiple RBLN NPUs. For this guide, we will use the meta-llama/Meta-Llama-3-8B-Instruct
model.
Note
Rebellions Scalable Design (RSD) is available on ATOM™+ (RBLN-CA12
and RBLN-CA22
) and ATOM™-Max (RBLN-CA25
). You can check your RBLN NPU type using the rbln-stat
command.
Note
Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- Installation Command:
Note
Please note that rebel-compiler
requires an RBLN Portal account.
Note
Please note that the meta-llama/Meta-Llama-3-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli
command as shown below:
Using Optimum RBLN library
¶
Model Compilation for Multiple NPUs¶
To begin, import the RBLNLlamaForCausalLM
class from optimum-rbln
. This class's from_pretrained()
method downloads the Llama 3
model from the HuggingFace Hub and compiles it using the RBLN Compiler
. When exporting the model, specify the following parameters:
export
: Must beTrue
to compile the model.rbln_batch_size
: Defines the batch size for compilation.rbln_max_seq_len
: Defines the maximum sequence length.rbln_tensor_parallel_size
: Defines the number of NPUs to be used for inference.
After compilation, save the model artifacts to disk using the save_pretrained()
method. This will create a directory (e.g., rbln-Llama-3-8B-Instruct
) containing the compiled model.
Load the Compiled RBLN Model¶
Load the compiled RBLN model using RBLNLlamaForCausalLM.from_pretrained()
. Pass the saved directory path as model_id
and set the export
parameter to False
.
Prepare Inputs¶
Use the AutoTokenizer
from the transformers
library to tokenize the input sequences. For instruction-tuned models like Llama 3
, it's important to apply the chat template.
Model Inference¶
You can now generate a response using the generate()
method.
The results will look like this:
Summary and References¶
This tutorial demonstrated how to compile the meta-llama/Meta-Llama-3-8B-Instruct
model to run on 4 RBLN NPUs. The compiled model can be efficiently
inferenced on an RBLN NPU for chatbot.