Skip to content

Llama-3-8B (Chatbot)

Overview

This tutorial explains how to compile and deploy the Llama 3 model from HuggingFace using multiple RBLN NPUs. For this guide, we will use the meta-llama/Meta-Llama-3-8B-Instruct model.

Note

Rebellions Scalable Design (RSD) is available on ATOM™+ (RBLN-CA12 and RBLN-CA22) and ATOM™-Max (RBLN-CA25). You can check your RBLN NPU type using the rbln-stat command.

Note

Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Setup & Installation

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

Note

Please note that rebel-compiler requires an RBLN Portal account.

Note

Please note that the meta-llama/Meta-Llama-3-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli command as shown below:

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) .
Token: *****

Using Optimum RBLN library

Model Compilation for Multiple NPUs

To begin, import the RBLNLlamaForCausalLM class from optimum-rbln. This class's from_pretrained() method downloads the Llama 3 model from the HuggingFace Hub and compiles it using the RBLN Compiler. When exporting the model, specify the following parameters:

  • export: Must be True to compile the model.
  • rbln_batch_size: Defines the batch size for compilation.
  • rbln_max_seq_len: Defines the maximum sequence length.
  • rbln_tensor_parallel_size: Defines the number of NPUs to be used for inference.

After compilation, save the model artifacts to disk using the save_pretrained() method. This will create a directory (e.g., rbln-Llama-3-8B-Instruct) containing the compiled model.

from optimum.rbln import RBLNLlamaForCausalLM

# Define the HuggingFace model ID
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile the model for 4 RBLN NPUs
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=1,
    rbln_max_seq_len=8192,
    rbln_tensor_parallel_size=4,
)

compiled_model.save_pretrained("rbln-Llama-3-8B-Instruct")

Load the Compiled RBLN Model

Load the compiled RBLN model using RBLNLlamaForCausalLM.from_pretrained(). Pass the saved directory path as model_id and set the export parameter to False.

1
2
3
4
5
6
7
from optimum.rbln import RBLNLlamaForCausalLM

# Load the compiled RBLN model from the specified directory
compiled_model = RBLNLlamaForCausalLM.from_pretrained(
    model_id="rbln-Llama-3-8B-Instruct",
    export=False
)

Prepare Inputs

Use the AutoTokenizer from the transformers library to tokenize the input sequences. For instruction-tuned models like Llama 3, it's important to apply the chat template.

1
2
3
4
5
6
7
8
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

conversation = [{"role": "user", "content": "Hey, are you conscious? Can you talk to me?"}]
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)

Model Inference

You can now generate a response using the generate() method.

1
2
3
4
5
6
7
8
9
output_sequence = compiled_model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    do_sample=False,
    max_length=8192,
)

out = tokenizer.batch_decode(output_sequence, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(out[0])

The results will look like this:

1
2
3
4
5
Hello! I'm an AI, which means I'm a computer program designed to simulate conversation and answer questions to the best of my ability. I don't have consciousness in the way that humans do, but I'm designed to be very responsive and interactive.

I can understand and respond to language, and I can even learn and improve over time based on the conversations I have with users like you. So, in a sense, I'm "awake" and ready to chat with you!

What would you like to talk about? Do you have a specific question or topic in mind, or do you just want to chat about something random? I'm here to listen and help if I can!

Summary and References

This tutorial demonstrated how to compile the meta-llama/Meta-Llama-3-8B-Instruct model to run on 4 RBLN NPUs. The compiled model can be efficiently inferenced on an RBLN NPU for chatbot.

References: