Llama3 8B¶
Overview¶
This tutorial explains how to run the model on vLLM using multiple RBLN NPUs. For this guide, we will use the meta-llama/Meta-Llama-3-8B-Instruct model.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs
vLLM
- Installation Command:
Note
Please note that rebel-compiler requires an RBLN Portal account.
Note
Please note that the meta-llama/Meta-Llama-3-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli command as shown below:
Execution¶
Model Compilation¶
To begin, import the RBLNLlamaForCausalLM class from optimum-rbln. This class's from_pretrained() method downloads the Llama 3 model from the HuggingFace Hub and compiles it using the RBLN Compiler. When exporting the model, specify the following parameters:
-
export: Must beTrueto compile the model. -
rbln_batch_size: Defines the batch size for compilation. -
rbln_max_seq_len: Defines the maximum sequence length. -
rbln_tensor_parallel_size: Defines the number of NPUs to be used for inference.
After compilation, save the model artifacts to disk using save_pretrained(), creating a directory (e.g., rbln-Llama-3-8B-Instruct) with the compiled model.
Note
Select batch size based on model size and NPU specs. Moreover, vllm-rbln supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.
Inference using vLLM¶
You can use the compiled model with vLLM. The example below shows how to set up the vLLM engine using a compiled model and run inference.
Example Output: