Llama3-8B¶
Overview¶
This tutorial explains how to run the model on vLLM
using multiple RBLN NPUs. For this guide, we will use the meta-llama/Meta-Llama-3-8B-Instruct
model.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs
vLLM
- Installation Command:
Note
Please note that rebel-compiler
requires an RBLN Portal account.
Note
Please note that the meta-llama/Meta-Llama-3-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli
command as shown below:
Execution¶
Model Compilation¶
To begin, import the RBLNLlamaForCausalLM
class from optimum-rbln
. This class's from_pretrained()
method downloads the Llama 3
model from the HuggingFace Hub and compiles it using the RBLN Compiler
. When exporting the model, specify the following parameters:
-
export
: Must beTrue
to compile the model. -
rbln_batch_size
: Defines the batch size for compilation. -
rbln_max_seq_len
: Defines the maximum sequence length. -
rbln_tensor_parallel_size
: Defines the number of NPUs to be used for inference.
After compilation, save the model artifacts to disk using save_pretrained()
, creating a directory (e.g., rbln-Llama-3-8B-Instruct
) with the compiled model.
Note
Select batch size based on model size and NPU specs. Moreover, vllm-rbln
supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.
Inference using vLLM¶
You can use the compiled model with vLLM
. The example below shows how to set up the vLLM engine
using a compiled model and run inference.
Note
In the case of regular Paged Attention, not Flash Attention, block_size
should match with max_seq_len
.
Example Output: