Llama3-8B¶
Overview¶
This tutorial explains how to run the model on vLLM using multiple RBLN NPUs. For this guide, we will use the meta-llama/Meta-Llama-3-8B-Instruct
model.
Note
Rebellions Scalable Design (RSD) is available on ATOM™+ (RBLN-CA12
and RBLN-CA22
) and ATOM™-Max (RBLN-CA25
). You can check your RBLN NPU type using the rbln-stat
command.
Note
Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln
- vllm - It is automatically installed when you install
vllm-rbln
.
- vllm - It is automatically installed when you install
- Installation Command:
Note
Please note that rebel-compiler
requires an RBLN Portal account.
Note
Please note that the meta-llama/Meta-Llama-3-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli
command as shown below:
Compile Llama3-8B¶
To begin, import the RBLNLlamaForCausalLM
class from optimum-rbln
. This class's from_pretrained()
method downloads the Llama 3
model from the HuggingFace Hub and compiles it using the RBLN Compiler
. When exporting the model, specify the following parameters:
export
: Must beTrue
to compile the model.rbln_batch_size
: Defines the batch size for compilation.rbln_max_seq_len
: Defines the maximum sequence length.rbln_tensor_parallel_size
: Defines the number of NPUs to be used for inference.
After compilation, save the model artifacts to disk using the save_pretrained()
method. This will create a directory (e.g., rbln-Llama-3-8B-Instruct
) containing the compiled model.
Note
You can select an appropriate batch size based on the model size and the specifications of the NPUs. Since vllm-rbln
supports continuous batching, it’s important to configure the batch size carefully to ensure optimal throughput and resource utilization.
For information on enabling dynamic batching, see Inference with Dynamic Batch Sizes.
Use vLLM API for Inference¶
Note
Note that for Non-Flash Attention, block_size
should match with max_seq_len
.
You can use the compiled model with vLLM APIs. The following example shows how to set up the vLLM engine using a compiled model and run inference with it.
You can find more vLLM API usage examples for encoder-decoder models and multi-modal models in RBLN Model Zoo.
Please refer to the vLLM Docs for more information on the vLLM API.