Llama3.1-8B with Flash Attention¶
Overview¶
This tutorial guides how to run the model with Flash Attention on vLLM. For this guide, we will use meta-llama/Llama-3.1-8B-Instruct
model.
Flash Attention improves memory efficiency and throughput, enabling better performance for models handling long contexts. Flash Attention mode is activated by adding the rbln_kvcache_partition_len
parameter during compilation.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.9–3.12
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs
vLLM
- Installation Command:
Note
Please note that rebel-compiler
requires an RBLN Portal account.
Note
Please note that the meta-llama/Llama-3.1-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the huggingface-cli
command as shown below:
Execution¶
Model Compilation¶
To begin, import the RBLNLlamaForCausalLM
class from optimum-rbln
. This class's from_pretrained()
method downloads the Llama 3.1
model from the HuggingFace Hub and compiles it using the RBLN Compiler
. When exporting the model, specify the following parameters:
-
export
: Must beTrue
to compile the model. -
rbln_batch_size
: Defines the batch size for compilation. -
rbln_max_seq_len
: Defines the maximum sequence length. -
rbln_tensor_parallel_size
: Defines the number of NPUs to be used for inference. -
rbln_kvcache_partition_len
: Defines the length of KV cache partitions for flash attention.rbln_max_seq_len
must be multiple ofrbln_kvcache_partition_len
and larger thanrbln_kvcache_partition_len
.
After compilation, save the model artifacts to disk using the save_pretrained()
method. This will create a directory (e.g., rbln-Llama-3-1-8B-Instruct
) containing the compiled model.
Note
Select batch size based on model size and NPU specs. Moreover, vllm-rbln
supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.
Inference using vLLM¶
You can use the compiled model with vLLM
. The example below shows how to set up the vLLM engine
using a compiled model and run inference.
Note
Note that for Flash Attention, block_size
should match with rbln_kvcache_partition_len
.
Example Output: