Llama3.1 8B¶
Overview¶
This tutorial guides how to run the model with Flash Attention on vLLM. For this guide, we will use meta-llama/Llama-3.1-8B-Instruct model.
Flash Attention improves memory efficiency and throughput, enabling better performance for models handling long contexts. Flash Attention mode is activated by adding the rbln_kvcache_partition_len parameter during compilation.
Setup & Installation¶
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Python: 3.10–3.13
- RBLN Driver
- Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs
vLLM
- Installation Command:
Note
- Please note that
rebel-compilerrequires an RBLN Portal account. - The commands above are intended for a default pip install on Debian-based Linux such as Ubuntu. For all other configurations, refer to the Installation Guide for the supported install matrix and the applicable commands.
Note
Please note that the meta-llama/Llama-3.1-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the hf (huggingface-cli) command as shown below:
1. Execution: With Pre-Compilation¶
This step demonstrates how to use vLLM to serve the meta-llama/Llama-3.1-8B-Instruct model on multiple RBLN NPUs.
1.1. Model Compilation¶
To begin, import the RBLNLlamaForCausalLM class from optimum-rbln. This class's from_pretrained() method downloads the Llama 3.1 model from the HuggingFace Hub and compiles it using the RBLN Compiler. When exporting the model, specify the following parameters:
-
rbln_batch_size: Defines the batch size for compilation. -
rbln_max_seq_len: Defines the maximum sequence length. -
rbln_tensor_parallel_size: Defines the number of NPUs to be used for inference. -
rbln_kvcache_partition_len: Defines the length of KV cache partitions for flash attention.rbln_max_seq_lenmust be a multiple ofrbln_kvcache_partition_lenand greater thanrbln_kvcache_partition_len.
After compilation, save the model artifacts to disk using the save_pretrained() method. This will create a directory (e.g., rbln-Llama-3-1-8B-Instruct) containing the compiled model.
Note
Select batch size based on model size and NPU specs. Moreover, vllm-rbln supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.
1.2. Inference using vLLM¶
You can use the compiled model with vLLM. The example below shows how to set up the vLLM engine using a compiled model and run inference.
Example Output:
2. Execution: Without Pre-Compilation (Beta)¶
Info
This is a beta feature. From vLLM RBLN 0.10.4, you can serve the model without running a separate pre-compile script. Pass vLLM engine parameters directly to LLM(), and vLLM RBLN runs the optimum-rbln compile step automatically at engine startup.
2.1. Compilation and Inference using vLLM¶
Set the tensor-parallel size with the VLLM_RBLN_TP_SIZE environment variable, then pass block_size (matching rbln_kvcache_partition_len), max_model_len, and max_num_seqs directly to LLM().
Note
vllm-rbln currently requires block_size to be set explicitly. See the Configuration Guide for details.
Note
Compiled artifacts are saved under $VLLM_CACHE_ROOT/compiled_models. VLLM_CACHE_ROOT defaults to ~/.cache/vllm; set the environment variable to override the location.