Llama3.1 8B¶

Overview¶

This tutorial guides how to run the model with Flash Attention on vLLM. For this guide, we will use meta-llama/Llama-3.1-8B-Instruct model.

Flash Attention improves memory efficiency and throughput, enabling better performance for models handling long contexts. Flash Attention mode is activated by adding the rbln_kvcache_partition_len parameter during compilation.

Setup & Installation¶

Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:

System Requirements:
- Python: 3.10–3.13
- RBLN Driver
Packages Requirements:
- torch
- transformers
- numpy
- RBLN Compiler
- optimum-rbln
- huggingface_hub[cli]
- vllm-rbln – automatically installs vLLM

Installation Command:

pip install \
  --extra-index-url https://pypi.rbln.ai/simple \
  rebel-compiler==0.10.4.post1
pip install \
  --extra-index-url https://wheels.vllm.ai/0.18.0/cpu \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  vllm-rbln==0.10.4

Note

Please note that rebel-compiler requires an RBLN Portal account.
The commands above are intended for a default pip install on Debian-based Linux such as Ubuntu. For all other configurations, refer to the Installation Guide for the supported install matrix and the applicable commands.

Note

Please note that the meta-llama/Llama-3.1-8B-Instruct model on HuggingFace has restricted access. Once access is granted, you can log in using the hf (huggingface-cli) command as shown below:

$ hf auth login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) .
Enter your token (input will not be visible):

1. Execution: With Pre-Compilation¶

This step demonstrates how to use vLLM to serve the meta-llama/Llama-3.1-8B-Instruct model on multiple RBLN NPUs.

1.1. Model Compilation¶

To begin, import the RBLNLlamaForCausalLM class from optimum-rbln. This class's from_pretrained() method downloads the Llama 3.1 model from the HuggingFace Hub and compiles it using the RBLN Compiler. When exporting the model, specify the following parameters:

rbln_batch_size: Defines the batch size for compilation.
rbln_max_seq_len: Defines the maximum sequence length.
rbln_tensor_parallel_size: Defines the number of NPUs to be used for inference.
rbln_kvcache_partition_len: Defines the length of KV cache partitions for flash attention. rbln_max_seq_len must be a multiple of rbln_kvcache_partition_len and greater than rbln_kvcache_partition_len.

After compilation, save the model artifacts to disk using the save_pretrained() method. This will create a directory (e.g., rbln-Llama-3-1-8B-Instruct) containing the compiled model.

Note

Select batch size based on model size and NPU specs. Moreover, vllm-rbln supports Dynamic Batching to ensure optimal throughput and resource utilization. See Dynamic Batching for details.

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Compile and export
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_batch_size=1,
    rbln_max_seq_len=131_072,
    rbln_tensor_parallel_size=8,
    rbln_kvcache_partition_len=16_384,
)

# Save compiled results to disk
model.save_pretrained("rbln-Llama-3-1-8B-Instruct")

1.2. Inference using vLLM¶

You can use the compiled model with vLLM. The example below shows how to set up the vLLM engine using a compiled model and run inference.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

def main():
    model_id = "rbln-Llama-3-1-8B-Instruct"
    llm = LLM(model=model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    sampling_params = SamplingParams(
        temperature=0.0,
        skip_special_tokens=True,
        stop_token_ids=[tokenizer.eos_token_id],
        max_tokens=100
    )

    conversation = [
        {
            "role": "user",
            "content": "Who are you?"
        }
    ]

    chat = tokenizer.apply_chat_template(
        conversation, 
        add_generation_prompt=True,
        tokenize=False
    )

    outputs = llm.generate(chat, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(generated_text)

if __name__ == "__main__":
    main()

Example Output:

1	`I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."`

2. Execution: Without Pre-Compilation (Beta)¶

Info

This is a beta feature. From vLLM RBLN 0.10.4, you can serve the model without running a separate pre-compile script. Pass vLLM engine parameters directly to LLM(), and vLLM RBLN runs the optimum-rbln compile step automatically at engine startup.

2.1. Compilation and Inference using vLLM¶

Set the tensor-parallel size with the VLLM_RBLN_TP_SIZE environment variable, then pass block_size (matching rbln_kvcache_partition_len), max_model_len, and max_num_seqs directly to LLM().

Note

vllm-rbln currently requires block_size to be set explicitly. See the Configuration Guide for details.

Note

Compiled artifacts are saved under $VLLM_CACHE_ROOT/compiled_models. VLLM_CACHE_ROOT defaults to ~/.cache/vllm; set the environment variable to override the location.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import os

def main():
    model_id = "meta-llama/Llama-3.1-8B-Instruct"
    os.environ["VLLM_RBLN_TP_SIZE"] = "8"
    llm = LLM(
        model=model_id,
        block_size=16_384,
        max_model_len=131_072,
        max_num_seqs=1,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # ... (rest of main() unchanged)

if __name__ == "__main__":
    main()

Llama3.1 8B¶

Overview¶

Setup & Installation¶

1. Execution: With Pre-Compilation¶

1.1. Model Compilation¶

1.2. Inference using vLLM¶

2. Execution: Without Pre-Compilation (Beta)¶

2.1. Compilation and Inference using vLLM¶

References¶