Skip to content

Attention Modes

Overview

vLLM RBLN supports two attention modes: Naive Attention and Flash Attention.

The mode can be specified at model compilation using the rbln_attn_impl and rbln_kvcache_partition_len parameter. During inference, vLLM RBLN automatically applies the selected mode based on the attn_impl field in the rbln_config.json file of the compiled model.

Thus, users only need to specify the attention mode once at compilation.

Naive Attention

Naive Attention runs without any additional optimizations applied to the attention mechanism.

To compile and run the model with Naive Attention, set rbln_attn_impl="eager".

Since the default value of rbln_attn_impl is "eager", you can omit this parameter and it will be applied automatically. For Naive Attention, rbln_kvcache_partition_len does not need to be specified.

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Compile and export
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_attn_impl="eager",
    rbln_batch_size=4,
    rbln_max_seq_len=8192,
    rbln_tensor_parallel_size=4,
)

# Save compiled results to disk
model.save_pretrained("rbln-Llama-3-8B-Instruct")

For a detailed example, refer to the Llama3 8B tutorial.

Flash Attention

Flash Attention improves memory efficiency and throughput, enabling better performance for models handling long contexts.

Flash Attention mode is activated by setting rbln_attn_impl="flash_attn" and specifying the rbln_kvcache_partition_len parameter during compilation.

If rbln_kvcache_partition_len is provided and differs from rbln_max_seq_len, rbln_attn_impl is automatically set to flash_attn, even without explicitly setting rbln_attn_impl="flash_attn".

To achieve the best performance with Flash Attention, you're recommended to follow the parameters set in the RBLN Model Zoo.

from optimum.rbln import RBLNLlamaForCausalLM

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Compile and export
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,
    rbln_attn_impl="flash_attn",
    rbln_batch_size=1,
    rbln_max_seq_len=131_072,
    rbln_tensor_parallel_size=8,
    rbln_kvcache_partition_len=16_384,
)

# Save compiled results to disk
model.save_pretrained("rbln-Llama-3-1-8B-Instruct")

For a detailed example, refer to the Llama3.1 8B tutorial.