Attention Modes¶
Overview¶
vLLM RBLN supports two attention modes: Naive Attention and Flash Attention.
The mode can be specified at model compilation using the rbln_attn_impl
and rbln_kvcache_partition_len
parameter. During inference, vLLM RBLN automatically applies the selected mode based on the attn_impl
field in the rbln_config.json file of the compiled model.
Thus, users only need to specify the attention mode once at compilation.
Naive Attention¶
Naive Attention runs without any additional optimizations applied to the attention mechanism.
To compile and run the model with Naive Attention, set rbln_attn_impl="eager"
.
Since the default value of rbln_attn_impl
is "eager"
, you can omit this parameter and it will be applied automatically.
For Naive Attention, rbln_kvcache_partition_len
does not need to be specified.
For a detailed example, refer to the Llama3 8B tutorial.
Flash Attention¶
Flash Attention improves memory efficiency and throughput, enabling better performance for models handling long contexts.
Flash Attention mode is activated by setting rbln_attn_impl="flash_attn"
and specifying the rbln_kvcache_partition_len
parameter during compilation.
If rbln_kvcache_partition_len
is provided and differs from rbln_max_seq_len
, rbln_attn_impl
is automatically set to flash_attn
, even without explicitly setting rbln_attn_impl="flash_attn"
.
To achieve the best performance with Flash Attention, you're recommended to follow the parameters set in the RBLN Model Zoo.
For a detailed example, refer to the Llama3.1 8B tutorial.