Qwen2¶

The Qwen2 family of models have been developed by Alibaba Cloud and are open-source large language models. RBLN NPUs can accelerate Qwen2 model inference using Optimum RBLN.

Key Classes¶

RBLNQwen2ForCausalLM: The main model class for running Qwen2 on RBLN NPU
RBLNQwen2ForCausalLMConfig: Configuration class specifically for Qwen2 models

API Reference¶

Classes¶

`RBLNQwen2ForCausalLM` ¶

Bases: RBLNDecoderOnlyModelForCausalLM

The Qwen2 Model transformer with a language modeling head (linear layer) on top. This model inherits from [RBLNDecoderOnlyModelForCausalLM]. Check the superclass documentation for the generic methods the library implements for all its models.

A class to convert and run pre-trained transformers based Qwen2ForCausalLM model on RBLN devices. It implements the methods to convert a pre-trained transformers Qwen2ForCausalLM model into a RBLN transformer model by: - transferring the checkpoint weights of the original into an optimized RBLN graph, - compiling the resulting graph using the RBLN compiler.

Configuration: This model uses [RBLNQwen2ForCausalLMConfig] for configuration. When calling methods like from_pretrained or from_model, the rbln_config parameter should be an instance of [RBLNQwen2ForCausalLMConfig] or a dictionary conforming to its structure.

See the [RBLNQwen2ForCausalLMConfig] class for all available configuration options.

Examples:

from optimum.rbln import RBLNQwen2ForCausalLM

# Simple usage using rbln_* arguments
# `max_seq_len` is automatically inferred from the model config
model = RBLNQwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    export=True,
    rbln_batch_size=1,
    rbln_tensor_parallel_size=4,
)


# Using a config dictionary
rbln_config = {
    "batch_size": 1,
    "max_seq_len": 4096,
    "tensor_parallel_size": 4,
}
model = RBLNQwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    export=True,
    rbln_config=rbln_config
)


# Using a RBLNQwen2ForCausalLMConfig instance (recommended for type checking)
from optimum.rbln import RBLNQwen2ForCausalLMConfig

config = RBLNQwen2ForCausalLMConfig(
    batch_size=1,
    max_seq_len=4096,
    tensor_parallel_size=4
)
model = RBLNQwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    export=True,
    rbln_config=config
)

Functions¶

`from_pretrained(model_id, export=False, rbln_config=None, **kwargs)` `classmethod` ¶

The from_pretrained() function is utilized in its standard form as in the HuggingFace transformers library. User can use this function to load a pre-trained model from the HuggingFace library and convert it to a RBLN model to be run on RBLN NPUs.

Parameters:

Name	Type	Description	Default
`model_id`	`Union[str, Path]`	The model id of the pre-trained model to be loaded. It can be downloaded from the HuggingFace model hub or a local path, or a model id of a compiled model using the RBLN Compiler.	required
`export`	`bool`	A boolean flag to indicate whether the model should be compiled.	`False`
`rbln_config`	`Optional[Union[Dict, RBLNModelConfig]]`	Configuration for RBLN model compilation and runtime. This can be provided as a dictionary or an instance of the model's configuration class (e.g., `RBLNLlamaForCausalLMConfig` for Llama models). For detailed configuration options, see the specific model's configuration class documentation.	`None`
`kwargs`	`Dict[str, Any]`	Additional keyword arguments. Arguments with the prefix 'rbln_' are passed to rbln_config, while the remaining arguments are passed to the HuggingFace library.	`{}`

Returns:

Type	Description
`Self`	A RBLN model instance ready for inference on RBLN NPU devices.

`from_model(model, *, rbln_config=None, **kwargs)` `classmethod` ¶

Converts and compiles a pre-trained HuggingFace library model into a RBLN model. This method performs the actual model conversion and compilation process.

Parameters:

Name	Type	Description	Default
`model`	`PreTrainedModel`	The PyTorch model to be compiled. The object must be an instance of the HuggingFace transformers PreTrainedModel class.	required
`rbln_config`	`Optional[Union[Dict, RBLNModelConfig]]`	Configuration for RBLN model compilation and runtime. This can be provided as a dictionary or an instance of the model's configuration class (e.g., `RBLNLlamaForCausalLMConfig` for Llama models). For detailed configuration options, see the specific model's configuration class documentation.	`None`
`kwargs`	`Dict[str, Any]`	Additional keyword arguments. Arguments with the prefix 'rbln_' are passed to rbln_config, while the remaining arguments are passed to the HuggingFace library.	`{}`

The method performs the following steps:

Compiles the PyTorch model into an optimized RBLN graph
Configures the model for the specified NPU device
Creates the necessary runtime objects if requested
Saves the compiled model and configurations

Returns:

Type	Description
`Self`	A RBLN model instance ready for inference on RBLN NPU devices.

`save_pretrained(save_directory)` ¶

Saves a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained] class method.

Parameters:

Name	Type	Description	Default
`save_directory`	`Union[str, PathLike]`	The directory to save the model and its configuration files. Will be created if it doesn't exist.	required

`generate(input_ids, attention_mask=None, max_length=None)` ¶

The generate function is utilized in its standard form as in the HuggingFace transformers library. User can use this function to generate text from the model.

Parameters:

Name	Type	Description	Default
`input_ids`	`LongTensor`	The sequence used as a prompt for the generation	required
`attention_mask`	`Optional[Tensor]`	The attention mask to apply on the sequence	`None`
`max_length`	`Optional[int]`	The maximum length of the sequence to be generated	`None`

Classes¶

`RBLNQwen2ForCausalLMConfig` ¶

Bases: RBLNDecoderOnlyModelForCausalLMConfig

Configuration class for RBLN Qwen2 models.

This class is an alias of RBLNDecoderOnlyModelForCausalLMConfig.

Example usage:

from optimum.rbln import RBLNQwen2ForCausalLM, RBLNQwen2ForCausalLMConfig

# Create a configuration object
config = RBLNQwen2ForCausalLMConfig(
    batch_size=1,
    max_seq_len=4096,
    tensor_parallel_size=4
)

# Use the configuration with from_pretrained
model = RBLNQwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B",
    export=True,
    rbln_config=config
)

Functions¶

`init(batch_size=None, max_seq_len=None, use_inputs_embeds=None, use_attention_mask=None, attn_impl=None, kvcache_partition_len=None, kvcache_block_size=None, quantization=None, prefill_chunk_size=None, kvcache_num_blocks=None, decoder_batch_sizes=None, **kwargs)` ¶

Parameters:

Name	Type	Description	Default
`batch_size`	`Optional[int]`	The batch size for inference. Defaults to 1.	`None`
`max_seq_len`	`Optional[int]`	The maximum sequence length supported by the model. If not provided, it attempts to infer from the model's configuration (`max_position_embeddings` or `n_positions`). Must be specified if not available in the model config.	`None`
`use_inputs_embeds`	`Optional[bool]`	Whether to use input embeddings (`inputs_embeds`) directly instead of `input_ids`. Defaults to False. Requires the model to be compiled with this option enabled.	`None`
`use_attention_mask`	`Optional[bool]`	Whether the model requires attention masks during inference. This is typically determined based on the target device and model architecture. Defaults are often set automatically based on the model and RBLN NPU.	`None`
`attn_impl`	`Optional[str]`	Specifies the attention implementation to use. See the "Attention Implementation (`attn_impl`)" section below for details.	`None`
`kvcache_partition_len`	`Optional[int]`	Defines the partition length for the KV cache when using "flash_attn". See the "KV Cache Partition Length (`kvcache_partition_len`)" section below for details.	`None`
`kvcache_block_size`	`Optional[int]`	Sets the size (in number of tokens) of each block in the PagedAttention KV cache. See the "KV Cache Block Size (`kvcache_block_size`)" section below for details.	`None`
`quantization`	`Optional[Dict[str, Any]]`	Configuration dictionary for applying model quantization. Specifies format, group sizes, etc.	`None`
`prefill_chunk_size`	`Optional[int]`	The chunk size used during the prefill phase for processing input sequences. Defaults to 128. Must be a positive integer divisible by 64. Affects prefill performance and memory usage.	`None`
`kvcache_num_blocks`	`Optional[int]`	The total number of blocks to allocate for the PagedAttention KV cache. See the "KV Cache Number of Blocks (`kvcache_num_blocks`)" section below for details.	`None`
`decoder_batch_sizes`	`Optional[List[int]]`	A list of batch sizes for which separate decoder models will be compiled. This allows the model to handle varying batch sizes efficiently during generation. If not specified, defaults to a list containing only the model's main batch size. When specifying multiple batch sizes: 1) All values must be less than or equal to the main batch size. 2) The list will be sorted in descending order (larger batch sizes first). 3) If using multiple decoders, at least one batch size should match the main batch size.	`None`
`**kwargs`	`Dict[str, Any]`	Additional keyword arguments passed to the parent `RBLNModelConfig`.	`{}`

Raises:

Type	Description
`ValueError`	If `batch_size` is not a positive integer.
`ValueError`	If `prefill_chunk_size` is not a positive integer divisible by 64.
`ValueError`	If `max_seq_len` cannot be determined and is required.
`ValueError`	If attention parameter constraints are violated (e.g., `max_seq_len` vs `kvcache_partition_len` for flash attention).

Attention Implementation

attn_impl determines the underlying attention mechanism used by the model.

"eager" (Default if kvcache_partition_len is not set): Uses the standard PyTorch attention implementation. Suitable for sequences up to a certain limit (e.g., 32,768 tokens).
"flash_attn": Utilizes an optimized Flash Attention implementation, beneficial for longer sequences and potentially faster execution. Requires max_seq_len to be at least 8,192. If kvcache_partition_len is specified, attn_impl automatically defaults to "flash_attn". When using "flash_attn", kvcache_block_size must equal kvcache_partition_len.

The choice impacts performance and memory usage, especially for long sequences. Constraints related to max_seq_len and kvcache_partition_len apply when using "flash_attn".

KV Cache Partition Length

kvcache_partition_len is relevant only when attn_impl is "flash_attn".

It defines the length (number of tokens) of each partition within the Key-Value (KV) cache.
Must be between 4,096 and 32,768 (inclusive).
When using "flash_attn", max_seq_len must be a multiple of kvcache_partition_len and at least twice its value (max_seq_len >= 2 * kvcache_partition_len).
If attn_impl is "flash_attn" and kvcache_partition_len is None, it defaults to 16,384.

KV Cache Number of Blocks

kvcache_num_blocks controls the total number of memory blocks allocated for the PagedAttention KV cache. Each block holds kvcache_block_size tokens of Key and Value states.

Automatic Estimation (Default): If kvcache_num_blocks is None, the system estimates the maximum number of blocks that can fit into the available RBLN device memory. This calculation considers the model size (kernel memory), required buffer memory, the number of layers and heads, kvcache_block_size, tensor parallelism, and available RBLN NPU DRAM. This aims to maximize cache capacity for potentially better performance with long sequences or larger batches without manual tuning.
Manual Setting: You can explicitly set the number of blocks. This provides finer control but requires careful consideration of memory limits. Setting it too high may lead to compilation errors if it exceeds available memory. The system will issue warnings if your setting exceeds the estimated maximum.
Performance Impact: A larger number of blocks reduces the likelihood of cache eviction, which is beneficial for tasks involving many long sequences or large batch sizes, enabling higher throughput. However, allocating more blocks consumes more memory.
Minimum Requirement: The system requires a minimum number of blocks to function, calculated based on max_seq_len, kvcache_block_size, and batch_size. The number of allocated blocks must be sufficient to hold at least one full sequence length per item in the batch concurrently. The system will log warnings or raise errors if constraints are violated (e.g., if kvcache_num_blocks is less than batch_size when using Flash Attention).

The optimal value depends on the specific model, task, hardware, and desired trade-off between performance and memory usage. The automatic estimation provides a robust starting point.

Qwen2¶

Key Classes¶

API Reference¶

Classes¶

RBLNQwen2ForCausalLM ¶

Functions¶

from_pretrained(model_id, export=False, rbln_config=None, **kwargs) classmethod ¶

from_model(model, *, rbln_config=None, **kwargs) classmethod ¶

save_pretrained(save_directory) ¶

generate(input_ids, attention_mask=None, max_length=None) ¶

Classes¶

RBLNQwen2ForCausalLMConfig ¶

Functions¶

__init__(batch_size=None, max_seq_len=None, use_inputs_embeds=None, use_attention_mask=None, attn_impl=None, kvcache_partition_len=None, kvcache_block_size=None, quantization=None, prefill_chunk_size=None, kvcache_num_blocks=None, decoder_batch_sizes=None, **kwargs) ¶

`RBLNQwen2ForCausalLM` ¶

`from_pretrained(model_id, export=False, rbln_config=None, **kwargs)` `classmethod` ¶

`from_model(model, *, rbln_config=None, **kwargs)` `classmethod` ¶

`save_pretrained(save_directory)` ¶

`generate(input_ids, attention_mask=None, max_length=None)` ¶

`RBLNQwen2ForCausalLMConfig` ¶

`init(batch_size=None, max_seq_len=None, use_inputs_embeds=None, use_attention_mask=None, attn_impl=None, kvcache_partition_len=None, kvcache_block_size=None, quantization=None, prefill_chunk_size=None, kvcache_num_blocks=None, decoder_batch_sizes=None, **kwargs)` ¶