Qwen2.5-VL¶
The Qwen2.5-VL model is a vision-language model designed for tasks like Visual Question Answering (VQA), image captioning, and video understanding. It can process both image and video inputs alongside text, making it highly versatile for multimodal applications. RBLN NPUs can accelerate Qwen2.5-VL model inference using Optimum RBLN.
Classes¶
RBLNQwen2_5_VLForConditionalGeneration
¶
Bases: RBLNDecoderOnlyModelForCausalLM
RBLNQwen2_5_VLForConditionalGeneration is a multi-modal model that integrates vision and language processing capabilities, optimized for RBLN NPUs. It is designed for conditional generation tasks that involve both image and text inputs.
This model inherits from [RBLNDecoderOnlyModelForCausalLM
]. Check the superclass documentation for the generic methods the library implements for all its models.
Important Note
This model includes a Large Language Model (LLM). For optimal performance, it is highly recommended to use
tensor parallelism for the language model. This can be achieved by using the rbln_config
parameter in the
from_pretrained
method. Refer to the from_pretrained
documentation and the RBLNQwen2_5_VLForConditionalGenerationConfig class for details.
Examples:
Functions¶
from_pretrained(model_id, export=False, rbln_config=None, **kwargs)
classmethod
¶
The from_pretrained()
function is utilized in its standard form as in the HuggingFace transformers library.
User can use this function to load a pre-trained model from the HuggingFace library and convert it to a RBLN model to be run on RBLN NPUs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
Union[str, Path]
|
The model id of the pre-trained model to be loaded. It can be downloaded from the HuggingFace model hub or a local path, or a model id of a compiled model using the RBLN Compiler. |
required |
export
|
bool
|
A boolean flag to indicate whether the model should be compiled. |
False
|
rbln_config
|
Optional[Union[Dict, RBLNModelConfig]]
|
Configuration for RBLN model compilation and runtime. This can be provided as a dictionary or an instance of the model's configuration class (e.g., |
None
|
kwargs
|
Dict[str, Any]
|
Additional keyword arguments. Arguments with the prefix 'rbln_' are passed to rbln_config, while the remaining arguments are passed to the HuggingFace library. |
{}
|
Returns:
Type | Description |
---|---|
Self
|
A RBLN model instance ready for inference on RBLN NPU devices. |
from_model(model, *, rbln_config=None, **kwargs)
classmethod
¶
Converts and compiles a pre-trained HuggingFace library model into a RBLN model. This method performs the actual model conversion and compilation process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
PreTrainedModel
|
The PyTorch model to be compiled. The object must be an instance of the HuggingFace transformers PreTrainedModel class. |
required |
rbln_config
|
Optional[Union[Dict, RBLNModelConfig]]
|
Configuration for RBLN model compilation and runtime. This can be provided as a dictionary or an instance of the model's configuration class (e.g., |
None
|
kwargs
|
Dict[str, Any]
|
Additional keyword arguments. Arguments with the prefix 'rbln_' are passed to rbln_config, while the remaining arguments are passed to the HuggingFace library. |
{}
|
The method performs the following steps:
- Compiles the PyTorch model into an optimized RBLN graph
- Configures the model for the specified NPU device
- Creates the necessary runtime objects if requested
- Saves the compiled model and configurations
Returns:
Type | Description |
---|---|
Self
|
A RBLN model instance ready for inference on RBLN NPU devices. |
save_pretrained(save_directory)
¶
Saves a model and its configuration file to a directory, so that it can be re-loaded using the
[from_pretrained
] class method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_directory
|
Union[str, PathLike]
|
The directory to save the model and its configuration files. Will be created if it doesn't exist. |
required |
generate(input_ids, attention_mask=None, max_length=None)
¶
The generate function is utilized in its standard form as in the HuggingFace transformers library. User can use this function to generate text from the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_ids
|
LongTensor
|
The sequence used as a prompt for the generation |
required |
attention_mask
|
Optional[Tensor]
|
The attention mask to apply on the sequence |
None
|
max_length
|
Optional[int]
|
The maximum length of the sequence to be generated |
None
|
Classes¶
RBLNQwen2_5_VisionTransformerPretrainedModelConfig
¶
Bases: RBLNModelConfig
Functions¶
__init__(max_seq_lens=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_seq_lens
|
int
|
The maximum sequence length for the Vision Transformer, representing the number of patches in a sequence derived from an image or a video frame. See the "Max Seq Lens" section below for details. |
None
|
**kwargs
|
Dict[str, Any]
|
Additional keyword arguments passed to the parent |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Max Seq Lens
Since Qwen2_5_VLForConditionalGeneration
performs inference on a per-image or per-frame basis,
max_seq_lens
should be set based on the maximum expected resolution of the input images or video frames,
according to the following guidelines:
- Minimum Value:
max_seq_lens
must be greater than or equal to the number of patches generated from the input image. For example, a 224x224 image with a patch size of 14 results in (224 / 14) * (224 / 14) = 256 patches. Therefore,max_seq_lens
must be at least 256. - Alignment Requirement:
max_seq_lens
must be a multiple of(window_size / patch_size)^2
due to the requirements of the window-based attention mechanism. For instance, ifwindow_size
is 112 andpatch_size
is 14, then(112 / 14)^2 = 64
, meaning valid values formax_seq_lens
include 64, 128, 192, 256, etc.
RBLNQwen2_5_VLForConditionalGenerationConfig
¶
Bases: RBLNDecoderOnlyModelForCausalLMConfig
Functions¶
__init__(visual=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
visual
|
Optional[RBLNModelConfig]
|
Configuration for the vision encoder component. |
None
|
**kwargs
|
Dict[str, Any]
|
Additional keyword arguments passed to the parent class |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If the |
ValueError
|
If |
ValueError
|
If any inherited parameters violate constraints defined in the parent class,
such as |