GPT-2¶
GPT-2 (Generative Pre-trained Transformer 2)는 텍스트를 생성하기 위해 트랜스포머 기반 신경망 아키텍처를 사용하는 자기회귀적 언어 모델입니다. 시퀀스에서 다음 단어를 예측하는 목표로 다양한 인터넷 텍스트 데이터셋에서 훈련되었습니다. RBLN NPU는 Optimum RBLN을 사용하여 GPT-2 모델 추론을 가속화할 수 있습니다.
주요 클래스¶
RBLNGPT2LMHeadModel
: RBLN NPU에서 인과적 언어 모델링을 위한 GPT-2 모델 구현RBLNGPT2LMHeadModelConfig
: GPT-2 인과적 언어 모델의 설정 클래스
API 참조¶
Classes¶
RBLNGPT2LMHeadModel
¶
Bases: RBLNDecoderOnlyModelForCausalLM
GPT-2 model for causal language modeling optimized for RBLN NPU.
Functions¶
from_pretrained(model_id, export=False, rbln_config=None, **kwargs)
classmethod
¶
The from_pretrained()
function is utilized in its standard form as in the HuggingFace transformers library.
User can use this function to load a pre-trained model from the HuggingFace library and convert it to a RBLN model to be run on RBLN NPUs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
Union[str, Path]
|
The model id of the pre-trained model to be loaded. It can be downloaded from the HuggingFace model hub or a local path, or a model id of a compiled model using the RBLN Compiler. |
required |
export
|
bool
|
A boolean flag to indicate whether the model should be compiled. |
False
|
rbln_config
|
Optional[Union[Dict, RBLNModelConfig]]
|
Configuration for RBLN model compilation and runtime. This can be provided as a dictionary or an instance of the model's configuration class (e.g., |
None
|
kwargs
|
Dict[str, Any]
|
Additional keyword arguments. Arguments with the prefix 'rbln_' are passed to rbln_config, while the remaining arguments are passed to the HuggingFace library. |
{}
|
Returns:
Type | Description |
---|---|
Self
|
A RBLN model instance ready for inference on RBLN NPU devices. |
from_model(model, *, rbln_config=None, **kwargs)
classmethod
¶
Converts and compiles a pre-trained HuggingFace library model into a RBLN model. This method performs the actual model conversion and compilation process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
PreTrainedModel
|
The PyTorch model to be compiled. The object must be an instance of the HuggingFace transformers PreTrainedModel class. |
required |
rbln_config
|
Optional[Union[Dict, RBLNModelConfig]]
|
Configuration for RBLN model compilation and runtime. This can be provided as a dictionary or an instance of the model's configuration class (e.g., |
None
|
kwargs
|
Dict[str, Any]
|
Additional keyword arguments. Arguments with the prefix 'rbln_' are passed to rbln_config, while the remaining arguments are passed to the HuggingFace library. |
{}
|
The method performs the following steps:
- Compiles the PyTorch model into an optimized RBLN graph
- Configures the model for the specified NPU device
- Creates the necessary runtime objects if requested
- Saves the compiled model and configurations
Returns:
Type | Description |
---|---|
Self
|
A RBLN model instance ready for inference on RBLN NPU devices. |
save_pretrained(save_directory)
¶
Saves a model and its configuration file to a directory, so that it can be re-loaded using the
[from_pretrained
] class method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_directory
|
Union[str, PathLike]
|
The directory to save the model and its configuration files. Will be created if it doesn't exist. |
required |
generate(input_ids, attention_mask=None, max_length=None)
¶
The generate function is utilized in its standard form as in the HuggingFace transformers library. User can use this function to generate text from the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_ids
|
LongTensor
|
The sequence used as a prompt for the generation |
required |
attention_mask
|
Optional[Tensor]
|
The attention mask to apply on the sequence |
None
|
max_length
|
Optional[int]
|
The maximum length of the sequence to be generated |
None
|
Classes¶
RBLNGPT2LMHeadModelConfig
¶
Bases: RBLNDecoderOnlyModelForCausalLMConfig
Configuration class for GPT-2 causal language model. Inherits from RBLNDecoderOnlyModelForCausalLMConfig with no additional parameters.
Functions¶
__init__(batch_size=None, max_seq_len=None, use_inputs_embeds=None, use_attention_mask=None, attn_impl=None, kvcache_partition_len=None, kvcache_block_size=None, quantization=None, prefill_chunk_size=None, kvcache_num_blocks=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
Optional[int]
|
The batch size for inference. Defaults to 1. |
None
|
max_seq_len
|
Optional[int]
|
The maximum sequence length supported by the model.
If not provided, it attempts to infer from the model's configuration
( |
None
|
use_inputs_embeds
|
Optional[bool]
|
Whether to use input embeddings ( |
None
|
use_attention_mask
|
Optional[bool]
|
Whether the model requires attention masks during inference. This is typically determined based on the target device and model architecture. Defaults are often set automatically based on the model and RBLN NPU. |
None
|
attn_impl
|
Optional[str]
|
Specifies the attention implementation to use.
See the "Attention Implementation ( |
None
|
kvcache_partition_len
|
Optional[int]
|
Defines the partition length for the KV cache
when using "flash_attn". See the "KV Cache Partition Length ( |
None
|
kvcache_block_size
|
Optional[int]
|
Sets the size (in number of tokens) of each block
in the PagedAttention KV cache. See the "KV Cache Block Size ( |
None
|
quantization
|
Optional[Dict[str, Any]]
|
Configuration dictionary for applying model quantization. Specifies format, group sizes, etc. |
None
|
prefill_chunk_size
|
Optional[int]
|
The chunk size used during the prefill phase for processing input sequences. Defaults to 128. Must be a positive integer divisible by 64. Affects prefill performance and memory usage. |
None
|
kvcache_num_blocks
|
Optional[int]
|
The total number of blocks to allocate for the
PagedAttention KV cache. See the "KV Cache Number of Blocks ( |
None
|
**kwargs
|
Dict[str, Any]
|
Additional keyword arguments passed to the parent |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If attention parameter constraints are violated (e.g., |
Attention Implementation
attn_impl
determines the underlying attention mechanism used by the model.
"eager"
(Default ifkvcache_partition_len
is not set): Uses the standard PyTorch attention implementation. Suitable for sequences up to a certain limit (e.g., 32,768 tokens)."flash_attn"
: Utilizes an optimized Flash Attention implementation, beneficial for longer sequences and potentially faster execution. Requiresmax_seq_len
to be at least 8,192. Ifkvcache_partition_len
is specified,attn_impl
automatically defaults to"flash_attn"
. When using"flash_attn"
,kvcache_block_size
must equalkvcache_partition_len
.
The choice impacts performance and memory usage, especially for long sequences.
Constraints related to max_seq_len
and kvcache_partition_len
apply when using
"flash_attn"
.
KV Cache Partition Length
kvcache_partition_len
is relevant only when attn_impl
is "flash_attn"
.
- It defines the length (number of tokens) of each partition within the Key-Value (KV) cache.
- Must be between 4,096 and 32,768 (inclusive).
- When using
"flash_attn"
,max_seq_len
must be a multiple ofkvcache_partition_len
and at least twice its value (max_seq_len >= 2 * kvcache_partition_len
). - If
attn_impl
is"flash_attn"
andkvcache_partition_len
isNone
, it defaults to 16,384.
KV Cache Number of Blocks
kvcache_num_blocks
controls the total number of memory blocks allocated for the PagedAttention KV cache.
Each block holds kvcache_block_size
tokens of Key and Value states.
- Automatic Estimation (Default): If
kvcache_num_blocks
isNone
, the system estimates the maximum number of blocks that can fit into the available RBLN device memory. This calculation considers the model size (kernel memory), required buffer memory, the number of layers and heads,kvcache_block_size
, tensor parallelism, and available RBLN NPU DRAM. This aims to maximize cache capacity for potentially better performance with long sequences or larger batches without manual tuning. - Manual Setting: You can explicitly set the number of blocks. This provides finer control but requires careful consideration of memory limits. Setting it too high may lead to compilation errors if it exceeds available memory. The system will issue warnings if your setting exceeds the estimated maximum.
- Performance Impact: A larger number of blocks reduces the likelihood of cache eviction, which is beneficial for tasks involving many long sequences or large batch sizes, enabling higher throughput. However, allocating more blocks consumes more memory.
- Minimum Requirement: The system requires a minimum number of blocks to function,
calculated based on
max_seq_len
,kvcache_block_size
, andbatch_size
. The number of allocated blocks must be sufficient to hold at least one full sequence length per item in the batch concurrently. The system will log warnings or raise errors if constraints are violated (e.g., ifkvcache_num_blocks
is less thanbatch_size
when using Flash Attention).
The optimal value depends on the specific model, task, hardware, and desired trade-off between performance and memory usage. The automatic estimation provides a robust starting point.