Stable Diffusion 3¶
Stable Diffusion 3 (SD3)는 Stability AI의 최신 세대 텍스트-이미지 모델로, Multimodal Diffusion Transformer (MMDiT) 아키텍처를 특징으로 합니다. 여러 피사체, 공간 관계, 다양한 텍스트 스타일을 포함하는 복잡한 프롬프트를 처리하는 데 뛰어납니다. SD3는 향상된 프롬프트 해석을 위해 세 가지 고유한 텍스트 인코더(CLIP-L, OpenCLIP-G, T5-XXL)를 활용합니다. RBLN NPU는 Optimum RBLN을 사용하여 Stable Diffusion 3 파이프라인을 가속화할 수 있습니다.
지원하는 파이프라인¶
Optimum RBLN은 여러 Stable Diffusion 3 파이프라인을 지원합니다:
- 텍스트-이미지 변환(Text-to-Image): 텍스트 프롬프트에서 고품질 이미지 생성.
- 이미지-이미지 변환(Image-to-Image): 텍스트 프롬프트를 기반으로 기존 이미지 수정.
- 인페인팅(Inpainting): 텍스트 프롬프트에 따라 이미지의 마스킹된 영역 채우기.
주요 클래스¶
RBLNStableDiffusion3Pipeline
: Stable Diffusion 3의 텍스트-이미지 파이프라인.RBLNStableDiffusion3PipelineConfig
: 텍스트-이미지 파이프라인 설정.RBLNStableDiffusion3Img2ImgPipeline
: Stable Diffusion 3의 이미지-이미지 파이프라인.RBLNStableDiffusion3Img2ImgPipelineConfig
: 이미지-이미지 파이프라인 설정.RBLNStableDiffusion3InpaintPipeline
: Stable Diffusion 3의 인페인팅 파이프라인.RBLNStableDiffusion3InpaintPipelineConfig
: 인페인팅 파이프라인 설정.
중요: Guidance Scale에 따른 배치 크기 설정¶
배치 크기와 Guidance Scale
Stable Diffusion 3을 guidance scale > 1.0으로 사용할 때(기본값은 일반적으로 5.0-7.0), classifier-free guidance 기법으로 인해 MMDiT Transformer의 실제 배치 크기가 실행 시 2배가 됩니다.
RBLN NPU는 정적 그래프 컴파일을 사용하므로, 컴파일 시 Transformer의 배치 크기가 실행 시 배치 크기와 일치해야 합니다. 그렇지 않으면 추론 중에 오류가 발생합니다.
기본 동작¶
Transformer의 배치 크기를 명시적으로 지정하지 않는 경우, Optimum RBLN은 다음과 같이 동작합니다:
- guidance scale > 1.0을 사용한다고 가정합니다.
- 자동으로 Transformer의 배치 크기를 파이프라인 배치 크기의 2배로 설정합니다.
기본 guidance scale(1.0보다 큰 값)을 사용할 계획이라면, 이 자동 구성이 올바르게 작동합니다. 그러나 다른 guidance scale을 사용하거나 더 많은 제어가 필요한 경우에는 Transformer의 배치 크기를 명시적으로 구성해야 합니다.
예시: Transformer 배치 크기 명시적 설정 (guidance_scale > 1.0)¶
예시: Guidance 비활성화 (guidance_scale = 0.0)¶
guidance_scale
을 정확히 0.0으로 사용할 계획이라면 (classifier-free guidance 비활성화),
Transformer 배치 크기를 추론 배치 크기와 일치하도록 명시적으로 설정해야 합니다:
사용 예제 (텍스트-이미지)¶
API 참조¶
Classes¶
RBLNStableDiffusion3Pipeline
¶
Bases: RBLNDiffusionMixin
, StableDiffusion3Pipeline
Functions¶
from_pretrained(model_id, *, export=False, model_save_dir=None, rbln_config={}, lora_ids=None, lora_weights_names=None, lora_scales=None, **kwargs)
classmethod
¶
Load a pretrained diffusion pipeline from a model checkpoint, with optional compilation for RBLN NPUs.
This method has two distinct operating modes:
- When
export=True
: Takes a PyTorch-based diffusion model, compiles it for RBLN NPUs, and loads the compiled model - When
export=False
: Loads an already compiled RBLN model frommodel_id
without recompilation
It supports various diffusion pipelines including Stable Diffusion, Kandinsky, ControlNet, and other diffusers-based models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The model ID or path to the pretrained model to load. Can be either:
|
required |
export
|
bool
|
If True, takes a PyTorch model from |
False
|
model_save_dir
|
Optional[PathLike]
|
Directory to save the compiled model artifacts. Only used when |
None
|
rbln_config
|
Dict[str, Any]
|
Configuration options for RBLN compilation. Can include settings for specific submodules
such as |
{}
|
lora_ids
|
Optional[Union[str, List[str]]]
|
LoRA adapter ID(s) to load and apply before compilation. LoRA weights are fused
into the model weights during compilation. Only used when |
None
|
lora_weights_names
|
Optional[Union[str, List[str]]]
|
Names of specific LoRA weight files to load, corresponding to lora_ids. Only used when |
None
|
lora_scales
|
Optional[Union[float, List[float]]]
|
Scaling factor(s) to apply to the LoRA adapter(s). Only used when |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments to pass to the underlying diffusion pipeline constructor or the RBLN compilation process. These may include parameters specific to individual submodules or the particular diffusion pipeline being used. |
{}
|
Returns:
Type | Description |
---|---|
Self
|
A compiled diffusion pipeline that can be used for inference on RBLN NPU. The returned object is an instance of the class that called this method, inheriting from RBLNDiffusionMixin. |
RBLNStableDiffusion3InpaintPipeline
¶
Bases: RBLNDiffusionMixin
, StableDiffusion3InpaintPipeline
Functions¶
from_pretrained(model_id, *, export=False, model_save_dir=None, rbln_config={}, lora_ids=None, lora_weights_names=None, lora_scales=None, **kwargs)
classmethod
¶
Load a pretrained diffusion pipeline from a model checkpoint, with optional compilation for RBLN NPUs.
This method has two distinct operating modes:
- When
export=True
: Takes a PyTorch-based diffusion model, compiles it for RBLN NPUs, and loads the compiled model - When
export=False
: Loads an already compiled RBLN model frommodel_id
without recompilation
It supports various diffusion pipelines including Stable Diffusion, Kandinsky, ControlNet, and other diffusers-based models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The model ID or path to the pretrained model to load. Can be either:
|
required |
export
|
bool
|
If True, takes a PyTorch model from |
False
|
model_save_dir
|
Optional[PathLike]
|
Directory to save the compiled model artifacts. Only used when |
None
|
rbln_config
|
Dict[str, Any]
|
Configuration options for RBLN compilation. Can include settings for specific submodules
such as |
{}
|
lora_ids
|
Optional[Union[str, List[str]]]
|
LoRA adapter ID(s) to load and apply before compilation. LoRA weights are fused
into the model weights during compilation. Only used when |
None
|
lora_weights_names
|
Optional[Union[str, List[str]]]
|
Names of specific LoRA weight files to load, corresponding to lora_ids. Only used when |
None
|
lora_scales
|
Optional[Union[float, List[float]]]
|
Scaling factor(s) to apply to the LoRA adapter(s). Only used when |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments to pass to the underlying diffusion pipeline constructor or the RBLN compilation process. These may include parameters specific to individual submodules or the particular diffusion pipeline being used. |
{}
|
Returns:
Type | Description |
---|---|
Self
|
A compiled diffusion pipeline that can be used for inference on RBLN NPU. The returned object is an instance of the class that called this method, inheriting from RBLNDiffusionMixin. |
RBLNStableDiffusion3Img2ImgPipeline
¶
Bases: RBLNDiffusionMixin
, StableDiffusion3Img2ImgPipeline
Functions¶
from_pretrained(model_id, *, export=False, model_save_dir=None, rbln_config={}, lora_ids=None, lora_weights_names=None, lora_scales=None, **kwargs)
classmethod
¶
Load a pretrained diffusion pipeline from a model checkpoint, with optional compilation for RBLN NPUs.
This method has two distinct operating modes:
- When
export=True
: Takes a PyTorch-based diffusion model, compiles it for RBLN NPUs, and loads the compiled model - When
export=False
: Loads an already compiled RBLN model frommodel_id
without recompilation
It supports various diffusion pipelines including Stable Diffusion, Kandinsky, ControlNet, and other diffusers-based models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The model ID or path to the pretrained model to load. Can be either:
|
required |
export
|
bool
|
If True, takes a PyTorch model from |
False
|
model_save_dir
|
Optional[PathLike]
|
Directory to save the compiled model artifacts. Only used when |
None
|
rbln_config
|
Dict[str, Any]
|
Configuration options for RBLN compilation. Can include settings for specific submodules
such as |
{}
|
lora_ids
|
Optional[Union[str, List[str]]]
|
LoRA adapter ID(s) to load and apply before compilation. LoRA weights are fused
into the model weights during compilation. Only used when |
None
|
lora_weights_names
|
Optional[Union[str, List[str]]]
|
Names of specific LoRA weight files to load, corresponding to lora_ids. Only used when |
None
|
lora_scales
|
Optional[Union[float, List[float]]]
|
Scaling factor(s) to apply to the LoRA adapter(s). Only used when |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments to pass to the underlying diffusion pipeline constructor or the RBLN compilation process. These may include parameters specific to individual submodules or the particular diffusion pipeline being used. |
{}
|
Returns:
Type | Description |
---|---|
Self
|
A compiled diffusion pipeline that can be used for inference on RBLN NPU. The returned object is an instance of the class that called this method, inheriting from RBLNDiffusionMixin. |
Classes¶
RBLNStableDiffusion3PipelineBaseConfig
¶
Bases: RBLNModelConfig
Functions¶
__init__(transformer=None, text_encoder=None, text_encoder_2=None, text_encoder_3=None, vae=None, *, max_seq_len=None, sample_size=None, image_size=None, batch_size=None, img_height=None, img_width=None, guidance_scale=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer
|
Optional[RBLNSD3Transformer2DModelConfig]
|
Configuration for the transformer model component. Initialized as RBLNSD3Transformer2DModelConfig if not provided. |
None
|
text_encoder
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the primary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_2
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the secondary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_3
|
Optional[RBLNT5EncoderModelConfig]
|
Configuration for the tertiary text encoder. Initialized as RBLNT5EncoderModelConfig if not provided. |
None
|
vae
|
Optional[RBLNAutoencoderKLConfig]
|
Configuration for the VAE model component. Initialized as RBLNAutoencoderKLConfig if not provided. |
None
|
max_seq_len
|
Optional[int]
|
Maximum sequence length for text inputs. Defaults to 256. |
None
|
sample_size
|
Optional[Tuple[int, int]]
|
Spatial dimensions for the transformer model. |
None
|
image_size
|
Optional[Tuple[int, int]]
|
Dimensions for the generated images. Cannot be used together with img_height/img_width. |
None
|
batch_size
|
Optional[int]
|
Batch size for inference, applied to all submodules. |
None
|
img_height
|
Optional[int]
|
Height of the generated images. |
None
|
img_width
|
Optional[int]
|
Width of the generated images. |
None
|
guidance_scale
|
Optional[float]
|
Scale for classifier-free guidance. Deprecated parameter. |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments passed to the parent RBLNModelConfig. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If both image_size and img_height/img_width are provided. |
Note
When guidance_scale > 1.0, the transformer batch size is automatically doubled to accommodate classifier-free guidance.
RBLNStableDiffusion3PipelineConfig
¶
Bases: RBLNStableDiffusion3PipelineBaseConfig
Functions¶
__init__(transformer=None, text_encoder=None, text_encoder_2=None, text_encoder_3=None, vae=None, *, max_seq_len=None, sample_size=None, image_size=None, batch_size=None, img_height=None, img_width=None, guidance_scale=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer
|
Optional[RBLNSD3Transformer2DModelConfig]
|
Configuration for the transformer model component. Initialized as RBLNSD3Transformer2DModelConfig if not provided. |
None
|
text_encoder
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the primary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_2
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the secondary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_3
|
Optional[RBLNT5EncoderModelConfig]
|
Configuration for the tertiary text encoder. Initialized as RBLNT5EncoderModelConfig if not provided. |
None
|
vae
|
Optional[RBLNAutoencoderKLConfig]
|
Configuration for the VAE model component. Initialized as RBLNAutoencoderKLConfig if not provided. |
None
|
max_seq_len
|
Optional[int]
|
Maximum sequence length for text inputs. Defaults to 256. |
None
|
sample_size
|
Optional[Tuple[int, int]]
|
Spatial dimensions for the transformer model. |
None
|
image_size
|
Optional[Tuple[int, int]]
|
Dimensions for the generated images. Cannot be used together with img_height/img_width. |
None
|
batch_size
|
Optional[int]
|
Batch size for inference, applied to all submodules. |
None
|
img_height
|
Optional[int]
|
Height of the generated images. |
None
|
img_width
|
Optional[int]
|
Width of the generated images. |
None
|
guidance_scale
|
Optional[float]
|
Scale for classifier-free guidance. Deprecated parameter. |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments passed to the parent RBLNModelConfig. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If both image_size and img_height/img_width are provided. |
Note
When guidance_scale > 1.0, the transformer batch size is automatically doubled to accommodate classifier-free guidance.
RBLNStableDiffusion3Img2ImgPipelineConfig
¶
Bases: RBLNStableDiffusion3PipelineBaseConfig
Functions¶
__init__(transformer=None, text_encoder=None, text_encoder_2=None, text_encoder_3=None, vae=None, *, max_seq_len=None, sample_size=None, image_size=None, batch_size=None, img_height=None, img_width=None, guidance_scale=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer
|
Optional[RBLNSD3Transformer2DModelConfig]
|
Configuration for the transformer model component. Initialized as RBLNSD3Transformer2DModelConfig if not provided. |
None
|
text_encoder
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the primary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_2
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the secondary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_3
|
Optional[RBLNT5EncoderModelConfig]
|
Configuration for the tertiary text encoder. Initialized as RBLNT5EncoderModelConfig if not provided. |
None
|
vae
|
Optional[RBLNAutoencoderKLConfig]
|
Configuration for the VAE model component. Initialized as RBLNAutoencoderKLConfig if not provided. |
None
|
max_seq_len
|
Optional[int]
|
Maximum sequence length for text inputs. Defaults to 256. |
None
|
sample_size
|
Optional[Tuple[int, int]]
|
Spatial dimensions for the transformer model. |
None
|
image_size
|
Optional[Tuple[int, int]]
|
Dimensions for the generated images. Cannot be used together with img_height/img_width. |
None
|
batch_size
|
Optional[int]
|
Batch size for inference, applied to all submodules. |
None
|
img_height
|
Optional[int]
|
Height of the generated images. |
None
|
img_width
|
Optional[int]
|
Width of the generated images. |
None
|
guidance_scale
|
Optional[float]
|
Scale for classifier-free guidance. Deprecated parameter. |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments passed to the parent RBLNModelConfig. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If both image_size and img_height/img_width are provided. |
Note
When guidance_scale > 1.0, the transformer batch size is automatically doubled to accommodate classifier-free guidance.
RBLNStableDiffusion3InpaintPipelineConfig
¶
Bases: RBLNStableDiffusion3PipelineBaseConfig
Functions¶
__init__(transformer=None, text_encoder=None, text_encoder_2=None, text_encoder_3=None, vae=None, *, max_seq_len=None, sample_size=None, image_size=None, batch_size=None, img_height=None, img_width=None, guidance_scale=None, **kwargs)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer
|
Optional[RBLNSD3Transformer2DModelConfig]
|
Configuration for the transformer model component. Initialized as RBLNSD3Transformer2DModelConfig if not provided. |
None
|
text_encoder
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the primary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_2
|
Optional[RBLNCLIPTextModelWithProjectionConfig]
|
Configuration for the secondary text encoder. Initialized as RBLNCLIPTextModelWithProjectionConfig if not provided. |
None
|
text_encoder_3
|
Optional[RBLNT5EncoderModelConfig]
|
Configuration for the tertiary text encoder. Initialized as RBLNT5EncoderModelConfig if not provided. |
None
|
vae
|
Optional[RBLNAutoencoderKLConfig]
|
Configuration for the VAE model component. Initialized as RBLNAutoencoderKLConfig if not provided. |
None
|
max_seq_len
|
Optional[int]
|
Maximum sequence length for text inputs. Defaults to 256. |
None
|
sample_size
|
Optional[Tuple[int, int]]
|
Spatial dimensions for the transformer model. |
None
|
image_size
|
Optional[Tuple[int, int]]
|
Dimensions for the generated images. Cannot be used together with img_height/img_width. |
None
|
batch_size
|
Optional[int]
|
Batch size for inference, applied to all submodules. |
None
|
img_height
|
Optional[int]
|
Height of the generated images. |
None
|
img_width
|
Optional[int]
|
Width of the generated images. |
None
|
guidance_scale
|
Optional[float]
|
Scale for classifier-free guidance. Deprecated parameter. |
None
|
**kwargs
|
Dict[str, Any]
|
Additional arguments passed to the parent RBLNModelConfig. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If both image_size and img_height/img_width are provided. |
Note
When guidance_scale > 1.0, the transformer batch size is automatically doubled to accommodate classifier-free guidance.