Stable Diffusion 3 (Image Generation)¶
Stable Diffusion 3 (SD3) is a Multimodal Diffusion Transformer (MMDiT) model, which outperforms various state-of-the-art text-to-image generation models.
Some models, such as SD3, consist of multiple submodules, and loading all submodules onto a single device may occasionally exceed the device's memory capacity. To optimize memory utilization, devices can be divided and individually assigned to runtimes created for each submodule.
The RBLN Compiler and Profiler supports multi-device compilation and the generation of profiled data for models with multiple submodules. For example, the Profiler can generate multiple profiled data files for each submodule in SD3, such as text encoders, diffusion transformer, VAE (Variational AutoEncoder), as well as a single profiled data file that includes all submodules of SD3.
Preliminaries¶
Installation¶
Compile the Model and Extract Profiled Data¶
Or, set "activate_profiler": True
to activate RBLN Profiler in each module.
Also, you can profile a subset of submodules that you want to analyze. If so, set "activate_profiler": False
for submodules you don't want to profile.
Recommendation
Since the Profiler captures all iterations of the diffusion process, which may result in extended profiling times. To obtain quicker profiling results, you can set num_inference_steps=1
in inference.py
.
Analysis of Profiled Data from SD3 with RBLN Profiler¶
Profiling result¶
SD3 has three text encoders (text_encoder
, text_encoder_2
, and text_encoder_3
). According to the encode_prompt
function in the pipeline_stable_diffusion_3.py
code, each text encoder is called twice to generate prompts and negative prompts. Since the RBLN Profiler generates single *.pb
file when each text encoder is called, this encode_prompt
process results in the generation of six *.pb
files. Additionally, since the diffusion transformer is called for a specified number of iterations (num_inference_steps
), multiple *.pb
files are generated for each iteration. Finally, the VAE decoder produces a final image, a single *.pb
file is generated. A full log *.pb
file is also generated to represent the entire tracing information, displaying the sequential profiling results of all submodules.
As mentioned in inference.py
, loading all submodules of SD3 on a single device may exceed the device's memory capacity. To address this, text_encoder_3
is executed on a separate device (ATOM 1
). Furthermore, since the multi-device setup is mounted on a single host, Host
commands are generated only on the root device (ATOM 0
).
Tiling large operations¶
The RBLN Compiler efficiently optimizes large operations by applying the tiling method. As seen in the profiling results above, the tiled operations share the same name and different indices.