Skip to content

Stable Diffusion 3 (Image Generation)

Stable Diffusion 3 (SD3) is a Multimodal Diffusion Transformer (MMDiT) model, which outperforms various state-of-the-art text-to-image generation models.

Some models, such as SD3, consist of multiple submodules, and loading all submodules onto a single device may occasionally exceed the device's memory capacity. To optimize memory utilization, devices can be divided and individually assigned to runtimes created for each submodule.

The RBLN Compiler and Profiler supports multi-device compilation and the generation of profiled data for models with multiple submodules. For example, the Profiler can generate multiple profiled data files for each submodule in SD3, such as text encoders, diffusion transformer, VAE (Variational AutoEncoder), as well as a single profiled data file that includes all submodules of SD3.

Preliminaries

Installation

$ cd RBLN_MODEL_ZOO_PATH/rbln-model-zoo/huggingface/stable-diffusion/stable_diffusion_3_t2i
$ pip install -r requirements.txt

Compile the Model and Extract Profiled Data

1
2
3
# Sample Text: "a photo of a cat holding a sign that says hello world”
$ python3 compile.py
$ RBLN_PROFILER=1 python3 inference.py

Or, set "activate_profiler": True to activate RBLN Profiler in each module.

# rbln_model_zoo/huggingface/stable-diffusion/stable_diffusion_3_t2i/inference.py
...
    pipe = RBLNStableDiffusion3Pipeline.from_pretrained(
        ...
        rbln_config={
            ...
            "text_encoder": {"device": 0, "activate_profiler": True},
            "text_encoder_2": {"device": 0, "activate_profiler": True},
            "text_encoder_3": {"device": 1, "activate_profiler": True},
            "transformer": {"device": 0, "activate_profiler": True},
            "vae": {"device": 0, "activate_profiler": True},
        },
    )
...

Also, you can profile a subset of submodules that you want to analyze. If so, set "activate_profiler": False for submodules you don't want to profile.

Recommendation

Since the Profiler captures all iterations of the diffusion process, which may result in extended profiling times. To obtain quicker profiling results, you can set num_inference_steps=1 in inference.py.

1
2
3
4
5
6
7
8
# rbln_model_zoo/huggingface/stable-diffusion/stable_diffusion_3_t2i/inference.py
...
# Generate image
image = pipe(
    # prompt, num_inference_steps=28, height=1024, width=1024, guidance_scale=7.0
    prompt, num_inference_steps=1, height=1024, width=1024, guidance_scale=7.0
).images[0]
...

Analysis of Profiled Data from SD3 with RBLN Profiler

Profiling result

# https://github.com/huggingface/diffusers/blob/78bc824729f76a14ff2f211fc7f9a31e5500a41e/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L343
class StableDiffusion3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin):
...
    def encode_prompt(
        ...
        ):
      ...
      prompt_embed, pooled_prompt_embed = self._get_clip_prompt_embeds(
          prompt=prompt,
          ...
      )
      prompt_2_embed, pooled_prompt_2_embed = self._get_clip_prompt_embeds(
          prompt=prompt_2,
          ...
      )
      ...
      t5_prompt_embed = self._get_t5_prompt_embeds(
          prompt=prompt_3,
          ...
      )
      ...
      negative_prompt_embed, negative_pooled_prompt_embed = self._get_clip_prompt_embeds(
          negative_prompt,
          ...
      )
      negative_prompt_2_embed, negative_pooled_prompt_2_embed = self._get_clip_prompt_embeds(
          negative_prompt_2,
          ...
      )
      t5_negative_prompt_embed = self._get_t5_prompt_embeds(
          prompt=negative_prompt_3,
          ...
      )
      ...
      return prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds

SD3 has three text encoders (text_encoder, text_encoder_2, and text_encoder_3). According to the encode_prompt function in the pipeline_stable_diffusion_3.py code, each text encoder is called twice to generate prompts and negative prompts. Since the RBLN Profiler generates single *.pb file when each text encoder is called, this encode_prompt process results in the generation of six *.pb files. Additionally, since the diffusion transformer is called for a specified number of iterations (num_inference_steps), multiple *.pb files are generated for each iteration. Finally, the VAE decoder produces a final image, a single *.pb file is generated. A full log *.pb file is also generated to represent the entire tracing information, displaying the sequential profiling results of all submodules.

As mentioned in inference.py, loading all submodules of SD3 on a single device may exceed the device's memory capacity. To address this, text_encoder_3 is executed on a separate device (ATOM 1). Furthermore, since the multi-device setup is mounted on a single host, Host commands are generated only on the root device (ATOM 0).

Tiling large operations

The RBLN Compiler efficiently optimizes large operations by applying the tiling method. As seen in the profiling results above, the tiled operations share the same name and different indices.