Llama3-8B (Text Generation)¶

Llama3-8B is an open-source LLM (Large Language Model) developed by Meta that shows over-human performance across various natural language processing tasks. Its extensive number of parameters and substantial memory requirements pose significant challenges for deploying the model on a single device.

To address this issue, Rebellions supports LLMs through tensor parallelism, which is called RSD (Rebellions Scalable Design). As shown in the visualization result of profiling data below, the simultaneous use of multiple devices introduces unique patterns that are not observed in single-device operations. This page introduce several distinct scenarios that arise when running an LLM with RSD.

Preliminaries¶

Installation¶

$ cd rbln-model-zoo/huggingface/transformers/text2text-generation/llama/llama3-8b
$ pip3 install -r requirements.txt

Compile Model and Activate RBLN Profiler¶

# Sample Text: "Hey, are you conscious? Can you talk to me?”
$ python3 compile.py
$ RBLN_PROFILER=1 python3 inference.py

Or, set rbln_activate_profiler=True to activate the RBLN Profiler:

# rbln-model-zoo/huggingface/transformers/text2text-generation/llama/llama3-8b/inference.py
...
# Load compiled model
model = RBLNLlamaForCausalLM.from_pretrained(
    model_id=os.path.basename(model_id),
    export=False,
    rbln_activate_profiler=True,
)
...

Extract Profiled Data with Inference¶

You can obtain profiling data with model inference:

# rbln-model-zoo/huggingface/transformers/text2text-generation/llama/llama3-8b/inference.py
...
# Generate tokens
output_sequence = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_length=8192,
)
...

Alternatively, you can use scoped profiling methods as shown below:

# rbln-model-zoo/huggingface/transformers/text2text-generation/llama/llama3-8b/inference.py
...
from rebel.profiler import profile

with profile(output_dir="./temp"):
    # Generate tokens
    output_sequence = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=8192,
    )
...

Analysis of Profiled data from Llama3-8B with RBLN Profiler¶

This analysis explains how layer normalization in the LlamaDecoderLayer, as well as the q_proj and o_proj layers of self-attention, divided by compiler sharding, handle synchronization and input/output. In addition, attention postfix that newly appended to the name of several commands to help understanding the executing process of attention mechanism.

`LlamaDecoderLayer`¶

# https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/llama/modeling_llama.py#L332
class LlamaDecoderLayer(nn.Module):
    ...
    def forward(...) -> ...:
        ...

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights = self.self_attn(
            hidden_states=hidden_states,
            ...
        )
        ...

`LlamaAttation`¶

# https://github.com/huggingface/transformers/blob/5fa35344755d8d9c29610b57d175efd03776ae9e/src/transformers/models/llama/modeling_llama.py#L269
class LlamaAttention(nn.Module):
    ...
    def forward(
        self,
        hidden_states: torch.Tensor,
        ...
    ) -> ...:
        ...
        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
        ...
        attn_output = self.o_proj(attn_output)
        ...

Layernorm Execution in Root Node¶

A 0_causal_lm.model.0.input_layernorm is small enough to run on a single device. Thus, this layer is processed on a single device and the output is transferred to other devices. While a root ATOM™ device processes input_layernorm, other devices load weight parameters that will be needed later with Neural DMA. After computing input_layernorm, the output tensor is transferred to leaf ATOM™ devices with Device HDMA commands.

Send Output Data and Synchronization¶

The output tensor of input_layernorm is transferred to DRAM of ATOM™ 0 by Task DMA. Also, because leaf ATOM™ devices should use the same tensor, it is transferred to them by Device HDMA. These processes occur with almost no time delay, allowing the leaf nodes to quickly process the next layer. When tensor data is transferred by a Device HDMA command, a receiver (leaf) ATOM™ executes the Device Sync command for integrity. After this command is processed, each device loads the data like those transferred from External HDMA. Note that the Device HDMA command is executed on the sender, not the receiver.

Gather data to root node¶

The Device HDMA command is executed not only when an output tensor computed in the root ATOM™ device is scattered, but also when output tensors computed in leaf ATOM™ devices are gathered. The above example shows how the output tensor computed on a leaf ATOM™ device is transferred to the root ATOM™ device. Each output tensor that is transferred by Device HDMA is transferred to internal Shared Memory of root ATOM™ by Task DMA, and the final Task DMA command that retrieves the output tensor from a leaf ATOM™ device depends on the Neural Engine Clusters command for gathering the output.

Naming policy about `attention`¶

In transformer based networks, it is important to understand how attention mechanism works. However, because existing frameworks mainly construct models using modules, it is hard to identify when and how an attention is operated. To help users understand, RBLN Profiler appends attention postfix to commands that consist of an attention operation.