Skip to content

Executing transformers model with PyTorch RBLN: Llama3.2-1B

Overview

Here is a demonstration of running Llama3.2-1B model from HuggingFace Transformers.

Note

Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Setup & Installation

Note

Please note that torch-rbln requires an RBLN Portal account.

Note

Please note that the meta-llama/Llama-3.2-1B model on HuggingFace Hub has restricted access. Once access is granted, you can log in using the huggingface-cli command as shown below:

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) .
Token: *****

Llama3.2-1B Example

The following llama.py is an example of executing Llama3.2-1B model downloaded from HuggingFace Hub. As you can see, the below example uses the same code as GPU or CPU except using rbln instead of cuda or cpu.

llama.py
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3.2-1B"
device = "rbln"

tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map=None,
)
model.to(device)

prompt = "Is Sun bigger than Earth?"
inputs = tokenizer(prompt, return_tensors="pt")

input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

outputs = model.generate(
    input_ids,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id,
    max_new_tokens=4,
    num_return_sequences=1,
    do_sample=False,
    top_p=None,
    temperature=None,
)

prompt_length_tokens = input_ids.shape[1]
generated_text = tokenizer.decode(outputs[0][prompt_length_tokens:], skip_special_tokens=True).strip()
generated_text = re.sub(r"\[duplicate\]\n?", "", generated_text)

print(f"Q: {prompt}")
print(f"A: {generated_text}")

If you run the above script with cuda or rbln, we expect the same result like the following:

1
2
3
$ python3 llama.py
Q: Is Sun bigger than Earth?
A: The answer is yes.

Note

You can execute it using the same code as GPU eager mode, with the exception of changing the device to "rbln". torch-rbln compiles and executes each operation using rebel_compiler. When an op is executed for the second time with an input tensor of the same shape, the cached executable is used, avoiding recompilation time. However, in this Llama model, the input sequence length changes at each token generation step. This requires a new compilation for each step, and the compilation overhead affects the overall execution speed. The model still functions correctly, but this behavior may result in slower performance compared to fixed-length inputs.