Skip to content

Executing transformers model with PyTorch RBLN: Llama3.2-1B

Overview

This tutorial demonstrates how to run the Llama3.2-1B model from HuggingFace Transformers.

Note

Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Setup & Installation

1
2
3
$ pip install \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  torch-rbln==0.1.8
1
2
3
$ pip install \
  --extra-index-url https://pypi.rbln.ai/simple \
  rebel-compiler==0.10.2

Note

Please note that rebel-compiler requires an RBLN Portal account.

Note

Please note that the meta-llama/Llama-3.2-1B model on HuggingFace Hub has restricted access. Once access is granted, you can log in using the hf (HuggingFace CLI) command as shown below:

$ hf auth login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) .
Enter your token (input will not be visible):

Llama3.2-1B Example

The following llama.py example runs the Llama3.2-1B model downloaded from HuggingFace Hub. It uses the same code you would use on a GPU or CPU, except that it uses rbln instead of cuda or cpu.

llama.py
import re

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"
device = "rbln"

tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map=None,
)
model.to(device)

prompt = "Is Sun bigger than Earth?"
inputs = tokenizer(prompt, return_tensors="pt")

input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

outputs = model.generate(
    input_ids,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id,
    max_new_tokens=4,
    num_return_sequences=1,
    do_sample=False,
    top_p=None,
    temperature=None,
)

prompt_length_tokens = input_ids.shape[1]
generated_text = tokenizer.decode(
    outputs[0][prompt_length_tokens:], skip_special_tokens=True
).strip()
generated_text = re.sub(r"\[duplicate\]\n?", "", generated_text)

print(f"Q: {prompt}")
print(f"A: {generated_text}")

If you run the above script with cuda or rbln, you should see output similar to the following:

1
2
3
$ python3 llama.py
Q: Is Sun bigger than Earth?
A: The answer is yes.

Note

Eager mode on RBLN NPUs: Treat this like a normal GPU eager script; only the device argument changes to "rbln".

Shape-based compile cache: Whenever a PyTorch operator runs on the rbln device, torch-rbln calls rebel-compiler to compile that operator into an executable for RBLN NPUs. If the same operator is invoked again with identical input tensor shapes, the cached executable is reused instead of compiling from scratch each time.

Decode in this Llama example: Autoregressive generation appends one token per step, so sequence length, and often tensor shapes, change every step. The cache hits less often, recompilation happens more frequently, and wall-clock time increases even though answers are correct. Fixed-length batches amortize compile cost far better.