Executing transformers model with PyTorch RBLN: Llama3.2-1B¶
Overview¶
This tutorial demonstrates how to run the Llama3.2-1B model from HuggingFace Transformers.
Note
Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Setup & Installation¶
- System requirements
- Python: 3.10–3.13
- RBLN Driver
- Major package requirements
- PyTorch
- transformers: >=4.49.0,<4.50.0
- RBLN Compiler
torch-rbln- HuggingFace Hub CLI (
huggingface_hub[cli])
- Installation
Note
Please note that rebel-compiler requires an RBLN Portal account.
Note
Please note that the meta-llama/Llama-3.2-1B model on HuggingFace Hub has restricted access. Once access is granted, you can log in using the hf (HuggingFace CLI) command as shown below:
Llama3.2-1B Example¶
The following llama.py example runs the Llama3.2-1B model downloaded from HuggingFace Hub. It uses the same code you would use on a GPU or CPU, except that it uses rbln instead of cuda or cpu.
If you run the above script with cuda or rbln, you should see output similar to the following:
Note
Eager mode on RBLN NPUs: Treat this like a normal GPU eager script; only the device argument changes to "rbln".
Shape-based compile cache: Whenever a PyTorch operator runs on the rbln device, torch-rbln calls rebel-compiler to compile that operator into an executable for RBLN NPUs. If the same operator is invoked again with identical input tensor shapes, the cached executable is reused instead of compiling from scratch each time.
Decode in this Llama example: Autoregressive generation appends one token per step, so sequence length, and often tensor shapes, change every step. The cache hits less often, recompilation happens more frequently, and wall-clock time increases even though answers are correct. Fixed-length batches amortize compile cost far better.