Executing transformers model with PyTorch RBLN: Llama3.2-1B¶
Overview¶
Here is a demonstration of running Llama3.2-1B model from HuggingFace Transformers.
Note
Llama 3 is licensed under the LLAMA Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Setup & Installation¶
- System Requirements:
- Python: 3.9-3.12
- RBLN Driver
- Major Package Requirements:
- torch
- transformers: >=4.49.0,<4.50.0
- RBLN Compiler
- torch-rbln
- huggingface_hub[cli]
- Installation Command:
Note
Please note that torch-rbln
requires an RBLN Portal account.
Note
Please note that the meta-llama/Llama-3.2-1B model on HuggingFace Hub has restricted access. Once access is granted, you can log in using the huggingface-cli
command as shown below:
Llama3.2-1B Example¶
The following llama.py
is an example of executing Llama3.2-1B model downloaded from HuggingFace Hub. As you can see, the below example uses the same code as GPU or CPU except using rbln
instead of cuda
or cpu
.
If you run the above script with cuda
or rbln
, we expect the same result like the following:
Note
You can execute it using the same code as GPU eager mode, with the exception of changing the device to "rbln". torch-rbln
compiles and executes each operation using rebel_compiler
. When an op is executed for the second time with an input tensor of the same shape, the cached executable is used, avoiding recompilation time. However, in this Llama model, the input sequence length changes at each token generation step. This requires a new compilation for each step, and the compilation overhead affects the overall execution speed. The model still functions correctly, but this behavior may result in slower performance compared to fixed-length inputs.