Skip to content

PyTorch NLP BERT-base

This tutorial aims to introduce how to compile and deploy the natural language processing model BERT for a masked language modeling task provided by Hugging Face. The model predicts the most probable word to fill the masked position of the given sentence.

The tutorial consists of two main steps:

  1. How to compile the PyTorch BERT-base model and save to local storage
  2. How to deploy the compiled model in the runtime-based inference environment

Prerequisite

Before we proceed, please make sure you have installed the following pip packages in your system:

Note

If you want to skip the details and quickly compile and deploy the models on RBLN NPU, you can directly jump to the summary section in this tutorial. The code summarized in this section includes all the necessary steps required to compile and deploy the model so it can be used as a quick starting point for your own project.

Native RBLN API

Step 1. How to compile

We will demonstrate how to compile the Hugging Face BERT-base model.

Prepare the model

To begin, we will import the BertForMaskedLM model from the transformers library.

1
2
3
4
5
6
7
import torch
from transformers import BertForMaskedLM
import rebel  # RBLN Compiler

# Instantiate HuggingFace PyTorch BERT-base model
bert_model = BertForMaskedLM.from_pretrained("bert-base-uncased", return_dict=False)
bert_model.eval()

Compile the model

Once the model torch.nn.Module is instantiated, we can simply compile it with rebel.compile_from_torch() method.

# Compile the model
MAX_SEQ_LEN = 128
input_info = [
    ("input_ids", [1, MAX_SEQ_LEN], "int64"),
    ("attention_mask", [1, MAX_SEQ_LEN], "int64"),
    ("token_type_ids", [1, MAX_SEQ_LEN], "int64"),
]
compiled_model = rebel.compile_from_torch(
    bert_model,
    input_info,
    # If the NPU is installed on your host machine, you can omit the `npu` argument.
    # The function will automatically detect and use the installed NPU.
    npu="RBLN-CA12",
)

If the NPU is installed on your host machine, you can omit the npu argument in the rebel.compile_from_torch() function. In this case, the function will automatically detect and use the installed NPU. However, if the NPU is not installed on your host machine, you need to specify the target NPU using the npu argument to avoid any errors.

Currently, there are two supported NPU names: RBLN-CA02, RBLN-CA12. If you are unsure about the name of your target NPU, you can check it by running the rbln-stat command in the shell on the host machine where the NPU is installed.

Save the compiled model

To save the compiled model to local storage, we can use the compiled_model.save() method as below.

# Save the compiled model to local storage
compiled_model.save("bert_base.rbln")

Step 2. How to deploy

In this section, we will learn how to load the compiled model, run inference on RBLN NPU, and check results inferred from the model.

Prepare the input

First, we need to preprocess the input text sequence for the masked language modeling task. We will use BertTokenizer from the transformers library to tokenize the input sequence.

1
2
3
4
5
6
7
8
9
import torch
from transformers import BertTokenizer, pipeline
import rebel  # RBLN Runtime

# Prepare the input
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "the color of rose is [MASK]." 
MAX_SEQ_LEN = 128
inputs = tokenizer(text, return_tensors="pt", padding="max_length", max_length=MAX_SEQ_LEN)

Run inference

The RBLN Runtime module rebel.Runtime() is used to load the compiled model by passing the path of the saved model as an input argument. The tensor_type argument in rebel.Runtime() specifies the type of tensor to be used for input and output data. It can be set to either "pt" for PyTorch tensors or "np" for NumPy arrays.

We can use the run() method of the instantiated runtime module rebel.Runtime() for running inference. Additionally, the forward() method and the __call__ magic method can also be used to run inference, maintaining compatibility with PyTorch's interface.

1
2
3
4
5
# Load the compiled model
module = rebel.Runtime("bert_base.rbln", tensor_type="pt")

# Run inference
out = module(**inputs)

You can see fundamental information of the runtime module, such as input/output shapes and the compiled model size, by using the print(module) function.

Check results

To decode the final logits to text, we will use fill-mask pipeline from the transformers library. It generates the most probable word to fill in [MASK] of the given sentence.

1
2
3
# Check results
unmasker = pipeline("fill-mask", model="bert-base-uncased", framework="pt")
print(unmasker.postprocess({"input_ids": inputs.input_ids, "logits": out}))

The results will look like:

1
2
3
4
5
6
7
[
    {'score': 0.23562490940093994, 'token': 2317, 'token_str': 'white', 'sequence': 'the color of rose is white.'},
    {'score': 0.10957575589418411, 'token': 2417, 'token_str': 'red', 'sequence': 'the color of rose is red.'},
    {'score': 0.08016733080148697, 'token': 2304, 'token_str': 'black', 'sequence': 'the color of rose is black.'},
    {'score': 0.07074742764234543, 'token': 3756, 'token_str': 'yellow', 'sequence': 'the color of rose is yellow.'},
    {'score': 0.05175992473959923, 'token': 2630, 'token_str': 'blue', 'sequence': 'the color of rose is blue.'},
]

Summary

The complete code for model compilation is:

import torch
from transformers import BertForMaskedLM
import rebel  # RBLN Compiler

# Instantiate HuggingFace PyTorch BERT-base model
bert_model = BertForMaskedLM.from_pretrained("bert-base-uncased", return_dict=False)
bert_model.eval()

# Compile the model
MAX_SEQ_LEN = 128
input_info = [
    ("input_ids", [1, MAX_SEQ_LEN], "int64"),
    ("attention_mask", [1, MAX_SEQ_LEN], "int64"),
    ("token_type_ids", [1, MAX_SEQ_LEN], "int64"),
]
compiled_model = rebel.compile_from_torch(
    bert_model,
    input_info,
    # If the NPU is installed on your host machine, you can omit the `npu` argument.
    # The function will automatically detect and use the installed NPU.
    npu="RBLN-CA12",
)

# Save the compiled model to local storage
compiled_model.save("bert_base.rbln")

The completed code for inference of the compiled model is as follows:

import torch
from transformers import BertTokenizer, pipeline
import rebel  # RBLN Runtime

# Prepare the input
MAX_SEQ_LEN = 128
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "the color of rose is [MASK]." 
inputs = tokenizer(text, return_tensors="pt", padding="max_length", max_length=MAX_SEQ_LEN)

# Load the compiled model
module = rebel.Runtime("bert_base.rbln", tensor_type="pt")

# Run inference
out = module(**inputs)

# Check results
unmasker = pipeline("fill-mask", model="bert-base-uncased", framework="pt")
print(unmasker.postprocess({"input_ids": inputs.input_ids, "logits": out}))

torch.compile() API

The RBLN SDK not only offers its native API but also supports PyTorch's torch.compile feature. This integration allows developers to harness the power of PyTorch's just-in-time (JIT) compilation for optimized model execution directly within the RBLN SDK. By incorporating RBLN's custom backend into any workflow that utilizes torch.compile, you can achieve enhanced performance while maintaining full compatibility with RBLN's native features.

This guide demonstrates how to compile and run the Hugging Face BERT-base model using the torch.compile() API.

Prepare the Model

To begin, import the BertForMaskedLM model from the HuggingFace transformers library and instantiate the model. This step mirrors the process used in the native RBLN API.

import torch
from transformers import BertForMaskedLM
import rebel  # Import RBLN Compiler

if torch.__version__ >= "2.5.0":
    torch._dynamo.config.inline_inbuilt_nn_modules = False

# Instantiate the Hugging Face BERT-base model
bert_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
bert_model.eval()

Prepare the Input

Next, prepare the input data by tokenizing the input text sequence for the masked language modeling task. This process is also identical to using the native API.

from transformers import BertTokenizer

# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define the input text
text = "The color of a rose is [MASK]."

# Tokenize and prepare the input
MAX_SEQ_LEN = 128
inputs = tokenizer(text, return_tensors="pt", padding="max_length", max_length=MAX_SEQ_LEN)

Compile and Run the Model

With the model and input prepared, compile and run the model using torch.compile(). This step enables JIT compilation at runtime during the first forward pass.

# Compile the model using the RBLN backend
compiled_model = torch.compile(bert_model, 
                               backend="rbln",  # Specify the RBLN backend
                               options={"cache_dir": "./rbln_cache_dir"},  # Cache directory for compiled artifacts
                               dynamic=False)  # Disable dynamic shapes (not supported by RBLN backend)

# Run inference using the compiled model
logits = compiled_model(**inputs).logits

# Decode the predicted token
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(f"Predicted word: {tokenizer.decode(predicted_token_id)}")

Understanding torch.compile() Parameters

backend="rbln":

  • Description: Specifies the backend to use for model compilation.
  • Purpose: By setting this to "rbln", you direct the compilation process to utilize the RBLN SDK’s custom backend, which is optimized for performance within the RBLN environment.

options={"cache_dir": "PATH/TO/rbln_cache_dir", "npu" : "TARGET_NPU_DEVICE"}:

  • Description: Provides additional options for the compilation process.
  • Purpose:

    • cache_dir : "cache_dir" option specifies the directory where compiled artifacts should be stored.
      • Usage: This is similar to using compiled_model.save("resnet50.rbln") in the native API, creating an RBLN artifact at the specified path.
      • Caching: If a compiled model already exists in the specified directory, the RBLN backend will use the cached version instead of recompiling the model. This helps to reduce compilation time and overhead when the model is reused.
    • npu : The identifier of the target NPU for compilation. Refer to the npu option in the native API documentation for more details on specifying the device identifier.

dynamic=False:

  • Description: Indicates whether the model should support dynamic input shapes.
  • Purpose:
    • Setting dynamic to False is recommended for the RBLN backend because it currently does not support dynamic shapes.
    • Behavior: With this option set to False, the model assumes fixed input shapes, and any inputs with different shapes will trigger a recompilation. This ensures that the compilation is optimized for the specific shapes used in inference but means that you may need to recompile if the input shapes change.

Summary

Below is a complete example that includes argument parsing to select between BERT-base and BERT-large models.

import argparse
import torch
from transformers import BertForMaskedLM, BertTokenizer
import rebel  # Needed to use the "rbln" backend with torch.compile

if torch.__version__ >= "2.5.0":
    torch._dynamo.config.inline_inbuilt_nn_modules = False

def parse_arguments():
    parser = argparse.ArgumentParser(description="Run BERT model using torch.compile with RBLN backend.")
    parser.add_argument("--model_name", type=str, choices=["base", "large"], default="base", help="Size of BERT model: 'base' or 'large'.")
    return parser.parse_args()

def main():
    args = parse_arguments()
    model_name = f"bert-{args.model_name}-uncased"
    MAX_SEQ_LEN = 128

    # Instantiate and compile the model
    model = BertForMaskedLM.from_pretrained(model_name)
    compiled_model = torch.compile(model, backend="rbln", dynamic=False, options={"cache_dir": "./rbln_cache_dir"})

    # Prepare input text for masked language modeling
    tokenizer = BertTokenizer.from_pretrained(model_name)
    text = "The color of a rose is [MASK]."
    inputs = tokenizer(text, return_tensors="pt", padding="max_length", max_length=MAX_SEQ_LEN)

    # Run inference using the compiled model
    logits = compiled_model(**inputs).logits

    # Decode and print the predicted word
    mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
    print(f"Predicted word: {tokenizer.decode(predicted_token_id)}")

if __name__ == "__main__":
    main()