Skip to content

TensorFlow NLP BERT-base

This tutorial introduces how to compile and deploy the natural language processing model BERT for a masked laguage modeling task provided by Hugging Face. The model predicts the most probable word to fill in the masked word of the given sentence.

This tutorial consists of following steps:

  1. How to compile the TensorFlow BERT-base model and save the compiled one
  2. How to deploy the compiled model in the runtime-based inference environment

Prerequisite

Before we begin, please make sure you have installed the following pip packages in your system:

Note

If you want to skip the details and quickly compile and deploy the models on RBLN NPU, you can directly jump to the summary section in this tutorial. The code summarized in this section includes all the necessary steps required to compile and deploy the model so it can be used as a quick starting point for your own project.

Step 1. How to compile

In this section, we will demonstrate how to compile the Hugging Face BERT-base model.

Prepare the model

To start, we will import the TFBertForMaskedLM model from the transformers library and convert it to a tf.function object.

from transformers import TFBertForMaskedLM
import tensorflow as tf
import rebel  # RBLN Compiler

# Instantiate HuggingFace TensorFlow BERT-base model
model = TFBertForMaskedLM.from_pretrained("bert-base-uncased")
func = tf.function(
    lambda input_ids, attention_mask, token_type_ids: model(
        input_ids, attention_mask, token_type_ids
    )
)

Compile the model

Once the tf.function is instantiated, we can simply compile it with rebel.compile_from_tf_function() method.

# Compile the model
MAX_SEQ_LEN = 128
input_info = [
    ("input_ids", [1, MAX_SEQ_LEN], tf.int64),
    ("attention_mask", [1, MAX_SEQ_LEN], tf.int64),
    ("token_type_ids", [1, MAX_SEQ_LEN], tf.int64),
]
compiled_model = rebel.compile_from_tf_function(
    func,
    input_info,
    # If the NPU is installed on your host machine, you can omit the `npu` argument.
    # The function will automatically detect and use the installed NPU.
    npu="RBLN-CA12",
)

If the NPU is installed on your host machine, you can omit the npu argument in the rebel.compile_from_tf_function() function. In this case, the function will automatically detect and use the installed NPU. However, if the NPU is not installed on your host machine, you need to specify the target NPU using the npu argument to avoid any errors.

Currently, there are two supported NPU names: RBLN-CA02, RBLN-CA12. If you are unsure about the name of your target NPU, you can check it by running the rbln-stat command in the shell on the host machine where the NPU is installed.

Save the compiled model

We can use the compiled_model.save() method to save the compiled model to the disk.

# Save the compiled model to disk
compiled_model.save("bert_base.rbln")

Step 2. How to deploy

In this section, we will learn how to load the compiled model, run inference, and check results inferred from the model.

Prepare the input

First, we need to prepare the input data for the target task, masked language modeling. We will use the pre-trained BertTokenizer from the transformers library to tokenize the input sequence.

from transformers import BertTokenizer, pipeline
import tensorflow as tf
import rebel  # RBLN Runtime


# Prepare the input
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "The capital of Korea is [MASK]."
MAX_SEQ_LEN = 128
inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=MAX_SEQ_LEN)

Run inference

The RBLN Runtime module rebel.Runtime() is used to load the compiled model. We can use the run() method of the instantiated runtime module for running inference on the given sentnce. Additionally, the __call__ magic method can also be used to run inference.

1
2
3
4
5
# Load the compiled model
module = rebel.Runtime("bert_base.rbln")

# Run inference
out = module.run(**inputs)

You can see fundamental information of the runtime module, such as input/output shapes and the compiled model size, by using the print(module) function.

Check results

To decode the final logits to text, we first locate the index of the masked token in the input sequence. Then, we select the corresponding logits from the output of the model. After that, we decode the answer corresponding to the token id with the highest score using tokenizer.

1
2
3
4
5
# Check results
mask_token_index = tf.where((inputs.input_ids == tokenizer.mask_token_id)[0])
selected_logits = tf.gather_nd(out[0], indices=mask_token_index)
predicted_token_id = tf.math.argmax(selected_logits, axis=-1)
print("Masked word is [", tokenizer.decode(predicted_token_id), "].")

The results will look like:

Masked word is [ seoul ]

Summary

The complete code for the model compilation is:

from transformers import TFBertForMaskedLM
import tensorflow as tf
import rebel  # RBLN Compiler

# Instantiate HuggingFace TensorFlow BERT-base model
model = TFBertForMaskedLM.from_pretrained("bert-base-uncased")
func = tf.function(
    lambda input_ids, attention_mask, token_type_ids: model(
        input_ids, attention_mask, token_type_ids
    )
)

# Compile the model
MAX_SEQ_LEN = 128
input_info = [
    ("input_ids", [1, MAX_SEQ_LEN], tf.int64),
    ("attention_mask", [1, MAX_SEQ_LEN], tf.int64),
    ("token_type_ids", [1, MAX_SEQ_LEN], tf.int64),
]
compiled_model = rebel.compile_from_tf_function(
    func,
    input_info,
    # If the NPU is installed on your host machine, you can omit the `npu` argument.
    # The function will automatically detect and use the installed NPU.
    npu="RBLN-CA12",
)

# Save the compiled model to disk
compiled_model.save("bert_base.rbln")

The complete code for deployment of the compiled model is:

from transformers import BertTokenizer, pipeline
import tensorflow as tf
import rebel  # RBLN Runtime

# Prepare the input
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "The capital of Korea is [MASK]."
MAX_SEQ_LEN = 128
inputs = tokenizer(text, return_tensors='np', padding="max_length", max_length=MAX_SEQ_LEN)

# Load the compiled model
module = rebel.Runtime("bert_base.rbln")

# Run inference
out = module.run(**inputs)

# Check results
mask_token_index = tf.where((inputs.input_ids == tokenizer.mask_token_id)[0])
selected_logits = tf.gather_nd(out[0], indices=mask_token_index)
predicted_token_id = tf.math.argmax(selected_logits, axis=-1)
print("Masked word is [", tokenizer.decode(predicted_token_id), "].")