Text Generation
Overview
This tutorial explains how to run inference with a Llama3-8b model using the RBLN SDK C/C++ Runtime API.
The model is compiled using the RBLN SDK Python API, and the resulting *.rbln file is used for inference with the
RBLN SDK C/C++ Runtime API. This approach combines the ease of model preparation in Python with the performance
benefits of C/C++ for inference.
The entire code used in this tutorial can be found in RBLN Model Zoo.
Setup & Installation
Before you begin, ensure that your system environment is properly configured and that all required packages are installed. This includes:
- System Requirements:
- Packages Requirements:
- Installation Command:
| pip install optimum-rbln huggingface_hub cmake
pip install \
--extra-index-url https://pypi.rbln.ai/simple \
rebel-compiler==0.10.2
|
Note
- Please note that
rebel-compiler requires an RBLN Portal account.
- The commands above are intended for a default pip install on Debian-based Linux such as Ubuntu. For all other configurations, refer to the Installation Guide for the supported install matrix and the applicable commands.
Compilation with RBLN Python API
The RBLN Python API is used to compile the Llama3-8b model.
After compilation, the model files (e.g., prefill.rbln and decoder.rbln) are saved in the
Meta-Llama-3-8B-Instruct directory.
Compile the Model
Import the RBLNLlamaForCausalLM class from optimum-rbln and use from_pretrained() to download
and compile the model. The compiled model is saved to disk using model.save_pretrained().
| import os
from optimum.rbln import RBLNLlamaForCausalLM
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True,
rbln_batch_size=1,
rbln_max_seq_len=8192,
rbln_tensor_parallel_size=4,
)
# Save the compiled model
model.save_pretrained(os.path.basename(model_id))
|
Tokenize input using AutoTokenizer from the transformers library.
The generated input binary file c_input_ids.bin is saved to disk.
Note
This tutorial demonstrates how to use RBLN SDK C/C++ Runtime API for Llama3-8b.
This example focuses on C/C++ based inference, so the pre- and post-processing, i.e. tokenization, are handled by Python APIs.
| from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
input_text = "Hey, are you conscious? Can you talk to me?"
batch_size = 1
# Prepare inputs
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
conversation = [[{"role": "user", "content": input_text}]] * batch_size
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.numpy()
# Save the generated input binary file
input_ids.tofile("c_input_ids.bin")
|
Inference with RBLN SDK C/C++ Runtime API
In this step, the compiled model is used for inference with the RBLN SDK C/C++ Runtime API.
The process involves model loading, running inference, and decoding the output.
Prepare CMake Build Script
Define dependencies and linking for the example application.
Note
${YOUR_SAMPLE_PATH} refers to the directory containing the CMake file and inference code.
| cmake_minimum_required(VERSION 3.26)
# Collect all source files
file(GLOB SOURCE_FILES "*.cc")
# Define executable
add_executable(text_generation llama_main.cc ${SOURCE_FILES})
# Link RBLN runtime
find_package(rbln CONFIG REQUIRED)
target_link_libraries(text_generation rbln::rbln_runtime)
# Add header files directory
target_include_directories(text_generation PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include)
|
Prepare Code for Inference
Several source files are used for inference, including:
-
llama_main.cc: Inference application for model loading and execution.
-
llama_class_example.hpp/cc: Inference wrapper and workflow management.
-
llama_tensor_example.hpp/cc: Inference wrapper using RBLN SDK C/C++ Runtime API, including:
-
Prefill and decode stage management of Llama3-8b
-
Input/output buffer handling
-
Model execution flow control
-
llama_tensor_op_example.hpp: Tensor manipulation operations.
llama_main.cc
| #include "llama_class_example.hpp"
int main() {
LLamaClass llama_cls;
// Init Model configuration
llama_cls.InitConfig();
// Create Model & Runtime
llama_cls.Prepare();
// Init LLamaClass
llama_cls.Init();
auto input_ids = Tensor<int64_t>(1, 23);
assert(LoadBinary<int64_t>(llama_cls.GetIdsPath(), input_ids) == true);
auto past_cached_length = Tensor<int32_t>();
llama_cls.PrepareInput(input_ids, past_cached_length);
// Process of Prefill phase
llama_cls.DoPrefill();
// Process of Decode phase
llama_cls.DoDecode();
// Generate c_text2text_generation_gen_id.bin
llama_cls.GenerateBinary();
// Reset LLamaClass for iteration
llama_cls.Reset();
// Deinit LLamaClass
llama_cls.DeInit();
return 0;
}
|
llama_class_example.hpp
| #ifndef RBLN_LLAMA_H
#define RBLN_LLAMA_H
#include <assert.h>
#include <rbln/rbln.h>
#include <cstring>
#include <fstream>
#include <iostream>
#include <numeric>
#include <string>
#include <vector>
#include "llama_tensor_example.hpp"
#include "llama_tensor_op_example.hpp"
constexpr uint32_t kDecodeInputCount = 3;
constexpr uint32_t kPrefillInputCount = 4;
template <typename T>
bool LoadBinary(const std::string &filename, Tensor<T> &data) {
std::ifstream file(filename, std::ios::binary);
if (!file.is_open()) {
std::cout << "Could not open file: " + filename << std::endl;
return false;
}
file.seekg(0, std::ios::end);
const size_t fileSize = file.tellg();
file.seekg(0, std::ios::beg);
if (fileSize % sizeof(T) != 0) {
std::cout << "File size(" << fileSize << ") is not a multiple of data type size("
<< sizeof(T) << ")" << std::endl;
return false;
}
file.read(const_cast<char *>(static_cast<const char *>(data.GetData())),
fileSize);
if (file.fail()) {
std::cout << "Failed to read file: " << filename << std::endl;
return false;
}
return true;
}
int WriteToFile(const std::string &filePath, const void *data,
uint32_t data_len);
class LLamaClass {
public:
LLamaClass() = default;
~LLamaClass() = default;
// Init Model configuration
void InitConfig() {
prefill_id_ = "${YOUR_SAMPLE_PATH}/Meta-Llama-3-8B-Instruct/prefill.rbln";
dec_id_ = "${YOUR_SAMPLE_PATH}/Meta-Llama-3-8B-Instruct/decoder_batch_1.rbln";
input_ids_path_ = "${YOUR_SAMPLE_PATH}/c_input_ids.bin";
batch_size_ = 1;
max_seq_len_ = 8192;
prefill_chunk_size_ = 128;
}
// Init LLamaClass
void Init();
// Reset LLamaClass for iteration
void Reset();
// Deinit LLamaClass
void DeInit();
// Create Model & Runtime
void Prepare();
// Process of Prefill phase
void DoPrefill();
// Process of Decode phase
void DoDecode();
// Generate c_text2text_generation_gen_id.bin
void GenerateBinary();
template <typename T0, typename T1>
void PrepareInput(Tensor<T0> &input_ids, Tensor<T1> &v0) {
if (!v0.GetSize()) {
auto input_tensors = input_ids;
auto batch_size = input_tensors.GetRows();
std::vector<Tensor<int64_t>> l_input_tensors;
std::vector<Tensor<int32_t>> cache_positions;
auto past_cached_length = Tensor<int32_t>(batch_size, 1);
for (int i = 0; i < batch_size; i++) {
auto input_tensor =
tensor_ops::Reshape(input_tensors, input_tensors.GetCols());
auto valid_len = input_tensor.GetCols();
auto cache_position = Tensor<int32_t>();
tensor_ops::Arange(cache_position, 0, valid_len);
tensor_ops::Reshape(cache_position, 1, valid_len);
past_cached_length[i] = valid_len;
l_input_tensors.emplace_back(tensor_ops::UnSqueeze(input_tensor));
cache_positions.emplace_back(tensor_ops::UnSqueeze(cache_position));
}
mdl_input_ = ModelInput{l_input_tensors[0], cache_positions[0],
past_cached_length};
} else {
auto input_tensor = tensor_ops::SelectLastColumn(input_ids);
auto cache_positions = v0;
auto past_cached_length = v0 + 1;
mdl_input_ =
ModelInput{input_tensor, cache_positions, past_cached_length};
}
}
const std::string &GetIdsPath() { return input_ids_path_; }
RBLNModel *prefill_mdl_;
RBLNModel *dec_mdl_;
RBLNRuntime *prefill_rt_;
RBLNRuntime *dec_rt_;
private:
void ForwardPrefill();
void ForwardDecode();
typedef struct {
Tensor<int64_t> input_ids;
Tensor<int32_t> cache_position;
Tensor<int32_t> past_cached_length;
} ModelInput;
ModelInput mdl_input_;
int max_seq_len_;
int batch_size_;
int prefill_chunk_size_;
bool unfinished_sequences_;
std::string prefill_id_;
std::string dec_id_;
std::string input_ids_path_;
Tensor<float> output_logits_;
Tensor<int64_t> input_ids_;
};
#endif
|
llama_class_example.cc
| #include "llama_class_example.hpp"
#include <iostream>
int WriteToFile(const std::string &filePath, const void *data,
uint32_t data_len) {
std::ofstream fout;
fout.open(filePath, std::ios::out | std::ios::binary);
if (fout.is_open()) {
fout.write((const char *)data, data_len);
fout.close();
return 1;
}
return 0;
}
// Prefill forward method
void LLamaClass::ForwardPrefill() {
// Get input tensors and cache position
auto input_tensors = mdl_input_.input_ids;
auto cache_position = mdl_input_.cache_position;
// Get query length (number of tokens in the input sequence)
int query_length = input_tensors.GetCols();
// Process input in chunks (divided into chunks of size prefill_chunk_size)
for (auto step = 0; step < query_length; step += prefill_chunk_size_) {
// If the last chunk is incomplete (remaining tokens less than chunk size)
if ((step + prefill_chunk_size_) > query_length) {
// Calculate and add necessary padding to the input tensor
int padding_needed = step + prefill_chunk_size_ - query_length;
input_tensors = tensor_ops::Pad(input_tensors, 0, padding_needed);
// Extend cache positions (concatenate current cache positions with additional range)
auto new_cache_position = tensor_ops::ConcatenateWithRange(
cache_position, query_length, step + prefill_chunk_size_);
// Slice input tensors and cache positions for the current chunk
auto sliced_input_tensors = tensor_ops::VerticalSlicing(
input_tensors, step, step + prefill_chunk_size_);
auto sliced_cache_positions = tensor_ops::VerticalSlicing(
new_cache_position, step, step + prefill_chunk_size_);
// Create query position and empty block tables(with value 0) for KV-cache management
Tensor<int16_t> query_position(query_length % prefill_chunk_size_ - 1);
Tensor<int16_t> block_tables(0);
// Check if prefill input count exceeds expected limit
if (rbln_get_num_inputs(prefill_rt_) > kPrefillInputCount) {
throw std::runtime_error(
"You appear to be running on ATOM(RBLN-CA02). RSD is only "
"available on ATOM+(RBLN-CA12). Check your NPU type with "
"'rbln-stat' command.");
}
// Set inputs for the model runtime
rbln_set_input(prefill_rt_, 0, sliced_input_tensors.GetData());
rbln_set_input(prefill_rt_, 1, sliced_cache_positions.GetData());
rbln_set_input(prefill_rt_, 2, block_tables.GetData());
rbln_set_input(prefill_rt_, 3, query_position.GetData());
// Run the model
rbln_run(prefill_rt_);
// Get output logits and convert to tensor
void *logit = static_cast<float *>(rbln_get_output(prefill_rt_, 0));
auto layout = rbln_get_output_layout(prefill_rt_, 0);
output_logits_ = Tensor<float>(logit, layout->shape[1], layout->shape[2]);
}
}
// Predict the next token from logits using Argmax
auto next_tokens = tensor_ops::GetArgmax<float, int64_t>(output_logits_);
// Concatenate existing input IDs with the predicted next token
input_ids_ = tensor_ops::Concatenate(mdl_input_.input_ids, next_tokens);
}
// Decoder forward method
void LLamaClass::ForwardDecode() {
// Get input tensors for decoding from prefill step
auto dec_input_tensors = mdl_input_.input_ids;
// Get batch size from the number of rows in input tensors
auto dec_batch_size = dec_input_tensors.GetRows();
// Get cache positions for decoding from prefill step
auto dec_cache_position = mdl_input_.cache_position;
// For each item in the batch
for (auto b_idx = 0; b_idx < dec_batch_size; b_idx++) {
// Get the current decoding step
auto decoding_step = dec_cache_position[b_idx];
}
// Initialize block tables for KV-cache management with shape of (batch_size, 1)
Tensor<int16_t> block_tables(batch_size_, 1);
// Check if decoder input count exceeds expected limit
if (rbln_get_num_inputs(dec_rt_) > kDecodeInputCount) {
throw std::runtime_error(
"You appear to be running on ATOM(RBLN-CA02). RSD is only available on "
"ATOM+(RBLN-CA12). Check your NPU type with 'rbln-stat' command.");
}
// Set inputs for decoder runtime
rbln_set_input(dec_rt_, 0, dec_input_tensors.GetData());
rbln_set_input(dec_rt_, 1, dec_cache_position.GetData());
rbln_set_input(dec_rt_, 2, block_tables.GetData());
// Run the decoder
rbln_run(dec_rt_);
// Get output logits from the decoder
float *dec_logit = static_cast<float *>(rbln_get_output(dec_rt_, 0));
auto dec_layout = rbln_get_output_layout(dec_rt_, 0);
// Convert output to tensor format
output_logits_ =
Tensor<float>(dec_logit, dec_layout->shape[1], dec_layout->shape[2]);
}
// Prefill forward wrapper
void LLamaClass::DoPrefill() {
// Run the prefill phase to process the input sequence and generate the next token
ForwardPrefill();
}
// Decoder forward wrapper
void LLamaClass::DoDecode() {
while (unfinished_sequences_) {
// Prepare input for the model with current token IDs and past cache info
PrepareInput(input_ids_, mdl_input_.past_cached_length);
// Run the decoder to get the next token logits
ForwardDecode();
// Get the next token using Argmax
auto dec_next_tokens =
tensor_ops::GetArgmax<float, int64_t>(output_logits_);
// Append/Concatenate the new tokens to the existing sequence
input_ids_ = tensor_ops::Concatenate(input_ids_, dec_next_tokens);
auto stopping_criteria = [](const auto &array) -> bool {
const int32_t eos_token_id = 128009;
// Stop generation if EOS token is found at the last position
if (array(0, array.GetCols() - 1) == eos_token_id)
return false;
return true;
};
unfinished_sequences_ = stopping_criteria(input_ids_);
}
}
void LLamaClass::Init() {
unfinished_sequences_ = true;
}
void LLamaClass::Reset() {
output_logits_ = Tensor<float>();
input_ids_ = Tensor<int64_t>();
}
void LLamaClass::DeInit() {
// Destroy runtime
rbln_destroy_runtime(prefill_rt_);
rbln_destroy_runtime(dec_rt_);
// Destroy model
rbln_destroy_model(prefill_mdl_);
rbln_destroy_model(dec_mdl_);
}
void LLamaClass::Prepare() {
// Create prefill/decoder model
prefill_mdl_ = rbln_create_model(prefill_id_.c_str());
dec_mdl_ = rbln_create_model(dec_id_.c_str());
// Create prefill/decoder runtime
prefill_rt_ = rbln_create_runtime(prefill_mdl_, nullptr, 0, 0);
dec_rt_ = rbln_create_runtime(dec_mdl_, nullptr, 0, 0);
}
void LLamaClass::GenerateBinary() {
if(!WriteToFile("c_text2text_generation_gen_id.bin", input_ids_.GetData(),
input_ids_.GetSize() * sizeof(int64_t))) {
std::cout << "Fail to save c_text2text_generation_gen_id.bin" << std::endl;
}
}
|
llama_tensor_example.hpp
| #ifndef RBLN_TENSOR_H
#define RBLN_TENSOR_H
#include <memory>
#include <string>
#include <vector>
template <typename T> class Tensor {
public:
Tensor() : depth_(1), rows_(0), cols_(0) { array_.reserve(0); }
Tensor(T val) : depth_(1), rows_(0), cols_(1) { array_.resize(1, T{val}); }
Tensor(size_t row, size_t col) : depth_(1), rows_(row), cols_(col) {
array_.resize(GetCapacity(), T{});
}
Tensor(const void *data, size_t row, size_t col)
: depth_(1), rows_(row), cols_(col) {
const T *ptr = static_cast<const T *>(data);
array_.assign(ptr, ptr + GetCapacity());
}
Tensor(size_t depth, size_t row, size_t col)
: depth_(depth), rows_(row), cols_(col) {
array_.resize(GetCapacity(), T{});
}
~Tensor() = default;
Tensor(const Tensor &other) {
array_ = other.array_;
depth_ = other.depth_;
rows_ = other.rows_;
cols_ = other.cols_;
}
Tensor(Tensor &&other) {
array_ = std::move(other.array_);
depth_ = other.depth_;
rows_ = other.rows_;
cols_ = other.cols_;
}
T &operator[](size_t i) { return array_[i]; }
T operator[](size_t i) const { return array_[i]; }
Tensor operator=(const Tensor &other) {
if (this != &other) {
array_ = other.array_;
depth_ = other.depth_;
rows_ = other.rows_;
cols_ = other.cols_;
}
return *this;
}
T &operator()(size_t r_idx, size_t c_idx) {
if (r_idx >= rows_ || c_idx >= cols_) {
throw std::out_of_range("Index out of bounds");
}
return array_[cols_ * r_idx + c_idx];
}
T &operator()(size_t col) {
if (col >= cols_) {
throw std::out_of_range("Index out of bounds");
}
return array_[col];
}
T operator()(size_t r_idx, size_t c_idx) const {
if (r_idx >= rows_ || c_idx >= cols_) {
throw std::out_of_range("Index out of bounds");
}
return array_[cols_ * r_idx + c_idx];
}
Tensor operator+(T val) {
Tensor ret(rows_, cols_);
for (auto r = 0; r < rows_; r++) {
for (auto c = 0; c < cols_; c++) {
ret(r, c) = array_[r * cols_ + c] + val;
}
}
return ret;
}
void *GetData() { return array_.data(); }
size_t GetRows() const { return rows_; }
size_t GetCols() const { return cols_; }
size_t GetDepth() const { return depth_; }
size_t GetSize() const { return array_.size(); }
void Ones() { std::fill(array_.begin(), array_.end(), T{1}); }
void Zeros() { std::fill(array_.begin(), array_.end(), T{0}); }
size_t GetCapacity(size_t r, size_t c) const {
return std::max(1UL, r) * std::max(1UL, c);
}
size_t GetCapacity() const {
return GetCapacity(rows_, cols_) * std::max(1UL, depth_);
}
void Resize(size_t row, size_t col) {
rows_ = row;
cols_ = col;
array_.resize(GetCapacity(row, col));
}
private:
std::vector<T> array_;
size_t depth_;
size_t rows_;
size_t cols_;
};
#endif
|
llama_tensor_op_example.hpp
| #ifndef RBLN_LLAMA_OPS_H
#define RBLN_LLAMA_OPS_H
// Tensor operations implementation
namespace tensor_ops {
template <typename T>
Tensor<T> Reshape(const Tensor<T> &tensor, int row, int col) {
if (tensor.GetCapacity(row, col) != tensor.GetCapacity()) {
throw std::runtime_error("Cannot reshape: total size must remain the same");
}
Tensor<T> ret(tensor);
std::vector<T> temp(tensor.GetSize());
for (size_t i = 0; i < row; ++i) {
for (size_t j = 0; j < col; ++j) {
size_t new_idx = i * col + j;
size_t old_idx = (new_idx / col) * tensor.GetCols() + (new_idx % col);
temp[new_idx] = tensor[old_idx];
}
}
for (size_t i = 0; i < temp.size(); ++i) {
ret[i] = temp[i];
}
ret.Resize(row, col);
return ret;
}
template <typename T> Tensor<T> Reshape(const Tensor<T> &tensor, int col) {
if (col != tensor.GetCapacity()) {
throw std::runtime_error("Cannot reshape: total size must remain the same");
}
Tensor<T> ret(tensor);
ret.Resize(0, col);
return ret;
}
template <typename T> void Arange(Tensor<T> &tensor, int start, int stop) {
tensor.Resize(0, stop - start);
for (size_t i = 0; i < stop - start; ++i) {
tensor[i] = static_cast<T>(start + i);
}
}
template <typename T> Tensor<T> UnSqueeze(const Tensor<T> &tensor) {
Tensor<T> ret(1, tensor.GetSize());
for (size_t i = 0; i < tensor.GetSize(); ++i) {
ret(0, i) = tensor[i];
}
return ret;
}
template <typename T> Tensor<T> SelectLastColumn(const Tensor<T> &tensor) {
Tensor<T> result(tensor.GetRows(), 1);
size_t last_col = tensor.GetCols() - 1;
for (size_t i = 0; i < tensor.GetRows(); ++i) {
result(i, 0) = tensor(i, last_col);
}
return result;
}
template <typename T>
Tensor<T> Pad(const Tensor<T> &tensor, size_t start_pos, size_t end_pos) {
Tensor<T> padded(tensor.GetRows(), tensor.GetCols() + start_pos + end_pos);
for (size_t i = 0; i < tensor.GetRows(); ++i) {
for (size_t j = 0; j < tensor.GetCols(); ++j) {
padded(i, start_pos + j) = tensor(i, j);
}
}
return padded;
}
template <typename T>
Tensor<T> VerticalSlicing(Tensor<T> &tensor, size_t start_pos, size_t end_pos) {
Tensor<T> ret(tensor);
std::vector<T> temp(ret.GetCapacity(ret.GetRows(), (end_pos - start_pos)));
for (size_t i = 0; i < ret.GetRows(); ++i) {
for (size_t j = start_pos; j < end_pos; ++j) {
temp[i * (end_pos - start_pos) + (j - start_pos)] = ret(i, j);
}
}
for (size_t i = 0; i < temp.size(); ++i) {
ret[i] = temp[i];
}
return ret;
}
template <typename T>
void SetCausalMask(Tensor<T> &tensor, const Tensor<T> &mask_tensor,
size_t start_pos, size_t end_pos) {
if (end_pos > tensor.GetCols()) {
throw std::out_of_range("Index range out of bounds");
}
for (size_t d = 0; d < tensor.GetDepth(); ++d) {
for (size_t r = 0; r < tensor.GetRows(); ++r) {
size_t base_idx = (d * tensor.GetRows() + r) * tensor.GetCols();
for (size_t idx = start_pos; idx < end_pos; ++idx) {
tensor[base_idx + idx] = mask_tensor(r, idx - start_pos);
}
}
}
}
template <typename T>
Tensor<T> ConcatenateWithRange(const Tensor<T> &tensor, size_t start_pos,
size_t end_pos) {
Tensor<T> range;
tensor_ops::Arange(range, start_pos, end_pos);
size_t total_cols = tensor.GetCols() + range.GetSize();
Tensor<T> result(1, total_cols);
for (size_t i = 0; i < tensor.GetCols(); ++i) {
result(0, i) = tensor[i];
}
for (size_t i = 0; i < range.GetSize(); ++i) {
result(0, tensor.GetCols() + i) = range[i];
}
return result;
}
template <typename T, typename T1>
Tensor<T1> GetArgmax(const Tensor<T> &tensor) {
Tensor<T1> next_tokens(tensor.GetRows(), 1);
for (size_t i = 0; i < tensor.GetRows(); ++i) {
size_t max_idx = 0;
T max_val = tensor(i, 0);
for (size_t j = 1; j < tensor.GetCols(); ++j) {
if (tensor(i, j) > max_val) {
max_val = tensor(i, j);
max_idx = j;
}
}
next_tokens(i, 0) = static_cast<T1>(max_idx);
}
return next_tokens;
}
template <typename T>
Tensor<T> Concatenate(const Tensor<T> &tensor, const Tensor<T> &other) {
Tensor<T> result(tensor.GetRows(), tensor.GetCols() + 1);
for (size_t i = 0; i < tensor.GetRows(); ++i) {
for (size_t j = 0; j < tensor.GetCols(); ++j) {
result(i, j) = tensor(i, j);
}
result(i, tensor.GetCols()) = other(i, 0);
}
return result;
}
template <typename T>
void SetMaskAtPos(Tensor<T> &tensor, size_t pos, T value) {
if (pos >= tensor.GetCols()) {
throw std::out_of_range("Index out of bounds");
}
for (size_t i = 0; i < tensor.GetRows(); ++i) {
tensor(i, pos) = value;
}
}
template <typename T>
void SetMaskUpToPos(Tensor<T> &tensor, size_t batch_idx, size_t pos, T value) {
if (pos > tensor.GetCols()) {
throw std::out_of_range("Index out of bounds");
}
for (size_t r = 0; r < tensor.GetRows(); ++r) {
for (size_t i = 0; i < pos; ++i) {
tensor(r, i) = value;
}
}
}
template <typename T> Tensor<T> TriuMask(size_t row, size_t col) {
Tensor<T> mask(row, col);
mask.Ones();
for (size_t i = 0; i < row; ++i) {
for (size_t j = i + 1; j < col; ++j) {
mask(i, j) = 0;
}
}
return mask;
}
template <typename T>
Tensor<T> FilterByMask(const Tensor<T> &tensor, const Tensor<T> &mask,
size_t i) {
size_t count = 0;
for (size_t j = 0; j < mask.GetCols(); ++j) {
if (mask(i, j) == 1) {
count++;
}
}
Tensor<T> result(1, count);
size_t idx = 0;
for (size_t j = 0; j < mask.GetCols(); ++j) {
if (mask(i, j) == 1) {
result[idx++] = tensor[j];
}
}
return result;
}
} // namespace tensor_ops
#endif
|
Build with CMake and Run the Executable
Create a build directory, run cmake, compile the code, and execute the binary.
The execution generates a binary file containing the token ID sequence.
| mkdir ${YOUR_SAMPLE_PATH}/build
cd ${YOUR_SAMPLE_PATH}/build
cmake ..
make
# Run the executable
./text_generation
|
Generate Text from Output Data
Decode the output token ID sequence generated by the C/C++ executable into recognizable text using a Python script.
| import numpy as np
import torch
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
input_text = "Hey, are you conscious? Can you talk to me?"
batch_size = 1
# Prepare inputs
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
conversation = [[{"role": "user", "content": input_text}]] * batch_size
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)
input_ids = inputs.input_ids
input_len = inputs.input_ids.shape[-1]
# Data decoding
output_sequence = torch.tensor(np.fromfile("c_text2text_generation_gen_id.bin", dtype=np.int64), dtype=torch.int64)
generated_texts = tokenizer.decode(output_sequence[input_len:], skip_special_tokens=True, clean_up_tokenization_spaces=True)
print("--- input text ---")
print(input_text)
print("--- Decoded C Result ---")
print(generated_texts)
|
Example Output:
| --- input text ---
Hey, are you conscious? Can you talk to me?
--- Decoded C Result ---
Hello! I'm an AI, which means I'm a computer program designed to simulate conversation and answer questions to the best of my ability. I don't have consciousness in the way that humans do, but I'm designed to be very responsive and interactive.
I can understand and respond to language, and I can even learn and improve over time based on the conversations I have with users like you. So, in a sense, I'm "awake" and ready to chat with you!
What would you like to talk about? Do you have a specific question or topic in mind, or do you just want to chat about something random? I'm here to listen and help if I can!
|
References