텍스트 생성
이 튜토리얼은 RBLN SDK C/C++ Runtime API
를 사용하여 Llama3-8b 모델을 배포하는 방법을 소개합니다. 모델 컴파일은 RBLN SDK Python API
를 통해 수행할 수 있으며, 결과 파일인 *.rbln
은 RBLN SDK C/C++ Runtime API
를 사용하여 추론을 실행 할 수 있습니다.
이는 Python API의 간편한 모델 준비와 C/C++의 빠른 추론 성능을 결합한 접근 방식입니다.
이 튜토리얼은 두 부분으로 나누어져 있습니다:
- Python API로
Llama3-8b
를 컴파일하는 방법
- 컴파일된 모델을 C/C++ Runtime API로 추론하는 방법
전제 조건
시작하기 전에 시스템에 다음 패키지가 설치되어 있는지 확인하세요:
- 모델 컴파일
- RBLN SDK C/C++ Runtime API
1단계. 모델 컴파일하기
RBLN Python API
는 컴파일과 추론을 모두 할 수 있지만, RBLN SDK C/C++ Runtime API
는 추론에만 사용할 수 있습니다.
이 튜토리얼에서는 모델 컴파일은 RBLN Python API
를 사용하고, 추론 작업에는 RBLN SDK C/C++ Runtime API
를 사용합니다.
모델 컴파일
optimum-rbln
라이브러리에서 RBLNLlamaForCausalLM
클래스를 사용합니다. 해당 클래스가 제공하는 RBLNLlamaForCausalLM.from_pretrained()
메서드를 이용하여, 허깅페이스 허브에서 Llama3-8b
모델을 다운로드하고 모델을 컴파일합니다.
model.save_pretrained()
메서드를 사용하여 컴파일된 모델을 디스크에 저장할 수 있습니다. 컴파일 된 모델 파일인 prefill.rbln
과 decoder.rbln
이 Meta-Llama-3-8B-Instruct
경로에 생성됩니다.
| # compile.py
import os
from optimum.rbln import RBLNLlamaForCausalLM
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = RBLNLlamaForCausalLM.from_pretrained(
model_id=model_id,
export=True,
rbln_batch_size=1,
rbln_max_seq_len=8192,
rbln_tensor_parallel_size=4,
)
# Save the compiled model
model.save_pretrained(os.path.basename(model_id))
|
입력 데이터 생성하기
transformers
라이브러리 패키지의 AutoTokenizer
를 사용하여 입력을 토크나이즈합니다.
Note
이 튜토리얼은 Llama3-8b
에 대해 RBLN SDK C/C++ Runtime API
를 사용하는 방법을 보여줍니다. 이 예제는 C/C++ 기반 추론에 중점을 두므로, 전처리 및 후처리(tokenization)는 Python API로 처리됩니다.
입력 이진 파일인 c_input_ids.bin
은 아래의 Python 스크립트를 통해 생성할 수 있습니다.
| # pre_process.py
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
input_text = "Hey, are you conscious? Can you talk to me?"
batch_size = 1
# Prepare inputs
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
conversation = [[{"role": "user", "content": input_text}]] * batch_size
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.numpy()
# Save the generated input binary files
input_ids.tofile("c_input_ids.bin")
|
입력 데이터 생성이 성공적으로 완료되면, 생성된 입력 이진 파일 c_input_ids.bin
을 확인할 수 있습니다.
2단계. RBLN SDK C/C++ Runtime API를 사용한 추론 수행하기
이제, RBLN SDK C/C++ Runtime API
를 사용하여 모델을 배포할 수 있습니다. 이 과정에는 모델 로딩, 추론 실행, 그리고 출력 디코딩이 포함됩니다.
Note
${YOUR_SAMPLE_PATH}
는 CMake 파일과 추론 코드가 포함된 디렉토리를 의미합니다.
CMake 빌드 스크립트 준비하기
다음의 CMake 스크립트는 외부 패키지에 대한 의존성과 이를 예제 애플리케이션 코드와 연결하는 방법을 설명합니다.
| # CMakeLists.txt
cmake_minimum_required(VERSION 3.26)
# Collect all source files
file(GLOB SOURCE_FILES "*.cc")
# Define executable
add_executable(llama_binding llama_main.cc ${SOURCE_FILES})
# Link RBLN runtime
find_package(rbln CONFIG REQUIRED)
target_link_libraries(llama_binding rbln::rbln_runtime)
# Add header files directory
target_include_directories(llama_binding PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include)
|
추론을 위한 코드 준비하기
다음은 추론을 위한 주요 구현이며, 각 구현에 대한 세부 사항은 다음과 같습니다:
- llama_main.cc: 모델 로딩 및 실행 워크플로우를 처리하는 추론 애플리케이션
- llama_tensor_example.hpp: 사용자 정의 텐서 연산자 및 데이터 구조 구현
- llama_class_example.hpp/cc:
RBLN SDK C/C++ Runtime API
를 사용한 추론 래퍼로, 다음을 포함:
Llama3-8b
의 prefill 및 decode 단계 관리
- 입출력 버퍼 처리
- 모델 실행 흐름 제어
- llama_tensor_op_example.hpp: 텐서 조작 연산 집합
llama_main.cc
| #include "llama_class_example.hpp"
int main() {
LLamaClass llama_cls;
// Init Model configuration
llama_cls.InitConfig();
// Create Model & Runtime
llama_cls.Prepare();
// Init LLamaClass
llama_cls.Init();
auto input_ids = Tensor<int64_t>(1, 23);
assert(LoadBinary<int64_t>(llama_cls.GetIdsPath(), input_ids) == true);
auto past_cached_length = Tensor<int32_t>();
llama_cls.PrepareInput(input_ids, past_cached_length);
// Process of Prefill phase
llama_cls.DoPrefill();
// Process of Decode phase
llama_cls.DoDecode();
// Generate c_text2text_generation_gen_id.bin
llama_cls.GenerateBinary();
// Reset LLamaClass for iteration
llama_cls.Reset();
// Deinit LLamaClass
llama_cls.DeInit();
return 0;
}
|
llama_class_example.hpp
| #ifndef RBLN_LLAMA_H
#define RBLN_LLAMA_H
#include <assert.h>
#include <rbln/rbln.h>
#include <cstring>
#include <fstream>
#include <iostream>
#include <numeric>
#include <string>
#include <vector>
#include "llama_tensor_example.hpp"
#include "llama_tensor_op_example.hpp"
constexpr uint32_t kDecodeInputCount = 3;
constexpr uint32_t kPrefillInputCount = 4;
template <typename T>
bool LoadBinary(const std::string &filename, Tensor<T> &data) {
std::ifstream file(filename, std::ios::binary);
if (!file.is_open()) {
std::cout << "Could not open file: " + filename << std::endl;
return false;
}
file.seekg(0, std::ios::end);
const size_t fileSize = file.tellg();
file.seekg(0, std::ios::beg);
if (fileSize % sizeof(T) != 0) {
std::cout << "File size(" << fileSize << ") is not a multiple of data type size("
<< sizeof(T) << ")" << std::endl;
return false;
}
file.read(const_cast<char *>(static_cast<const char *>(data.GetData())),
fileSize);
if (file.fail()) {
std::cout << "Failed to read file: " << filename << std::endl;
return false;
}
return true;
}
int WriteToFile(const std::string &filePath, const void *data,
uint32_t data_len);
class LLamaClass {
public:
LLamaClass() = default;
~LLamaClass() = default;
// Init Model configuration
void InitConfig() {
prefill_id_ = "${YOUR_SAMPLE_PATH}/Meta-Llama-3-8B-Instruct/prefill.rbln";
dec_id_ = "${YOUR_SAMPLE_PATH}/Meta-Llama-3-8B-Instruct/decoder.rbln";
input_ids_path_ = "${YOUR_SAMPLE_PATH}/c_input_ids.bin";
batch_size_ = 1;
max_seq_len_ = 8192;
prefill_chunk_size_ = 128;
}
// Init LLamaClass
void Init();
// Reset LLamaClass for iteration
void Reset();
// Deinit LLamaClass
void DeInit();
// Create Model & Runtime
void Prepare();
// Process of Prefill phase
void DoPrefill();
// Process of Decode phase
void DoDecode();
// Generate c_text2text_generation_gen_id.bin
void GenerateBinary();
template <typename T0, typename T1>
void PrepareInput(Tensor<T0> &input_ids, Tensor<T1> &v0) {
if (!v0.GetSize()) {
auto input_tensors = input_ids;
auto batch_size = input_tensors.GetRows();
std::vector<Tensor<int64_t>> l_input_tensors;
std::vector<Tensor<int32_t>> cache_positions;
auto past_cached_length = Tensor<int32_t>(batch_size, 1);
for (int i = 0; i < batch_size; i++) {
auto input_tensor =
tensor_ops::Reshape(input_tensors, input_tensors.GetCols());
auto valid_len = input_tensor.GetCols();
auto cache_position = Tensor<int32_t>();
tensor_ops::Arange(cache_position, 0, valid_len);
tensor_ops::Reshape(cache_position, 1, valid_len);
past_cached_length[i] = valid_len;
l_input_tensors.emplace_back(tensor_ops::UnSqueeze(input_tensor));
cache_positions.emplace_back(tensor_ops::UnSqueeze(cache_position));
}
mdl_input_ = ModelInput{l_input_tensors[0], cache_positions[0],
past_cached_length};
} else {
auto input_tensor = tensor_ops::SelectLastColumn(input_ids);
auto cache_positions = v0;
auto past_cached_length = v0 + 1;
mdl_input_ =
ModelInput{input_tensor, cache_positions, past_cached_length};
}
}
const std::string &GetIdsPath() { return input_ids_path_; }
RBLNModel *prefill_mdl_;
RBLNModel *dec_mdl_;
RBLNRuntime *prefill_rt_;
RBLNRuntime *dec_rt_;
private:
void ForwardPrefill();
void ForwardDecode();
typedef struct {
Tensor<int64_t> input_ids;
Tensor<int32_t> cache_position;
Tensor<int32_t> past_cached_length;
} ModelInput;
ModelInput mdl_input_;
int max_seq_len_;
int batch_size_;
int prefill_chunk_size_;
bool unfinished_sequences_;
std::string prefill_id_;
std::string dec_id_;
std::string input_ids_path_;
Tensor<float> output_logits_;
Tensor<int64_t> input_ids_;
};
#endif
|
llama_class_example.cc
| #include "llama_class_example.hpp"
#include <iostream>
int WriteToFile(const std::string &filePath, const void *data,
uint32_t data_len) {
std::ofstream fout;
fout.open(filePath, std::ios::out | std::ios::binary);
if (fout.is_open()) {
fout.write((const char *)data, data_len);
fout.close();
return 1;
}
return 0;
}
// Prefill forward method
void LLamaClass::ForwardPrefill() {
// Get input tensors and cache position
auto input_tensors = mdl_input_.input_ids;
auto cache_position = mdl_input_.cache_position;
// Get query length (number of tokens in the input sequence)
int query_length = input_tensors.GetCols();
// Process input in chunks (divided into chunks of size prefill_chunk_size)
for (auto step = 0; step < query_length; step += prefill_chunk_size_) {
// If the last chunk is incomplete (remaining tokens less than chunk size)
if ((step + prefill_chunk_size_) > query_length) {
// Calculate and add necessary padding to the input tensor
int padding_needed = step + prefill_chunk_size_ - query_length;
input_tensors = tensor_ops::Pad(input_tensors, 0, padding_needed);
// Extend cache positions (concatenate current cache positions with additional range)
auto new_cache_position = tensor_ops::ConcatenateWithRange(
cache_position, query_length, step + prefill_chunk_size_);
// Slice input tensors and cache positions for the current chunk
auto sliced_input_tensors = tensor_ops::VerticalSlicing(
input_tensors, step, step + prefill_chunk_size_);
auto sliced_cache_positions = tensor_ops::VerticalSlicing(
new_cache_position, step, step + prefill_chunk_size_);
// Create query index and empty block tables(with value 0) for KV-cache management
Tensor<int> query_idx(query_length % prefill_chunk_size_ - 1);
Tensor<int16_t> block_tables(0);
// Check if prefill input count exceeds expected limit
if (rbln_get_num_inputs(prefill_rt_) > kPrefillInputCount) {
throw std::runtime_error(
"You appear to be running on ATOM(RBLN-CA02). RSD is only "
"available on ATOM+(RBLN-CA12). Check your NPU type with "
"'rbln-stat' command.");
}
// Set inputs for the model runtime
rbln_set_input(prefill_rt_, 0, sliced_input_tensors.GetData());
rbln_set_input(prefill_rt_, 1, sliced_cache_positions.GetData());
rbln_set_input(prefill_rt_, 2, query_idx.GetData());
rbln_set_input(prefill_rt_, 3, block_tables.GetData());
// Run the model
rbln_run(prefill_rt_);
// Get output logits and convert to tensor
void *logit = static_cast<float *>(rbln_get_output(prefill_rt_, 0));
auto layout = rbln_get_output_layout(prefill_rt_, 0);
output_logits_ = Tensor<float>(logit, layout->shape[1], layout->shape[2]);
}
}
// Predict the next token from logits using Argmax
auto next_tokens = tensor_ops::GetArgmax<float, int64_t>(output_logits_);
// Concatenate existing input IDs with the predicted next token
input_ids_ = tensor_ops::Concatenate(mdl_input_.input_ids, next_tokens);
}
// Decoder forward method
void LLamaClass::ForwardDecode() {
// Get input tensors for decoding from prefill step
auto dec_input_tensors = mdl_input_.input_ids;
// Get batch size from the number of rows in input tensors
auto dec_batch_size = dec_input_tensors.GetRows();
// Get cache positions for decoding from prefill step
auto dec_cache_position = mdl_input_.cache_position;
// For each item in the batch
for (auto b_idx = 0; b_idx < dec_batch_size; b_idx++) {
// Get the current decoding step
auto decoding_step = dec_cache_position[b_idx];
}
// Initialize block tables for KV-cache management with shape of (batch_size, 1)
Tensor<int16_t> block_tables(batch_size_, 1);
// Check if decoder input count exceeds expected limit
if (rbln_get_num_inputs(dec_rt_) > kDecodeInputCount) {
throw std::runtime_error(
"You appear to be running on ATOM(RBLN-CA02). RSD is only available on "
"ATOM+(RBLN-CA12). Check your NPU type with 'rbln-stat' command.");
}
// Set inputs for decoder runtime
rbln_set_input(dec_rt_, 0, dec_input_tensors.GetData());
rbln_set_input(dec_rt_, 1, dec_cache_position.GetData());
rbln_set_input(dec_rt_, 2, block_tables.GetData());
// Run the decoder
rbln_run(dec_rt_);
// Get output logits from the decoder
float *dec_logit = static_cast<float *>(rbln_get_output(dec_rt_, 0));
auto dec_layout = rbln_get_output_layout(dec_rt_, 0);
// Convert output to tensor format
output_logits_ =
Tensor<float>(dec_logit, dec_layout->shape[1], dec_layout->shape[2]);
}
// Prefill forward wrapper
void LLamaClass::DoPrefill() {
// Run the prefill phase to process the input sequence and generate the next token
ForwardPrefill();
}
// Decoder forward wrapper
void LLamaClass::DoDecode() {
while (unfinished_sequences_) {
// Prepare input for the model with current token IDs and past cache info
PrepareInput(input_ids_, mdl_input_.past_cached_length);
// Run the decoder to get the next token logits
ForwardDecode();
// Get the next token using Argmax
auto dec_next_tokens =
tensor_ops::GetArgmax<float, int64_t>(output_logits_);
// Append/Concatenate the new tokens to the existing sequence
input_ids_ = tensor_ops::Concatenate(input_ids_, dec_next_tokens);
auto stopping_criteria = [](const auto &array) -> bool {
const int32_t eos_token_id = 128009;
// Stop generation if EOS token is found at the last position
if (array(0, array.GetCols() - 1) == eos_token_id)
return false;
return true;
};
unfinished_sequences_ = stopping_criteria(input_ids_);
}
}
void LLamaClass::Init() {
unfinished_sequences_ = true;
}
void LLamaClass::Reset() {
output_logits_ = Tensor<float>();
input_ids_ = Tensor<int64_t>();
}
void LLamaClass::DeInit() {
// Destroy runtime
rbln_destroy_runtime(prefill_rt_);
rbln_destroy_runtime(dec_rt_);
// Destroy model
rbln_destroy_model(prefill_mdl_);
rbln_destroy_model(dec_mdl_);
}
void LLamaClass::Prepare() {
// Create prefill/decoder model
prefill_mdl_ = rbln_create_model(prefill_id_.c_str());
dec_mdl_ = rbln_create_model(dec_id_.c_str());
// Create prefill/decoder runtime
prefill_rt_ = rbln_create_runtime(prefill_mdl_, nullptr, 0, 0);
dec_rt_ = rbln_create_runtime(dec_mdl_, nullptr, 0, 0);
}
void LLamaClass::GenerateBinary() {
if(!WriteToFile("c_text2text_generation_gen_id.bin", input_ids_.GetData(),
input_ids_.GetSize() * sizeof(int64_t))) {
std::cout << "Fail to save c_text2text_generation_gen_id.bin" << std::endl;
}
}
|
llama_tensor_example.hpp
| #ifndef RBLN_TENSOR_H
#define RBLN_TENSOR_H
#include <memory>
#include <string>
#include <vector>
template <typename T> class Tensor {
public:
Tensor() : depth_(1), rows_(0), cols_(0) { array_.reserve(0); }
Tensor(T val) : depth_(1), rows_(0), cols_(1) { array_.resize(1, T{val}); }
Tensor(size_t row, size_t col) : depth_(1), rows_(row), cols_(col) {
array_.resize(GetCapacity(), T{});
}
Tensor(const void *data, size_t row, size_t col)
: depth_(1), rows_(row), cols_(col) {
const T *ptr = static_cast<const T *>(data);
array_.assign(ptr, ptr + GetCapacity());
}
Tensor(size_t depth, size_t row, size_t col)
: depth_(depth), rows_(row), cols_(col) {
array_.resize(GetCapacity(), T{});
}
~Tensor() = default;
Tensor(const Tensor &other) {
array_ = other.array_;
depth_ = other.depth_;
rows_ = other.rows_;
cols_ = other.cols_;
}
Tensor(Tensor &&other) {
array_ = std::move(other.array_);
depth_ = other.depth_;
rows_ = other.rows_;
cols_ = other.cols_;
}
T &operator[](size_t i) { return array_[i]; }
T operator[](size_t i) const { return array_[i]; }
Tensor operator=(const Tensor &other) {
if (this != &other) {
array_ = other.array_;
depth_ = other.depth_;
rows_ = other.rows_;
cols_ = other.cols_;
}
return *this;
}
T &operator()(size_t r_idx, size_t c_idx) {
if (r_idx >= rows_ || c_idx >= cols_) {
throw std::out_of_range("Index out of bounds");
}
return array_[cols_ * r_idx + c_idx];
}
T &operator()(size_t col) {
if (col >= cols_) {
throw std::out_of_range("Index out of bounds");
}
return array_[col];
}
T operator()(size_t r_idx, size_t c_idx) const {
if (r_idx >= rows_ || c_idx >= cols_) {
throw std::out_of_range("Index out of bounds");
}
return array_[cols_ * r_idx + c_idx];
}
Tensor operator+(T val) {
Tensor ret(rows_, cols_);
for (auto r = 0; r < rows_; r++) {
for (auto c = 0; c < cols_; c++) {
ret(r, c) = array_[r * cols_ + c] + val;
}
}
return ret;
}
void *GetData() { return array_.data(); }
size_t GetRows() const { return rows_; }
size_t GetCols() const { return cols_; }
size_t GetDepth() const { return depth_; }
size_t GetSize() const { return array_.size(); }
void Ones() { std::fill(array_.begin(), array_.end(), T{1}); }
void Zeros() { std::fill(array_.begin(), array_.end(), T{0}); }
size_t GetCapacity(size_t r, size_t c) const {
return std::max(1UL, r) * std::max(1UL, c);
}
size_t GetCapacity() const {
return GetCapacity(rows_, cols_) * std::max(1UL, depth_);
}
void Resize(size_t row, size_t col) {
rows_ = row;
cols_ = col;
array_.resize(GetCapacity(row, col));
}
private:
std::vector<T> array_;
size_t depth_;
size_t rows_;
size_t cols_;
};
#endif
|
llama_tensor_op_example.hpp
| #ifndef RBLN_LLAMA_OPS_H
#define RBLN_LLAMA_OPS_H
// Tensor operations implementation
namespace tensor_ops {
template <typename T>
Tensor<T> Reshape(const Tensor<T> &tensor, int row, int col) {
if (tensor.GetCapacity(row, col) != tensor.GetCapacity()) {
throw std::runtime_error("Cannot reshape: total size must remain the same");
}
Tensor<T> ret(tensor);
std::vector<T> temp(tensor.GetSize());
for (size_t i = 0; i < row; ++i) {
for (size_t j = 0; j < col; ++j) {
size_t new_idx = i * col + j;
size_t old_idx = (new_idx / col) * tensor.GetCols() + (new_idx % col);
temp[new_idx] = tensor[old_idx];
}
}
for (size_t i = 0; i < temp.size(); ++i) {
ret[i] = temp[i];
}
ret.Resize(row, col);
return ret;
}
template <typename T> Tensor<T> Reshape(const Tensor<T> &tensor, int col) {
if (col != tensor.GetCapacity()) {
throw std::runtime_error("Cannot reshape: total size must remain the same");
}
Tensor<T> ret(tensor);
ret.Resize(0, col);
return ret;
}
template <typename T> void Arange(Tensor<T> &tensor, int start, int stop) {
tensor.Resize(0, stop - start);
for (size_t i = 0; i < stop - start; ++i) {
tensor[i] = static_cast<T>(start + i);
}
}
template <typename T> Tensor<T> UnSqueeze(const Tensor<T> &tensor) {
Tensor<T> ret(1, tensor.GetSize());
for (size_t i = 0; i < tensor.GetSize(); ++i) {
ret(0, i) = tensor[i];
}
return ret;
}
template <typename T> Tensor<T> SelectLastColumn(const Tensor<T> &tensor) {
Tensor<T> result(tensor.GetRows(), 1);
size_t last_col = tensor.GetCols() - 1;
for (size_t i = 0; i < tensor.GetRows(); ++i) {
result(i, 0) = tensor(i, last_col);
}
return result;
}
template <typename T>
Tensor<T> Pad(const Tensor<T> &tensor, size_t start_pos, size_t end_pos) {
Tensor<T> padded(tensor.GetRows(), tensor.GetCols() + start_pos + end_pos);
for (size_t i = 0; i < tensor.GetRows(); ++i) {
for (size_t j = 0; j < tensor.GetCols(); ++j) {
padded(i, start_pos + j) = tensor(i, j);
}
}
return padded;
}
template <typename T>
Tensor<T> VerticalSlicing(Tensor<T> &tensor, size_t start_pos, size_t end_pos) {
Tensor<T> ret(tensor);
std::vector<T> temp(ret.GetCapacity(ret.GetRows(), (end_pos - start_pos)));
for (size_t i = 0; i < ret.GetRows(); ++i) {
for (size_t j = start_pos; j < end_pos; ++j) {
temp[i * (end_pos - start_pos) + (j - start_pos)] = ret(i, j);
}
}
for (size_t i = 0; i < temp.size(); ++i) {
ret[i] = temp[i];
}
return ret;
}
template <typename T>
void SetCausalMask(Tensor<T> &tensor, const Tensor<T> &mask_tensor,
size_t start_pos, size_t end_pos) {
if (end_pos > tensor.GetCols()) {
throw std::out_of_range("Index range out of bounds");
}
for (size_t d = 0; d < tensor.GetDepth(); ++d) {
for (size_t r = 0; r < tensor.GetRows(); ++r) {
size_t base_idx = (d * tensor.GetRows() + r) * tensor.GetCols();
for (size_t idx = start_pos; idx < end_pos; ++idx) {
tensor[base_idx + idx] = mask_tensor(r, idx - start_pos);
}
}
}
}
template <typename T>
Tensor<T> ConcatenateWithRange(const Tensor<T> &tensor, size_t start_pos,
size_t end_pos) {
Tensor<T> range;
tensor_ops::Arange(range, start_pos, end_pos);
size_t total_cols = tensor.GetCols() + range.GetSize();
Tensor<T> result(1, total_cols);
for (size_t i = 0; i < tensor.GetCols(); ++i) {
result(0, i) = tensor[i];
}
for (size_t i = 0; i < range.GetSize(); ++i) {
result(0, tensor.GetCols() + i) = range[i];
}
return result;
}
template <typename T, typename T1>
Tensor<T1> GetArgmax(const Tensor<T> &tensor) {
Tensor<T1> next_tokens(tensor.GetRows(), 1);
for (size_t i = 0; i < tensor.GetRows(); ++i) {
size_t max_idx = 0;
T max_val = tensor(i, 0);
for (size_t j = 1; j < tensor.GetCols(); ++j) {
if (tensor(i, j) > max_val) {
max_val = tensor(i, j);
max_idx = j;
}
}
next_tokens(i, 0) = static_cast<T1>(max_idx);
}
return next_tokens;
}
template <typename T>
Tensor<T> Concatenate(const Tensor<T> &tensor, const Tensor<T> &other) {
Tensor<T> result(tensor.GetRows(), tensor.GetCols() + 1);
for (size_t i = 0; i < tensor.GetRows(); ++i) {
for (size_t j = 0; j < tensor.GetCols(); ++j) {
result(i, j) = tensor(i, j);
}
result(i, tensor.GetCols()) = other(i, 0);
}
return result;
}
template <typename T>
void SetMaskAtPos(Tensor<T> &tensor, size_t pos, T value) {
if (pos >= tensor.GetCols()) {
throw std::out_of_range("Index out of bounds");
}
for (size_t i = 0; i < tensor.GetRows(); ++i) {
tensor(i, pos) = value;
}
}
template <typename T>
void SetMaskUpToPos(Tensor<T> &tensor, size_t batch_idx, size_t pos, T value) {
if (pos > tensor.GetCols()) {
throw std::out_of_range("Index out of bounds");
}
for (size_t r = 0; r < tensor.GetRows(); ++r) {
for (size_t i = 0; i < pos; ++i) {
tensor(r, i) = value;
}
}
}
template <typename T> Tensor<T> TriuMask(size_t row, size_t col) {
Tensor<T> mask(row, col);
mask.Ones();
for (size_t i = 0; i < row; ++i) {
for (size_t j = i + 1; j < col; ++j) {
mask(i, j) = 0;
}
}
return mask;
}
template <typename T>
Tensor<T> FilterByMask(const Tensor<T> &tensor, const Tensor<T> &mask,
size_t i) {
size_t count = 0;
for (size_t j = 0; j < mask.GetCols(); ++j) {
if (mask(i, j) == 1) {
count++;
}
}
Tensor<T> result(1, count);
size_t idx = 0;
for (size_t j = 0; j < mask.GetCols(); ++j) {
if (mask(i, j) == 1) {
result[idx++] = tensor[j];
}
}
return result;
}
} // namespace tensor_ops
#endif
|
CMake로 빌드하기
| mkdir ${YOUR_SAMPLE_PATH}/build
cd ${YOUR_SAMPLE_PATH}/build
cmake ..
make
|
실행 파일 실행하기
위의 모든 단계를 따르면, 컴파일된 실행 파일 llama_binding
은 build
디렉토리에 위치하게 됩니다.
실행 파일을 실행하면 로컬 저장소에 c_text2text_generation_gen_id.bin
파일이 생성됩니다. 이 파일에는 Llama3-8b
decoder가 생성한 Token ID sequence
데이터가 포함되어 있습니다. 아래의 Python 코드를 사용하여 이를 인식 가능한 텍스트로 디코딩할 수 있습니다.
출력 데이터로부터 텍스트 생성하기
| # post_process.py
import numpy as np
import torch
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
input_text = "Hey, are you conscious? Can you talk to me?"
batch_size = 1
# 입력 데이터 준비
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
conversation = [[{"role": "user", "content": input_text}]] * batch_size
text = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", padding=True)
input_ids = inputs.input_ids
input_len = inputs.input_ids.shape[-1]
# 데이터 디코딩
output_sequence = torch.tensor(np.fromfile("c_text2text_generation_gen_id.bin", dtype=np.int64), dtype=torch.int64)
generated_texts = tokenizer.decode(
output_sequence[input_len:], skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print("--- input text ---")
print(input_text)
print("--- Decoded C Result ---")
print(generated_texts)
|
결과는 다음과 같이 보여집니다:
| --- input text ---
Hey, are you conscious? Can you talk to me?
--- Decoded C Result ---
Hello! I'm an AI, which means I'm a computer program designed to simulate conversation and answer questions to the best of my ability. I don't have consciousness in the way that humans do, but I'm designed to be very responsive and interactive.
I can understand and respond to language, and I can even learn and improve over time based on the conversations I have with users like you. So, in a sense, I'm "awake" and ready to chat with you!
What would you like to talk about? Do you have a specific question or topic in mind, or do you just want to chat about something random? I'm here to listen and help if I can!
|