Llama3.1-8B with Flash Attention¶
TorchServe provides a vLLM Handler that utilizes a custom handler to support the vLLM engine. With this handler, vllm-rbln
can be leveraged to serve LLM models efficiently. This tutorial guides you through serving the Llama3.1-8B model with Flash Attention using TorchServe’s vLLM Handler and vllm-rbln
.
For instructions on setting up the TorchServe environment, refer to TorchServe. To check the YAML files, model compilation and TorchServe configuration introduced on this page, visit the Model Zoo.
Note
This tutorial is written with the assumption that the reader already has a good understanding of how to compile and infer models using RBLN SDK. If you are not familiar with RBLN SDK, please refer to the Tutorials.
Prerequisites¶
The following prerequisites should be prepared for this tutorial.
- Ubuntu 20.04 LTS (Debian bullseye) or higher
- RBLN NPUs equipped (e.g., RBLN ATOM+ NPU)
- Python (supports 3.9 - 3.12)
- RBLN SDK (driver, compiler) (RBLN SDK Driver >= 1.2.92, rebel-compiler >= 0.7.3)
- TorchServe
- vllm-rbln >= 0.7.3.post2
- optimum-rbln >= 0.7.3
Note
To use the Llama3.1-8B
model, 8 RBLN NPUs are required. You can refer to the recommended number of NPUs for each model in Optimum RBLN Multi-NPUs Supported Models.
Note
The vllm-rbln
package does not depend on vllm
, and installing both may cause operational issues. If you installed vllm
after vllm-rbln
, please reinstall vllm-rbln
to ensure proper functionality.
Compile Llama3.1-8B¶
To prepare the model for serving, create the rbln_model
folder and navigate into it.
Compile the Llama3.1-8B
model using optimum-rbln.
Note
You need to select an appropriate batch size. In this case, it is set to 1.
Quick Start with TorchServe¶
In TorchServe, models are served as Model Archive (.mar
) units, which contain all necessary information for serving the model. The following guide explains how to create a .mar
file and use it for model serving.
RBLN vLLM Handler¶
TorchServe provides a vLLM Handler
to utilize the vLLM Engine. Because the handler code may have a dependency issue with the installed vLLM version, we suggest using RBLN vLLM Handler
, which is compatible with the latest version of vllm-rbln
, as shown below:
rbln_vllm_handler.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
Write the Model Configuration¶
Let’s create a model_config.yaml
file to configure the number of workers and TorchServe frontend parameters for serving the Llama3.1-8B
model. This yaml file contains the vLLM engine settings for LLM serving.
For more details, refer to TorchServe Document - Advanced configuration.
max_num_seqs
: Maximum number of sequences per iteration. This MUST match the compiledbatch_size
.block_size
: The size of the block for Paged Attention. When using Flash Attention, the block size must be equal torbln_kvcache_partition_len
.device
: Device type for vLLM execution. Set it torbln
.model
: Absolute path of the compiled model.served_model_name
: The name of the model to be served.- The
max_num_batched_tokens
fields should be set to the same value asmax_model_len
for RBLN device.
Model Archiving with torch-model-archiver
¶
The model_store
directory stores .mar
files, including the Llama3.1-8B
model archive used in this tutorial, for serving.
Now that the setup is complete, run the torch-model-archiver
command to create the model archive file.
The options passed to torch-model-archiver
are as follows.
--model-name
: Set the name of the model to be served asllama3.1-8b
.--version
: Specifies the version of the model to be served with TorchServe.--handler
: Specifies the handler script for the model, set asrbln_vllm_handler.py
.--config-file
: Specifies the yaml configuration file for the model, set asmodel_config.yaml
.--archive-format
: An option to specify the archiving format. Set asno-archive
.--export-path
: Specifies the directory where the archived model will be stored, set to themodel_store
folder created earlier.--extra-files
: Specifies a list of additional dependency files to include in the archive. Multiple files or directories can be specified, separated by commas (,). The internal folder structure of the specified directories is preserved in the archive.
Once the archiving process using torch-model-archiver
is complete, a folder named llama3.1-8b
will be created in model_store, where the model will be served. Since the no-archive
option was used, the archive’s internal files will be stored in this folder instead of being packaged into a .mar
file.
Run torchserve
¶
TorchServe can be started by running the following command. For a simple test where token authentication is not required, you can use the --disable-token-auth
option.
--start
: Starts the TorchServe service.--ncs
: Disable snapshot feature.--model-store
: Specifies the directory containing models.--models
: Loads a specific model. Loads all models available in themodel_store
directory.--disable-token-auth
: Disables authentication for management API endpoints, simplifying testing.
When TorchServe is started in success, it operates in the background. The command to stop TorchServe is as follows:
The Management API of TorchServe receives requests on port 8081 by default.
You can check the list of models currently being served using the following Management API.
If the operation is successful, you can verify that the Llama3.1-8B
model is being served.
Inference Request with TorchServe Inference API
¶
Simple Request with curl¶
Now, we can send an inference request using the Prediction API from the TorchServe Inference API to test the Llama3.1-8B model served with TorchServe.
The Inference API of TorchServe receives requests on port 8080 by default.
Make an inference request using the TorchServe Inference API
with curl.
If the inference request is successful, the following similar response is returned.