Llama3.1-8B with Flash Attention¶
Flash Attention enables efficient handling of long contexts in models like Llama3.1-8B by reducing memory usage and improving throughput. This tutorial will guide you how to serve Llama3.1-8B
with Flash Attention using vllm-rbln
and Nvidia Triton Inference Server.
You can check out the actual commands required to compile the model and initialize triton vllm_backend on our model zoo.
Note
This tutorial is written with the assumption that the reader already has a good understanding of how to compile and infer models using RBLN SDK. If you are not familiar with RBLN SDK, please refer to the Tutorials.
Prerequisites¶
- Ubuntu 20.04 LTS (Debian bullseye) or higher
- RBLN NPUs equipped (RBLN ATOM+)
- Python (supports 3.9 - 3.12)
- Docker
- RBLN SDK Driver >= 1.2.92
- rebel-compiler >= 0.7.3
- optimum-rbln >= 0.7.3.post2
- vllm-rbln >= 0.7.3
Note
To use the Llama3.1-8B
model, 8 RBLN NPUs are required. You can refer to the recommended number of NPUs for each model in Optimum RBLN Multi-NPUs Supported Models.
Note
Since the vllm-rbln
package does not depend on the vllm
package, duplicate installations may cause operational issues. If you installed the vllm
package after vllm-rbln
, please reinstall the vllm-rbln
package to ensure proper functionality.
Compile Llama3.1-8B¶
You need to compile the Llama3.1-8B
model using optimum-rbln.
Note
Choose an appropriate batch size for your serving needs. Here, it is set to 1.
Triton Inference Server with vLLM enabled¶
Nvidia Triton Inference Server provides vLLM backend.
If you are using Backend.AI, refer to Step 1. If you are using an on-premise server, skip Step 1 and proceed directly to Step 2.
Step 1. Setting Up the Backend.AI Environment¶
- Start a session via Backend.AI.
- Select Triton Server (
ngc-triton
) as your environment. You can see the version of24.12 / vllm / x86_64 / python-py3
.
Step 2. Prepare Nvidia Triton vllm_backend
and Modify Model Configurations for Llama3.1-8B¶
A. Clone the Nvidia Triton Inference Server vllm_backend
repository:
B. Place the precompiled Llama-3.1-8B-Instruct
directory into the cloned vllm_backend/samples/model_repository/vllm_model/1
directory:
Your directory should look like the following at this point:
Note
- The vLLM backend for Nvidia Triton Server doesn't need a
model.py
file unlike other vision model backends. All model processing logic is pre-included in the Docker container atbackends/vllm/model.py
, so you only needmodel.json
for configuration. - You can either use the default
config.pbtxt
from the repository or create a new one using the template below. Note that input and output formats must match exactly as shown, since they're required by the vLLM backend (see Step 4: gRPC Client Inference Request).
C. Modify model.json
Modify vllm_backend/samples/model_repository/vllm_model/1/model.json
.
model
: Absolute path of the compiled model.max_num_seqs
: Maximum number of sequences per iteration. This MUST match the compiledbatch_size
block_size
: The size of the block for Paged Attention. When using Flash Attention, the block size must be equal torbln_kvcache_partition_len
.device
: Device type for vLLM execution. Please set this torbln
.- The
max_num_batched_tokens
fields should be set to the same value asmax_model_len
for RBLN device.
Step 3. Run the Inference Server¶
We are now ready to run the inference server. If you are using Backend.AI, please refer to the A. Backend.AI section. If you are not a Backend.AI user, proceed to the B. On-premise server section.
A. Backend.AI¶
Before proceeding, install the required dependencies:
You will see the following messages that indicate successful initiation of the server:
B. On-premise server¶
If you are not using Backend.AI
, follow these steps to start the inference server in the Docker container. (Backend.AI
users can skip to Step 4.)
To access the RBLN NPU devices, the inference server container must be run in privileged mode or with the required devices mounted. In this tutorial, we will run the container in privileged mode with a mount option for the cloned vllm_backend
repository. Use the following command to execute the container:
Install the required dependencies in the container:
Start the Triton Server in the container:
You will see the following messages that indicate successful initiation of the server:
Step 4. Send a Chat Completion Request¶
The vLLM backend of the Triton inference server defines its own model.py
, which is different from the model.py
defined in the Resnet50 tutorial in terms of input/output signature. The input parameter was called INPUT__0
and the output was called OUTPUT__0
. But the input parameter of vLLM has the name text_input
and the output is called text_output
. Our client should be modified accordingly. Please refer to vLLM model.py for more details.
The following shows the client code for vLLM backend. This client also requires tritonclient
and grpcio
packages.
Note
You need to appropriately apply a chat template for proper functioning. The chat template must be formatted with system
, user
, and `assistant roles. Additionally, you must configure the sampling parameters like this:
If the request works properly, you will see an output like the one shown below.
If you need to change other sampling paramaters (such as temperature
, top_p
, top_k
, max_tokens
, early_stopping
...) please refer to VLLM's python client.