Nvidia Triton Inference Server¶

Nvidia Triton Inference Server is an open-source software designed to enable efficient and scalable deployment of machine learning models. You can efficiently serve the AI models leveraging Rebellions' NPUs with the Nvidia Triton Inference Server. Specifically for LLMs, the vLLM backend integrated with vllm-rbln leverages advanced continuous batching optimization techniques to maximize throughput and minimize latency.

Getting started¶

With the Nvidia Triton Inference Server, we can serve using the Python backend and the vLLM backend based on the Rebellion NPU. To utilize the RBLN NPU, the rebel-compiler package must be installed.

If you consider the serving environment using the Nvidia Triton Inference Server in a Docker environment, you can utilize the rebellions/tritonserver Docker image. For detailed information, please refer to B. On-premise server in the Llama3-8B tutorial.

Tutorial¶

We provide three comprehensive tutorials that show how to enable Nvidia Triton Inference Server:

Resnet50 shows how to serve the image classification model with Nvidia Triton Inference Server
YOLOv8 demonstrates how to serve the object detection model with Nvidia Triton Inference Server
Llama3-8B introduces how to serve the LLM model with vLLM-enabled Nvidia Triton Inference Server
Llama3.1-8B with Flash Attention guide on how to enable Flash Attention with Nvidia Triton Inference Server