Nvidia Triton Inference Server
Nvidia Triton Inference Server is an open-source software designed to enable efficient and scalable deployment of machine learning models. You can efficiently serve the AI models leveraging Rebellions' NPUs with the Nvidia Triton Inference Server. Specifically for LLMs, the vLLM backend integrated with vllm-rbln
leverages advanced continuous batching optimization techniques to maximize throughput and minimize latency.
Getting started¶
With the Nvidia Triton Inference Server, we can serve using the Python backend and the vLLM backend based on the Rebellion NPU. To utilize the RBLN NPU, the rebel-compiler
package must be installed.
If you consider the serving environment using the Nvidia Triton Inference Server in a Docker environment, you can utilize the rebellions/tritonserver
Docker image.
For detailed information, please refer to B. On-premise server in the Llama2-7B with Continuous Batching
tutorial.
Tutorial¶
We provide two comprehensive tutorials that show how to enable Nvidia Triton Inference Server:
- Resnet50 shows how to serve the precompiled Resnet50 model with Nvidia Triton Inference Server
- Llama2-7B with Continuous Batching shows how to serve Llama2-7B with vLLM-enabled NVidia Triton Inference Server