Skip to content

Nvidia Triton Inference Server

Nvidia Triton Inference Server is an open-source software designed to enable efficient and scalable deployment of machine learning models. You can efficiently serve the AI models leveraging Rebellions' NPUs with the Nvidia Triton Inference Server. Specifically for LLMs, the vLLM backend integrated with vllm-rbln leverages advanced continuous batching optimization techniques to maximize throughput and minimize latency.

Getting started

With the Nvidia Triton Inference Server, we can serve using the Python backend and the vLLM backend based on the Rebellion NPU. To utilize the RBLN NPU, the rebel-compiler package must be installed.

If you consider the serving environment using the Nvidia Triton Inference Server in a Docker environment, you can utilize the rebellions/tritonserver Docker image. For detailed information, please refer to B. On-premise server in the Llama2-7B with Continuous Batching tutorial.

Tutorial

We provide two comprehensive tutorials that show how to enable Nvidia Triton Inference Server: