Ray Serve
Ray Serve is a scalable model-serving library built on the Ray distributed framework. It provides key features such as dynamic batching, multi-node deployment, and response streaming.
This section guides you on how to utilize RBLN NPUs and provides examples within the Ray Serve environment.
Getting Started¶
Ray Serve Installation¶
You can install Ray Serve using pip as follows.
For more detailed instructions on installing Ray Serve, please refer to the Ray official documentation.
RBLN NPU with Ray¶
RBLN NPU is one of the accelerators as pre-defined resource types in Ray Core. To run tasks or actors on RBLN NPU, request resources as shown below:
Optionally, you can set the RBLN_DEVICES environment variable before launching Ray or Ray Serve to control which Rebellions RBLN devices are exposed to Ray.
Tutorial¶
We provide tutorials to help users get started with model serving using Ray Serve.
- Resnet50 explains how to serve an image classification model with Ray Serve
- YOLOv8 demonstrates how to serve an object detection model with Ray Serve
- Llama3-8B explains how to serve an LLM using the Ray Serve LLM API
- Llama3-8B with Flash Attention provides a guide on how to enable Flash Attention with Ray Serve