Ray Serve
Ray is an open-source unified framework for high-performance distributed computing in Python. Ray Serve is a scalable machine learning model serving platform built on top of Ray. It provides key features such as dynamic batching, multi-node deployment, and response streaming, while supporting a wide range of machine learning frameworks.
Within Ray Serve, Rebellions’ high-performance NPU can be seamlessly integrated to deliver efficient and scalable inference serving.
Getting Started¶
Ray Serve Installation¶
You can install Ray Serve using pip as follows.
For more detailed instructions on installing Ray Serve, please refer to the Ray official documentation.
RBLN NPU with Ray¶
RBLN NPUs are officially supported in Ray Core.
To run tasks or actors on an RBLN NPU, specify the required the number of NPUs as follows:
Optionally, you can set the RBLN_DEVICES environment variable before launching Ray or Ray Serve to control which Rebellions RBLN devices are exposed to Ray.
Tutorial¶
We provide tutorials to help users get started with model serving using Ray Serve.
- Resnet50 explains how to serve an image classification model with Ray Serve
- YOLOv8 demonstrates how to serve an object detection model with Ray Serve
- Llama3-8B explains how to serve an LLM using the Ray Serve LLM API
- Llama3-8B with Flash Attention provides a guide on how to enable Flash Attention with Ray Serve