Ray Serve

Ray Serve is a scalable model-serving library built on the Ray distributed framework. It provides key features such as dynamic batching, multi-node deployment, and response streaming.

This section guides you on how to utilize RBLN NPUs and provides examples within the Ray Serve environment.

Getting Started¶

Ray Serve Installation¶

You can install Ray Serve using pip as follows.

1	`$ pip3 install "ray[serve]" transformers requests torch --extra-index-url https://download.pytorch.org/whl/cpu`

For more detailed instructions on installing Ray Serve, please refer to the Ray official documentation.

RBLN NPU with Ray¶

RBLN NPU is one of the accelerators as pre-defined resource types in Ray Core. To run tasks or actors on RBLN NPU, request resources as shown below:

1	`@ray.remote(resources={"RBLN":1})`

Optionally, you can set the RBLN_DEVICES environment variable before launching Ray or Ray Serve to control which Rebellions RBLN devices are exposed to Ray.

1	`$ RBLN_DEVICES=1,2,3,4 serve run llama3-8b:app`

Tutorial¶

We provide tutorials to help users get started with model serving using Ray Serve.

Resnet50 explains how to serve an image classification model with Ray Serve
YOLOv8 demonstrates how to serve an object detection model with Ray Serve
Llama3-8B explains how to serve an LLM using the Ray Serve LLM API
Llama3-8B with Flash Attention provides a guide on how to enable Flash Attention with Ray Serve