Skip to content

RBLNSampler

Overview

As the vocabulary size of large language models (LLMs) increases, the sampling stage occupies a growing portion of the overall inference time. Among various sampling methods, the widely used Top-p sampling requires sorting all tokens, which results in high time complexity.

To address this, the FlashInfer algorithm introduces a method that performs Top-p sampling without a sorting step. vLLM RBLN leverages this algorithm to execute the Top-p sampling operation on RBLN devices, further improving inference throughput.

Enabling RBLNSampler

This feature is enabled by default. To disable it and use the original vLLM sampler, set the environment variable VLLM_RBLN_SAMPLER=0.

During vLLM initialization, the sampler is compiled in the warm-up phase. If you set VLLM_RBLN_ENABLE_WARM_UP=0, the warm-up step is skipped, and the compilation will occur at runtime instead. However, this may degrade the initial inference speed, so it is recommended to keep the default setting VLLM_RBLN_ENABLE_WARM_UP=1.

Limits

Currently, it supports only Top-p sampling. In the future, the RBLN Sampler will be extended to support additional sampling algorithms to run efficiently on RBLN devices.