Sampler¶

As the vocabulary size of large language models (LLMs) increases, the sampling stage occupies a growing portion of the overall inference time. Among various sampling methods, the widely used Top-p sampling requires sorting all tokens, which results in high time complexity.

To address this, the FlashInfer algorithm introduces a method that performs Top-p sampling without a sorting step. vLLM RBLN leverages this algorithm to execute the Top-p sampling operation on RBLN devices, further improving inference throughput.

This feature is enabled by default. To disable it and use the original vLLM sampler, set the environment variable VLLM_RBLN_SAMPLER=0.

During vLLM initialization, the sampler is compiled in the warm-up phase. If you set VLLM_RBLN_ENABLE_WARM_UP=0, the warm-up step is skipped, and the compilation will occur at runtime instead. However, this may degrade the initial inference speed, so it is recommended to keep the default setting VLLM_RBLN_ENABLE_WARM_UP=1.

In the future, the RBLN Sampler will be extended to support additional sampling algorithms to run efficiently on RBLN devices.