Sampler¶
As the vocabulary size of large language models (LLMs) increases, the sampling stage occupies a growing portion of the overall inference time. Among various sampling methods, the widely used Top-p sampling requires sorting all tokens, which results in high time complexity.
To address this, the FlashInfer algorithm introduces a method that performs Top-p sampling without a sorting step. vLLM RBLN leverages this algorithm to execute the Top-p sampling operation on RBLN devices, further improving inference throughput.
This feature is enabled by default. To disable it and use the original vLLM sampler, set the environment variable VLLM_RBLN_SAMPLER=0.
During vLLM initialization, the sampler is compiled in the warm-up phase.
If you set VLLM_RBLN_ENABLE_WARM_UP=0, the warm-up step is skipped, and the compilation will occur at runtime instead.
However, this may degrade the initial inference speed, so it is recommended to keep the default setting VLLM_RBLN_ENABLE_WARM_UP=1.
In the future, the RBLN Sampler will be extended to support additional sampling algorithms to run efficiently on RBLN devices.