Model deployment¶

Model deployment means preparing model serving. Before deploying a model, confirm that the required RBLN ServingRuntime and HardwareProfile are registered in the OpenShift AI dashboard under Settings.

Create a project
1. In the left menu, click Projects, then select Create project.
2. Enter a name and description, then confirm. All subsequent resources—connections, deployments, and permissions—are scoped to this project.
Prepare storage and create a data connection
1. Provision a storage backend for your model artifacts. Supported options include an S3-compatible object store, a URI-based repository, or a PersistentVolumeClaim (PVC).
2. Inside the project, open the Connections tab and select Create connection.
3. Choose the connection type that matches your storage backend (for example, S3-compatible object storage or URI - v1), fill in the required fields, and save.
Open the deployment wizard

Inside the project, open the Deployment tab and select Deploy model. The Deploy a model wizard opens.
- Step 1 — Model details
  - Under Model location, select Existing connection and choose the connection created in step 2.
  - Select the Model type: Predictive AI for classical ML models, or Generative AI for large language models (LLMs).
- Step 2 — Model deployment settings
  - Model deployment name: Enter a unique identifier for the deployment.
  - Hardware profile: Select the profile that matches the target hardware (for example, the Rebellions NPU profile), and specify the number of accelerators.
  - Serving runtime: Select the custom runtime configured for your hardware and model format.
- Step 3 — Advanced settings
  - Enable Model route to expose an external inference endpoint. Optionally, enable Token authentication to restrict access to the endpoint.
  - In the Configuration parameters section, add any runtime arguments or environment variables required by the selected serving runtime. These settings apply only to this deployment and do not affect the global runtime configuration.
The deployment wizard internally creates InferenceService custom resource (CR) for KServe in Red Hat OpenShift AI. Once created, the KServe controller detects it and generates the required resources such as Deployments, Services and Pods necessary for model serving. InferenceService will be explained in detail below.
Deploy and verify
1. Select Deploy to submit the deployment.
2. Monitor the status on the Deployments tab. When the status becomes Available, the inference endpoint URL appears in the Inference endpoints column.
3. Optionally, open the pod terminal in the OpenShift console and run rbln-smi to confirm that the Rebellions NPU is recognized inside the serving container.

InferenceService¶

An InferenceService defines a model-serving deployment: it routes requests to the selected runtime, applies the configured hardware profile, and exposes REST or gRPC endpoints. It typically specifies:

Model location and format
The ServingRuntime to use
The HardwareProfile to be applied
Whether passthrough routing is enabled for REST or gRPC
HTTP or gRPC access settings for the deployed model

There are two ways to create InferenceService.

Using the OpenShift AI UI: In a Project, open the Deployment tab and select Deploy a model. The wizard automatically creates an InferenceService CR.
Using YAML: Apply a manifest directly:

oc apply -f <inferenceservice.yaml>

The following example shows an RBLN InferenceService that deploys a Qwen3 model with the vLLM-based runtime.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen3-06b
  namespace: test-vllm
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    opendatahub.io/hardware-profile-name: rebellions-npu
    opendatahub.io/hardware-profile-namespace: redhat-ods-applications
    opendatahub.io/model-type: generative
    security.opendatahub.io/enable-auth: "false"
    haproxy.router.openshift.io/timeout: 3000s
    serving.kserve.io/stop: "false"
  labels:
    opendatahub.io/dashboard: "true"
    networking.kserve.io/visibility: "exposed"
spec:
  predictor:
    automountServiceAccountToken: false
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      runtime: qwen3-06b
      storageUri: hf://Qwen/Qwen3-0.6B
      args:
        - "--block-size=4096"
        - "--enable-chunked-prefill"
        - "--max-num-batched-tokens=128"
        - "--max-num-seqs=1"
      env:
        - name: OMP_NUM_THREADS
          value: "2"
        - name: VLLM_RBLN_NUM_DEVICES_PER_LOCAL_RANK
          value: "1"
      resources:
        requests:
          cpu: 6
          memory: 100Gi
          rebellions.ai/npu: 1
        limits:
          cpu: 6
          memory: 100Gi
          rebellions.ai/npu: 1

After InferenceService CR is deployed, proper functioning can be checked by the following command.

curl https://qwen3-06b-test-vllm.<your-domain>/v1/chat/completions -v -k  \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-06b",
    "messages": [{"role":"user","content":"Hello, who are you?"}],
    "max_tokens": 100
  }'

Notes¶

Compilation and warm-up are required for model serving in vLLM RBLN. These run automatically at engine startup when the first request is processed (for example. when LLM(...) is initialized).
The first startup is slower because graph compilation and warm-up are performed.
Subsequent runs reuse cached binaries by default (under $VLLM_CACHE_ROOT/rbln, typically ~/.cache/vllm/rbln).
Changing model or runtime shape parameters (for example, max_num_seqs, max_num_batched_tokens, or block_size) may trigger additional compilation.