Installation¶

This section describes how to install and validate the software stack for Red Hat AI Enterprise with Rebellions. Install components and perform verification in the following order:

Red Hat OpenShift cluster — For cluster installation and lifecycle tasks, see OpenShift support.
RBLN NPU Operator — After the cluster is ready, install the operator. It deploys and manages the Rebellions components required to provision RBLN NPUs on OpenShift. See RBLN NPU Operator.

Verification — When the operator is running, Rebellions NPU devices are exposed to cluster workloads. Perform the checks below to confirm resource registration, pod scheduling, and in-container device access.

Node capacity (resource registration)

Query node capacity (example):
1
kubectl get node -o json | jq '.status.capacity'
A successful registration includes a rebellions.ai/npu entry in the JSON, for example:
1 2 3 4 5 6
{ "cpu": "128", "memory": "1056858492Ki", "pods": "250", "rebellions.ai/npu": "1" }
That entry shows the node advertises the accelerator and can schedule pods that request rebellions.ai/npu.

Device visibility inside the pod

To use the ATOM™ accelerator, declare rebellions.ai/npu under both requests and limits in the pod specification—for example:

apiVersion: v1
kind: Pod
metadata:
  name: rbln-workload
spec:
  containers:
    - name: rbln-container
      image: <container_image>
      resources:
        requests:
          rebellions.ai/npu: "1"
        limits:
          rebellions.ai/npu: "1"

Set equal requests and limits so the scheduler can bind the pod to a node that exposes the device.

From a running pod, run:

kubectl exec -it rbln-workload -- rbln-smi

Expected output includes device identifiers, memory, and utilization, which confirms the container can access the ATOM™ accelerator.

Reference: sample rbln-smi output

> rbln-smi

+-------------------------------------------------------------------------------------------------+
|                                Device Information KMD ver: 3.2.2                                |
+-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
| NPU |    Name   | Device  |   PCI BUS ID  | Temp |  Power  | Perf |  Memory(used/total) |  Util |
+=====+===========+=========+===============+======+=========+======+=====================+=======+
| 0   | RBLN-CA25 | rbln0   |  0000:05:00.0 |  36C |  56.4W  | P14  |    0.0B / 15.7GiB   |   0.0 |
+-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
|                                       Context Information                                       |
+-----+---------------------+--------------+-----------+----------+------+---------------+--------+
| NPU | Process             |     PID      |    CTX    | Priority | PTID |      Memalloc | Status |
+=====+=====================+==============+===========+==========+======+===============+========+
| N/A | N/A                 |     N/A      |    N/A    |   N/A    | N/A  |           N/A |  N/A   |
+-----+---------------------+--------------+-----------+----------+------+---------------+--------+
rbln-smi success, skipping rbln-stat

If a check fails or command output differs materially from the patterns above (including the sample rbln-smi listing), contact client support for assistance.

Red Hat OpenShift AI — Install the product using the procedure in the Red Hat OpenShift AI self-managed installation guide.

Red Hat AI Enterprise with Rebellions NPU has been validated with Red Hat OpenShift AI 3.4, which uses a KServe-oriented model-serving workflow, hardware profiles, and runtime auto-selection so deployments can be matched to the intended compute resources.

Key components and roles¶

Model serving on Red Hat OpenShift AI 3.4 relies on two primary custom resources:

ServingRuntime — Container image, entrypoint, environment, and supported model formats for the inference stack.
HardwareProfile — CPU, memory, accelerator, toleration, and node-selector constraints so the scheduler can place workloads on suitable nodes.

Rebellions provides reference manifests for RBLN ServingRuntime, and RBLN HardwareProfile objects that follow this workflow.

RBLN ServingRuntime¶

A ServingRuntime defines how model-serving pods are built: which image to run, how models are supplied, and which protocols and formats are supported. Register one from the dashboard under Settings > Model resources and operations > Serving runtimes, then choose Add serving runtime. The following dialog is shown.

Select an API protocol (REST or gRPC). This example uses REST and a generative model type. Paste the following ServingRuntime YAML into the editor.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/apiProtocol: REST
    opendatahub.io/model-type: '["generative"]'
    opendatahub.io/modelServingSupport: '["single"]'
    opendatahub.io/recommended-accelerators: '["rebellions.ai/npu"]'
    opendatahub.io/runtime-version: v0.10.1.1
    openshift.io/display-name: vLLM RBLN ATOM ServingRuntime for RedHat
  labels:
    opendatahub.io/dashboard: "true"
  name: qwen3-06b
  namespace: test-vllm
spec:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"
  containers:
    - args:
        - --port=8080
        - --model=/mnt/models
        - --served-model-name="qwen3-06b"
        - --enable-prefix-caching
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      env:
        - name: HOME
          value: /workspace
        - name: OMP_NUM_THREADS
          value: "2"
        - name: VLLM_RBLN_USE_VLLM_MODEL
          value: "1"
        - name: VLLM_LOGGING_LEVEL
          value: WARNING
      image: repo.rebellions.ai/rebellions/vllm-rbln-rhel9:3.4
      name: kserve-container
      ports:
        - containerPort: 8080
          protocol: TCP
      readinessProbe:
        failureThreshold: 30
        initialDelaySeconds: 80
        periodSeconds: 20
        tcpSocket:
          port: 8080
        timeoutSeconds: 5
      volumeMounts:
        - mountPath: /workspace
          name: workspace-volume
  imagePullSecrets:
    - name: repo-rebellions-sw
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: vLLM
  volumes:
    - emptyDir:
        sizeLimit: 100G
      name: workspace-volume

The ServingRuntime above does not specify a secret for pulling images. You can either reference OpenShift's Global Pull Secret or create a separate secret and add it to the ServingRuntime as follows:

spec:
  imagePullSecrets:
    - name: "Your Image Pull Secret"

The above YAML manifest can be applied by the following command:

oc apply -f <serving-runtime.yaml>

RBLN Hardware profile¶

Hardware profiles are custom resources for targeted scheduling. Use them to declare CPU, memory, accelerator, toleration, and node-selector requirements for workloads such as workbenches and model serving. Create an RBLN profile from Settings > Environment setup > Hardware profiles, then select Create hardware profile.

The form exposes fields for:

Hardware identifiers
Explicit resource limits (CPU, memory, accelerators)
Tolerations
Node selectors

The following defines a recommended RBLN hardware profile for Rebellions NPU ATOM™-MAX (RBLN-CA25).

kind: HardwareProfile
metadata:
  annotations:
    opendatahub.io/dashboard-feature-visibility: '[]'
    opendatahub.io/disabled: "false"
    opendatahub.io/display-name: Rebellions NPU (ATOM)
  name: rebellions-npu
  namespace: redhat-ods-applications
spec:
  identifiers:
  - resourceType: CPU
    identifier: cpu
    displayName: CPU
    defaultCount: "6"
    minCount: "6"
    resourceType: CPU
  - resourceType: Memory
    identifier: memory
    displayName: Memory
    defaultCount: "50Gi"
    minCount: "30Gi"
  - resourceType: Accelerator
    identifier: rebellions.ai/npu
    displayName: rebellions-atom  
    defaultCount: "1"
    minCount: "1"

The above YAML can be applied by the following command:

oc apply -f <hardware-profile.yaml>

For more information, see Red Hat’s guide Working with hardware profiles.

Local testing (Podman or Docker) - Before deploying model serving on OpenShift, check whether a container instance of RHOAI vLLM RBLN image is created well on a host. Also, you can check whether an AI model works before it is deployed on a Kubernetes pod. Follow the steps below.

Login to registry

Authenticate to the container registry:
1
podman login repo.rebellions.ai

Run the container

podman run \
  -d \
  --name vllm-rbln \
  --device /dev/rbln0 \
  --device /dev/rsd0 \
  -v ~/.cache/huggingface:/root/.cache/huggingface:Z \
  -e HF_TOKEN=<hf_token> \
  -e HOME=/workspace \
  -e VLLM_RBLN_USE_VLLM_MODEL=1 \
  -p 8000:8000 \
  --ipc=host \
  repo.rebellions.ai/rebellions/vllm-rbln-rhel9:3.4 \
  Qwen/Qwen3-0.6B \
  --block-size 4096 \
  --max-num-seqs 1 \
  --max-model-len 4096 \
  --max-num-batched-tokens 128

Send a test request Send a sample chat completion request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role":"user","content":"Hello, who are you?"}],
    "max_tokens": 100
  }'

When local validation succeeds, continue with model deployment and inference on Red Hat OpenShift AI.