Skip to content

Installation

This section describes how to install and validate the software stack for Red Hat AI Enterprise with Rebellions. Install components and perform verification in the following order:

  1. Red Hat OpenShift cluster — For cluster installation and lifecycle tasks, see OpenShift support.

  2. RBLN NPU Operator — After the cluster is ready, install the operator. It deploys and manages the Rebellions components required to provision RBLN NPUs on OpenShift. See RBLN NPU Operator.

  3. Verification — When the operator is running, Rebellions NPU devices are exposed to cluster workloads. Perform the checks below to confirm resource registration, pod scheduling, and in-container device access.

    1. Node capacity (resource registration)

      Query node capacity (example):

      kubectl get node -o json | jq '.status.capacity'
      

      A successful registration includes a rebellions.ai/ATOM entry in the JSON, for example:

      1
      2
      3
      4
      5
      6
      {
        "cpu": "128",
        "memory": "1056858492Ki",
        "pods": "250",
        "rebellions.ai/ATOM": "1"
      }
      

      That entry shows the node advertises the accelerator and can schedule pods that request rebellions.ai/ATOM.

    2. Device visibility inside the pod

      To use the ATOM™ accelerator, declare rebellions.ai/ATOM under both requests and limits in the pod specification—for example:

      apiVersion: v1
      kind: Pod
      metadata:
        name: rbln-workload
      spec:
        containers:
          - name: rbln-container
            image: <container_image>
            resources:
              requests:
                rebellions.ai/ATOM: "1"
              limits:
                rebellions.ai/ATOM: "1"
      

      Set equal requests and limits so the scheduler can bind the pod to a node that exposes the device.

      From a running pod, run:

      kubectl exec -it rbln-workload -- rbln-smi
      

      Expected output includes device identifiers, memory, and utilization, which confirms the container can access the ATOM™ accelerator.

      Reference: sample rbln-smi output

      > rbln-smi
      
      +-------------------------------------------------------------------------------------------------+
      |                                Device Information KMD ver: 3.0.2                                |
      +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
      | NPU |    Name   | Device  |   PCI BUS ID  | Temp |  Power  | Perf |  Memory(used/total) |  Util |
      +=====+===========+=========+===============+======+=========+======+=====================+=======+
      | 0   | RBLN-CA25 | rbln0   |  0000:05:00.0 |  36C |  56.4W  | P14  |    0.0B / 15.7GiB   |   0.0 |
      +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
      |                                       Context Information                                       |
      +-----+---------------------+--------------+-----------+----------+------+---------------+--------+
      | NPU | Process             |     PID      |    CTX    | Priority | PTID |      Memalloc | Status |
      +=====+=====================+==============+===========+==========+======+===============+========+
      | N/A | N/A                 |     N/A      |    N/A    |   N/A    | N/A  |           N/A |  N/A   |
      +-----+---------------------+--------------+-----------+----------+------+---------------+--------+
      rbln-smi success, skipping rbln-stat
      

    If a check fails or command output differs materially from the patterns above (including the sample rbln-smi listing), contact client support for assistance.

  4. Red Hat OpenShift AI — Install the product using the procedure in the Red Hat OpenShift AI self-managed installation guide.

    Red Hat AI Enterprise with Rebellions NPU has been validated with Red Hat OpenShift AI 3.3, which uses a KServe-oriented model-serving workflow, hardware profiles, and runtime auto-selection so deployments can be matched to the intended compute resources.

    Key components and roles

    Model serving on Red Hat OpenShift AI 3.3 relies on two primary custom resources:

    • ServingRuntime — Container image, entrypoint, environment, and supported model formats for the inference stack.
    • HardwareProfile — CPU, memory, accelerator, toleration, and node-selector constraints so the scheduler can place workloads on suitable nodes.

    Rebellions provides reference manifests for RBLN ServingRuntime, and RBLN HardwareProfile objects that follow this workflow.

    RBLN ServingRuntime

    A ServingRuntime defines how model-serving pods are built: which image to run, how models are supplied, and which protocols and formats are supported. Register one from the dashboard under Settings > Model resources and operations > Serving runtimes, then choose Add serving runtime. The following dialog is shown.

    image

    Select an API protocol (REST or gRPC). This example uses REST and a generative model type. Paste the following ServingRuntime YAML into the editor.

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      annotations:
        opendatahub.io/apiProtocol: REST
        opendatahub.io/model-type: '["generative"]'
        opendatahub.io/modelServingSupport: '["single"]'
        opendatahub.io/recommended-accelerators: '["rebellions.ai/ATOM"]'
        opendatahub.io/runtime-version: v0.10.1.1
        openshift.io/display-name: vLLM RBLN ATOM ServingRuntime for RedHat
      labels:
        opendatahub.io/dashboard: "true"
      name: qwen3-06b
      namespace: test-vllm
    spec:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8080"
      containers:
        - args:
            - --port=8080
            - --model=/mnt/models
            - --served-model-name="qwen3-06b"
            - --enable-prefix-caching
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          env:
            - name: HOME
              value: /workspace
            - name: OMP_NUM_THREADS
              value: "2"
            - name: VLLM_RBLN_USE_VLLM_MODEL
              value: "1"
            - name: VLLM_LOGGING_LEVEL
              value: WARNING
          image: repo.rebellions.ai/rebellions/vllm-rbln-rhel9:3.3
          name: kserve-container
          ports:
            - containerPort: 8080
              protocol: TCP
          readinessProbe:
            failureThreshold: 30
            initialDelaySeconds: 80
            periodSeconds: 20
            tcpSocket:
              port: 8080
            timeoutSeconds: 5
          volumeMounts:
            - mountPath: /workspace
              name: workspace-volume
      imagePullSecrets:
        - name: repo-rebellions-sw
      multiModel: false
      supportedModelFormats:
        - autoSelect: true
          name: vLLM
      volumes:
        - emptyDir:
            sizeLimit: 100G
          name: workspace-volume
    

    The ServingRuntime above does not specify a secret for pulling images. You can either reference OpenShift's Global Pull Secret (link) or create a separate secret and add it to the ServingRuntime as follows:

    1
    2
    3
    spec:
      imagePullSecrets:
        - name: "Your Image Pull Secret"
    

    The above YAML manifest can be applied by the following command:

    oc apply -f <serving-runtime.yaml>
    

    RBLN Hardware profile

    Hardware profiles are custom resources for targeted scheduling. Use them to declare CPU, memory, accelerator, toleration, and node-selector requirements for workloads such as workbenches and model serving. Create an RBLN profile from Settings > Environment setup > Hardware profiles, then select Create hardware profile.

    image image

    The form exposes fields for:

    • Hardware identifiers
    • Explicit resource limits (CPU, memory, accelerators)
    • Tolerations
    • Node selectors

    The following defines a recommended RBLN hardware profile for Rebellions NPU ATOM™-MAX (RBLN-CA25).

    kind: HardwareProfile
    metadata:
      annotations:
        opendatahub.io/dashboard-feature-visibility: '[]'
        opendatahub.io/disabled: "false"
        opendatahub.io/display-name: Rebellions NPU (ATOM)
      name: rebellions-npu
      namespace: redhat-ods-applications
    spec:
      identifiers:
      - resourceType: CPU
        identifier: cpu
        displayName: CPU
        defaultCount: "6"
        minCount: "6"
        resourceType: CPU
      - resourceType: Memory
        identifier: memory
        displayName: Memory
        defaultCount: "50Gi"
        minCount: "30Gi"
      - resourceType: Accelerator
        identifier: rebellions.ai/ATOM
        displayName: rebellions-atom  
        defaultCount: "1"
        minCount: "1"
    

    The above YAML can be applied by the following command:

    oc apply -f <hardware-profile.yaml>
    

    For more information, see Red Hat’s guide Working with hardware profiles.

  5. Local testing (Podman or Docker) - Before deploying model serving on OpenShift, check whether a container instance of RHOAI vLLM RBLN image is created well on a host. Also, it can be checked whether a AI model works or not before it is deployed on kubernetes pod. Follow the steps below.

    1. Login to registry

      Authenticate to the container registry:

      podman login repo.rebellions.ai
      
    2. Run the container

      podman run \
        -d \
        --name vllm-rbln \
        --device /dev/rbln0
        --device /dev/rsd0 \
        -v ~/.cache/huggingface:/root/.cache/huggingface:Z \
        -e HF_TOKEN=<hf_token>
        -e HOME=/workspace \
        -e VLLM_RBLN_USE_VLLM_MODEL=1 \
        -p 8000:8000 \
        --ipc=host \
        repo.rebellions.ai/rebellions/vllm-rbln-rhel9:3.3 \
        Qwen/Qwen3-0.6B \
        --block-size 4096 \
        --max-num-seqs 1 \
        --max-model-len 4096 \
        --max-num-batched-tokens 128
      

    3. Send a test request Send a sample chat completion request:

      1
      2
      3
      4
      5
      6
      7
      curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "Qwen3/Qwen3-0.6B",
          "messages": [{"role":"user","content":"Hello, who are you?"}],
          "max_tokens": 100
        }'
      

    When local validation succeeds, continue with model deployment and inference on Red Hat OpenShift AI.