Configure Kernel Parameters for RBLN NPUs on OpenShift (Supermicro AMD)¶

This guide describes how to configure kernel boot parameters required for RBLN NPUs to function correctly under the RBLN NPU Operator on OpenShift.

These parameters are required for stable device operation and must be applied to nodes where NPU workloads are scheduled. Configuration is performed using the OpenShift MachineConfig Operator (MCO).

Without these settings, NPUs may operate incorrectly.

Target Hardware¶

This guide has been validated on the following configuration:

Item	Value
Vendor	Supermicro
CPU	AMD
NPU	RBLN-CA25
Driver	RBLN Driver v3.0.0

Kernel Parameters¶

The following kernel parameters are required for optimal NPU operation:

Parameter	Description
`transparent_hugepage=madvise`	Sets THP to madvise mode. THP is applied only to regions explicitly requested by the application via `madvise(MADV_HUGEPAGE)`.
`pcie_aspm=force`	Force-enables PCIe Active State Power Management.
`pci=pcie_bus_perf`	Sets PCIe bus performance optimization mode.
`pci=bfsort`	Enables BFS (Breadth-First Search) sorting for PCI devices.
`iommu.strict=1`	Enables IOMMU strict mode (immediate IOTLB flush on DMA unmap).

Prerequisites¶

OpenShift Container Platform 4.x cluster
cluster-admin privileges
OpenShift CLI (oc) installed and logged in
(Optional) Node Feature Discovery Operator v0.16+ — required for automatic hardware identification

Hardware Identification¶

Before applying MachineConfig, verify that the target node is a Supermicro AMD server.

Manual Verification¶

# Check system manufacturer
oc debug node/<NODE_NAME> -- chroot /host dmidecode -s system-manufacturer

# Check CPU model
oc debug node/<NODE_NAME> -- chroot /host bash -c "grep 'model name' /proc/cpuinfo | head -1"

Expected output:

1 2	`Supermicro model name : AMD EPYC ...`

(Optional) Automatic Identification via NFD NodeFeatureRule¶

With NFD v0.16+, the system.dmiid.sys_vendor feature can be used to read DMI vendor information from nodes. A NodeFeatureRule can be used to automatically assign custom labels to matching nodes.

Note

NFD does not automatically create labels for system.dmiid. A NodeFeatureRule must be defined.

References:

Create a NodeFeatureRule¶

apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
  name: smci-amd-detection
spec:
  rules:
    - name: "detect-supermicro"
      labels:
        hardware.io/vendor-supermicro: "true"
      matchFeatures:
        - feature: system.dmiid
          matchExpressions:
            sys_vendor: {op: In, value: ["Supermicro"]}
    - name: "detect-amd-cpu"
      labels:
        hardware.io/cpu-amd: "true"
      matchFeatures:
        - feature: cpu.model
          matchExpressions:
            vendor_id: {op: In, value: ["AMD"]}

oc apply -f <NFR_YAML_FILE>

Verify Labels¶

After applying, confirm that the following labels are automatically assigned to the target node:

oc get node <NODE_NAME> -o jsonpath='{.metadata.labels}' | python3 -m json.tool | grep hardware.io

Expected output:

1 2	`"hardware.io/vendor-supermicro": "true", "hardware.io/cpu-amd": "true",`

Use NFD Labels as MCP nodeSelector¶

Once NFD labels are active, they can be used directly in the MCP nodeSelector without manual node-role labeling. In that case, update the nodeSelector in the MCP YAML in Create Custom MachineConfigPool as follows:

  nodeSelector:
    matchLabels:
      hardware.io/vendor-supermicro: "true"
      hardware.io/cpu-amd: "true"

With this configuration, Supermicro AMD nodes added to the cluster are automatically included in the MCP, and kernel parameters are applied without manual labeling.

Applying Kernel Arguments¶

Warning

If the MachineConfig machineconfiguration.openshift.io/role label is set to worker, it will be applied to the default worker MCP. This means the configuration will be applied to all worker nodes in the cluster, triggering node reboots.

To limit the scope, create a custom MCP that targets only specific nodes (for example, Supermicro AMD servers), and set the MachineConfig role label to that MCP name.

In the steps below, replace __MCP_NAME__ with a name appropriate for your environment (e.g., smci-ca25-amd).

Pre-check¶

# Check node status
oc get nodes

# Check target node labels
oc get node <NODE_NAME> --show-labels

# List existing MCPs
oc get mcp

# Check current kernel parameters (record baseline before applying)
oc debug node/<NODE_NAME> -- chroot /host cat /proc/cmdline

Create Custom MachineConfigPool¶

A custom MCP allows kernel parameters to be applied only to selected nodes.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: __MCP_NAME__
spec:
  machineConfigSelector:
    matchExpressions:
      - key: machineconfiguration.openshift.io/role
        operator: In
        values:
          - worker
          - __MCP_NAME__
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/__MCP_NAME__: ""

Note

If using NFD-based node identification, you can replace the nodeSelector with NFD-generated labels (see Use NFD Labels as MCP nodeSelector).

# Create MCP
oc apply -f <MCP_YAML_FILE>

# Assign role label to target node (skip if using NFD automatic identification)
oc label node <NODE_NAME> node-role.kubernetes.io/__MCP_NAME__=""

Verify MCP status:

oc get mcp __MCP_NAME__
# Confirm UPDATED=True, DEGRADED=False

Apply MachineConfig¶

MachineConfig Name Convention (`99-` prefix)¶

MCO sorts MachineConfigs lexicographically by name and merges them sequentially to produce the final rendered config. The numeric prefix determines merge priority:

Prefix	Purpose
`00-`	OpenShift defaults (e.g., `00-worker`)
`01-`	Base runtime/kubelet settings
`50-`	Operational custom settings
`99-`	User custom settings (merged last, overrides existing settings)

Using the 99- prefix ensures that kernel parameters are merged after OpenShift default MCs and are reliably reflected in the final config.

Idempotency¶

MCO operates declaratively when applying MachineConfigs. If the same MC is re-applied or if a MC with identical kernel parameters already exists:

MCO compares the current rendered config with the new rendered config.
No node reboot occurs if there are no changes.
The MCP status remains UPDATED=True.

This means re-applying the same MC when the parameters are already in place is safely ignored without unnecessary reboots.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-grub-kargs-amd-smci-ca25
  labels:
    machineconfiguration.openshift.io/role: __MCP_NAME__
spec:
  kernelArguments:
    - transparent_hugepage=madvise
    - pcie_aspm=force
    - pci=pcie_bus_perf
    - pci=bfsort
    - iommu.strict=1

oc apply -f <MC_YAML_FILE>

Monitor Rollout¶

After applying MachineConfig, MCO automatically cordons → drains → reboots the target node.

# Watch MCP status (wait until UPDATING=True → False)
watch oc get mcp __MCP_NAME__

# Watch node status (SchedulingDisabled → Ready)
watch oc get nodes

# Check rendered MC
oc get mc | grep rendered-__MCP_NAME__

Verify¶

Verify Kernel Parameters¶

oc debug node/<NODE_NAME> -- chroot /host cat /proc/cmdline

Confirm that the following parameters are present in the cmdline output:

transparent_hugepage=madvise
pcie_aspm=force
pci=pcie_bus_perf
pci=bfsort
iommu.strict=1

Verify THP Status¶

oc debug node/<NODE_NAME> -- chroot /host cat /sys/kernel/mm/transparent_hugepage/enabled

Expected output:

1	`always [madvise] never`

Verify Final MCP Status¶

oc get mcp __MCP_NAME__

Field	Expected
UPDATED	True
UPDATING	False
DEGRADED	False

Rollback¶

If issues arise, deleting the MachineConfig restores the original kernel parameters.

Warning

Follow the deletion order below strictly. Incorrect order may leave nodes in a Degraded state. (Recovery: see Degraded Recovery)

Rollback Procedure¶

Deletion order:

1	`Delete MC → Wait for reboot to complete → Remove node label → Delete MCP`

Step 1. Delete MachineConfig

oc delete mc 99-grub-kargs-amd-smci-ca25

Step 2. Wait for rollout to complete

The MCO re-renders the configuration without the kernel arguments and reboots the node.

Wait until the rollout is fully complete before proceeding:

watch oc get mcp __MCP_NAME__
# Confirm UPDATED=True, UPDATING=False, DEGRADED=False

Step 3. Confirm kernel parameters are removed

Confirm that the previously applied kernel parameters are no longer present:

oc debug node/<NODE_NAME> -- chroot /host cat /proc/cmdline

### Cleanup

After rollback is complete, remove the node label and delete the MCP.

**Step 4.** Remove node label

```bash
oc label node <NODE_NAME> node-role.kubernetes.io/__MCP_NAME__-

Step 5. Delete MCP

oc delete mcp __MCP_NAME__

Troubleshooting¶

Degraded Recovery¶

Symptom: If the MCP or node label is deleted before the MC, the node's currentConfig annotation references an already-deleted rendered MachineConfig, leaving the node in a Degraded state.

1	`Node <NODE_NAME> is reporting: "missing MachineConfig rendered-<MCP>-xxxxx"`

Cause: When deleting an MC, MCO only re-renders for nodes belonging to that MCP. If the node has left the MCP before the MC is deleted, MCO has no target node to process, and the kernel parameters remain in place.

Recovery Procedure:

# 1. Recreate MCP
oc apply -f <MCP_YAML_FILE>

# 2. Re-assign label to node
oc label node <NODE_NAME> node-role.kubernetes.io/__MCP_NAME__=""

# 3. Get new rendered config name
oc get mcp __MCP_NAME__ -o jsonpath='{.spec.configuration.name}'

# 4. Force-update node's currentConfig annotation to new rendered config
oc annotate node <NODE_NAME> \
  machineconfiguration.openshift.io/currentConfig=<NEW_RENDERED_CONFIG> \
  machineconfiguration.openshift.io/desiredConfig=<NEW_RENDERED_CONFIG> \
  --overwrite

# 5. Restart MCD pod to reset stale state
MCD_POD=$(oc get pods -n openshift-machine-config-operator -o wide \
  | grep <NODE_NAME> | grep machine-config-daemon | awk '{print $1}')
oc delete pod $MCD_POD -n openshift-machine-config-operator

# 6. Confirm MCP is healthy
watch oc get mcp __MCP_NAME__
# Confirm UPDATED=True, DEGRADED=False

# 7. After recovery, clean up in correct order (Rollback Procedure → Cleanup)

Configure Kernel Parameters for RBLN NPUs on OpenShift (Supermicro AMD)¶

Target Hardware¶

Kernel Parameters¶

Prerequisites¶

Hardware Identification¶

Manual Verification¶

(Optional) Automatic Identification via NFD NodeFeatureRule¶

Create a NodeFeatureRule¶

Verify Labels¶

Use NFD Labels as MCP nodeSelector¶

Applying Kernel Arguments¶

Pre-check¶

Create Custom MachineConfigPool¶

Apply MachineConfig¶

MachineConfig Name Convention (99- prefix)¶

Idempotency¶

Monitor Rollout¶

Verify¶

Verify Kernel Parameters¶

Verify THP Status¶

Verify Final MCP Status¶

Rollback¶

Rollback Procedure¶

Troubleshooting¶

Degraded Recovery¶

MachineConfig Name Convention (`99-` prefix)¶