Configure Kernel Parameters for RBLN NPUs on OpenShift (Supermicro AMD)¶
This guide describes how to configure kernel boot parameters required for RBLN NPUs to function correctly under the RBLN NPU Operator on OpenShift.
These parameters are required for stable device operation and must be applied to nodes where NPU workloads are scheduled. Configuration is performed using the OpenShift MachineConfig Operator (MCO).
Without these settings, NPUs may operate incorrectly.
Target Hardware¶
This guide has been validated on the following configuration:
| Item | Value |
|---|---|
| Vendor | Supermicro |
| CPU | AMD |
| NPU | RBLN-CA25 |
| Driver | RBLN Driver v3.0.0 |
Kernel Parameters¶
The following kernel parameters are required for optimal NPU operation:
| Parameter | Description |
|---|---|
transparent_hugepage=madvise |
Sets THP to madvise mode. THP is applied only to regions explicitly requested by the application via madvise(MADV_HUGEPAGE). |
pcie_aspm=force |
Force-enables PCIe Active State Power Management. |
pci=pcie_bus_perf |
Sets PCIe bus performance optimization mode. |
pci=bfsort |
Enables BFS (Breadth-First Search) sorting for PCI devices. |
iommu.strict=1 |
Enables IOMMU strict mode (immediate IOTLB flush on DMA unmap). |
Prerequisites¶
- OpenShift Container Platform 4.x cluster
cluster-adminprivileges- OpenShift CLI (
oc) installed and logged in - (Optional) Node Feature Discovery Operator v0.16+ — required for automatic hardware identification
Hardware Identification¶
Before applying MachineConfig, verify that the target node is a Supermicro AMD server.
Manual Verification¶
Expected output:
(Optional) Automatic Identification via NFD NodeFeatureRule¶
With NFD v0.16+, the system.dmiid.sys_vendor feature can be used to read DMI vendor information from nodes. A NodeFeatureRule can be used to automatically assign custom labels to matching nodes.
Note
NFD does not automatically create labels for system.dmiid. A NodeFeatureRule must be defined.
1 2 3 4 | |
Create a NodeFeatureRule¶
Verify Labels¶
After applying, confirm that the following labels are automatically assigned to the target node:
Expected output:
Use NFD Labels as MCP nodeSelector¶
Once NFD labels are active, they can be used directly in the MCP nodeSelector without manual node-role labeling. In that case, update the nodeSelector in the MCP YAML in Create Custom MachineConfigPool as follows:
With this configuration, Supermicro AMD nodes added to the cluster are automatically included in the MCP, and kernel parameters are applied without manual labeling.
Applying Kernel Arguments¶
Warning
If the MachineConfig machineconfiguration.openshift.io/role label is set to worker, it will be applied to the default worker MCP. This means the configuration will be applied to all worker nodes in the cluster, triggering node reboots.
To limit the scope, create a custom MCP that targets only specific nodes (for example, Supermicro AMD servers), and set the MachineConfig role label to that MCP name.
In the steps below, replace __MCP_NAME__ with a name appropriate for your environment (e.g., smci-ca25-amd).
Pre-check¶
Create Custom MachineConfigPool¶
A custom MCP allows kernel parameters to be applied only to selected nodes.
Note
If using NFD-based node identification, you can replace the nodeSelector with NFD-generated labels (see Use NFD Labels as MCP nodeSelector).
Verify MCP status:
Apply MachineConfig¶
MachineConfig Name Convention (99- prefix)¶
MCO sorts MachineConfigs lexicographically by name and merges them sequentially to produce the final rendered config. The numeric prefix determines merge priority:
| Prefix | Purpose |
|---|---|
00- |
OpenShift defaults (e.g., 00-worker) |
01- |
Base runtime/kubelet settings |
50- |
Operational custom settings |
99- |
User custom settings (merged last, overrides existing settings) |
Using the 99- prefix ensures that kernel parameters are merged after OpenShift default MCs and are reliably reflected in the final config.
Idempotency¶
MCO operates declaratively when applying MachineConfigs. If the same MC is re-applied or if a MC with identical kernel parameters already exists:
- MCO compares the current rendered config with the new rendered config.
- No node reboot occurs if there are no changes.
- The MCP status remains
UPDATED=True.
This means re-applying the same MC when the parameters are already in place is safely ignored without unnecessary reboots.
Monitor Rollout¶
After applying MachineConfig, MCO automatically cordons → drains → reboots the target node.
Verify¶
Verify Kernel Parameters¶
Confirm that the following parameters are present in the cmdline output:
transparent_hugepage=madvisepcie_aspm=forcepci=pcie_bus_perfpci=bfsortiommu.strict=1
Verify THP Status¶
Expected output:
Verify Final MCP Status¶
| Field | Expected |
|---|---|
| UPDATED | True |
| UPDATING | False |
| DEGRADED | False |
Rollback¶
If issues arise, deleting the MachineConfig restores the original kernel parameters.
Warning
Follow the deletion order below strictly. Incorrect order may leave nodes in a Degraded state. (Recovery: see Degraded Recovery)
Rollback Procedure¶
Deletion order:
Step 1. Delete MachineConfig
Step 2. Wait for rollout to complete
The MCO re-renders the configuration without the kernel arguments and reboots the node.
Wait until the rollout is fully complete before proceeding:
Step 3. Confirm kernel parameters are removed
Confirm that the previously applied kernel parameters are no longer present:
Step 5. Delete MCP
Troubleshooting¶
Degraded Recovery¶
Symptom: If the MCP or node label is deleted before the MC, the node's currentConfig annotation references an already-deleted rendered MachineConfig, leaving the node in a Degraded state.
Cause: When deleting an MC, MCO only re-renders for nodes belonging to that MCP. If the node has left the MCP before the MC is deleted, MCO has no target node to process, and the kernel parameters remain in place.
Recovery Procedure: