Cross-vendor IPMI quirks we learned the hard way.
Operating IPMI across a fleet of mixed-vendor hardware is a compatibility minefield. Here are the specific footguns we have hit running monitoring on seven boxes spanning four vendors, six motherboard generations, and three operating system families.
The Intelligent Platform Management Interface (IPMI) has been the dominant out-of-band management standard on commodity server hardware for over two decades. In theory, it gives a single interface to sensor data, system event logs, and remote control across every vendor. In practice, "single interface" describes a target the vendors collectively achieve about 80% of the time, and the remaining 20% is where you spend your hours.
We have spent some hours. Our validation fleet at Glassmkr runs seven boxes across four vendors: Supermicro, Gigabyte, ASUS, and ASRockRack. The motherboard generations span 2017 to 2024. We operate them on a mix of Debian, Ubuntu, Rocky Linux, Alma Linux, and Proxmox. Every interaction with IPMI on this fleet has eventually surfaced a vendor-specific or distro-specific quirk that needed handling.
Here are six of them, in roughly the order they bit us.
1. The two-confounding-issues case: missing tool AND disabled config
One of our boxes, our own production services-1, was reporting IPMI data as empty for weeks. Sensors empty. SEL count zero. The ipmi.available field set to false.
The natural diagnostic sequence on Linux for IPMI is: check kernel modules, check the device node, check the userspace tool, check the agent. We did all four:
ipmi_si, ipmi_devintf, ipmi_msghandler, ipmi_ssif: loaded
/dev/ipmi0: present, root:root, 240,0
ipmitool: bash: ipmitool: command not found Found it. The ipmitool binary was missing from the box. Standard fix: apt install ipmitool.
After installation we restarted the agent. The data was still empty. We had not, in fact, found it.
The second issue surfaced when we read the agent's config file. Buried in the collection section was a single line: ipmi: false. Someone (probably one of us, months earlier, during initial provisioning) had explicitly disabled IPMI collection. The tool's absence had masked the config flag; the config flag had masked the tool's absence. Each was independently sufficient to break IPMI collection. Together they made the diagnostic less obvious than either would have been alone.
Two fixes (install tool, flip flag). Both small. The lesson is structural: when a system reports nothing, the cause might be plural. Check every layer independently before declaring the first layer the culprit.
2. The alias-vs-DMI divergence
Each of our validation boxes has a human-readable alias. The aliases were assigned during provisioning based on a quick scan of the hardware. One of them was labelled "supermicro-x12qch" because the BMC interface said Supermicro. We trusted the BMC and moved on.
Months later, monitoring started enriching server metadata with DMI data from /sys/class/dmi/id/. The DMI report for "supermicro-x12qch" came back: vendor GIGABYTE, product R292-4S1-00.
The box is a Gigabyte. Not a Supermicro. The BMC firmware was lying because it had been re-flashed by the previous owner, or because the BMC vendor SKU does not match the motherboard vendor, or for some other reason we have not chased to ground. From the customer's perspective, the alias was wrong by 100%, but every monitoring action we had taken against the box had still worked, because the underlying IPMI surface is vendor-neutral enough at the protocol level.
The fix on our side was to surface the DMI-reported vendor and product in the dashboard alongside the alias, so the discrepancy becomes visible. Customers naming a box wrong (or being lied to by a re-flashed BMC) is now flagged with a one-glance read.
3. The malformed SEL timestamp
System Event Log timestamps should be ISO-8601. Real-world ipmitool output is not always ISO-8601. We have observed shapes like:
24-11-14T16:16:02 UTCZ Two-digit year, day order varies, the trailing "Z" with a leading "UTC" string, and the time field can be 12-hour or 24-hour depending on BMC firmware version. Our agent was emitting these strings unparsed because the BMC was emitting them unparsed.
The downstream consequence was a rule that used SEL event ages to filter old events. The rule could not parse the timestamps, so it could not filter by age, so old events kept firing forever. The fix was on the agent side: parse the timestamps into proper ISO-8601 (or sentinel) before emitting. The parsing layer had to handle several shapes we did not initially document because we did not know they existed.
Cross-vendor: if you are parsing SEL output, do not trust the timestamp format. Build a layer specifically for normalising it.
4. The amd64-microcode non-free-firmware footgun
Multiple AMD-based boxes in our fleet had a kernel_vulnerabilities alert pointing at a missing microcode update. Our documented guidance said: install the appropriate microcode package, reboot.
On Debian-based boxes, this is apt install amd64-microcode. We ran it. The output was:
E: Package 'amd64-microcode' has no installation candidate The package exists in the Debian archive. It just lives in the non-free-firmware component, which is not enabled in a stock Debian 12 or 13 sources file. A customer following our guidance hits the error and has no obvious next step.
The fix on our side was a documentation update: when our guidance recommends amd64-microcode, include a note that non-free-firmware may need to be enabled in /etc/apt/sources.list. The fix on the customer side is a one-line addition to sources.list followed by apt update. Small. But invisible if our docs do not mention it.
This is the kind of distro-specific knowledge that does not surface unless someone actually runs the fix path on the affected distro. Vendor documentation, in our experience, never mentions it.
5. The firewalld vs ufw asymmetry
Our no_firewall rule shipped with documented commands using ufw, which is the default firewall management tool on Debian, Ubuntu, and derivatives. Customers running Rocky Linux, Alma Linux, or Fedora got the same alert with the same ufw-centric instructions. ufw is not the firewall on those distros. firewalld is.
We had not added a RHEL-family branch to our documentation. The fix was straightforward (firewall-cmd commands for enabling firewalld and the SSH service zone), but the documentation gap had been live for months. The customers most affected by it were exactly the customers least likely to know they were affected: operators who know firewalld would naturally adapt our ufw commands, but operators who are newer might just stop.
The cross-vendor monitoring lesson here: any documentation that recommends a tool needs to handle the multi-distro case. We had handled it for kernel_needs_reboot (separate apt and dnf commands) but missed it for the firewall rule. The audit pass after this discovery surfaced two other rules with similar gaps.
6. The Gigabyte DTS firmware offset
Some Gigabyte AMD motherboards report CPU package temperature via a sensor named CPU<N>_DTS. The reported value is consistently about 30 °C above the actual die temperature, because the sensor is using an internal reference offset that has not been zeroed. This is firmware behaviour, not something the operating system can fix.
A naive temperature alert built on this sensor would fire at "85 °C package temperature critical" while the actual die was at 55 °C, well within normal range. Our default behaviour now is to deprioritise the DTS sensors specifically on these boards, falling back to the per-core sensors which report accurate values. The list of board families needing this treatment grows as we discover more of them.
If you are building temperature monitoring across mixed hardware, never trust a sensor named "DTS" without cross-checking it against another sensor on the same chip. The IPMI protocol provides the data faithfully; the firmware producing the data may not.
The pattern
These six are not the only cases, just the ones we have hit recently enough to remember in detail. Each one was small in isolation. Collectively they are the reason why cross-vendor monitoring is harder than monitoring a homogeneous fleet, and why "support every vendor" is a multi-quarter commitment rather than a feature flag.
The pattern across all six is that the underlying protocol is consistent, but each layer above the protocol introduces vendor or distro variability. The kernel module loads consistently. The /dev/ipmi0 device is consistent. The ipmitool binary's flags are consistent. But the BMC firmware emits whatever it wants. The vendor's documentation may or may not exist. The distro's package manager may or may not have your package. The board's sensor naming may or may not be accurate.
If you operate a single-vendor fleet, this is not your problem. If you operate two or three vendors, you can adapt manually. If you operate four or more, you need automation that handles each vendor's quirks as data, not as exceptions in code.
Our agent does this with vendor profiles that select sensor mapping rules, BMC quirk overrides, and per-vendor sensor downranking. The list of overrides grows monthly. We do not expect this work to ever finish, because each new motherboard generation introduces its own quirks. The honest framing of our cross-vendor support is: we have caught most of the common ones, we will catch more over time, and we will surface failures via validation tests rather than waiting for customers to hit them.
If you are building cross-vendor IPMI tooling, expect the same trajectory. Plan for the audit, not just the build.
Read more about how Glassmkr handles bare-metal monitoring: the documentation, or why bare metal monitoring is different.