A terminal A/B of two ticket drafts for the same alert: the template returns in 3 milliseconds, names the fault and keeps the remote-hands option; Gemma returns in 22 seconds, opens with a generic 'a likely hardware fault' and drops a line. Shipped: template; Gemma left off behind a flag.

The most honest AI feature we shipped has no AI in it

We built the self-hosted-Gemma path for our ticket-draft feature, A/B'd it against a plain template on a live degraded-array alert, and shipped the template. The model was fine. Fine did not justify 22 seconds and a hallucination surface. What we kept, what we cut, and why the value was never the prose.

A monitoring timeline for a Gigabyte MC12-LE0 showing a cluster of multi-hour reporting gaps in mid-May beside an uptime counter that keeps climbing through every gap, proving the box never rebooted.

Would you have caught my VRM degradation?

A customer's Ryzen 9 5950X lost its VRM and asked if Glassmkr would have caught it. The honest answer was no: voltage alone cannot watch a DVFS core rail. What we built instead (a behavioral signal and a variance-aware voltage-drift signal), what happened when we backtested it on our own MC12-LE0, and the uptime check that stopped us claiming a win we had not earned.

A vitest run showing four passing tests tagged by cluster (security, evaluator, tenancy, parser), the agent and dashboard test counts, and a marketing-site line reading zero tests on purpose.

What our test suite looks like, and why

Four tests from the code that runs Glassmkr, and the incident that put each one there: a suppressed security alert, a temperature threshold that means different things on different boards, a 404 that has to stay a 404, and a power-supply name captured from a lying BMC. A test suite as a map of what has hurt you.

A terminal table validating GPU rules across NVIDIA L4, RTX A4000 and A16: L4 stayed quiet at 67C of 90C under load, the A16 reported 8 dies with correct per-die dedup, and ECC/XID/NVLink were fixture-tested rather than induced.

Validating GPU monitoring across three NVIDIA cards: L4, RTX A4000, A16

Three cards, eight rules. We tested three live on real hardware (power-cap throttling on a multi-die A16, thermal load on an L4, driver drift across all three) and deliberately did not induce the other five, because a real uncorrected-ECC or XID fault needs a power cycle to clear. The honest map of what 'GPU monitoring, tested' means.

A monitoring dashboard showing a RHEL host marked healthy while security patches sit pending and unapplied behind a download-only dnf-automatic timer.

We found a security false-negative in our own monitoring

On RHEL-family hosts, download-only dnf-automatic timers were treated as 'auto-updates configured,' silently suppressing the pending_security_updates alert while Critical patches sat unapplied. What the bug was, how dogfooding caught it, and the fix in Crucible 0.13.6.

Multi-vendor IPMI compatibility matrix showing Supermicro, Gigabyte, ASUS, and ASRockRack boards alongside their SEL timestamp formats, sensor quirks, and required workarounds.

Cross-vendor IPMI quirks we learned the hard way

Six specific footguns from running monitoring across Supermicro, Gigabyte, ASUS and ASRockRack on Debian, Ubuntu, Rocky, Alma and Proxmox. SEL timestamp shapes, BMC firmware that lies about vendor, distro-specific package gaps, and the Gigabyte DTS +30 °C offset.

Timeline of a stale no_firewall alert: customer enables ufw at 21:35, dashboard still shows the alert through 21:45, finally clears at 21:54. Post-fix the latency is 5 minutes.

When your monitoring tool punishes customers for doing the right thing

60 minutes of stale alerts after a legitimate fix. A kernel reboot that fired its own critical alert. Two distinct bugs surfaced when we actually applied our own remediation guidance end to end on the validation fleet.

A two-column terminal table comparing HYPOTHESIS against ACTUAL: vendor allowlist vs no allowlist exists, detection is wrong vs detection is correct, fix the detection vs fix the rendering and emission.

When a Phase 1 audit changed our hypothesis

A server reported 'IPMI: Not detected' while showing ECC counts on the same screen. The spec said our vendor detection was wrong. An hour-long audit said detection was correct and three other things were broken. Map the problem before you write the code.

Furnace AI assistant: reads alert, looks at evidence, suggests fix. Self-hosted Gemma 4 26B, no third-party LLM APIs.

Introducing Furnace: the AI assistant that helps you fix alerts

Furnace reads your alerts, looks at the evidence, and suggests remediation. Self-hosted Gemma 4 26B in Amsterdam. Conservative, hedging, willing to say 'I don't know'. The AI in your monitoring shouldn't be the headline.

Terminal probe output highlighting three alert-docs gap patterns surfaced by a constrained AI run

We used an AI as a controlled probe of our alert documentation

We forbade an AI from using its training data and made it resolve real infrastructure alerts using only the guidance our own dashboard produces. Three gap patterns surfaced. All three fixed in the same week.

Training next to Gemma: top output showing python train.py at 2200% CPU alongside llama-server at 9% CPU on l4-ams-01

Training a drive-failure model on a GPU server's CPU

We retrained our drive-failure predictor on 2 years of Backblaze data (222M drive-days) on the CPU of our L4 inference server. Gemma stayed resident in VRAM. 59 minutes, no new compute, 5.8% inference overhead. Plus the feature-importance surprise: SMART 197 beat SMART 187.

Glassmkr terminal preview: crucible fleet --status showing 3 servers, 62 rules evaluated, all healthy

Introducing Glassmkr: bare metal monitoring built by operators

Two pieces, one philosophy: Crucible (the open-source agent) plus Dashboard (the optional SaaS). Built by operators with a decade of bare metal experience across 60+ global locations.

Qwen3.6 vs Gemma 4 benchmark: thinking mode cost 7x latency for no material quality gain

We benchmarked Qwen3.6 against our production Gemma 4 on an L4. Here's what actually mattered.

Three-way benchmark of Gemma 4 26B-A4B, Qwen3.6 35B-A3B no-think, and Qwen3.6 35B-A3B thinking on a production infrastructure health analysis prompt. Real wall-clock numbers, VRAM footprints, and the quality-latency tradeoff that matters for narration.

ipmitool sensor output showing CPU2 at 89C critical, FAN1 at 0 RPM, mixed PSU status

IPMI diagnostics for bare metal: what to monitor and how to read it

A practical guide to monitoring IPMI sensors, SEL logs, and BMC health on Dell, Supermicro, and HPE servers. Covers kipmi0 CPU issues, vendor quirks, and what to alert on.

Gemma 4 26B-A4B (3.8B active) shipped; dense models 8B, 32B, and 70B did not make it

What We Learned Running Gemma 4 on an L4 GPU for Production Server Analysis

How we deployed Gemma 4 26B on an NVIDIA L4 for AI health analysis of bare metal servers. Covers model selection, why vLLM failed, quantization choices, and prompting for structured infrastructure output.

Priority-ordered hardware alerts: P1 SMART failing, P2 RAID degraded, P3 reallocated sectors rising, P4 ECC errors

IPMI, SMART, and RAID: The Hardware Layer Your Cloud Monitoring Tool Ignores

Most monitoring tools stop at the OS. Below it sits an entire hardware layer: disk firmware predicting its own failure, fans at 0 RPM, ECC memory correcting silent errors. Here is what to monitor and why.

Bare metal failure modes across storage, hardware, network, OS - none of which a cloud APM sees

Why bare metal monitoring is different

Cloud monitoring tools were built for ephemeral workloads. They track HTTP latency and container restarts. But when you run physical servers, the failure modes are fundamentally different: drives wear out, DIMM slots develop bit errors, fans fail silently, and RAID arrays degrade without anyone noticing.