Introducing Crucible: 38 alert rules for bare metal
We built an open-source monitoring agent that covers the failure modes that actually matter when running physical servers. IPMI sensors, SMART attributes, RAID health, network errors, and OS-level alerts, all in a single binary with zero configuration.
The problem
General-purpose monitoring tools like Datadog, New Relic, and Grafana Cloud were built for cloud workloads. They track HTTP latency, container restarts, and application metrics. When you try to use them for bare metal, you end up writing custom checks for everything that matters: drive health, IPMI thresholds, RAID degradation, ECC memory errors.
Traditional tools like Nagios and Zabbix can do hardware monitoring, but they require hours of configuration. You need to define every check, every threshold, every notification rule. Most operators end up with a fraction of the coverage they need.
Our approach
Crucible takes a different path. Instead of being configurable for every possible scenario, it is opinionated about what matters. We spent a decade operating bare metal infrastructure across 60+ locations. We know what breaks. We encoded that knowledge into 38 alert rules that cover the real failure modes:
- OS (9 rules): RAM pressure, CPU utilization, CPU iowait, OOM kills, load average vs core count, clock drift, swap usage, NTP sync, unexpected reboots
- Storage (8 rules): SMART health and reallocated sectors, NVMe wear leveling, RAID degradation, disk space, disk I/O errors, disk latency, filesystem going read-only, inode exhaustion
- Network (5 rules): Interface error rates, link speed mismatches, interface saturation, conntrack table exhaustion, bond slave down
- Hardware/IPMI (5 rules): CPU temperature, fan failures, PSU redundancy loss, ECC memory errors (correctable and uncorrectable), critical SEL events
- ZFS (2 rules): Pool health, scrub errors
- Security (6 rules): SSH root password login, missing firewall, pending security updates, kernel vulnerability mitigations, kernel reboot needed, unattended upgrades disabled
- Service Health (3 rules): Failed systemd services, file descriptor exhaustion, server unreachable
Per-core CPU monitoring
Since version 0.3.0, Crucible reports per-core CPU utilization: user, system, iowait, idle, irq, and softirq for every logical core. A single core pegged at 90% looks like 3% aggregate on a 32-thread machine. Without per-core data, you would never catch a stuck process, an IRQ affinity issue, or a single-threaded bottleneck. Forge renders this as an expandable per-core chart alongside the aggregate view.
How it works
Crucible runs as a systemd service. Every 5 minutes it collects a complete snapshot of your server's health and pushes it to Forge. One command to install, zero configuration files to maintain.
curl -sf https://forge.glassmkr.com/install | bash The agent is lightweight (single Node.js process, ~90MB RSS; varies by hardware: servers with more IPMI sensors use more) and non-intrusive. It reads from /proc, /sys, smartctl, ipmitool, and mdadm. No kernel modules, no eBPF, no root-level hooks into your application stack.
Each snapshot includes: aggregate and per-core CPU, memory and swap, disk space with inode counts and mount options, SMART attributes (model, temperature, power-on hours, reallocated sectors), network interface stats, RAID array status, IPMI sensor readings, and security posture (SSH config, firewall, kernel vulns, pending updates).
Alert management
Forge evaluates all 38 rules on every snapshot and assigns priority levels: P1 Urgent (data loss imminent), P2 High (service-impacting), P3 Medium (degrading), P4 Low (informational). Each alert card shows evidence links to the relevant dashboard section and copy-pasteable diagnostic commands.
Alerts can be acknowledged (silences notifications for that occurrence) or the underlying rule can be muted per-server (stops it from firing entirely until unmuted). This lets operators handle known issues without noise. Alert tabs let you filter between Active, Acknowledged, and All views.
Open source
Crucible is MIT licensed and available on GitHub. You can run it standalone, pipe the output to your own systems, or pair it with Forge for history, fleet views, and managed alerting.
We believe the monitoring agent should be free and open. You should be able to inspect exactly what data leaves your server. The value of Forge is in the fleet-level dashboard, AI analysis, and notification infrastructure, not in locking down the collector.
What comes next
We are working on expanding Crucible's coverage. GPU monitoring (NVIDIA SMI), storage controller support beyond mdadm, deeper NVMe health reporting, and process-level monitoring are on the roadmap. Every addition follows the same principle: only ship alerts for failure modes we have actually encountered in production.