DOCS / ALERT RULES
Alert rules
Glassmkr ships 60 alert rules tuned for bare-metal infrastructure. Each rule has a title, summary, priority, and category here; per-alert remediation guidance (command to run, what to verify, rollback notes) is rendered inside the dashboard on the alert detail page.
For AI agents: the machine-readable catalog is at /llms-full.txt.
Storage
disk_io_errorsP1Disk I/O errors
Kernel logged I/O errors against one or more block devices. Indicates failing storage hardware, a flaky cable/controller, or filesystem corruption. Investigate immediately to prevent data loss.
disk_latency_highP3Disk latency high
Disk's average read or write latency exceeds the threshold under non-trivial IOPS load. Indicates a struggling drive, saturated I/O queue, in-progress RAID rebuild, or noisy-neighbor workload.
nvme_critical_warningP1NVMe critical warning byte non-zero
An NVMe device's Critical Warning byte (NVM Express §5.21) is non-zero. Per spec, any non-zero bit is a vendor-recommended immediate-action signal: temperature threshold exceeded, available spare below threshold, reliability degraded, read-only mode, volatile memory backup failed, or persistent memory region read-only.
nvme_wear_highP2NVMe wear high
NVMe drive's percentage-used indicator is at or above the configured threshold. Plan replacement before the drive enters read-only protection mode at 100%.
raid_degradedP1RAID array degraded
One or more disks have failed in an mdadm software array or a hardware RAID controller (Dell PERC, LSI/Broadcom MegaRAID, HPE Smart Array, Adaptec). One more failure may cause data loss.
smart_failingP1Drive failing per SMART
SMART data indicates imminent drive failure (reallocated sectors, pending sectors, or aggregate health != PASSED). Back up data and replace the drive.
ZFS
zfs_pool_unhealthyP1ZFS pool unhealthy
ZFS pool in non-OPTIMAL state. Severity scales with vdev redundancy class (Crucible v0.10.4+): SUSPENDED pools and FAULTED top-level vdevs page critical; DEGRADED on single/raidz1/mirror_2way pages critical; raidz3/mirror_3way+ pages warning. L2ARC failures emit at info severity (no data loss). SLOG faults handled by zfs_slog_faulted.
zfs_scrub_errorsP1ZFS scrub found errors
ZFS pool's most recent scrub detected checksum or repair errors, or the pool has not been scrubbed in over 30 days. Errors suggest failing disks or silent corruption; missing scrub is preventive-maintenance gap.
zfs_slog_faultedP1ZFS SLOG vdev faulted
A ZIL log vdev (SLOG) is FAULTED or REMOVED. Sync-write durability for the pool is compromised until the SLOG is replaced.
Filesystem
disk_fill_projectionP1Disk fill projection imminent
Linear projection on filesystem available_bytes indicates exhaustion within 24h (P1) or 7d (P2). Companion to disk_space_high (absolute %).
disk_space_highP2Disk space high
Filesystem usage at or above the configured threshold (default 85%). At 100% services that write to this filesystem will fail; at >=95% the buffer is hours-not-days.
fd_exhaustionP1File descriptor exhaustion
Host-wide file descriptor usage at or above 80% of fs.file-max OR a single process at or above 80% of its RLIMIT_NOFILE soft limit. The per-process path activates with Crucible v0.11.0+; older agents only emit on the host-wide path.
filesystem_readonlyP1Filesystem remounted read-only
Kernel forced a filesystem to read-only mode, usually due to I/O errors. Services that write to this mount will fail. Data already on the filesystem is readable; new writes are not.
inode_highP2Inode usage high
Filesystem has many small files; inode usage at or above 85% of the table. At 100%, file creation fails with ENOSPC even though `df -h` shows free space.
lvm_thinpool_metadata_highP1LVM thin pool metadata near full
LVM thin pool metadata volume at or above 80%. Metadata exhaustion is silent and catastrophic: writes across all thin volumes in the pool start failing in unpredictable ways at 100%. Extend the metadata volume before it fills.
Memory & CPU
cpu_highP2CPU usage high
Aggregate CPU utilization at or above 90% (idle below 10%). Critical at >=98% (idle <2%). Either a runaway process or workload exceeding capacity.
cpu_iowait_highP2CPU I/O wait high
CPU is spending 20%+ of its time waiting on disk I/O. Indicates storage bottleneck; either an overwhelmed device or runaway I/O from one process.
cpu_pressure_highP2CPU pressure stall sustained
PSI reports CPU contention persistently above threshold. Aggregate signal across the host; subordinates cpu_high and load_high to this incident when it fires.
load_highP3Load average high
1-minute load average exceeds 2x the CPU core count for several minutes. Usually indicates an I/O bottleneck (high D-state processes) rather than pure CPU saturation.
mem_pressure_highP1Memory pressure sustained
PSI reports memory contention with active paging or rapid MemAvailable decline. Real pressure signal, not used% noise.
oom_killsP1OOM killer recently fired
Kernel out-of-memory killer terminated one or more processes in the recent window. Severe memory pressure or a memory leak. Killed services may be down.
ram_highP3RAM usage high
Memory pressure on the host. Warning at 90%, critical at 95%. Sustained pressure leads to swap thrashing and OOM kills.
swap_highP2Swap usage high
Swap usage at or above 50%. Swap I/O is 10-100x slower than RAM; sustained swap = thrashing. Critical band (>=80%) indicates imminent service degradation.
Network
accept_backlog_or_syn_floodP1Accept backlog or SYN flood
2 or more of conntrack_exhaustion / listen_overflow / tcp_retrans_high are active on the same host within 5 minutes. Indicates accept-queue buildup or SYN flood.
bond_slave_downP1Bond slave interface down
A slave NIC in a bonded interface has MII status down. The bond is running with reduced redundancy; one more failure breaks the bond entirely.
conntrack_exhaustionP1Conntrack table near full
Linux nf_conntrack table is at or above 75% capacity. At 100%, new connections are silently dropped; services appear to work but new clients can't connect. Critical band (>=90%) means dropping is imminent.
interface_errorsP2Interface errors high
Network interface reports elevated CRC / frame / carrier errors (physical layer) OR elevated packet drops (software ring/softirq layer). Tier red = critical (cable swap or kernel tuning urgent); tier yellow = warning.
interface_saturationP3Interface near saturation
Network interface utilization above the configured threshold (default 90% of negotiated speed). Plan bandwidth upgrade or traffic shaping; queue depth growth predicts the next OOMing connection-handling daemon.
lacp_partner_lostP1LACP partner lost
Bond MII layer reports up but the LACP partner is unsynchronized. The bond appears functional while traffic is dropped by the switch. Also emits a warning when the active aggregator has fewer ports than configured (redundancy reduced).
link_speed_mismatchP2Link speed mismatch
Network interface negotiated a speed below 1 Gbps despite supporting higher. Almost always a physical-layer or autoneg issue; rarely a real config decision.
listen_overflowP2TCP listen-queue dropping connections
/proc/net/netstat TcpExt ListenOverflows or ListenDrops is incrementing; the kernel is dropping arriving connections at accept-queue level. Either the application can't accept() fast enough or net.core.somaxconn is too small for the offered load.
softnet_dropsP1Kernel softnet dropping packets
/proc/net/softnet_stat reports kernel input-queue drops at sustained rate (>1 pkt/s). The NET_RX softirq backlog is filling faster than the kernel can process; packets are being silently discarded. Often correlated with conntrack pressure or CPU pressure.
tcp_retrans_highP2TCP retransmit rate elevated
TCP retransmit ratio (retransmits / segments sent) over the most recent snapshot interval exceeds 2%. Above 1% commonly impacts performance; above 5% significantly degrades throughput. Indicates network reliability or remote-peer problems.
Hardware (BMC/IPMI)
cpu_temperature_highP1CPU temperature high
CPU thermal reading at or above the warning threshold (default 80°C; critical 90°C). At critical, thermal throttling kicks in and silicon damage risk climbs.
ecc_errorsP1ECC memory errors
Memory controller reported one or more uncorrectable ECC errors. Data corruption has occurred; the DIMM is failing. Replace immediately.
ipmi_fan_failureP1IPMI fan failure
BMC reports one or more chassis fans in critical state or at 0 RPM. Cooling capacity is reduced; CPU temperatures may climb and trigger thermal throttling or emergency shutdown.
ipmi_sel_criticalP1IPMI SEL critical events
BMC System Event Log contains one or more critical-severity asserted events in the last N days (default 30). Critical events indicate real hardware faults; DIMM, PSU, fan, voltage, or temperature.
mce_uncorrectedP0Uncorrected machine check exception
EDAC reports an uncorrected memory error. Replace the affected DIMM.
psu_redundancy_lossP1PSU redundancy lost
One or more PSUs are in fault, absent, or degraded state. Single power failure now risks full server outage. Dell BMCs report this via an aggregate sensor; other vendors via per-PSU sensors.
GPU
gpu_corrected_ecc_stormP3GPU corrected-ECC level high
GPU corrected-ECC counter is high or single-bit retired pages are non-zero. SBE storms typically precede DBE faults; this rule gives operators time to plan preventive replacement before uncorrected ECC fires.
gpu_driver_or_firmware_driftP3GPU vbios drift within host
Multiple GPUs of the same model on this host report different vbios versions. Within-host vbios drift typically indicates a failed firmware update or mixed-batch installation.
gpu_pcie_link_degradedP2GPU PCIe link degraded
GPU's current PCIe gen or width is below the GPU's advertised maximum. Host-to-GPU bandwidth is capped below the GPU's capability; meaningful for large-model loading and PCIe-attached weights, catastrophic for training-style workloads.
gpu_power_cap_throttlingP2GPU power-cap throttling
GPU is being throttled by software power cap (sw_power_cap) or hardware power brake (hw_power_brake). May be intentional (operator-configured limit) or unexpected (PSU sizing, chassis power policy).
gpu_thermal_criticalP1GPU thermal critical
GPU die temperature at or above HW slowdown threshold, or kernel reports thermal throttle engaged. Sustained operation at thermal limits accelerates wear and reduces inference throughput. Boot grace 300s for post-boot sensor stabilisation.
gpu_uncorrected_eccP0GPU uncorrected ECC or DBE retired pages
GPU reports uncorrected ECC errors, double-bit ECC retired pages, or pending retirements. Uncorrected ECC means error correction could not recover; in-flight data may have been corrupted. Pending retirements require a reboot.
gpu_xid_criticalP0GPU XID critical event
NVIDIA XID error classified as critical per NVIDIA's published XID severity table. Hardware-witnessed fault on the GPU; data may be at risk and the workload likely degraded.
nvlink_link_downP1NVLink link down
An NVLink on a multi-GPU host is in the down state. Multi-GPU bandwidth is reduced; if the GPU participates in NCCL collectives the entire training/inference job's latency degrades.
Time & Services
clock_driftP2Clock drift
System clock is at least 5 seconds off from upstream NTP. Critical at >=60s; TLS validation, log correlation, database replication, and cron all break.
ntp_not_syncedP2NTP not synced
Either the kernel clock is unsynchronized (critical; drift in progress) OR the NTP daemon has stopped while the clock is still synced (warning; drift will start once kernel state expires).
service_flappingP1systemd service flapping
A systemd unit has hit its start-limit (systemd stopped restarting it) OR has restarted 5+ times. A service that can't stabilise consumes resources without delivering value; investigate before bumping restart limits.
systemd_service_failedP1systemd service failed
One or more systemd units are in the failed state. The service is not running; dependent functionality is offline. Crucible 0.9.2+ also ships the last 5 journal lines per failed unit in evidence so root cause is one click away.
systemd_service_oom_killedP1systemd service killed by OOM
systemd reports a failed unit with Result=oom-kill. The kernel OOM killer terminated the service; pair with the host-level oom_kills emission to find the underlying memory pressure source.
unexpected_rebootP1Unexpected reboot
Server rebooted without an operator-acknowledged planned reboot. Possible causes: kernel panic, hardware fault (PSU brownout, thermal shutdown, watchdog), power outage, or remote reboot via BMC.
Security & Patching
kernel_needs_rebootP2Reboot required (newer kernel installed)
A newer kernel package is installed on disk but the running kernel is older. Security patches in the new kernel are not active until reboot.
kernel_vulnerabilitiesP2Kernel vulnerability mitigations missing
One or more CPU vulnerability mitigations (Spectre, Meltdown, MDS, etc.) report unmitigated or partial coverage in /sys/devices/system/cpu/vulnerabilities/. Update kernel + CPU microcode to apply.
no_firewallP1No host firewall active
No active firewall rules detected. All listening ports are reachable from any network the host is connected to, unless protected by network-level ACLs (VPC, cloud SG, on-prem ACL).
pending_security_updatesP2Pending security updates
Package manager reports one or more security updates available AND auto-updates are not configured. Manual patching is required; counterpart to unattended_upgrades_disabled which fires when the auto-update mechanism itself is missing.
server_unreachableP1Server unreachable
Dashboard has not received a snapshot from this server in 2x the configured collection interval (default 10 minutes). Either the Crucible agent stopped reporting, the network is down, or the server is offline. Alert auto-resolves on next successful snapshot.
ssh_root_passwordP1SSH allows root password login
sshd allows root login with password. Brute-force-able from the network. Switch to key-only root login (still works for key-based ops); ideally disable root SSH entirely and use a sudo-equipped operator account.
unattended_upgrades_disabledP3Unattended security upgrades disabled
No automatic security update mechanism is configured. The host is at the operator's mercy for patch cadence; if patches are pending, the counterpart pending_security_updates rule will fire.