DOCS / ALERT RULES

Alert rules

Glassmkr ships 65 alert rules tuned for bare-metal infrastructure. Each rule has a title, summary, priority, and category here; per-alert remediation guidance (command to run, what to verify, rollback notes) is rendered inside the dashboard on the alert detail page.

For AI agents: the machine-readable catalog is at /llms-full.txt.

Storage

disk_io_errors P1

Disk I/O errors

Kernel logged I/O errors against one or more block devices. Indicates failing storage hardware, a flaky cable/controller, or filesystem corruption. Investigate immediately to prevent data loss.
disk_latency_high P3

Disk latency high

Disk's average read or write latency exceeds the threshold under non-trivial IOPS load. Indicates a struggling drive, saturated I/O queue, in-progress RAID rebuild, or noisy-neighbor workload.
nvme_critical_warning P1

NVMe critical warning byte non-zero

An NVMe device's Critical Warning byte (NVM Express §5.21) is non-zero. Per spec, any non-zero bit is a vendor-recommended immediate-action signal: temperature threshold exceeded, available spare below threshold, reliability degraded, read-only mode, volatile memory backup failed, or persistent memory region read-only.
nvme_wear_high P2

NVMe wear high

NVMe drive's percentage-used indicator is at or above the configured threshold. Plan replacement before the drive enters read-only protection mode at 100%.
raid_degraded P1

RAID array degraded

One or more disks have failed in an mdadm software array or a hardware RAID controller (Dell PERC, LSI/Broadcom MegaRAID, HPE Smart Array, Adaptec). One more failure may cause data loss.
smart_failing P1

Drive failing per SMART

SMART data indicates imminent drive failure (reallocated sectors, pending sectors, or aggregate health != PASSED). Back up data and replace the drive.

Storage

Disk I/O errors

Disk latency high

NVMe critical warning byte non-zero

NVMe wear high

RAID array degraded

Drive failing per SMART

ZFS

ZFS pool unhealthy

ZFS scrub found errors

ZFS SLOG vdev faulted

Filesystem

Disk fill projection imminent

Disk space high

File descriptor exhaustion

Filesystem remounted read-only

Inode usage high

LVM thin pool metadata near full

Memory & CPU

CPU usage high

CPU I/O wait high

CPU pressure stall sustained

I/O pressure sustained

Load average high

Memory pressure sustained

OOM killer recently fired

RAM usage high

Swap usage high

Network

Accept backlog or SYN flood

Bond slave interface down

Conntrack table near full

Interface errors high

Interface near saturation

LACP partner lost

Link speed mismatch

TCP listen-queue dropping connections

Kernel softnet dropping packets

TCP retransmit rate elevated

Hardware (BMC/IPMI)

CMOS battery low

CPU temperature high

ECC memory errors

IPMI fan failure

IPMI SEL critical events

Uncorrected machine check exception

Memory channels under-populated

PSU redundancy lost

GPU

GPU corrected-ECC level high

GPU vbios drift within host

GPU will not survive a reboot

GPU PCIe link degraded

GPU power-cap throttling

GPU thermal critical

GPU uncorrected ECC or DBE retired pages

GPU XID critical event

NVLink link down

Time & Services

Clock drift

NTP not synced

systemd service flapping

systemd service failed

systemd service killed by OOM

Unexpected reboot

Security & Patching

Reboot required (newer kernel installed)

Kernel vulnerability mitigations missing

No host firewall active

Pending security updates

Server unreachable

SSH config changed but not applied

SSH allows root password login

Unattended security upgrades disabled