# Glassmkr Documentation Corpus (machine-readable)

Generated: 2026-05-20

This file is the LLM-friendly corpus for Glassmkr's documentation.
It contains the public catalog metadata for all 60 alert rules.
For human-readable documentation, see https://glassmkr.com/docs and
https://glassmkr.com/trust. Per-alert remediation guidance is rendered
inside the dashboard at https://app.glassmkr.com.

## Index

- Homepage: https://glassmkr.com
- Pricing: https://glassmkr.com/#pricing
- Documentation: https://glassmkr.com/docs
- Trust + security: https://glassmkr.com/trust
- Per-rule pages: https://glassmkr.com/docs/rules/<rule_id>
- Crucible agent source (MIT): https://github.com/glassmkr/crucible

## Glassmkr in one paragraph

Glassmkr is bare-metal infrastructure monitoring. 60 alert rules
tuned for real failure modes ship enabled out of the box, covering
storage, ZFS, filesystem, memory + CPU, network, hardware (BMC/IPMI),
NVIDIA GPU, time + services, and security + patching. The Crucible agent
is MIT-licensed and on npm; runs as a non-root user; ships only metrics
and alert state, never logs or command output. Pricing is $3 per node
per month after 3 free nodes. AI character Furnace runs on self-hosted
Gemma 4 26B on a single NVIDIA L4 GPU in Amsterdam; no third-party LLM
APIs. EU-stored data, GDPR-aligned operator (Czech sole-trader).

## Category: Storage

### Rule: disk_io_errors

URL: https://glassmkr.com/docs/rules/disk_io_errors
Priority: P1
Title: Disk I/O errors
Summary: Kernel logged I/O errors against one or more block devices. Indicates failing storage hardware, a flaky cable/controller, or filesystem corruption. Investigate immediately to prevent data loss.

---

### Rule: disk_latency_high

URL: https://glassmkr.com/docs/rules/disk_latency_high
Priority: P3
Title: Disk latency high
Summary: Disk's average read or write latency exceeds the threshold under non-trivial IOPS load. Indicates a struggling drive, saturated I/O queue, in-progress RAID rebuild, or noisy-neighbor workload.

---

### Rule: nvme_critical_warning

URL: https://glassmkr.com/docs/rules/nvme_critical_warning
Priority: P1
Title: NVMe critical warning byte non-zero
Summary: An NVMe device's Critical Warning byte (NVM Express §5.21) is non-zero. Per spec, any non-zero bit is a vendor-recommended immediate-action signal: temperature threshold exceeded, available spare below threshold, reliability degraded, read-only mode, volatile memory backup failed, or persistent memory region read-only.

---

### Rule: nvme_wear_high

URL: https://glassmkr.com/docs/rules/nvme_wear_high
Priority: P2
Title: NVMe wear high
Summary: NVMe drive's percentage-used indicator is at or above the configured threshold. Plan replacement before the drive enters read-only protection mode at 100%.

---

### Rule: raid_degraded

URL: https://glassmkr.com/docs/rules/raid_degraded
Priority: P1
Title: RAID array degraded
Summary: One or more disks have failed in an mdadm software array or a hardware RAID controller (Dell PERC, LSI/Broadcom MegaRAID, HPE Smart Array, Adaptec). One more failure may cause data loss.

---

### Rule: smart_failing

URL: https://glassmkr.com/docs/rules/smart_failing
Priority: P1
Title: Drive failing per SMART
Summary: SMART data indicates imminent drive failure (reallocated sectors, pending sectors, or aggregate health != PASSED). Back up data and replace the drive.

---

## Category: ZFS

### Rule: zfs_pool_unhealthy

URL: https://glassmkr.com/docs/rules/zfs_pool_unhealthy
Priority: P1
Title: ZFS pool unhealthy
Summary: ZFS pool in non-OPTIMAL state. Severity scales with vdev redundancy class (Crucible v0.10.4+): SUSPENDED pools and FAULTED top-level vdevs page critical; DEGRADED on single/raidz1/mirror_2way pages critical; raidz3/mirror_3way+ pages warning. L2ARC failures emit at info severity (no data loss). SLOG faults handled by zfs_slog_faulted.

---

### Rule: zfs_scrub_errors

URL: https://glassmkr.com/docs/rules/zfs_scrub_errors
Priority: P1
Title: ZFS scrub found errors
Summary: ZFS pool's most recent scrub detected checksum or repair errors, or the pool has not been scrubbed in over 30 days. Errors suggest failing disks or silent corruption; missing scrub is preventive-maintenance gap.

---

### Rule: zfs_slog_faulted

URL: https://glassmkr.com/docs/rules/zfs_slog_faulted
Priority: P1
Title: ZFS SLOG vdev faulted
Summary: A ZIL log vdev (SLOG) is FAULTED or REMOVED. Sync-write durability for the pool is compromised until the SLOG is replaced.

---

## Category: Filesystem

### Rule: disk_fill_projection

URL: https://glassmkr.com/docs/rules/disk_fill_projection
Priority: P1
Title: Disk fill projection imminent
Summary: Linear projection on filesystem available_bytes indicates exhaustion within 24h (P1) or 7d (P2). Companion to disk_space_high (absolute %).

---

### Rule: disk_space_high

URL: https://glassmkr.com/docs/rules/disk_space_high
Priority: P2
Title: Disk space high
Summary: Filesystem usage at or above the configured threshold (default 85%). At 100% services that write to this filesystem will fail; at >=95% the buffer is hours-not-days.

---

### Rule: fd_exhaustion

URL: https://glassmkr.com/docs/rules/fd_exhaustion
Priority: P1
Title: File descriptor exhaustion
Summary: Host-wide file descriptor usage at or above 80% of fs.file-max OR a single process at or above 80% of its RLIMIT_NOFILE soft limit. The per-process path activates with Crucible v0.11.0+; older agents only emit on the host-wide path.

---

### Rule: filesystem_readonly

URL: https://glassmkr.com/docs/rules/filesystem_readonly
Priority: P1
Title: Filesystem remounted read-only
Summary: Kernel forced a filesystem to read-only mode, usually due to I/O errors. Services that write to this mount will fail. Data already on the filesystem is readable; new writes are not.

---

### Rule: inode_high

URL: https://glassmkr.com/docs/rules/inode_high
Priority: P2
Title: Inode usage high
Summary: Filesystem has many small files; inode usage at or above 85% of the table. At 100%, file creation fails with ENOSPC even though `df -h` shows free space.

---

### Rule: lvm_thinpool_metadata_high

URL: https://glassmkr.com/docs/rules/lvm_thinpool_metadata_high
Priority: P1
Title: LVM thin pool metadata near full
Summary: LVM thin pool metadata volume at or above 80%. Metadata exhaustion is silent and catastrophic: writes across all thin volumes in the pool start failing in unpredictable ways at 100%. Extend the metadata volume before it fills.

---

## Category: Memory & CPU

### Rule: cpu_high

URL: https://glassmkr.com/docs/rules/cpu_high
Priority: P2
Title: CPU usage high
Summary: Aggregate CPU utilization at or above 90% (idle below 10%). Critical at >=98% (idle <2%). Either a runaway process or workload exceeding capacity.

---

### Rule: cpu_iowait_high

URL: https://glassmkr.com/docs/rules/cpu_iowait_high
Priority: P2
Title: CPU I/O wait high
Summary: CPU is spending 20%+ of its time waiting on disk I/O. Indicates storage bottleneck; either an overwhelmed device or runaway I/O from one process.

---

### Rule: cpu_pressure_high

URL: https://glassmkr.com/docs/rules/cpu_pressure_high
Priority: P2
Title: CPU pressure stall sustained
Summary: PSI reports CPU contention persistently above threshold. Aggregate signal across the host; subordinates cpu_high and load_high to this incident when it fires.

---

### Rule: load_high

URL: https://glassmkr.com/docs/rules/load_high
Priority: P3
Title: Load average high
Summary: 1-minute load average exceeds 2x the CPU core count for several minutes. Usually indicates an I/O bottleneck (high D-state processes) rather than pure CPU saturation.

---

### Rule: mem_pressure_high

URL: https://glassmkr.com/docs/rules/mem_pressure_high
Priority: P1
Title: Memory pressure sustained
Summary: PSI reports memory contention with active paging or rapid MemAvailable decline. Real pressure signal, not used% noise.

---

### Rule: oom_kills

URL: https://glassmkr.com/docs/rules/oom_kills
Priority: P1
Title: OOM killer recently fired
Summary: Kernel out-of-memory killer terminated one or more processes in the recent window. Severe memory pressure or a memory leak. Killed services may be down.

---

### Rule: ram_high

URL: https://glassmkr.com/docs/rules/ram_high
Priority: P3
Title: RAM usage high
Summary: Memory pressure on the host. Warning at 90%, critical at 95%. Sustained pressure leads to swap thrashing and OOM kills.

---

### Rule: swap_high

URL: https://glassmkr.com/docs/rules/swap_high
Priority: P2
Title: Swap usage high
Summary: Swap usage at or above 50%. Swap I/O is 10-100x slower than RAM; sustained swap = thrashing. Critical band (>=80%) indicates imminent service degradation.

---

## Category: Network

### Rule: accept_backlog_or_syn_flood

URL: https://glassmkr.com/docs/rules/accept_backlog_or_syn_flood
Priority: P1
Title: Accept backlog or SYN flood
Summary: 2 or more of conntrack_exhaustion / listen_overflow / tcp_retrans_high are active on the same host within 5 minutes. Indicates accept-queue buildup or SYN flood.

---

### Rule: bond_slave_down

URL: https://glassmkr.com/docs/rules/bond_slave_down
Priority: P1
Title: Bond slave interface down
Summary: A slave NIC in a bonded interface has MII status down. The bond is running with reduced redundancy; one more failure breaks the bond entirely.

---

### Rule: conntrack_exhaustion

URL: https://glassmkr.com/docs/rules/conntrack_exhaustion
Priority: P1
Title: Conntrack table near full
Summary: Linux nf_conntrack table is at or above 75% capacity. At 100%, new connections are silently dropped; services appear to work but new clients can't connect. Critical band (>=90%) means dropping is imminent.

---

### Rule: interface_errors

URL: https://glassmkr.com/docs/rules/interface_errors
Priority: P2
Title: Interface errors high
Summary: Network interface reports elevated CRC / frame / carrier errors (physical layer) OR elevated packet drops (software ring/softirq layer). Tier red = critical (cable swap or kernel tuning urgent); tier yellow = warning.

---

### Rule: interface_saturation

URL: https://glassmkr.com/docs/rules/interface_saturation
Priority: P3
Title: Interface near saturation
Summary: Network interface utilization above the configured threshold (default 90% of negotiated speed). Plan bandwidth upgrade or traffic shaping; queue depth growth predicts the next OOMing connection-handling daemon.

---

### Rule: lacp_partner_lost

URL: https://glassmkr.com/docs/rules/lacp_partner_lost
Priority: P1
Title: LACP partner lost
Summary: Bond MII layer reports up but the LACP partner is unsynchronized. The bond appears functional while traffic is dropped by the switch. Also emits a warning when the active aggregator has fewer ports than configured (redundancy reduced).

---

### Rule: link_speed_mismatch

URL: https://glassmkr.com/docs/rules/link_speed_mismatch
Priority: P2
Title: Link speed mismatch
Summary: Network interface negotiated a speed below 1 Gbps despite supporting higher. Almost always a physical-layer or autoneg issue; rarely a real config decision.

---

### Rule: listen_overflow

URL: https://glassmkr.com/docs/rules/listen_overflow
Priority: P2
Title: TCP listen-queue dropping connections
Summary: /proc/net/netstat TcpExt ListenOverflows or ListenDrops is incrementing; the kernel is dropping arriving connections at accept-queue level. Either the application can't accept() fast enough or net.core.somaxconn is too small for the offered load.

---

### Rule: softnet_drops

URL: https://glassmkr.com/docs/rules/softnet_drops
Priority: P1
Title: Kernel softnet dropping packets
Summary: /proc/net/softnet_stat reports kernel input-queue drops at sustained rate (>1 pkt/s). The NET_RX softirq backlog is filling faster than the kernel can process; packets are being silently discarded. Often correlated with conntrack pressure or CPU pressure.

---

### Rule: tcp_retrans_high

URL: https://glassmkr.com/docs/rules/tcp_retrans_high
Priority: P2
Title: TCP retransmit rate elevated
Summary: TCP retransmit ratio (retransmits / segments sent) over the most recent snapshot interval exceeds 2%. Above 1% commonly impacts performance; above 5% significantly degrades throughput. Indicates network reliability or remote-peer problems.

---

## Category: Hardware (BMC/IPMI)

### Rule: cpu_temperature_high

URL: https://glassmkr.com/docs/rules/cpu_temperature_high
Priority: P1
Title: CPU temperature high
Summary: CPU thermal reading at or above the warning threshold (default 80°C; critical 90°C). At critical, thermal throttling kicks in and silicon damage risk climbs.

---

### Rule: ecc_errors

URL: https://glassmkr.com/docs/rules/ecc_errors
Priority: P1
Title: ECC memory errors
Summary: Memory controller reported one or more uncorrectable ECC errors. Data corruption has occurred; the DIMM is failing. Replace immediately.

---

### Rule: ipmi_fan_failure

URL: https://glassmkr.com/docs/rules/ipmi_fan_failure
Priority: P1
Title: IPMI fan failure
Summary: BMC reports one or more chassis fans in critical state or at 0 RPM. Cooling capacity is reduced; CPU temperatures may climb and trigger thermal throttling or emergency shutdown.

---

### Rule: ipmi_sel_critical

URL: https://glassmkr.com/docs/rules/ipmi_sel_critical
Priority: P1
Title: IPMI SEL critical events
Summary: BMC System Event Log contains one or more critical-severity asserted events in the last N days (default 30). Critical events indicate real hardware faults; DIMM, PSU, fan, voltage, or temperature.

---

### Rule: mce_uncorrected

URL: https://glassmkr.com/docs/rules/mce_uncorrected
Priority: P0
Title: Uncorrected machine check exception
Summary: EDAC reports an uncorrected memory error. Replace the affected DIMM.

---

### Rule: psu_redundancy_loss

URL: https://glassmkr.com/docs/rules/psu_redundancy_loss
Priority: P1
Title: PSU redundancy lost
Summary: One or more PSUs are in fault, absent, or degraded state. Single power failure now risks full server outage. Dell BMCs report this via an aggregate sensor; other vendors via per-PSU sensors.

---

## Category: GPU

### Rule: gpu_corrected_ecc_storm

URL: https://glassmkr.com/docs/rules/gpu_corrected_ecc_storm
Priority: P3
Title: GPU corrected-ECC level high
Summary: GPU corrected-ECC counter is high or single-bit retired pages are non-zero. SBE storms typically precede DBE faults; this rule gives operators time to plan preventive replacement before uncorrected ECC fires.

---

### Rule: gpu_driver_or_firmware_drift

URL: https://glassmkr.com/docs/rules/gpu_driver_or_firmware_drift
Priority: P3
Title: GPU vbios drift within host
Summary: Multiple GPUs of the same model on this host report different vbios versions. Within-host vbios drift typically indicates a failed firmware update or mixed-batch installation.

---

### Rule: gpu_pcie_link_degraded

URL: https://glassmkr.com/docs/rules/gpu_pcie_link_degraded
Priority: P2
Title: GPU PCIe link degraded
Summary: GPU's current PCIe gen or width is below the GPU's advertised maximum. Host-to-GPU bandwidth is capped below the GPU's capability; meaningful for large-model loading and PCIe-attached weights, catastrophic for training-style workloads.

---

### Rule: gpu_power_cap_throttling

URL: https://glassmkr.com/docs/rules/gpu_power_cap_throttling
Priority: P2
Title: GPU power-cap throttling
Summary: GPU is being throttled by software power cap (sw_power_cap) or hardware power brake (hw_power_brake). May be intentional (operator-configured limit) or unexpected (PSU sizing, chassis power policy).

---

### Rule: gpu_thermal_critical

URL: https://glassmkr.com/docs/rules/gpu_thermal_critical
Priority: P1
Title: GPU thermal critical
Summary: GPU die temperature at or above HW slowdown threshold, or kernel reports thermal throttle engaged. Sustained operation at thermal limits accelerates wear and reduces inference throughput. Boot grace 300s for post-boot sensor stabilisation.

---

### Rule: gpu_uncorrected_ecc

URL: https://glassmkr.com/docs/rules/gpu_uncorrected_ecc
Priority: P0
Title: GPU uncorrected ECC or DBE retired pages
Summary: GPU reports uncorrected ECC errors, double-bit ECC retired pages, or pending retirements. Uncorrected ECC means error correction could not recover; in-flight data may have been corrupted. Pending retirements require a reboot.

---

### Rule: gpu_xid_critical

URL: https://glassmkr.com/docs/rules/gpu_xid_critical
Priority: P0
Title: GPU XID critical event
Summary: NVIDIA XID error classified as critical per NVIDIA's published XID severity table. Hardware-witnessed fault on the GPU; data may be at risk and the workload likely degraded.

---

### Rule: nvlink_link_down

URL: https://glassmkr.com/docs/rules/nvlink_link_down
Priority: P1
Title: NVLink link down
Summary: An NVLink on a multi-GPU host is in the down state. Multi-GPU bandwidth is reduced; if the GPU participates in NCCL collectives the entire training/inference job's latency degrades.

---

## Category: Time & Services

### Rule: clock_drift

URL: https://glassmkr.com/docs/rules/clock_drift
Priority: P2
Title: Clock drift
Summary: System clock is at least 5 seconds off from upstream NTP. Critical at >=60s; TLS validation, log correlation, database replication, and cron all break.

---

### Rule: ntp_not_synced

URL: https://glassmkr.com/docs/rules/ntp_not_synced
Priority: P2
Title: NTP not synced
Summary: Either the kernel clock is unsynchronized (critical; drift in progress) OR the NTP daemon has stopped while the clock is still synced (warning; drift will start once kernel state expires).

---

### Rule: service_flapping

URL: https://glassmkr.com/docs/rules/service_flapping
Priority: P1
Title: systemd service flapping
Summary: A systemd unit has hit its start-limit (systemd stopped restarting it) OR has restarted 5+ times. A service that can't stabilise consumes resources without delivering value; investigate before bumping restart limits.

---

### Rule: systemd_service_failed

URL: https://glassmkr.com/docs/rules/systemd_service_failed
Priority: P1
Title: systemd service failed
Summary: One or more systemd units are in the failed state. The service is not running; dependent functionality is offline. Crucible 0.9.2+ also ships the last 5 journal lines per failed unit in evidence so root cause is one click away.

---

### Rule: systemd_service_oom_killed

URL: https://glassmkr.com/docs/rules/systemd_service_oom_killed
Priority: P1
Title: systemd service killed by OOM
Summary: systemd reports a failed unit with Result=oom-kill. The kernel OOM killer terminated the service; pair with the host-level oom_kills emission to find the underlying memory pressure source.

---

### Rule: unexpected_reboot

URL: https://glassmkr.com/docs/rules/unexpected_reboot
Priority: P1
Title: Unexpected reboot
Summary: Server rebooted without an operator-acknowledged planned reboot. Possible causes: kernel panic, hardware fault (PSU brownout, thermal shutdown, watchdog), power outage, or remote reboot via BMC.

---

## Category: Security & Patching

### Rule: kernel_needs_reboot

URL: https://glassmkr.com/docs/rules/kernel_needs_reboot
Priority: P2
Title: Reboot required (newer kernel installed)
Summary: A newer kernel package is installed on disk but the running kernel is older. Security patches in the new kernel are not active until reboot.

---

### Rule: kernel_vulnerabilities

URL: https://glassmkr.com/docs/rules/kernel_vulnerabilities
Priority: P2
Title: Kernel vulnerability mitigations missing
Summary: One or more CPU vulnerability mitigations (Spectre, Meltdown, MDS, etc.) report unmitigated or partial coverage in /sys/devices/system/cpu/vulnerabilities/. Update kernel + CPU microcode to apply.

---

### Rule: no_firewall

URL: https://glassmkr.com/docs/rules/no_firewall
Priority: P1
Title: No host firewall active
Summary: No active firewall rules detected. All listening ports are reachable from any network the host is connected to, unless protected by network-level ACLs (VPC, cloud SG, on-prem ACL).

---

### Rule: pending_security_updates

URL: https://glassmkr.com/docs/rules/pending_security_updates
Priority: P2
Title: Pending security updates
Summary: Package manager reports one or more security updates available AND auto-updates are not configured. Manual patching is required; counterpart to unattended_upgrades_disabled which fires when the auto-update mechanism itself is missing.

---

### Rule: server_unreachable

URL: https://glassmkr.com/docs/rules/server_unreachable
Priority: P1
Title: Server unreachable
Summary: Dashboard has not received a snapshot from this server in 2x the configured collection interval (default 10 minutes). Either the Crucible agent stopped reporting, the network is down, or the server is offline. Alert auto-resolves on next successful snapshot.

---

### Rule: ssh_root_password

URL: https://glassmkr.com/docs/rules/ssh_root_password
Priority: P1
Title: SSH allows root password login
Summary: sshd allows root login with password. Brute-force-able from the network. Switch to key-only root login (still works for key-based ops); ideally disable root SSH entirely and use a sudo-equipped operator account.

---

### Rule: unattended_upgrades_disabled

URL: https://glassmkr.com/docs/rules/unattended_upgrades_disabled
Priority: P3
Title: Unattended security upgrades disabled
Summary: No automatic security update mechanism is configured. The host is at the operator's mercy for patch cadence; if patches are pending, the counterpart pending_security_updates rule will fire.

---