# Glassmkr Documentation Corpus (machine-readable) Generated: 2026-05-20 This file is the LLM-friendly corpus for Glassmkr's documentation. It contains the public catalog metadata for all 60 alert rules. For human-readable documentation, see https://glassmkr.com/docs and https://glassmkr.com/trust. Per-alert remediation guidance is rendered inside the dashboard at https://app.glassmkr.com. ## Index - Homepage: https://glassmkr.com - Pricing: https://glassmkr.com/#pricing - Documentation: https://glassmkr.com/docs - Trust + security: https://glassmkr.com/trust - Per-rule pages: https://glassmkr.com/docs/rules/ - Crucible agent source (MIT): https://github.com/glassmkr/crucible ## Glassmkr in one paragraph Glassmkr is bare-metal infrastructure monitoring. 60 alert rules tuned for real failure modes ship enabled out of the box, covering storage, ZFS, filesystem, memory + CPU, network, hardware (BMC/IPMI), NVIDIA GPU, time + services, and security + patching. The Crucible agent is MIT-licensed and on npm; runs as a non-root user; ships only metrics and alert state, never logs or command output. Pricing is $3 per node per month after 3 free nodes. AI character Furnace runs on self-hosted Gemma 4 26B on a single NVIDIA L4 GPU in Amsterdam; no third-party LLM APIs. EU-stored data, GDPR-aligned operator (Czech sole-trader). ## Category: Storage ### Rule: disk_io_errors URL: https://glassmkr.com/docs/rules/disk_io_errors Priority: P1 Title: Disk I/O errors Summary: Kernel logged I/O errors against one or more block devices. Indicates failing storage hardware, a flaky cable/controller, or filesystem corruption. Investigate immediately to prevent data loss. --- ### Rule: disk_latency_high URL: https://glassmkr.com/docs/rules/disk_latency_high Priority: P3 Title: Disk latency high Summary: Disk's average read or write latency exceeds the threshold under non-trivial IOPS load. Indicates a struggling drive, saturated I/O queue, in-progress RAID rebuild, or noisy-neighbor workload. --- ### Rule: nvme_critical_warning URL: https://glassmkr.com/docs/rules/nvme_critical_warning Priority: P1 Title: NVMe critical warning byte non-zero Summary: An NVMe device's Critical Warning byte (NVM Express §5.21) is non-zero. Per spec, any non-zero bit is a vendor-recommended immediate-action signal: temperature threshold exceeded, available spare below threshold, reliability degraded, read-only mode, volatile memory backup failed, or persistent memory region read-only. --- ### Rule: nvme_wear_high URL: https://glassmkr.com/docs/rules/nvme_wear_high Priority: P2 Title: NVMe wear high Summary: NVMe drive's percentage-used indicator is at or above the configured threshold. Plan replacement before the drive enters read-only protection mode at 100%. --- ### Rule: raid_degraded URL: https://glassmkr.com/docs/rules/raid_degraded Priority: P1 Title: RAID array degraded Summary: One or more disks have failed in an mdadm software array or a hardware RAID controller (Dell PERC, LSI/Broadcom MegaRAID, HPE Smart Array, Adaptec). One more failure may cause data loss. --- ### Rule: smart_failing URL: https://glassmkr.com/docs/rules/smart_failing Priority: P1 Title: Drive failing per SMART Summary: SMART data indicates imminent drive failure (reallocated sectors, pending sectors, or aggregate health != PASSED). Back up data and replace the drive. --- ## Category: ZFS ### Rule: zfs_pool_unhealthy URL: https://glassmkr.com/docs/rules/zfs_pool_unhealthy Priority: P1 Title: ZFS pool unhealthy Summary: ZFS pool in non-OPTIMAL state. Severity scales with vdev redundancy class (Crucible v0.10.4+): SUSPENDED pools and FAULTED top-level vdevs page critical; DEGRADED on single/raidz1/mirror_2way pages critical; raidz3/mirror_3way+ pages warning. L2ARC failures emit at info severity (no data loss). SLOG faults handled by zfs_slog_faulted. --- ### Rule: zfs_scrub_errors URL: https://glassmkr.com/docs/rules/zfs_scrub_errors Priority: P1 Title: ZFS scrub found errors Summary: ZFS pool's most recent scrub detected checksum or repair errors, or the pool has not been scrubbed in over 30 days. Errors suggest failing disks or silent corruption; missing scrub is preventive-maintenance gap. --- ### Rule: zfs_slog_faulted URL: https://glassmkr.com/docs/rules/zfs_slog_faulted Priority: P1 Title: ZFS SLOG vdev faulted Summary: A ZIL log vdev (SLOG) is FAULTED or REMOVED. Sync-write durability for the pool is compromised until the SLOG is replaced. --- ## Category: Filesystem ### Rule: disk_fill_projection URL: https://glassmkr.com/docs/rules/disk_fill_projection Priority: P1 Title: Disk fill projection imminent Summary: Linear projection on filesystem available_bytes indicates exhaustion within 24h (P1) or 7d (P2). Companion to disk_space_high (absolute %). --- ### Rule: disk_space_high URL: https://glassmkr.com/docs/rules/disk_space_high Priority: P2 Title: Disk space high Summary: Filesystem usage at or above the configured threshold (default 85%). At 100% services that write to this filesystem will fail; at >=95% the buffer is hours-not-days. --- ### Rule: fd_exhaustion URL: https://glassmkr.com/docs/rules/fd_exhaustion Priority: P1 Title: File descriptor exhaustion Summary: Host-wide file descriptor usage at or above 80% of fs.file-max OR a single process at or above 80% of its RLIMIT_NOFILE soft limit. The per-process path activates with Crucible v0.11.0+; older agents only emit on the host-wide path. --- ### Rule: filesystem_readonly URL: https://glassmkr.com/docs/rules/filesystem_readonly Priority: P1 Title: Filesystem remounted read-only Summary: Kernel forced a filesystem to read-only mode, usually due to I/O errors. Services that write to this mount will fail. Data already on the filesystem is readable; new writes are not. --- ### Rule: inode_high URL: https://glassmkr.com/docs/rules/inode_high Priority: P2 Title: Inode usage high Summary: Filesystem has many small files; inode usage at or above 85% of the table. At 100%, file creation fails with ENOSPC even though `df -h` shows free space. --- ### Rule: lvm_thinpool_metadata_high URL: https://glassmkr.com/docs/rules/lvm_thinpool_metadata_high Priority: P1 Title: LVM thin pool metadata near full Summary: LVM thin pool metadata volume at or above 80%. Metadata exhaustion is silent and catastrophic: writes across all thin volumes in the pool start failing in unpredictable ways at 100%. Extend the metadata volume before it fills. --- ## Category: Memory & CPU ### Rule: cpu_high URL: https://glassmkr.com/docs/rules/cpu_high Priority: P2 Title: CPU usage high Summary: Aggregate CPU utilization at or above 90% (idle below 10%). Critical at >=98% (idle <2%). Either a runaway process or workload exceeding capacity. --- ### Rule: cpu_iowait_high URL: https://glassmkr.com/docs/rules/cpu_iowait_high Priority: P2 Title: CPU I/O wait high Summary: CPU is spending 20%+ of its time waiting on disk I/O. Indicates storage bottleneck; either an overwhelmed device or runaway I/O from one process. --- ### Rule: cpu_pressure_high URL: https://glassmkr.com/docs/rules/cpu_pressure_high Priority: P2 Title: CPU pressure stall sustained Summary: PSI reports CPU contention persistently above threshold. Aggregate signal across the host; subordinates cpu_high and load_high to this incident when it fires. --- ### Rule: load_high URL: https://glassmkr.com/docs/rules/load_high Priority: P3 Title: Load average high Summary: 1-minute load average exceeds 2x the CPU core count for several minutes. Usually indicates an I/O bottleneck (high D-state processes) rather than pure CPU saturation. --- ### Rule: mem_pressure_high URL: https://glassmkr.com/docs/rules/mem_pressure_high Priority: P1 Title: Memory pressure sustained Summary: PSI reports memory contention with active paging or rapid MemAvailable decline. Real pressure signal, not used% noise. --- ### Rule: oom_kills URL: https://glassmkr.com/docs/rules/oom_kills Priority: P1 Title: OOM killer recently fired Summary: Kernel out-of-memory killer terminated one or more processes in the recent window. Severe memory pressure or a memory leak. Killed services may be down. --- ### Rule: ram_high URL: https://glassmkr.com/docs/rules/ram_high Priority: P3 Title: RAM usage high Summary: Memory pressure on the host. Warning at 90%, critical at 95%. Sustained pressure leads to swap thrashing and OOM kills. --- ### Rule: swap_high URL: https://glassmkr.com/docs/rules/swap_high Priority: P2 Title: Swap usage high Summary: Swap usage at or above 50%. Swap I/O is 10-100x slower than RAM; sustained swap = thrashing. Critical band (>=80%) indicates imminent service degradation. --- ## Category: Network ### Rule: accept_backlog_or_syn_flood URL: https://glassmkr.com/docs/rules/accept_backlog_or_syn_flood Priority: P1 Title: Accept backlog or SYN flood Summary: 2 or more of conntrack_exhaustion / listen_overflow / tcp_retrans_high are active on the same host within 5 minutes. Indicates accept-queue buildup or SYN flood. --- ### Rule: bond_slave_down URL: https://glassmkr.com/docs/rules/bond_slave_down Priority: P1 Title: Bond slave interface down Summary: A slave NIC in a bonded interface has MII status down. The bond is running with reduced redundancy; one more failure breaks the bond entirely. --- ### Rule: conntrack_exhaustion URL: https://glassmkr.com/docs/rules/conntrack_exhaustion Priority: P1 Title: Conntrack table near full Summary: Linux nf_conntrack table is at or above 75% capacity. At 100%, new connections are silently dropped; services appear to work but new clients can't connect. Critical band (>=90%) means dropping is imminent. --- ### Rule: interface_errors URL: https://glassmkr.com/docs/rules/interface_errors Priority: P2 Title: Interface errors high Summary: Network interface reports elevated CRC / frame / carrier errors (physical layer) OR elevated packet drops (software ring/softirq layer). Tier red = critical (cable swap or kernel tuning urgent); tier yellow = warning. --- ### Rule: interface_saturation URL: https://glassmkr.com/docs/rules/interface_saturation Priority: P3 Title: Interface near saturation Summary: Network interface utilization above the configured threshold (default 90% of negotiated speed). Plan bandwidth upgrade or traffic shaping; queue depth growth predicts the next OOMing connection-handling daemon. --- ### Rule: lacp_partner_lost URL: https://glassmkr.com/docs/rules/lacp_partner_lost Priority: P1 Title: LACP partner lost Summary: Bond MII layer reports up but the LACP partner is unsynchronized. The bond appears functional while traffic is dropped by the switch. Also emits a warning when the active aggregator has fewer ports than configured (redundancy reduced). --- ### Rule: link_speed_mismatch URL: https://glassmkr.com/docs/rules/link_speed_mismatch Priority: P2 Title: Link speed mismatch Summary: Network interface negotiated a speed below 1 Gbps despite supporting higher. Almost always a physical-layer or autoneg issue; rarely a real config decision. --- ### Rule: listen_overflow URL: https://glassmkr.com/docs/rules/listen_overflow Priority: P2 Title: TCP listen-queue dropping connections Summary: /proc/net/netstat TcpExt ListenOverflows or ListenDrops is incrementing; the kernel is dropping arriving connections at accept-queue level. Either the application can't accept() fast enough or net.core.somaxconn is too small for the offered load. --- ### Rule: softnet_drops URL: https://glassmkr.com/docs/rules/softnet_drops Priority: P1 Title: Kernel softnet dropping packets Summary: /proc/net/softnet_stat reports kernel input-queue drops at sustained rate (>1 pkt/s). The NET_RX softirq backlog is filling faster than the kernel can process; packets are being silently discarded. Often correlated with conntrack pressure or CPU pressure. --- ### Rule: tcp_retrans_high URL: https://glassmkr.com/docs/rules/tcp_retrans_high Priority: P2 Title: TCP retransmit rate elevated Summary: TCP retransmit ratio (retransmits / segments sent) over the most recent snapshot interval exceeds 2%. Above 1% commonly impacts performance; above 5% significantly degrades throughput. Indicates network reliability or remote-peer problems. --- ## Category: Hardware (BMC/IPMI) ### Rule: cpu_temperature_high URL: https://glassmkr.com/docs/rules/cpu_temperature_high Priority: P1 Title: CPU temperature high Summary: CPU thermal reading at or above the warning threshold (default 80°C; critical 90°C). At critical, thermal throttling kicks in and silicon damage risk climbs. --- ### Rule: ecc_errors URL: https://glassmkr.com/docs/rules/ecc_errors Priority: P1 Title: ECC memory errors Summary: Memory controller reported one or more uncorrectable ECC errors. Data corruption has occurred; the DIMM is failing. Replace immediately. --- ### Rule: ipmi_fan_failure URL: https://glassmkr.com/docs/rules/ipmi_fan_failure Priority: P1 Title: IPMI fan failure Summary: BMC reports one or more chassis fans in critical state or at 0 RPM. Cooling capacity is reduced; CPU temperatures may climb and trigger thermal throttling or emergency shutdown. --- ### Rule: ipmi_sel_critical URL: https://glassmkr.com/docs/rules/ipmi_sel_critical Priority: P1 Title: IPMI SEL critical events Summary: BMC System Event Log contains one or more critical-severity asserted events in the last N days (default 30). Critical events indicate real hardware faults; DIMM, PSU, fan, voltage, or temperature. --- ### Rule: mce_uncorrected URL: https://glassmkr.com/docs/rules/mce_uncorrected Priority: P0 Title: Uncorrected machine check exception Summary: EDAC reports an uncorrected memory error. Replace the affected DIMM. --- ### Rule: psu_redundancy_loss URL: https://glassmkr.com/docs/rules/psu_redundancy_loss Priority: P1 Title: PSU redundancy lost Summary: One or more PSUs are in fault, absent, or degraded state. Single power failure now risks full server outage. Dell BMCs report this via an aggregate sensor; other vendors via per-PSU sensors. --- ## Category: GPU ### Rule: gpu_corrected_ecc_storm URL: https://glassmkr.com/docs/rules/gpu_corrected_ecc_storm Priority: P3 Title: GPU corrected-ECC level high Summary: GPU corrected-ECC counter is high or single-bit retired pages are non-zero. SBE storms typically precede DBE faults; this rule gives operators time to plan preventive replacement before uncorrected ECC fires. --- ### Rule: gpu_driver_or_firmware_drift URL: https://glassmkr.com/docs/rules/gpu_driver_or_firmware_drift Priority: P3 Title: GPU vbios drift within host Summary: Multiple GPUs of the same model on this host report different vbios versions. Within-host vbios drift typically indicates a failed firmware update or mixed-batch installation. --- ### Rule: gpu_pcie_link_degraded URL: https://glassmkr.com/docs/rules/gpu_pcie_link_degraded Priority: P2 Title: GPU PCIe link degraded Summary: GPU's current PCIe gen or width is below the GPU's advertised maximum. Host-to-GPU bandwidth is capped below the GPU's capability; meaningful for large-model loading and PCIe-attached weights, catastrophic for training-style workloads. --- ### Rule: gpu_power_cap_throttling URL: https://glassmkr.com/docs/rules/gpu_power_cap_throttling Priority: P2 Title: GPU power-cap throttling Summary: GPU is being throttled by software power cap (sw_power_cap) or hardware power brake (hw_power_brake). May be intentional (operator-configured limit) or unexpected (PSU sizing, chassis power policy). --- ### Rule: gpu_thermal_critical URL: https://glassmkr.com/docs/rules/gpu_thermal_critical Priority: P1 Title: GPU thermal critical Summary: GPU die temperature at or above HW slowdown threshold, or kernel reports thermal throttle engaged. Sustained operation at thermal limits accelerates wear and reduces inference throughput. Boot grace 300s for post-boot sensor stabilisation. --- ### Rule: gpu_uncorrected_ecc URL: https://glassmkr.com/docs/rules/gpu_uncorrected_ecc Priority: P0 Title: GPU uncorrected ECC or DBE retired pages Summary: GPU reports uncorrected ECC errors, double-bit ECC retired pages, or pending retirements. Uncorrected ECC means error correction could not recover; in-flight data may have been corrupted. Pending retirements require a reboot. --- ### Rule: gpu_xid_critical URL: https://glassmkr.com/docs/rules/gpu_xid_critical Priority: P0 Title: GPU XID critical event Summary: NVIDIA XID error classified as critical per NVIDIA's published XID severity table. Hardware-witnessed fault on the GPU; data may be at risk and the workload likely degraded. --- ### Rule: nvlink_link_down URL: https://glassmkr.com/docs/rules/nvlink_link_down Priority: P1 Title: NVLink link down Summary: An NVLink on a multi-GPU host is in the down state. Multi-GPU bandwidth is reduced; if the GPU participates in NCCL collectives the entire training/inference job's latency degrades. --- ## Category: Time & Services ### Rule: clock_drift URL: https://glassmkr.com/docs/rules/clock_drift Priority: P2 Title: Clock drift Summary: System clock is at least 5 seconds off from upstream NTP. Critical at >=60s; TLS validation, log correlation, database replication, and cron all break. --- ### Rule: ntp_not_synced URL: https://glassmkr.com/docs/rules/ntp_not_synced Priority: P2 Title: NTP not synced Summary: Either the kernel clock is unsynchronized (critical; drift in progress) OR the NTP daemon has stopped while the clock is still synced (warning; drift will start once kernel state expires). --- ### Rule: service_flapping URL: https://glassmkr.com/docs/rules/service_flapping Priority: P1 Title: systemd service flapping Summary: A systemd unit has hit its start-limit (systemd stopped restarting it) OR has restarted 5+ times. A service that can't stabilise consumes resources without delivering value; investigate before bumping restart limits. --- ### Rule: systemd_service_failed URL: https://glassmkr.com/docs/rules/systemd_service_failed Priority: P1 Title: systemd service failed Summary: One or more systemd units are in the failed state. The service is not running; dependent functionality is offline. Crucible 0.9.2+ also ships the last 5 journal lines per failed unit in evidence so root cause is one click away. --- ### Rule: systemd_service_oom_killed URL: https://glassmkr.com/docs/rules/systemd_service_oom_killed Priority: P1 Title: systemd service killed by OOM Summary: systemd reports a failed unit with Result=oom-kill. The kernel OOM killer terminated the service; pair with the host-level oom_kills emission to find the underlying memory pressure source. --- ### Rule: unexpected_reboot URL: https://glassmkr.com/docs/rules/unexpected_reboot Priority: P1 Title: Unexpected reboot Summary: Server rebooted without an operator-acknowledged planned reboot. Possible causes: kernel panic, hardware fault (PSU brownout, thermal shutdown, watchdog), power outage, or remote reboot via BMC. --- ## Category: Security & Patching ### Rule: kernel_needs_reboot URL: https://glassmkr.com/docs/rules/kernel_needs_reboot Priority: P2 Title: Reboot required (newer kernel installed) Summary: A newer kernel package is installed on disk but the running kernel is older. Security patches in the new kernel are not active until reboot. --- ### Rule: kernel_vulnerabilities URL: https://glassmkr.com/docs/rules/kernel_vulnerabilities Priority: P2 Title: Kernel vulnerability mitigations missing Summary: One or more CPU vulnerability mitigations (Spectre, Meltdown, MDS, etc.) report unmitigated or partial coverage in /sys/devices/system/cpu/vulnerabilities/. Update kernel + CPU microcode to apply. --- ### Rule: no_firewall URL: https://glassmkr.com/docs/rules/no_firewall Priority: P1 Title: No host firewall active Summary: No active firewall rules detected. All listening ports are reachable from any network the host is connected to, unless protected by network-level ACLs (VPC, cloud SG, on-prem ACL). --- ### Rule: pending_security_updates URL: https://glassmkr.com/docs/rules/pending_security_updates Priority: P2 Title: Pending security updates Summary: Package manager reports one or more security updates available AND auto-updates are not configured. Manual patching is required; counterpart to unattended_upgrades_disabled which fires when the auto-update mechanism itself is missing. --- ### Rule: server_unreachable URL: https://glassmkr.com/docs/rules/server_unreachable Priority: P1 Title: Server unreachable Summary: Dashboard has not received a snapshot from this server in 2x the configured collection interval (default 10 minutes). Either the Crucible agent stopped reporting, the network is down, or the server is offline. Alert auto-resolves on next successful snapshot. --- ### Rule: ssh_root_password URL: https://glassmkr.com/docs/rules/ssh_root_password Priority: P1 Title: SSH allows root password login Summary: sshd allows root login with password. Brute-force-able from the network. Switch to key-only root login (still works for key-based ops); ideally disable root SSH entirely and use a sudo-equipped operator account. --- ### Rule: unattended_upgrades_disabled URL: https://glassmkr.com/docs/rules/unattended_upgrades_disabled Priority: P3 Title: Unattended security upgrades disabled Summary: No automatic security update mechanism is configured. The host is at the operator's mercy for patch cadence; if patches are pending, the counterpart pending_security_updates rule will fire. ---