Glassmkr Documentation
From zero to monitoring in about a minute. One agent, opinionated alert rules, no inbound ports.
#Getting started
Install the agent, see your first alert, route notifications where your team lives, and understand the billing model.
#What Glassmkr monitors
Glassmkr is a monitoring agent for bare metal and dedicated servers. The agent collects hardware and OS metrics every 60 seconds and pushes them to the Glassmkr dashboard, where a library of alert rules evaluates each snapshot automatically.
Hardware
IPMI sensors (temperature, fan speed, voltage, power draw), IPMI SEL event log, ECC memory errors, PSU redundancy status
Storage
SMART health and wear level, disk space and inodes, RAID array status, ZFS pool health and scrub errors, filesystem read-only detection, I/O errors and latency
Network
Interface errors and drops, link speed negotiation, bandwidth saturation, bond slave status, conntrack table usage
OS
CPU per-core utilization and iowait, load averages, RAM and swap, OOM kills, clock drift, NTP sync, systemd failed units, file descriptor exhaustion, unexpected reboots
Security
SSH root password authentication, firewall status, pending security updates, kernel vulnerabilities, reboot required flag, unattended upgrades configuration
The full rule library evaluates on every collection cycle. All rules included on every plan, including Free.
#Installation
Docker (recommended)
# 1. Create config directory
sudo mkdir -p /etc/glassmkr
# 2. Add your collector key (get it from glassmkr.com after signing up)
sudo tee /etc/glassmkr/crucible.yaml << 'EOF'
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE
interval: 300
EOF
# 3. Download and start
curl -O https://raw.githubusercontent.com/glassmkr/crucible/main/docker-compose.yml
docker compose up -d
# 4. Verify
docker compose logs glassmkr-crucible The container runs with --privileged and network_mode: host for IPMI, SMART, and bond monitoring. See Trust for details.
npm alternative
npm install -g @glassmkr/crucible
sudo glassmkr-crucible --config /etc/glassmkr/crucible.yaml Requires Node.js 24+. System packages smartmontools, ipmitool, dmidecode needed for full hardware monitoring.
Your server appears in the dashboard within 1 minute.
Sign up free to get your collector key.
#First alert
After install, the agent pushes its first snapshot within 1 minute and the dashboard evaluates the rule library against it. On bare-metal hardware, something is usually mildly degraded: a disk that's been running for years, kernel updates pending, a fan running near its threshold. Expect 0-3 alerts on first contact, most P3 or warning-level.
Each alert opens to a detail page with remediation guidance (the command to run, what to verify after) and a Furnace assistant annotation. See Furnace for how that works.
#Notification channels
Alerts delivered from [email protected].
Telegram
Bot messages with alert details and direct links.
Slack
Block Kit formatted messages with severity colors.
Discord
Rich embeds posted to a channel via incoming webhook.
PagerDuty
Events API v2; maps priority to PagerDuty severity.
Webhooks
POST JSON to any URL you configure.
All six channels are on every plan, free or Pro, with no cap on how many you configure. Every channel supports per-priority filtering (P1 to P4). Agent update notifications send major version alerts to everyone; patch notifications are opt-in. Routing detail under Multi-channel alerting.
#Pricing and billing
Free
- Up to 3 nodes
- All 62 alert rules
- All six notification channels, unlimited
- Full read+write API
- Predictive trend warnings
- 7 days data retention
- One trial AI analysis per server
- No credit card required
Pro $3/node/month
Everything in Free, and it lifts exactly three limits:
- More than 3 nodes (first 3 free, then $3/node/month)
- 90 days data retention (audit log 365 days)
- Unlimited AI health analysis
- Email support
Enterprise
Custom pricing and configuration. Contact [email protected].
#Concepts
Glassmkr's vocabulary: what a node is, what an alert rule is, what Furnace does, what trend warnings are, and how notifications get routed.
#Nodes and servers
One Linux server is one node. Billing, alerts, and notifications all attach to nodes. A server you stop reporting from for over 2 minutes triggers the server_unreachable alert and stays in your fleet count until you remove it from the dashboard. See server_unreachable.
#Alert rules
An alert rule is an evaluator function that runs on every snapshot. The shipped rules cover hardware faults, capacity, security posture, and service health. Each rule emits structured evidence (the metric values that caused the fire) and resolves automatically when the underlying condition clears. The full reference catalog is at /docs/rules. Per-alert remediation guidance is rendered inside the dashboard on the alert detail page.
#Furnace (AI assistance)
Furnace is the AI assistant that annotates alert detail pages. It reads the alert's evidence and the rule's remediation steps and produces context-specific notes. It runs on a self-hosted Gemma 4 26B model on a single NVIDIA L4 GPU in Amsterdam; no third-party LLM APIs (no OpenAI, no Anthropic, no Google). Your alert data does not leave EU jurisdiction.
Furnace is conservative by design: hedges interpretive claims, doesn't autocomplete shell commands, says "I don't know" when it doesn't. Full philosophy in the Furnace introduction blog post.
#Trend warnings
A predictive early-warning feature, on every plan, that surfaces warnings when metrics show degradation trends before the alert thresholds fire. Separate from the snapshot-driven alert rules (which fire on current state).
Trend warnings run in a 6-hour batch process on Glassmkr's backend. They analyze up to 90 days of metric history per server, apply correlation rules that require two independent signals, and optionally consult an internal ranking model trained on Backblaze's public drive failure dataset. The result is a small number of high-confidence warnings per server, not noisy anomaly detection on every metric.
What gets monitored
| Signal | What we watch for | Example warning |
|---|---|---|
| SMART reallocated sectors (5) | Growth over 7-30 days | Drive /dev/sda SMART 5 grew from 0 to 14 over 30 days |
| SMART reported uncorrectable (187) | Any appearance above zero | Drive /dev/sda: SMART 187 is now 3 |
| SMART command timeouts (188) | Repeated growth | Drive /dev/sda: 4 command timeouts in last 7 days |
| SMART pending/offline uncorrectable (197, 198) | Step change from zero | Drive /dev/sda: pending sectors appeared 3 days ago |
| SMART high fly writes (189) | Burst patterns | Drive /dev/sda: 8 high fly write events in 24h |
| NVMe critical_warning | Any bit set | NVMe /dev/nvme0n1: critical_warning bit 2 set (reliability degraded) |
| NVMe available_spare | Approaching or below threshold | NVMe /dev/nvme0n1: available_spare now 12%, threshold is 10% |
| NVMe media_errors | Growing rapidly | NVMe /dev/nvme0n1: media_errors increased from 0 to 4 in 7 days |
| NVMe p99 latency (planned, v1.1) | Sustained drift without IO volume change | NVMe /dev/nvme0n1: p99 read latency sustained 2.3x above baseline |
| Disk space per partition | Projected fill via linear regression | /data partition projected to hit 85% in 12 days at current growth |
| ECC correctable errors (planned, v1.1) | Bursts per DIMM location | DIMM CPU1_DIMM_A2: 15 correctable errors in 24h |
| PSU rail voltages | Drift 2-3% from nominal | PSU 1: 12V rail at 11.62V (drift 3.2%) |
| Fan RPM | Decline paired with temp rise in same zone | Fan SYS_FAN2: RPM dropped 25% and chassis zone temp rising |
| NIC errors | CRC/frame errors (TCP retransmit correlation planned, v1.1) | eth0: 47 CRC errors |
| ZFS checksum/read errors | Paired with matching SMART signal on same device | Drive /dev/sda: ZFS reported 7 checksum errors corroborating SMART 5 growth |
Severity tiers
- Imminent (red): projected failure within 7 days, or critical pattern (SMART 187 appearance, NVMe critical_warning). Push notification immediately.
- Soon (orange): projected within 30 days, or high-severity evidence. Push notification once.
- Scheduled (blue): projected within 90 days, or medium-severity. Dashboard only.
- Watch (grey): low confidence or more than 90 days out. Dashboard collapsed.
Correlation requirement
Where two signals exist on the same device, correlation is required before a notification. Several v1 categories fire on a single high-confidence signal because the underlying source is itself authoritative (a SMART step-change from zero, NVMe critical_warning bits, a PSU rail at 11.62V). The asymmetry is deliberate.
Multi-signal categories shipped in v1:
| Signals (on the same device) | Diagnosis |
|---|---|
| Drive SMART signal and ZFS errors | Storage device degradation |
| Fan RPM decline and chassis temp rise (same zone) | Cooling failure |
Multi-signal categories planned for v1.1 once the underlying collector work lands:
| Signals (on the same device) | Diagnosis |
|---|---|
| NVMe health signal and p99 latency inflation | NVMe pre-failure (fail-slow) |
| NIC CRC errors and TCP retransmits (same interface) | NIC hardware failure |
| ECC burst and MCE entries (same DIMM) | DIMM pre-failure |
This approach trades some recall (failures that only show one signal) for high precision. Google's FAST 2007 study found roughly 40-50% of drive failures in the field show no SMART-visible warning, so trend warnings are a meaningful reduction of surprise failures, not a guarantee that every failure becomes predictable.
What we explicitly don't do
- No general-purpose anomaly detection on every metric. Netdata's own docs demote their anomaly ML to "investigation aid, not alert source." We agree.
- No per-customer model training. With 3-50 servers per account, customer-specific models are base-rate-dominated. We use global thresholds plus an offline-trained ranker on Backblaze's public dataset.
- No LLM-based trend classification. Linear regression, CUSUM, and first-differences do this job better and cheaper. We use AI only to narrate deterministic findings in plain English.
- No confident failure predictions. We say "likely within 7-14 days", never "will fail on Tuesday." The underlying signals carry real uncertainty and we surface it.
Data requirements
Trend warnings are on every plan. What differs by plan is how much metric history each signal can see, which is set by your retention window (Free keeps 7 days, Pro keeps 90):
- The longer-horizon signals (SMART, NVMe, ECC, cooling, PSU, NIC) draw on up to 90 days of history, so they reach full sensitivity at Pro's 90-day retention.
- Disk space projection works on 7-day data, so it runs the same on Free.
- A server needs at least 3 days of contiguous data to receive any warnings. Freshly added servers are in an observation period.
Self-audit
The dashboard shows the feature's own track record: how many warnings were sent in the last 90 days, how many users confirmed were valuable, how many were dismissed, and how many were followed by a matching alert firing within 30 days. No other monitoring tool surfaces this, and it exists so you can audit whether trend warnings are actually earning their keep for your fleet.
#Multi-channel alerting
Each alert routes to channels based on the rules in your dashboard. Group by team, by server, by severity. Suppress during planned maintenance windows. The alerting layer is unopinionated; route alerts wherever your team already pays attention.
Channel types in Notification channels. Per-priority filter (P1-P4) and per-rule mutes operate independently of channel selection.
#Alert rules
62 rules across 9 categories, tuned for bare-metal failure modes. Per-rule catalog pages at /docs/rules show the title, summary, priority, and category, plus the quick-check command + verdict prior (recoverable / investigation / vendor-side) for each rule. Per-alert remediation guidance (full FIX content: prerequisites, safe-mode diagnostic, fix command, validation, rollback, blast-radius impact) lives in the dashboard on the alert detail page. 20 of the 62 rules ship with deep FIX content; 30+ are verified end-to-end on real hardware. The summary tables below group rules by category.
Storage (8 rules)
| Rule | Trigger | Severity |
|---|---|---|
disk_space_high | ≥ 85% warning, ≥ 95% critical. Configurable. | Warning / Critical |
disk_fill_projection | Trend warning: projected to fill within N days (cross-snapshot) | Warning |
smart_failing | Reallocated/pending sectors or health != PASSED | Critical |
nvme_wear_high | ≥ 85% wear warning, ≥ 95% critical. NVMe Critical Warning bits also decoded. | Warning / Critical |
raid_degraded | Any degraded or failed RAID array (mdadm + hardware RAID via storcli/perccli/ssacli/arcconf) | Critical |
disk_latency_high | Average latency > 100ms | Warning |
disk_io_errors | I/O errors detected in dmesg (structured event match) | Critical |
inode_high | ≥ 90% inodes used | Warning |
ZFS (3 rules)
| Rule | Trigger | Severity |
|---|---|---|
zfs_pool_unhealthy | Pool state != ONLINE; severity matrix by vdev redundancy class | Warning / Critical |
zfs_scrub_errors | Scrub detected errors, or pool has never been scrubbed (fresh-pool reminder) | Warning |
zfs_slog_faulted | SLOG vdev faulted (write-cache reliability impact) | Critical |
Filesystem (4 rules)
| Rule | Trigger | Severity |
|---|---|---|
filesystem_readonly | Mounted filesystem remounted read-only (kernel I/O error path) | Critical |
fd_exhaustion | > 80% of system or per-process file descriptors used | Warning |
lvm_thinpool_metadata_high | LVM thin-pool data or metadata > 80% used | Warning / Critical |
systemd_service_failed | Any systemd unit in failed state; classified by Result code | Warning |
Memory & CPU (9 rules)
| Rule | Trigger | Severity |
|---|---|---|
ram_high | ≥ 90% used, ≥ 95% critical. Configurable. | Warning / Critical |
swap_high | > 50% swap used | Warning |
oom_kills | Any OOM kill detected | Critical |
cpu_high | ≥ 90% utilization, ≥ 98% critical | Warning / Critical |
load_high | Load average > 1x core count warning, > 2x critical | Warning / Critical |
cpu_iowait_high | ≥ 20% iowait. Configurable. | Warning |
cpu_pressure_high | PSI cpu.some / cpu.full stall > threshold (kernel ≥ 4.20) | Warning |
mem_pressure_high | PSI memory.some / memory.full stall > threshold | Warning |
io_pressure_high | PSI io.full stall > threshold (companion to cpu_iowait_high) | Warning |
Network (10 rules)
| Rule | Trigger | Severity |
|---|---|---|
interface_errors | Hardware errors > 0 per interval, drops > 500 | Warning |
link_speed_mismatch | Interface negotiated ≥ 2x below highest advertised mode | Warning |
interface_saturation | ≥ 90% of negotiated link speed sustained | Warning |
bond_slave_down | A bond member interface is down | Critical |
lacp_partner_lost | LACP partner state lost on a bond member | Warning |
conntrack_exhaustion | > 80% of conntrack table used, or insert_failed rate spiking | Warning |
listen_overflow | Listening socket backlog overflows detected | Warning |
accept_backlog_or_syn_flood | Accept backlog or SYN-flood pattern (cross-snapshot) | Warning |
softnet_drops | Per-CPU softnet queue drops | Warning |
tcp_retrans_high | TCP retransmit rate above threshold | Warning |
Hardware / BMC (7 rules)
| Rule | Trigger | Severity |
|---|---|---|
cpu_temperature_high | > 80°C warning, > 90°C critical | Warning / Critical |
ecc_errors | Correctable > 0 warning, uncorrectable > 0 critical. EDAC + IPMI SEL sources. | Warning / Critical |
psu_redundancy_loss | PSU redundancy state degraded or lost | Critical |
ipmi_sel_critical | Critical SEL entries detected. Vendor parsers: Dell/Supermicro/HPE fleet-tested; Lenovo/Cisco/OpenBMC parser_quality stub. | Critical |
ipmi_fan_failure | Fan speed below minimum threshold or fan failure SEL event | Critical |
cmos_battery_low | CMOS / RTC battery voltage below threshold (clock drift and BIOS reset risk) | Warning |
service_flapping | Cross-snapshot: same systemd unit restarting repeatedly | Warning |
GPU (8 rules; NVIDIA)
| Rule | Trigger | Severity |
|---|---|---|
gpu_xid_critical | Critical NVIDIA XID event (e.g. XID 79 fall-off-the-bus) | Critical |
gpu_thermal_critical | Temperature ≥ 90°C, or hw_thermal_slowdown / sw_thermal_slowdown active. Note: not reachable on healthy L4 cooling under synthetic load; fires on real cooling-system issues. | Critical |
gpu_uncorrected_ecc | Uncorrected ECC error on GPU memory | Critical |
gpu_corrected_ecc_storm | Corrected ECC errors above rate threshold | Warning |
gpu_power_cap_throttling | Sustained power-cap throttling event | Warning |
gpu_pcie_link_degraded | PCIe link width or generation below advertised; cross-checked against ASPM idle state | Warning |
nvlink_link_down | NVLink peer link down (multi-GPU systems) | Critical |
gpu_driver_drift | NVIDIA driver version drift across the fleet | Info |
Time & services (4 rules)
| Rule | Trigger | Severity |
|---|---|---|
clock_drift | Offset > 1 second | Warning |
ntp_not_synced | NTP daemon not running or clock not synced | Warning |
unexpected_reboot | Server restarted unexpectedly; reboot evidence (pstore / kdump / wtmp) classifies cause | Event |
server_unreachable | Server missed 2+ check-ins (server-side watchdog) | Critical |
Security & patching (9 rules)
| Rule | Trigger | Severity |
|---|---|---|
ssh_root_password | Root login with password enabled | Warning |
no_firewall | No active firewall detected | Warning |
pending_security_updates | > 0 security updates pending | Info |
kernel_vulnerabilities | Active kernel vulnerabilities. Severity demotes to info when kernel software mitigation is engaged ("Clear CPU buffers attempted"). | Info / Warning |
kernel_needs_reboot | Kernel update requires reboot | Info |
unattended_upgrades_disabled | Auto-updates not configured | Info |
tls_certificate_expiring | TLS cert expiring within 30 days | Warning |
weak_root_password_policy | Root password policy weak or absent | Warning |
cve_critical_unpatched | Critical CVE detected as unpatched on the host's package versions | Warning |
State alerts auto-resolve when the condition clears. Event alerts (unexpected_reboot) stack occurrences and have a Resolve button. Acknowledged alerts still auto-resolve.
#Operations
Day-to-day tasks: managing nodes, tuning thresholds when a rule is too sensitive or not sensitive enough, scheduling maintenance windows, acknowledging alerts, and confirming or dismissing trend warnings so the ranker learns from your fleet.
#Managing nodes
Add a node by generating a collector key in the dashboard and pasting it into /etc/glassmkr/crucible.yaml on the target server (legacy installs: /etc/glassmkr/collector.yaml; the agent reads either). The new server reports within 1 minute. Remove a node from the dashboard's Servers page; the slot is released for billing on the next proration cycle. A server that stops reporting is not auto-removed; it surfaces a server_unreachable alert instead so unintentional silence is visible.
#Tuning thresholds (config reference)
The agent's full configuration lives in /etc/glassmkr/crucible.yaml (legacy installs: /etc/glassmkr/collector.yaml; the agent reads either, and glassmkr-crucible init migrates the file in place). Most fleets only ever set collector_key. The other fields exist for hostname overrides, faster collection on short-window debugging, or disabling a module when the underlying tool isn't present.
# Required
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE
# Optional
interval: 60 # Collection interval in seconds (default: 60)
# hostname: my-server # Override auto-detected hostname
# modules: # Disable specific collection modules
# ipmi: false
# smart: false
# zfs: false
# security: false server_url- The Glassmkr ingest endpoint. Always
https://app.glassmkr.comfor the hosted service. collector_key- Your server's authentication token. Generated when you add a server in the dashboard. Prefixed with
gmk_cru_live_(older keys may still use the legacycol_prefix until rotated). interval- How often (in seconds) the agent collects and pushes a snapshot. Default is 60 seconds. Minimum is 60.
hostname- Override the auto-detected hostname. Useful when the system hostname is generic or changes between reboots.
modules- Disable individual collection modules. Set any module to
falseto skip it. The agent will not attempt to read sensors for disabled modules.
Per-rule numeric thresholds (the percentage at which disk_space_high fires, the iowait floor for cpu_iowait_high, and similar) live in the dashboard, not in the agent's YAML. Open a rule's settings page to adjust them. Defaults are chosen for bare-metal fleets; tune up for noisy storage or down for capacity-tight servers.
#Maintenance windows
Schedule a planned-reboot or service window from a server's detail page. Alerts that fire during the window are suppressed at the notification layer (they still appear in the dashboard for audit). The unexpected_reboot event is treated as expected during a planned-reboot window. Windows accept a duration or an explicit end time.
#Acknowledging alerts
Click Acknowledge on an alert detail page to silence further notifications for that alert while you are working on it. State alerts still auto-resolve when the condition clears. Event alerts (like unexpected_reboot) stack subsequent occurrences under the acknowledged alert and expose a Resolve button when you are done.
#Trend warning feedback
Each trend warning has Confirm and Dismiss buttons. Confirm marks the warning as a true positive (typically because the underlying part was replaced); dismiss marks it as a false positive. The dashboard surfaces the feature's running track record under the trend warnings self-audit (see Trend warnings). Feedback also flows back into the ranker as labelled training signal across the fleet.
#API
Glassmkr exposes a small REST API for server, alert, and notification-channel management. Full machine-readable corpus at /llms-full.txt; LLM-first index at /llms.txt.
#Authentication
API calls authenticate with an account token. Generate one in the dashboard under Account > API tokens; tokens are prefixed gmk_acct_live_. Pass it as Authorization: Bearer <token>. Tokens carry the full permissions of the account; rotate any token that has been exposed.
curl -H "Authorization: Bearer gmk_acct_live_YOUR_TOKEN" \
https://app.glassmkr.com/api/v1/servers#Servers
GET /api/v1/servers lists all servers in the account with last-seen timestamps, hardware identifiers, and the most recent snapshot's headline metrics. GET /api/v1/servers/{id} returns a single server's full latest snapshot. DELETE /api/v1/servers/{id} removes a server. Adding a server is done from the dashboard so that the collector key can be issued and displayed once.
#Alerts
GET /api/v1/alerts returns currently open alerts. Filter with ?status=open|acknowledged|resolved and ?server_id=<id>. POST /api/v1/alerts/{id}/acknowledge and POST /api/v1/alerts/{id}/resolve are the two mutating endpoints. Webhook deliveries (configured per-channel) carry the same payload shape.
#Notification channels
GET /api/v1/channels lists configured channels; POST /api/v1/channels creates one; PATCH /api/v1/channels/{id} updates routing rules. The full schema, including per-priority filter syntax and webhook payload shape, is documented in /llms-full.txt.
#Reference
System requirements, architecture, a vocabulary glossary, the per-metric definitions you'll see in alert evidence, and Crucible's release history.
#System requirements
- Operating system
- Linux with systemd. Tested on Debian 11/12, Ubuntu 20.04 to 24.04, Rocky 8/9, AlmaLinux 8/9.
- Runtime
- Docker (recommended) or Node.js 24+.
- Privileges
- Root access required for IPMI, SMART, and /proc system reads.
- Network
- Outbound HTTPS on port 443 to
app.glassmkr.com. No inbound ports needed. - Resource usage
- Around 110 MB resident memory (RSS), under 1% of host RAM on every host we tested. Measured on Crucible 0.13.6 across all 10 validation hosts at steady state: median 108 MB, range 81 to 116 MB (varies mainly with the bundled Node version). Effectively 0% CPU at the default 60-second snapshot interval. Random-read I/O delta under 1.5% under fio saturation.
- Optional packages (npm install only)
smartmontools,ipmitool,dmidecodefor full hardware monitoring. Missing packages are silently skipped.
#Architecture
every 60s
- The agent is MIT open source: github.com/glassmkr/crucible
- Agent pushes outbound only, opens no inbound ports
- Snapshots contain hardware metrics only, no user data
- Dashboard runs on EU dedicated servers, no cloud providers
- AI analysis runs on a self-hosted GPU, no external AI providers
#Glossary
- Agent / Crucible
- The collection process that runs on each monitored server and pushes snapshots to the dashboard. MIT-licensed at github.com/glassmkr/crucible.
- Snapshot
- One push from the agent. Contains hardware sensor readings, OS counters, SMART data, and software state at a point in time.
- Alert rule
- An evaluator function that runs against every snapshot and fires when the rule's condition is met. The full catalog is at /docs/rules.
- Trend warning
- A feature on every plan that surfaces metric degradation trends before a threshold-based rule would fire.
- Furnace
- The self-hosted AI assistant that annotates alert detail pages. Gemma 4 26B on an NVIDIA L4 in Amsterdam.
- Dashboard
- The SaaS surface at
app.glassmkr.com. Hosts alert evaluation, notification routing, billing, the API, and Furnace.
#Metric definitions
Alert evidence references metric names that map to specific source files and counters. Headline definitions:
cpu.utilization_percent- From
/proc/stat, computed as (1 - idle_delta / total_delta) over the collection interval. Excludes iowait. cpu.iowait_percent- From
/proc/stat, the iowait counter delta over total delta. memory.used_percent- From
/proc/meminfo, computed as (MemTotal - MemAvailable) / MemTotal. load.avg_1m / avg_5m / avg_15m- From
/proc/loadavg. disk.used_percent- Per-mount from
statvfs(). Excluded mounts (tmpfs, snap squashfs) are not collected. smart.attr.<id>- Vendor-attribute values from
smartctl -A, keyed by attribute number (5 = reallocated_sector_ct, 187 = reported_uncorrect, etc.). ipmi.sensor.<name>- Numeric readings from
ipmitool sdr elistwith status flags. Includes fan RPM, voltage rails, temperature zones, PSU watts.
The complete metric inventory is in /llms-full.txt.
#Release history
Crucible release notes are tagged on GitHub: github.com/glassmkr/crucible/releases. The npm dist-tag latest always points at the version recommended for new installs; the dashboard's install snippet pulls from it. Major version notifications go to all configured channels by default; patch notifications are opt-in.
#Troubleshooting
Common failure modes during install and operation, and what to check first.
#Agent not reporting
If the server stops appearing in the dashboard, check in this order:
docker compose ps(orsystemctl status glassmkr-crucible) confirms the process is running.docker compose logs --tail=100 glassmkr-crucibleshows the most recent push attempt. Look for HTTP status codes other than 200.- Outbound connectivity to
app.glassmkr.com:443from the host:curl -I https://app.glassmkr.com. - The
collector_keyin/etc/glassmkr/crucible.yaml(legacy installs:/etc/glassmkr/collector.yaml) matches the key shown for the server in the dashboard. A rotated key invalidates the old one.
#Alerts firing too often
If a rule's threshold doesn't suit your fleet (a storage server that legitimately runs at 92% disk consistently, a database under sustained 85% memory pressure by design), adjust the threshold on the rule's settings page. See Tuning thresholds. If the noise is from a single host with known degraded hardware, an acknowledge plus a planned-maintenance window is usually a better fit than disabling the rule fleet-wide.
#Notifications not arriving
An alert that appears in the dashboard but never reaches a notification channel almost always points at the channel configuration:
- Email: check the spam folder;
[email protected]sets SPF, DKIM, and DMARC, but corporate filters sometimes still hold first contact. - Telegram: confirm the bot is still a member of the chat and that the chat ID matches the value stored in the channel.
- Slack: rotate the incoming-webhook URL if it has been revoked, and verify the channel still exists.
- Webhooks: open the channel's delivery history to see HTTP response codes from your endpoint. 4xx and 5xx responses are retried with backoff but not indefinitely.
- Per-priority filtering: a P3 alert routed to a P1-only channel by design will not deliver.
#Installation issues
The docker install path is the most predictable. If you are on the npm path, the agent silently skips modules whose backing tool is missing; install smartmontools, ipmitool, and dmidecode to enable the full hardware path. If IPMI sensors don't appear after install, run ipmitool sdr as root: if that errors, the BMC is unreachable from the host and Glassmkr cannot help until that is fixed. For anything else, mail [email protected] with the most recent 100 lines of agent logs.
#FAQ
Do I need to open any inbound ports?
No. The agent initiates all connections outbound over HTTPS (port 443). Your firewall rules do not need to change.
Does the agent work without IPMI?
Yes. If ipmitool is not installed or the BMC is not reachable, the IPMI module is silently skipped. All other monitoring continues normally.
What happens if connectivity is lost?
The server_unreachable rule fires after the server misses 2 consecutive check-ins, about 2 minutes at the default 60-second interval. When connectivity resumes, the agent continues pushing snapshots.
Can I self-host the dashboard?
The agent is MIT-licensed and fully open source. The dashboard and alert evaluation engine are SaaS-only.
How does pricing work mid-month?
Proration. Add a server mid-month and you are charged proportionally for the remaining days. Remove a server and the next bill reflects the change.
Is my data stored in the EU?
Yes. All infrastructure, including the database servers and AI GPU, runs on dedicated servers in EU data centers.