Glassmkr Documentation

From zero to monitoring in about a minute. One agent, opinionated alert rules, no inbound ports.

#Getting started

Install the agent, see your first alert, route notifications where your team lives, and understand the billing model.

#What Glassmkr monitors

Glassmkr is a monitoring agent for bare metal and dedicated servers. The agent collects hardware and OS metrics every 60 seconds and pushes them to the Glassmkr dashboard, where a library of alert rules evaluates each snapshot automatically.

Hardware

IPMI sensors (temperature, fan speed, voltage, power draw), IPMI SEL event log, ECC memory errors, PSU redundancy status

Storage

SMART health and wear level, disk space and inodes, RAID array status, ZFS pool health and scrub errors, filesystem read-only detection, I/O errors and latency

Network

Interface errors and drops, link speed negotiation, bandwidth saturation, bond slave status, conntrack table usage

OS

CPU per-core utilization and iowait, load averages, RAM and swap, OOM kills, clock drift, NTP sync, systemd failed units, file descriptor exhaustion, unexpected reboots

Security

SSH root password authentication, firewall status, pending security updates, kernel vulnerabilities, reboot required flag, unattended upgrades configuration

The full rule library evaluates on every collection cycle. All rules included on every plan, including Free.

#Installation

Docker (recommended)

# 1. Create config directory
sudo mkdir -p /etc/glassmkr

# 2. Add your collector key (get it from glassmkr.com after signing up)
sudo tee /etc/glassmkr/crucible.yaml << 'EOF'
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE
interval: 300
EOF

# 3. Download and start
curl -O https://raw.githubusercontent.com/glassmkr/crucible/main/docker-compose.yml
docker compose up -d

# 4. Verify
docker compose logs glassmkr-crucible

The container runs with --privileged and network_mode: host for IPMI, SMART, and bond monitoring. See Trust for details.

npm alternative

npm install -g @glassmkr/crucible
sudo glassmkr-crucible --config /etc/glassmkr/crucible.yaml

Requires Node.js 24+. System packages smartmontools, ipmitool, dmidecode needed for full hardware monitoring.

Your server appears in the dashboard within 1 minute.

to get your collector key.

#First alert

After install, the agent pushes its first snapshot within 1 minute and the dashboard evaluates the rule library against it. On bare-metal hardware, something is usually mildly degraded: a disk that's been running for years, kernel updates pending, a fan running near its threshold. Expect 0-3 alerts on first contact, most P3 or warning-level.

Each alert opens to a detail page with remediation guidance (the command to run, what to verify after) and a Furnace assistant annotation. See Furnace for how that works.

#Notification channels

Email

Alerts delivered from [email protected].

Telegram

Bot messages with alert details and direct links.

Slack

Block Kit formatted messages with severity colors.

Discord

Rich embeds posted to a channel via incoming webhook.

PagerDuty

Events API v2; maps priority to PagerDuty severity.

Webhooks

POST JSON to any URL you configure.

All six channels are on every plan, free or Pro, with no cap on how many you configure. Every channel supports per-priority filtering (P1 to P4). Agent update notifications send major version alerts to everyone; patch notifications are opt-in. Routing detail under Multi-channel alerting.

#Pricing and billing

Free

  • Up to 3 nodes
  • All 62 alert rules
  • All six notification channels, unlimited
  • Full read+write API
  • Predictive trend warnings
  • 7 days data retention
  • One trial AI analysis per server
  • No credit card required

Pro $3/node/month

Everything in Free, and it lifts exactly three limits:

  • More than 3 nodes (first 3 free, then $3/node/month)
  • 90 days data retention (audit log 365 days)
  • Unlimited AI health analysis
  • Email support
3 nodes: $0/mo 10 nodes: $21/mo 25 nodes: $66/mo 50 nodes: $141/mo

Enterprise

Custom pricing and configuration. Contact [email protected].

#Concepts

Glassmkr's vocabulary: what a node is, what an alert rule is, what Furnace does, what trend warnings are, and how notifications get routed.

#Nodes and servers

One Linux server is one node. Billing, alerts, and notifications all attach to nodes. A server you stop reporting from for over 2 minutes triggers the server_unreachable alert and stays in your fleet count until you remove it from the dashboard. See server_unreachable.

#Alert rules

An alert rule is an evaluator function that runs on every snapshot. The shipped rules cover hardware faults, capacity, security posture, and service health. Each rule emits structured evidence (the metric values that caused the fire) and resolves automatically when the underlying condition clears. The full reference catalog is at /docs/rules. Per-alert remediation guidance is rendered inside the dashboard on the alert detail page.

#Furnace (AI assistance)

Furnace is the AI assistant that annotates alert detail pages. It reads the alert's evidence and the rule's remediation steps and produces context-specific notes. It runs on a self-hosted Gemma 4 26B model on a single NVIDIA L4 GPU in Amsterdam; no third-party LLM APIs (no OpenAI, no Anthropic, no Google). Your alert data does not leave EU jurisdiction.

Furnace is conservative by design: hedges interpretive claims, doesn't autocomplete shell commands, says "I don't know" when it doesn't. Full philosophy in the Furnace introduction blog post.

#Trend warnings

A predictive early-warning feature, on every plan, that surfaces warnings when metrics show degradation trends before the alert thresholds fire. Separate from the snapshot-driven alert rules (which fire on current state).

Trend warnings run in a 6-hour batch process on Glassmkr's backend. They analyze up to 90 days of metric history per server, apply correlation rules that require two independent signals, and optionally consult an internal ranking model trained on Backblaze's public drive failure dataset. The result is a small number of high-confidence warnings per server, not noisy anomaly detection on every metric.

What gets monitored

SignalWhat we watch forExample warning
SMART reallocated sectors (5)Growth over 7-30 daysDrive /dev/sda SMART 5 grew from 0 to 14 over 30 days
SMART reported uncorrectable (187)Any appearance above zeroDrive /dev/sda: SMART 187 is now 3
SMART command timeouts (188)Repeated growthDrive /dev/sda: 4 command timeouts in last 7 days
SMART pending/offline uncorrectable (197, 198)Step change from zeroDrive /dev/sda: pending sectors appeared 3 days ago
SMART high fly writes (189)Burst patternsDrive /dev/sda: 8 high fly write events in 24h
NVMe critical_warningAny bit setNVMe /dev/nvme0n1: critical_warning bit 2 set (reliability degraded)
NVMe available_spareApproaching or below thresholdNVMe /dev/nvme0n1: available_spare now 12%, threshold is 10%
NVMe media_errorsGrowing rapidlyNVMe /dev/nvme0n1: media_errors increased from 0 to 4 in 7 days
NVMe p99 latency (planned, v1.1)Sustained drift without IO volume changeNVMe /dev/nvme0n1: p99 read latency sustained 2.3x above baseline
Disk space per partitionProjected fill via linear regression/data partition projected to hit 85% in 12 days at current growth
ECC correctable errors (planned, v1.1)Bursts per DIMM locationDIMM CPU1_DIMM_A2: 15 correctable errors in 24h
PSU rail voltagesDrift 2-3% from nominalPSU 1: 12V rail at 11.62V (drift 3.2%)
Fan RPMDecline paired with temp rise in same zoneFan SYS_FAN2: RPM dropped 25% and chassis zone temp rising
NIC errorsCRC/frame errors (TCP retransmit correlation planned, v1.1)eth0: 47 CRC errors
ZFS checksum/read errorsPaired with matching SMART signal on same deviceDrive /dev/sda: ZFS reported 7 checksum errors corroborating SMART 5 growth

Severity tiers

  • Imminent (red): projected failure within 7 days, or critical pattern (SMART 187 appearance, NVMe critical_warning). Push notification immediately.
  • Soon (orange): projected within 30 days, or high-severity evidence. Push notification once.
  • Scheduled (blue): projected within 90 days, or medium-severity. Dashboard only.
  • Watch (grey): low confidence or more than 90 days out. Dashboard collapsed.

Correlation requirement

Where two signals exist on the same device, correlation is required before a notification. Several v1 categories fire on a single high-confidence signal because the underlying source is itself authoritative (a SMART step-change from zero, NVMe critical_warning bits, a PSU rail at 11.62V). The asymmetry is deliberate.

Multi-signal categories shipped in v1:

Signals (on the same device)Diagnosis
Drive SMART signal and ZFS errorsStorage device degradation
Fan RPM decline and chassis temp rise (same zone)Cooling failure

Multi-signal categories planned for v1.1 once the underlying collector work lands:

Signals (on the same device)Diagnosis
NVMe health signal and p99 latency inflationNVMe pre-failure (fail-slow)
NIC CRC errors and TCP retransmits (same interface)NIC hardware failure
ECC burst and MCE entries (same DIMM)DIMM pre-failure

This approach trades some recall (failures that only show one signal) for high precision. Google's FAST 2007 study found roughly 40-50% of drive failures in the field show no SMART-visible warning, so trend warnings are a meaningful reduction of surprise failures, not a guarantee that every failure becomes predictable.

What we explicitly don't do

  • No general-purpose anomaly detection on every metric. Netdata's own docs demote their anomaly ML to "investigation aid, not alert source." We agree.
  • No per-customer model training. With 3-50 servers per account, customer-specific models are base-rate-dominated. We use global thresholds plus an offline-trained ranker on Backblaze's public dataset.
  • No LLM-based trend classification. Linear regression, CUSUM, and first-differences do this job better and cheaper. We use AI only to narrate deterministic findings in plain English.
  • No confident failure predictions. We say "likely within 7-14 days", never "will fail on Tuesday." The underlying signals carry real uncertainty and we surface it.

Data requirements

Trend warnings are on every plan. What differs by plan is how much metric history each signal can see, which is set by your retention window (Free keeps 7 days, Pro keeps 90):

  • The longer-horizon signals (SMART, NVMe, ECC, cooling, PSU, NIC) draw on up to 90 days of history, so they reach full sensitivity at Pro's 90-day retention.
  • Disk space projection works on 7-day data, so it runs the same on Free.
  • A server needs at least 3 days of contiguous data to receive any warnings. Freshly added servers are in an observation period.

Self-audit

The dashboard shows the feature's own track record: how many warnings were sent in the last 90 days, how many users confirmed were valuable, how many were dismissed, and how many were followed by a matching alert firing within 30 days. No other monitoring tool surfaces this, and it exists so you can audit whether trend warnings are actually earning their keep for your fleet.

#Multi-channel alerting

Each alert routes to channels based on the rules in your dashboard. Group by team, by server, by severity. Suppress during planned maintenance windows. The alerting layer is unopinionated; route alerts wherever your team already pays attention.

Channel types in Notification channels. Per-priority filter (P1-P4) and per-rule mutes operate independently of channel selection.

#Alert rules

62 rules across 9 categories, tuned for bare-metal failure modes. Per-rule catalog pages at /docs/rules show the title, summary, priority, and category, plus the quick-check command + verdict prior (recoverable / investigation / vendor-side) for each rule. Per-alert remediation guidance (full FIX content: prerequisites, safe-mode diagnostic, fix command, validation, rollback, blast-radius impact) lives in the dashboard on the alert detail page. 20 of the 62 rules ship with deep FIX content; 30+ are verified end-to-end on real hardware. The summary tables below group rules by category.

Storage (8 rules)

RuleTriggerSeverity
disk_space_high≥ 85% warning, ≥ 95% critical. Configurable.Warning / Critical
disk_fill_projectionTrend warning: projected to fill within N days (cross-snapshot)Warning
smart_failingReallocated/pending sectors or health != PASSEDCritical
nvme_wear_high≥ 85% wear warning, ≥ 95% critical. NVMe Critical Warning bits also decoded.Warning / Critical
raid_degradedAny degraded or failed RAID array (mdadm + hardware RAID via storcli/perccli/ssacli/arcconf)Critical
disk_latency_highAverage latency > 100msWarning
disk_io_errorsI/O errors detected in dmesg (structured event match)Critical
inode_high≥ 90% inodes usedWarning

ZFS (3 rules)

RuleTriggerSeverity
zfs_pool_unhealthyPool state != ONLINE; severity matrix by vdev redundancy classWarning / Critical
zfs_scrub_errorsScrub detected errors, or pool has never been scrubbed (fresh-pool reminder)Warning
zfs_slog_faultedSLOG vdev faulted (write-cache reliability impact)Critical

Filesystem (4 rules)

RuleTriggerSeverity
filesystem_readonlyMounted filesystem remounted read-only (kernel I/O error path)Critical
fd_exhaustion> 80% of system or per-process file descriptors usedWarning
lvm_thinpool_metadata_highLVM thin-pool data or metadata > 80% usedWarning / Critical
systemd_service_failedAny systemd unit in failed state; classified by Result codeWarning

Memory & CPU (9 rules)

RuleTriggerSeverity
ram_high≥ 90% used, ≥ 95% critical. Configurable.Warning / Critical
swap_high> 50% swap usedWarning
oom_killsAny OOM kill detectedCritical
cpu_high≥ 90% utilization, ≥ 98% criticalWarning / Critical
load_highLoad average > 1x core count warning, > 2x criticalWarning / Critical
cpu_iowait_high≥ 20% iowait. Configurable.Warning
cpu_pressure_highPSI cpu.some / cpu.full stall > threshold (kernel ≥ 4.20)Warning
mem_pressure_highPSI memory.some / memory.full stall > thresholdWarning
io_pressure_highPSI io.full stall > threshold (companion to cpu_iowait_high)Warning

Network (10 rules)

RuleTriggerSeverity
interface_errorsHardware errors > 0 per interval, drops > 500Warning
link_speed_mismatchInterface negotiated ≥ 2x below highest advertised modeWarning
interface_saturation≥ 90% of negotiated link speed sustainedWarning
bond_slave_downA bond member interface is downCritical
lacp_partner_lostLACP partner state lost on a bond memberWarning
conntrack_exhaustion> 80% of conntrack table used, or insert_failed rate spikingWarning
listen_overflowListening socket backlog overflows detectedWarning
accept_backlog_or_syn_floodAccept backlog or SYN-flood pattern (cross-snapshot)Warning
softnet_dropsPer-CPU softnet queue dropsWarning
tcp_retrans_highTCP retransmit rate above thresholdWarning

Hardware / BMC (7 rules)

RuleTriggerSeverity
cpu_temperature_high> 80°C warning, > 90°C criticalWarning / Critical
ecc_errorsCorrectable > 0 warning, uncorrectable > 0 critical. EDAC + IPMI SEL sources.Warning / Critical
psu_redundancy_lossPSU redundancy state degraded or lostCritical
ipmi_sel_criticalCritical SEL entries detected. Vendor parsers: Dell/Supermicro/HPE fleet-tested; Lenovo/Cisco/OpenBMC parser_quality stub.Critical
ipmi_fan_failureFan speed below minimum threshold or fan failure SEL eventCritical
cmos_battery_lowCMOS / RTC battery voltage below threshold (clock drift and BIOS reset risk)Warning
service_flappingCross-snapshot: same systemd unit restarting repeatedlyWarning

GPU (8 rules; NVIDIA)

RuleTriggerSeverity
gpu_xid_criticalCritical NVIDIA XID event (e.g. XID 79 fall-off-the-bus)Critical
gpu_thermal_criticalTemperature ≥ 90°C, or hw_thermal_slowdown / sw_thermal_slowdown active. Note: not reachable on healthy L4 cooling under synthetic load; fires on real cooling-system issues.Critical
gpu_uncorrected_eccUncorrected ECC error on GPU memoryCritical
gpu_corrected_ecc_stormCorrected ECC errors above rate thresholdWarning
gpu_power_cap_throttlingSustained power-cap throttling eventWarning
gpu_pcie_link_degradedPCIe link width or generation below advertised; cross-checked against ASPM idle stateWarning
nvlink_link_downNVLink peer link down (multi-GPU systems)Critical
gpu_driver_driftNVIDIA driver version drift across the fleetInfo

Time & services (4 rules)

RuleTriggerSeverity
clock_driftOffset > 1 secondWarning
ntp_not_syncedNTP daemon not running or clock not syncedWarning
unexpected_rebootServer restarted unexpectedly; reboot evidence (pstore / kdump / wtmp) classifies causeEvent
server_unreachableServer missed 2+ check-ins (server-side watchdog)Critical

Security & patching (9 rules)

RuleTriggerSeverity
ssh_root_passwordRoot login with password enabledWarning
no_firewallNo active firewall detectedWarning
pending_security_updates> 0 security updates pendingInfo
kernel_vulnerabilitiesActive kernel vulnerabilities. Severity demotes to info when kernel software mitigation is engaged ("Clear CPU buffers attempted").Info / Warning
kernel_needs_rebootKernel update requires rebootInfo
unattended_upgrades_disabledAuto-updates not configuredInfo
tls_certificate_expiringTLS cert expiring within 30 daysWarning
weak_root_password_policyRoot password policy weak or absentWarning
cve_critical_unpatchedCritical CVE detected as unpatched on the host's package versionsWarning

State alerts auto-resolve when the condition clears. Event alerts (unexpected_reboot) stack occurrences and have a Resolve button. Acknowledged alerts still auto-resolve.

#Operations

Day-to-day tasks: managing nodes, tuning thresholds when a rule is too sensitive or not sensitive enough, scheduling maintenance windows, acknowledging alerts, and confirming or dismissing trend warnings so the ranker learns from your fleet.

#Managing nodes

Add a node by generating a collector key in the dashboard and pasting it into /etc/glassmkr/crucible.yaml on the target server (legacy installs: /etc/glassmkr/collector.yaml; the agent reads either). The new server reports within 1 minute. Remove a node from the dashboard's Servers page; the slot is released for billing on the next proration cycle. A server that stops reporting is not auto-removed; it surfaces a server_unreachable alert instead so unintentional silence is visible.

#Tuning thresholds (config reference)

The agent's full configuration lives in /etc/glassmkr/crucible.yaml (legacy installs: /etc/glassmkr/collector.yaml; the agent reads either, and glassmkr-crucible init migrates the file in place). Most fleets only ever set collector_key. The other fields exist for hostname overrides, faster collection on short-window debugging, or disabling a module when the underlying tool isn't present.

# Required
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE

# Optional
interval: 60           # Collection interval in seconds (default: 60)
# hostname: my-server  # Override auto-detected hostname
# modules:             # Disable specific collection modules
#   ipmi: false
#   smart: false
#   zfs: false
#   security: false
server_url
The Glassmkr ingest endpoint. Always https://app.glassmkr.com for the hosted service.
collector_key
Your server's authentication token. Generated when you add a server in the dashboard. Prefixed with gmk_cru_live_ (older keys may still use the legacy col_ prefix until rotated).
interval
How often (in seconds) the agent collects and pushes a snapshot. Default is 60 seconds. Minimum is 60.
hostname
Override the auto-detected hostname. Useful when the system hostname is generic or changes between reboots.
modules
Disable individual collection modules. Set any module to false to skip it. The agent will not attempt to read sensors for disabled modules.

Per-rule numeric thresholds (the percentage at which disk_space_high fires, the iowait floor for cpu_iowait_high, and similar) live in the dashboard, not in the agent's YAML. Open a rule's settings page to adjust them. Defaults are chosen for bare-metal fleets; tune up for noisy storage or down for capacity-tight servers.

#Maintenance windows

Schedule a planned-reboot or service window from a server's detail page. Alerts that fire during the window are suppressed at the notification layer (they still appear in the dashboard for audit). The unexpected_reboot event is treated as expected during a planned-reboot window. Windows accept a duration or an explicit end time.

#Acknowledging alerts

Click Acknowledge on an alert detail page to silence further notifications for that alert while you are working on it. State alerts still auto-resolve when the condition clears. Event alerts (like unexpected_reboot) stack subsequent occurrences under the acknowledged alert and expose a Resolve button when you are done.

#Trend warning feedback

Each trend warning has Confirm and Dismiss buttons. Confirm marks the warning as a true positive (typically because the underlying part was replaced); dismiss marks it as a false positive. The dashboard surfaces the feature's running track record under the trend warnings self-audit (see Trend warnings). Feedback also flows back into the ranker as labelled training signal across the fleet.

#API

Glassmkr exposes a small REST API for server, alert, and notification-channel management. Full machine-readable corpus at /llms-full.txt; LLM-first index at /llms.txt.

#Authentication

API calls authenticate with an account token. Generate one in the dashboard under Account > API tokens; tokens are prefixed gmk_acct_live_. Pass it as Authorization: Bearer <token>. Tokens carry the full permissions of the account; rotate any token that has been exposed.

curl -H "Authorization: Bearer gmk_acct_live_YOUR_TOKEN" \
  https://app.glassmkr.com/api/v1/servers

#Servers

GET /api/v1/servers lists all servers in the account with last-seen timestamps, hardware identifiers, and the most recent snapshot's headline metrics. GET /api/v1/servers/{id} returns a single server's full latest snapshot. DELETE /api/v1/servers/{id} removes a server. Adding a server is done from the dashboard so that the collector key can be issued and displayed once.

#Alerts

GET /api/v1/alerts returns currently open alerts. Filter with ?status=open|acknowledged|resolved and ?server_id=<id>. POST /api/v1/alerts/{id}/acknowledge and POST /api/v1/alerts/{id}/resolve are the two mutating endpoints. Webhook deliveries (configured per-channel) carry the same payload shape.

#Notification channels

GET /api/v1/channels lists configured channels; POST /api/v1/channels creates one; PATCH /api/v1/channels/{id} updates routing rules. The full schema, including per-priority filter syntax and webhook payload shape, is documented in /llms-full.txt.

#Reference

System requirements, architecture, a vocabulary glossary, the per-metric definitions you'll see in alert evidence, and Crucible's release history.

#System requirements

Operating system
Linux with systemd. Tested on Debian 11/12, Ubuntu 20.04 to 24.04, Rocky 8/9, AlmaLinux 8/9.
Runtime
Docker (recommended) or Node.js 24+.
Privileges
Root access required for IPMI, SMART, and /proc system reads.
Network
Outbound HTTPS on port 443 to app.glassmkr.com. No inbound ports needed.
Resource usage
Around 110 MB resident memory (RSS), under 1% of host RAM on every host we tested. Measured on Crucible 0.13.6 across all 10 validation hosts at steady state: median 108 MB, range 81 to 116 MB (varies mainly with the bundled Node version). Effectively 0% CPU at the default 60-second snapshot interval. Random-read I/O delta under 1.5% under fio saturation.
Optional packages (npm install only)
smartmontools, ipmitool, dmidecode for full hardware monitoring. Missing packages are silently skipped.

#Architecture

Your server
The agent reads /proc, /sys, smartctl, ipmitool
CPU RAM Disk SMART IPMI Network ZFS Security
HTTPS / TLS
every 60s
Glassmkr
Dashboard Rule library Notifications AI analysis
PostgreSQL + ClickHouse on EU dedicated servers
  • The agent is MIT open source: github.com/glassmkr/crucible
  • Agent pushes outbound only, opens no inbound ports
  • Snapshots contain hardware metrics only, no user data
  • Dashboard runs on EU dedicated servers, no cloud providers
  • AI analysis runs on a self-hosted GPU, no external AI providers

#Glossary

Agent / Crucible
The collection process that runs on each monitored server and pushes snapshots to the dashboard. MIT-licensed at github.com/glassmkr/crucible.
Snapshot
One push from the agent. Contains hardware sensor readings, OS counters, SMART data, and software state at a point in time.
Alert rule
An evaluator function that runs against every snapshot and fires when the rule's condition is met. The full catalog is at /docs/rules.
Trend warning
A feature on every plan that surfaces metric degradation trends before a threshold-based rule would fire.
Furnace
The self-hosted AI assistant that annotates alert detail pages. Gemma 4 26B on an NVIDIA L4 in Amsterdam.
Dashboard
The SaaS surface at app.glassmkr.com. Hosts alert evaluation, notification routing, billing, the API, and Furnace.

#Metric definitions

Alert evidence references metric names that map to specific source files and counters. Headline definitions:

cpu.utilization_percent
From /proc/stat, computed as (1 - idle_delta / total_delta) over the collection interval. Excludes iowait.
cpu.iowait_percent
From /proc/stat, the iowait counter delta over total delta.
memory.used_percent
From /proc/meminfo, computed as (MemTotal - MemAvailable) / MemTotal.
load.avg_1m / avg_5m / avg_15m
From /proc/loadavg.
disk.used_percent
Per-mount from statvfs(). Excluded mounts (tmpfs, snap squashfs) are not collected.
smart.attr.<id>
Vendor-attribute values from smartctl -A, keyed by attribute number (5 = reallocated_sector_ct, 187 = reported_uncorrect, etc.).
ipmi.sensor.<name>
Numeric readings from ipmitool sdr elist with status flags. Includes fan RPM, voltage rails, temperature zones, PSU watts.

The complete metric inventory is in /llms-full.txt.

#Release history

Crucible release notes are tagged on GitHub: github.com/glassmkr/crucible/releases. The npm dist-tag latest always points at the version recommended for new installs; the dashboard's install snippet pulls from it. Major version notifications go to all configured channels by default; patch notifications are opt-in.

#Troubleshooting

Common failure modes during install and operation, and what to check first.

#Agent not reporting

If the server stops appearing in the dashboard, check in this order:

  1. docker compose ps (or systemctl status glassmkr-crucible) confirms the process is running.
  2. docker compose logs --tail=100 glassmkr-crucible shows the most recent push attempt. Look for HTTP status codes other than 200.
  3. Outbound connectivity to app.glassmkr.com:443 from the host: curl -I https://app.glassmkr.com.
  4. The collector_key in /etc/glassmkr/crucible.yaml (legacy installs: /etc/glassmkr/collector.yaml) matches the key shown for the server in the dashboard. A rotated key invalidates the old one.

#Alerts firing too often

If a rule's threshold doesn't suit your fleet (a storage server that legitimately runs at 92% disk consistently, a database under sustained 85% memory pressure by design), adjust the threshold on the rule's settings page. See Tuning thresholds. If the noise is from a single host with known degraded hardware, an acknowledge plus a planned-maintenance window is usually a better fit than disabling the rule fleet-wide.

#Notifications not arriving

An alert that appears in the dashboard but never reaches a notification channel almost always points at the channel configuration:

  • Email: check the spam folder; [email protected] sets SPF, DKIM, and DMARC, but corporate filters sometimes still hold first contact.
  • Telegram: confirm the bot is still a member of the chat and that the chat ID matches the value stored in the channel.
  • Slack: rotate the incoming-webhook URL if it has been revoked, and verify the channel still exists.
  • Webhooks: open the channel's delivery history to see HTTP response codes from your endpoint. 4xx and 5xx responses are retried with backoff but not indefinitely.
  • Per-priority filtering: a P3 alert routed to a P1-only channel by design will not deliver.

#Installation issues

The docker install path is the most predictable. If you are on the npm path, the agent silently skips modules whose backing tool is missing; install smartmontools, ipmitool, and dmidecode to enable the full hardware path. If IPMI sensors don't appear after install, run ipmitool sdr as root: if that errors, the BMC is unreachable from the host and Glassmkr cannot help until that is fixed. For anything else, mail [email protected] with the most recent 100 lines of agent logs.

#FAQ

Do I need to open any inbound ports?

No. The agent initiates all connections outbound over HTTPS (port 443). Your firewall rules do not need to change.

Does the agent work without IPMI?

Yes. If ipmitool is not installed or the BMC is not reachable, the IPMI module is silently skipped. All other monitoring continues normally.

What happens if connectivity is lost?

The server_unreachable rule fires after the server misses 2 consecutive check-ins, about 2 minutes at the default 60-second interval. When connectivity resumes, the agent continues pushing snapshots.

Can I self-host the dashboard?

The agent is MIT-licensed and fully open source. The dashboard and alert evaluation engine are SaaS-only.

How does pricing work mid-month?

Proration. Add a server mid-month and you are charged proportionally for the remaining days. Remove a server and the next bill reflects the change.

Is my data stored in the EU?

Yes. All infrastructure, including the database servers and AI GPU, runs on dedicated servers in EU data centers.