DOCS / TROUBLESHOOTING
Troubleshooting
Common issues with the Crucible agent and the Glassmkr Dashboard, with step-by-step solutions.
#Topic pages
- IPMI: how Crucible detects IPMI, why "Not detected" can be correct behavior, using
glassmkr-crucible doctor ipmi, per-vendor notes.
#Crucible service fails to start
Symptom: systemctl status glassmkr-crucible shows failed or inactive (dead).
- Check the service logs:
journalctl -u glassmkr-crucible --no-pager -n 50 - If you see a YAML parse error, re-run the init wizard with the same key to rewrite the config from scratch:
The wizard validates the key against the Dashboard before writing the config, so a typo surfaces immediately. Common YAML mistakes include tabs instead of spaces, missing quotes around strings with special characters, and incorrect indentation.sudo glassmkr-crucible init --api-key <your_collector_key> - If you see
permission denied, ensure the configuration file is readable:
The file should be owned by root with mode 0600.ls -la /etc/glassmkr/collector.yaml - If you see
bind: address already in use, another instance may be running:
Kill the stale process and try again.pgrep -a glassmkr-crucible
#Server shows "offline" in the dashboard
Symptom: The server card shows a gray status indicator and "last seen" is more than 2 minutes ago (the agent pushes every 60 seconds by default; the server_unreachable rule fires after 2 missed check-ins).
- Check that Crucible is running:
systemctl status glassmkr-crucible - Check network connectivity to the API:
You should getcurl -s -o /dev/null -w "%{http_code}" https://app.glassmkr.com/api/v1/health200. If not, check DNS resolution, firewall rules, and proxy settings. - Check whether the collector key is valid:
If you seesudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pagerauth error: 401, rotate the key in the Dashboard and update/etc/glassmkr/collector.yaml. - Check for network-level blocks:
nc -zv app.glassmkr.com 443 - If you are behind a proxy, configure it in
collector.yaml:proxy: https: http://proxy.internal:3128
#Metrics are delayed or missing
Symptom: The dashboard shows gaps in charts or data arrives minutes late.
- Check the agent's push timing:
The "Last push" value should be close to the configured interval (default 60 seconds).sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager - If pushes are slow, check the agent log for timeout errors:
grep -i "timeout\|retry" /var/log/glassmkr/crucible.log | tail -20 - If the server's clock is significantly off, snapshots may be dropped. Verify NTP is working:
If not synchronized:timedatectl statussudo timedatectl set-ntp true - If specific collectors are slow (e.g., SMART queries on many disks), they can delay the entire push. Inspect collector timing:
Consider increasing the interval or disabling slow collectors.sudo journalctl -u glassmkr-crucible -f
#SMART data is not appearing
Symptom: The Disk tab in the dashboard shows no SMART information.
- Ensure
smartmontoolsis installed:# Debian / Ubuntu sudo apt install smartmontools # RHEL / Rocky / Alma sudo dnf install smartmontools - Verify that
smartctlcan read your drives:
If this fails with a permission error, Crucible'ssudo smartctl -a /dev/sdaglassmkrservice user needs read access (the default install handles this via udev rules). - For hardware RAID controllers, drives behind the controller are not visible to
smartctlwithout the-dflag:sudo smartctl -a /dev/sda -d megaraid,0 - Verify the SMART collector is enabled:
collectors: smart: enabled: true
#IPMI, thermal, or fan data is missing
Symptom: The Hardware tab shows no temperature, fan, or PSU data.
- Install
lm-sensorsfor hwmon data:# Debian / Ubuntu sudo apt install lm-sensors sudo sensors-detect --auto - For IPMI data, install
ipmitooland verify it works:sudo apt install ipmitool sudo ipmitool sdr list - Run the IPMI self-diagnostic:
See the IPMI troubleshooting page for the full per-reason fix guide.sudo glassmkr-crucible doctor ipmi - If IPMI is not available (common on consumer hardware, cloud VMs without passthrough, laptops, Raspberry Pi), Crucible reads thermal data from hwmon directly.
- Confirm the thermal collector is not disabled:
collectors: thermal: enabled: true source: auto
#ZFS module not loaded
Symptom: the Storage tab shows no ZFS pools even though zpool list works on the host, or the zfs_* rules never fire.
- Check that the ZFS kernel module is loaded:
On many distributions the module is loaded on-demand by the firstlsmod | grep zfszpoolorzfscall. If Crucible starts before that happens, it sees no ZFS surface. - Force-load the module at boot:
echo zfs | sudo tee /etc/modules-load.d/zfs.conf sudo systemctl restart glassmkr-crucible - If
lsmod | grep zfsshows nothing and you expected ZFS, install the package set for your distribution (zfsutils-linuxon Debian/Ubuntu,zfson Rocky/Alma with EPEL). - If you have a kernel update pending, ZFS DKMS sometimes lags behind the running kernel; reboot or rebuild the module against the new kernel before assuming Crucible is at fault.
#GPU tier-1 (nvidia-smi) unavailable
Symptom: a server with NVIDIA GPUs reports no GPU data even though nvidia-smi works interactively.
Crucible's GPU collector probes three tiers in order: nvidia-smi (most common), DCGM exporter (preferred when present), and Redfish OEM stub (BMC-side, vendor-dependent). Validated on L4, A4000, and A16 in the validation fleet.
- Confirm
nvidia-smiis on the PATH that systemd sees:
Some distributions install nvidia-smi tosudo systemd-run --pty --uid=glassmkr nvidia-smi/usr/lib/nvidia/current/rather than/usr/bin/; the systemd unit'sPATHmay differ from your interactive shell. - If the binary is found but exits non-zero, check the driver state:
A driver loaded against a different kernel than the running one will fail here.nvidia-smi --query-gpu=name,driver_version,pstate --format=csv - If DCGM is installed and you want the richer dataset, ensure the exporter is running:
systemctl status nvidia-dcgm - For BMC-side Redfish GPU telemetry (rare; vendor-specific OEM extension), confirm the BMC has the GPU sensor model populated:
curl -k -u user:pass https://<bmc>/redfish/v1/Systems/1/Oem/
#Telegram notifications are not arriving
Symptom: Alerts fire in the dashboard but no Telegram messages are received.
- Test the channel from the dashboard or API:
curl -X POST https://app.glassmkr.com/api/v1/channels/CHANNEL_ID/test \ -H "Authorization: Bearer YOUR_TOKEN" - If the test fails with
401 Unauthorized, the bot token is invalid. Re-create the bot via BotFather or regenerate the token. - If the test fails with
400 Bad Request: chat not found, the chat ID is wrong. Common mistakes: missing the-100prefix for supergroups, the bot was removed from the group, the bot never received any message in the chat (send a message to the bot first). - If the test succeeds but real alerts do not arrive, check the channel routing. Go to Settings → Alert Defaults and confirm your Telegram channel is listed.
- Check the alert cooldown. By default, Glassmkr sends one notification per active alert per hour. Acknowledged or recently-notified alerts are suppressed.
#Email notifications go to spam
Symptom: Test emails arrive in the spam folder.
- Check the spam folder and mark messages as "not spam" to train your provider.
- Add
[email protected]to your contacts or safe senders list. - If you control the recipient domain, allow Glassmkr's SPF record. Contact support for the current IP ranges.
- For better deliverability, route through a custom SMTP server in your own domain. See the Channels page for setup.
#High CPU usage by Crucible
Symptom: the Crucible process uses more than 1-2% CPU consistently.
For reference, the validation-fleet measurement on 2026-05-21 across 7 hosts shows a median RSS of 91 MB idle, ~0% CPU, and fio delta under 1.5%; RSS ranged 65 MB to 103 MB. Sustained higher usage is unusual.
- Check which collectors are running:
sudo journalctl -u glassmkr-crucible -f - SMART queries on many disks can be expensive. If you have more than 20 disks, narrow the device list or increase the interval:
collectors: smart: devices: - /dev/sda - /dev/sdb - Per-core CPU metrics on machines with 64+ cores generate a lot of data. Disable per-core reporting if you do not need it:
collectors: cpu: per_core: false - If the collection interval is set very low (e.g., 10 seconds), increase it:
collectors: interval: 60
#Registration fails with "server limit reached"
Symptom: + Add Server returns an error about the server limit.
- The Free plan allows 3 servers. Pro is $3/node/month with the first 3 nodes free.
- If you have decommissioned servers still registered, delete them from the dashboard to free up slots.
- To upgrade your plan, go to Settings → Billing.
#My servers are disabled (lock icon, "no payment method on file")
Symptom: some server tiles show a lock-icon overlay and "Manage in Settings". Notifications stopped firing for those servers.
Why: on the Pro plan, servers beyond the 3-server free quota are disabled at the end of the billing period (or 30 days after account creation, whichever is later) when no payment method is on file. The first 3 servers always stay active. Disabled servers continue to ingest snapshots so historical data is preserved; they just stop firing notifications.
- Add a payment method: Settings → Billing → Add card (opens the Stripe portal).
- Restore in bulk: Settings → Disabled servers → Restore all. Restoration is instant once a card is on file.
- If you would rather drop into the free quota than pay, delete individual servers from the same screen.
Glassmkr sends warning emails before disable: when the payment method is removed, 3 days before disable, 1 day before disable, and at the moment of disable. If you do not see these, check your spam folder and confirm the account email is correct.
#Configuration changes are not taking effect
Symptom: you edited collector.yaml but Crucible still uses the old settings.
- Restart the service after any configuration change:
sudo systemctl restart glassmkr-crucible - Verify the running config by inspecting the agent's startup banner:
The first lines after restart print the resolved interval, enabled collectors, and Dashboard URL.sudo journalctl -u glassmkr-crucible --since "1 min ago" --no-pager - Check that you edited the correct file. The systemd unit may pin a non-default config path:
systemctl show glassmkr-crucible -p Environment - Environment variables override the config file. Check for any
GLASSMKR_*orCRUCIBLE_*variables in the systemd unit or shell environment.
#Per-core CPU data is not showing
Symptom: the per-core CPU chart does not appear, or per-core data is missing from AI analysis.
- Per-core monitoring requires Crucible 0.3.0 or later. Check:
glassmkr-crucible --version - Enable per-core in the config:
collectors: cpu: per_core: true - Restart Crucible:
sudo systemctl restart glassmkr-crucible - Wait for the next collection interval (default 60 seconds) for data to appear.
#Muted rules are still firing
Symptom: you muted a rule but it continues to fire alerts or send notifications.
- Muting takes effect on the next ingest cycle. Wait at least one collection interval after muting.
- If you muted via the configuration file, restart Crucible:
sudo systemctl restart glassmkr-crucible - If you muted via the dashboard, no restart is needed; the change applies on the next push from that server.
- Verify the rule is muted in the dashboard under the server's Alerts tab. Muted rules show a mute icon.
#Getting help
If your issue is not covered here:
- Capture an hour of agent logs:
sudo journalctl -u glassmkr-crucible --since "1 hour ago" --no-pager > crucible.log. Attach it when contacting support. - Email [email protected] with your server ID and a description of the issue.
Last verified: 2026-05-22 against Crucible v0.13.3. Resource footprint figures are from a 7-host validation-fleet measurement on 2026-05-21.