FOR GPU + ML INFRASTRUCTURE

Monitoring for GPU and ML infrastructure.

NVIDIA SMART, PSU redundancy at scale, thermal patterns under sustained load. The boring infrastructure layer that keeps your training runs going.

Install in 2 minutes View Crucible on GitHubMIT

3 nodes free. Test on a single GPU box before deploying fleet-wide.

curl -fsSL https://app.glassmkr.com/install.sh | bash

Crucible v0.13.20 on npm

The problem

GPU infrastructure breaks differently than general-purpose servers. Datacenter-tier GPU boxes have power supplies that fail, PCIe links that degrade, thermal anomalies that appear under sustained load, ECC errors at scales nobody experiences with consumer hardware.

The monitoring tools built for cloud workloads don’t know what to do with this. Datadog can chart GPU utilization, but it can’t tell you when your H100 is throttling because PSU 2 of 4 just dropped redundancy. Prometheus + DCGM-exporter gets you the metrics, but you still have to write the alert rules yourself, and most teams don’t have the infrastructure SRE bandwidth.

If you’re running GPU infrastructure, you need monitoring that understands GPU infrastructure.

How Glassmkr fits

GPU-aware alert rules.

Glassmkr’s 65 rules include the specific things that go wrong with GPU hardware: PSU redundancy loss (critical on 8-PSU H100/B200 chassis), ECC errors trending up, IPMI sensor critical from BMC, thermal anomalies, power draw outside expected envelope.

Multi-GPU server-level metrics, not just per-GPU.

The agent collects metrics at the server level (PSU state, chassis temperature, BMC SEL entries) and at the per-GPU level (driver version mismatches, ECC error count, memory bandwidth). Aggregate views show the whole rack.

Open-source agent, datacenter-friendly install.

Crucible is MIT licensed. Install per box with one bash command. Deploys cleanly across 10, 100, or 1000 nodes via configuration management. Air-gapped variants supported on request.

Pricing predictable enough for capacity planning.

$3 per node per month. A 100-GPU-box deployment is $291/month. No telemetry volume billing means your monitoring bill stays flat as your training runs intensify.

Every GPU, read at a glance

Per-GPU temperature, power draw against limit, VRAM, ECC counters, PCIe link width and the XID event log. Tier 1 (nvidia-smi) covers the eight GPU rules; DCGM and Redfish add depth where present.

app.glassmkr.com/server/gpu-ams-a16-01

2 devices Tier 1 Driver 550.163.01

GPU 0NVIDIA A16

Temp

36°C

Power

12.25/ 62.5 W

VRAM

12 MiB/ 15.0 GB

ECC: onCorrected 0Uncorrected 0

PCIe Gen 4 x16

GPU 1NVIDIA A16

Temp

38°C

Power

12.3/ 62.5 W

VRAM

12 MiB/ 15.0 GB

ECC: onCorrected 0Uncorrected 0

PCIe Gen 4 x16

XID event log: no events in 30 days

Specific alert rules that matter

GPU and ML infrastructure operators care most about:

PSU state: redundancy loss on multi-PSU chassis (H100 boxes typically have 4-8 PSUs; losing one is degraded, losing two is at-risk)
IPMI sensors: fan failures, temperature thresholds exceeded, SEL critical entries from the BMC indicating hardware-level issues
ECC errors: correctable error rate trending up (often a leading indicator of impending DIMM or GPU memory failure)
NVMe wear: high write amplification from checkpoint-heavy training workloads
Network state: interface errors on InfiniBand or 100/400G Ethernet links (training run halts when interconnect degrades)
Disk I/O patterns: sustained latency anomalies indicating storage degradation under shuffle/dataloader pressure
Memory pressure: OOM kills on the host (data loader processes dying)
Service health: DCGM exporter unhealthy, NVIDIA driver mismatches across boxes in a multi-node training cluster
Driver survives a reboot: on an NVIDIA box where nouveau was never blacklisted, the next reboot binds nouveau first and the GPU never comes back; on a marketplace this silently de-lists the host. Glassmkr reads the loaded modules and the blacklist state and warns while the box is still up, so you fix it in a window you choose instead of finding out from lost earnings.

The agent you can read

If you’re running ML infrastructure for research, you almost certainly have a compliance review process. Crucible passes most of them out of the box: MIT licensed, source published, no telemetry collection, non-root user, signed releases on npm.

For environments where outbound HTTPS is restricted, the agent supports HTTP proxies and air-gapped deployment with manual metric upload. Email [email protected] if you need an air-gapped variant; we’ll help.

Pricing reminder

$3 per node per month. First 3 nodes free.

For a small GPU cluster of 10 boxes: $21/month
For a medium cluster of 50 boxes (e.g. 50× H100, ~400 GPUs): $141/month
For a large deployment of 200 boxes: $591/month

Pricing is per server, not per GPU. An H100 box with 8 GPUs counts as one node.

Install now Talk to us about your deployment

Install on a single GPU box first.

Install on a single GPU box first to verify integration. The default rules will fire on your hardware within minutes if any of the GPU-specific signals are degraded.

curl -fsSL https://app.glassmkr.com/install.sh | bash

For fleet deployments or air-gapped environments, email [email protected]. We’re a small team and we work directly with the people who run the infrastructure.