FOR GPU + ML INFRASTRUCTURE
Monitoring for GPU and ML infrastructure.
NVIDIA SMART, PSU redundancy at scale, thermal patterns under sustained load. The boring infrastructure layer that keeps your training runs going.
3 nodes free. Test on a single GPU box before deploying fleet-wide.
curl -fsSL https://app.glassmkr.com/install.sh | bash Crucible v0.13.0 on npm
The problem
GPU infrastructure breaks differently than general-purpose servers. Datacenter-tier GPU boxes have power supplies that fail, PCIe links that degrade, thermal anomalies that appear under sustained load, ECC errors at scales nobody experiences with consumer hardware.
The monitoring tools built for cloud workloads don’t know what to do with this. Datadog can chart GPU utilisation, but it can’t tell you when your H100 is throttling because PSU 2 of 4 just dropped redundancy. Prometheus + DCGM-exporter gets you the metrics, but you still have to write the alert rules yourself, and most teams don’t have the infrastructure SRE bandwidth.
If you’re running GPU infrastructure, you need monitoring that understands GPU infrastructure.
How Glassmkr fits
GPU-aware alert rules.
Glassmkr’s 60 rules include the specific things that go wrong with GPU hardware: PSU redundancy loss (critical on 8-PSU H100/B200 chassis), ECC errors trending up, IPMI sensor critical from BMC, thermal anomalies, power draw outside expected envelope.
Multi-GPU server-level metrics, not just per-GPU.
The agent collects metrics at the server level (PSU state, chassis temperature, BMC SEL entries) and at the per-GPU level (driver version mismatches, ECC error count, memory bandwidth). Aggregate views show the whole rack.
Open-source agent, datacenter-friendly install.
Crucible is MIT licensed. Install per box with one bash command. Deploys cleanly across 10, 100, or 1000 nodes via configuration management. Air-gapped variants supported on request.
Pricing predictable enough for capacity planning.
$3 per node per month. A 100-GPU-box deployment is $291/month. No telemetry volume billing means your monitoring bill stays flat as your training runs intensify.
Specific alert rules that matter
GPU and ML infrastructure operators care most about:
- PSU state: redundancy loss on multi-PSU chassis (H100 boxes typically have 4-8 PSUs; losing one is degraded, losing two is at-risk)
- IPMI sensors: fan failures, temperature thresholds exceeded, SEL critical entries from the BMC indicating hardware-level issues
- ECC errors: correctable error rate trending up (often a leading indicator of impending DIMM or GPU memory failure)
- NVMe wear: high write amplification from checkpoint-heavy training workloads
- Network state: interface errors on InfiniBand or 100/400G Ethernet links (training run halts when interconnect degrades)
- Disk I/O patterns: sustained latency anomalies indicating storage degradation under shuffle/dataloader pressure
- Memory pressure: OOM kills on the host (data loader processes dying)
- Service health: DCGM exporter unhealthy, NVIDIA driver mismatches across boxes in a multi-node training cluster
The agent you can read
If you’re running ML infrastructure for research, you almost certainly have a compliance review process. Crucible passes most of them out of the box: MIT licensed, source published, no telemetry collection, non-root user, signed releases on npm.
For environments where outbound HTTPS is restricted, the agent supports HTTP proxies and air-gapped deployment with manual metric upload. Email [email protected] if you need an air-gapped variant; we’ll help.
Pricing reminder
$3 per node per month. First 3 nodes free.
- For a small GPU cluster of 10 boxes: $21/month
- For a medium cluster of 50 boxes (e.g. 50× H100, ~400 GPUs): $141/month
- For a large deployment of 200 boxes: $591/month
Pricing is per server, not per GPU. An H100 box with 8 GPUs counts as one node.
Install on a single GPU box first.
Install on a single GPU box first to verify integration. The default rules will fire on your hardware within minutes if any of the GPU-specific signals are degraded.
curl -fsSL https://app.glassmkr.com/install.sh | bashFor fleet deployments or air-gapped environments, email [email protected]. We’re a small team and we work directly with the people who run the infrastructure.