What our test suite looks like, and why.
Here are four tests from the code that runs Glassmkr, and the specific thing that went wrong to put each one there. None of them exist to hit a coverage number. Each is here because something broke, usually on our own hardware, and we did not want it to break the same way twice.
1. The test that exists because we hid real security alerts
This one is short, and it exists because we shipped a bug.
// crucible/src/collect/__tests__/security.test.ts
it("download-only timer => configured: false", async () => {
// dnf-automatic.timer is enabled, but /etc/dnf/automatic.conf has
// apply_updates = no: it fetches patches and never installs them.
const r = await checkAutoUpdates(fakeRun({ downloadTimer: true, applyYes: false }));
expect(r.auto_updates.mechanism).toBe("dnf-automatic");
expect(r.auto_updates.configured).toBe(false); // un-suppresses pending_security_updates
}); For a while, the agent treated any enabled dnf-automatic timer as "automatic updates are configured". That sounds reasonable until you know that dnf-automatic ships timers that only ever download updates and never apply them. A RHEL-family box running one of those is not patching itself; it is filling a cache. We counted it as patched anyway, which means we suppressed the pending_security_updates alert on exactly the hosts that most needed it.
A Rocky Linux 9.6 box on our own validation fleet sat with 26 pending security updates, one of them rated Critical, while our dashboard showed it as healthy. We found it by dogfooding, not because a customer complained, and we wrote the whole thing up in a separate post. The fix shipped in agent 0.13.6. The test above is what is left of it: given a download-only timer, configured must be false, which un-suppresses the alert. It fails loudly if anyone ever reintroduces the original assumption.
2. The test where 81 °C is critical but 83 °C is fine
A monitoring rule is not just a threshold. It is a threshold and a severity, and both have to be right. This pair pins a case that looks, at first glance, like a contradiction:
// apps/dashboard/.../alerts/__tests__/evaluator.test.ts
// warning = upper_critical - 15 ; critical = upper_critical - 5
it("81C with uc=85: critical (uc - 5 = 80)", () => {
s.ipmi.sensors = [cpu(81, /* upper_critical */ 85)];
expect(alertsOf("cpu_temperature_high", s)[0].severity).toBe("critical");
});
it("83C without upper_critical: warning (fallback 80C)", () => {
s.ipmi.sensors = [cpu(83)];
expect(alertsOf("cpu_temperature_high", s)[0].severity).toBe("warning");
}); 81 °C trips critical; 83 °C, two degrees hotter, is only a warning. That is correct. The first board's BMC reports its own upper-critical limit at 85 °C, so 81 °C is five degrees from the edge and genuinely alarming. The second board does not report a limit at all, so we fall back to a fixed 80 °C threshold and a softer reading. A flat "alert at 85 °C everywhere" would have been wrong in both directions: too jumpy on the board that runs hot by design, too quiet on the one that stays silent about its limits.
The rule derives its thresholds from each board's BMC-reported upper_critical (warning at the limit minus 15, critical at the limit minus 5). The tests exist because an earlier version did not do that, and worse, matched voltage rails by name, a rail like CPU_VDDCR0 with an upper-critical of 1.6 V, and ran it through the temperature formula: it subtracted 5 and fired "critical" on essentially any reading. The fix both filtered sensors by type and made the thresholds relative; these cases lock the behaviour in so neither half regresses.
3. The test that insists on 404, not 403
// apps/dashboard/.../servers/[id]/__tests__/security.test.ts
it("returns 404 (not 403) when the server belongs to another customer", async () => {
setQueries([{ rows: [] }]); // ownership check finds nothing for this customer
const err = await getServer("srv_belonging_to_someone_else");
expect(err.status).toBe(404); // a 403 would confirm the row exists
}); The important character here is the status code. When customer A asks for customer B's server by ID, the obvious response is 403 Forbidden. We return 404 Not Found, and there is a test that fails if anyone "tidies" it into a 403. A 403 confirms the row exists; it tells a prober they have guessed a real server ID belonging to someone else. A 404 says nothing at all. On a multi-tenant boundary, the distance between "wrong" and "leaks" is one status code.
An honest admission: this suite overlaps with our auth-middleware tests on purpose. The middleware is already supposed to enforce ownership, and we test the endpoints again anyway. On a boundary where a bug is a breach rather than a wrong number, we trust belt and braces over trusting ourselves.
4. The test that just writes down what a real BMC said
Most of our tests are not dramatic. The majority look like this:
// crucible/src/lib/__tests__/vendor-sensors.test.ts
it("matches PS<N> with space OR underscore, across vendors", () => {
// Fleet data: Supermicro H12SST and Dell iDRAC emit "PS1 Status";
// Gigabyte boards emit "PS1_Status".
expect(isPsuSensor("PS1 Status", "supermicro")).toBe(true);
expect(isPsuSensor("PS1_Status", "supermicro")).toBe(true); // Gigabyte BMC on a Supermicro-DMI box
expect(isPsuSensor("PS Redundancy", "dell")).toBe(false); // not an individual PSU
}); There is no clever logic here. It is a record of the exact strings real power-supply sensors emit on real boards, captured from the fleet. Supermicro and Dell write PS1 Status with a space; Gigabyte writes PS1_Status with an underscore. An earlier matcher only accepted the underscore form for Dell, which means we would have silently missed a failed PSU on the others.
The comment on one line is the whole reason cross-vendor monitoring is hard: Gigabyte BMC on a Supermicro-DMI box. That machine's chassis reports its vendor as Supermicro, but the BMC firmware inside is Gigabyte, so the sensor names follow Gigabyte's convention while every other signal says Supermicro. We told that story at length in cross-vendor IPMI quirks; this test is how we make sure we never regress on it.
These are the ugly tests. There are dozens of them, each pinned to a literal string some board produced once, and they will never be finished, because every new motherboard generation invents a fresh way to format the same fact. We have made peace with that.
What the shape tells you
Step back from the individual tests and a structure appears. The dashboard has 79 test files and roughly 16,000 lines of test code; 21 of those files exist only for the alert evaluator. The agent has more than 400 tests across 26 files, and the large majority are parsers fed real captured output. Sorted by what they actually protect, almost everything falls into three piles:
Parsing input we do not control. The agent reads ipmitool, smartctl, nvidia-smi, /proc, dmesg and systemctl, across vendors, firmware revisions and six Linux distributions. A wrong parse is a wrong reading is a wrong alert. This is the biggest pile by far.
The correctness of a decision. Every alert rule is pinned to its threshold and its severity, because a false negative is a missed outage and a false positive is alert fatigue, and both quietly cost you the customer's trust.
The security boundary. Multi-tenancy, where a bug is a breach rather than a glitch. Belt and braces, as above.
And an admission to go with the count: some of these are slow. The integration tests stand up real Postgres and ClickHouse instances, and we have looked at deleting them more than once to speed up CI. We keep them, because the last few times we trusted a mocked query over a real one, the bug was hiding in the part the mock pretended away.
Where we have no tests at all
This website, the one you are reading, has zero tests. Around 17,000 lines of Svelte and not a single test file. That is also deliberate.
The worst-case failure on the marketing site is that it looks wrong for an hour while we fix it. The worst-case failure on the alert engine is that a customer's outage does not fire, or one customer sees another's. Those are not the same kind of risk, so they do not get the same kind of investment. A test suite is a budget, and we spend it where being wrong is expensive.
Tests are scar tissue
Scar tissue is not decorative, and it is not a trophy. It is the body's way of making sure the same injury does no further damage the second time. That is exactly what a regression test is: the functional residue of something that already went wrong, left in place so it cannot go wrong the same way again.
Which is why the shape of a test suite tells you more than the shape of the code. The code describes what a system is meant to do. The tests describe what has actually happened to it. You can read someone's test suite and tell what their incidents looked like.
Here is ours. If you would rather see the result than the tests, the documentation covers what we monitor, and there is a live demo running against an anonymised sample fleet.