We publish six numbers on every engagement. This post walks the dashboard line by line.
The six are deliberately not a composite. Each tells you about a different failure. Read them together; never average them. The set is latency, cost per useful task, refusal rate, override rate, falsifier status, and evidence-class mix.
1. Latency
What it is. Wall-clock time from request submitted to usable answer in the operator's hands. Not the model's generation time — the end-to-end number the human experiences.
How we measure it. Per task class, as a distribution: median, 90th percentile, and the worst day in the window. Averages hide the bad tail; the bad tail is where users decide they hate the tool.
Failure mode it catches. Vendors quote "ms per token" because it looks small. The number that matters is the slow tail on a real task on a bad day. (Class C — observed in our own integration logs.)
2. Cost per useful task
What it is. Dollars spent per task the human actually shipped. The denominator is outputs kept, not outputs generated.
How we measure it. Total inference and orchestration cost over the window, divided by outputs the operator marked "used." We publish both per-task cost and the gap between cost-per-generation and cost-per-useful- task. That gap is its own diagnostic.
Failure mode it catches. Token pricing implies the dollar flows linearly with the work. It does not. A system that generates cheaply but needs three retries and a human rewrite is more expensive than one that costs more per call and gets it right the first time. When this line moves, ask whether generation cost rose or the "useful" count fell — they look identical here and require different fixes. (Class C.)
3. Refusal rate
What it is. Share of in-scope prompts the system declined or deflected, broken out by reason. The reason codes matter more than the headline number.
How we measure it. We bucket refusals — policy guardrail, capability ceiling, missing context, ambiguous prompt, upstream timeout — and publish the rate per bucket. A 12 percent rate that is all "missing context" is a documentation problem. The same 12 percent that is all "policy guardrail" is a configuration problem. Same headline, different work.
Failure mode it catches. Refusals that look "safe" but are the model ducking hard requests; refusals that disappear after a vendor update and leave a quiet drop in output quality behind them. If the rate drops, check whether override rate rose to absorb it. (Class C.)
4. Override rate
What it is. Share of outputs the human rewrote, replaced, or discarded before use. The single most diagnostic line on the dashboard.
How we measure it. Every output passes through an operator surface where the person marks one of four states: kept, edited, rewrote, discarded. The override rate is the share in the bottom three states, with the breakdown published.
Failure mode it catches. The "saving hours per week" claim that is secretly the team doing the work twice — once to prompt and once to fix. Override rate almost never moves alone; if it does, the prompt scaffolding changed underneath the team. (Class C — the line we have caught the most quiet failures on.)
5. Falsifier status
What it is. For every claim the engagement makes, a written test that would force a retraction. Plus the current status: pass, fail, or not yet run.
How we measure it. Each claim on the engagement page has a falsifier attached at publication. If it has not yet been run, the claim is shown as evidence class U. When a falsifier fails, the page changes the same day — the claim is retracted, not explained away.
Failure mode it catches. Marketing language. "It reduces errors" is not a claim until the test that would disprove it is named. Most vendor sheets cannot survive a falsifier audit because they were never built to be auditable. (Class E — the falsifier discipline traces back to standard philosophy of science, not anything we invented.)
6. Evidence-class mix
What it is. The share of the wins on the engagement page that are empirical-in-session (A), code or inspection (B), configuration or integration (C), expert citation (E), or unverified (U).
How we measure it. Every claim on a deliverable carries its class as a small tag. The mix is the rolled-up share. A deliverable that is mostly A and C is doing real work; one that is mostly E with a few Us is a literature review with a bill attached.
Failure mode it catches. Pages that read confident but are mostly unverified. The mix forces every author — including us — to admit how much of what they are saying is shown rather than claimed. (Class C.)
Reading the six together
No single line tells you the system is working. The dashboard works when the six disagree productively — when a latency win shows up as an override loss and forces the conversation. The discipline is to never quiet that disagreement with a composite.
For the broader posture this practice sits inside, see the pillar piece on measurement honesty. The wider architecture that makes these receipts auditable is described in the transparency architecture overview. To apply this discipline to your own engagement, the entry point is the workshop.
