A vendor told us their model scored ninety-two. Ninety-two what, on which task, measured how, refused how often, overridden by humans how often, falsifiable against which test? That conversation is the whole reason this post exists.
One number is a sales pitch. Six numbers are a receipt.
When a vendor hands you one composite score, they have already chosen what to hide. Composites collapse trade-offs that buyers need to see. A model can look great on a leaderboard and still cost more per useful task than the staff it was supposed to relieve, refuse half the prompts that actually mattered, and produce outputs your team silently rewrites before sending. None of that shows up in "ninety-two."
So on every SolutionWright engagement we publish six raw series, not a composite. Every client gets the same six. Every reader of this site can ask to see ours for our own systems. The list:
- Latency — wall-clock time from request submitted to usable answer in hand, per task class, distribution not average. (Class A, measured in session.)
- Cost per useful task — dollars per task that the human actually shipped, not dollars per token and not dollars per generation. The denominator is the number of outputs the team kept, including the rewrites. (Class A.)
- Refusal rate — share of in-scope prompts the system declined or deflected, broken out by reason. A high refusal rate is not automatically bad; an unexplained one always is. (Class A.)
- Override rate — share of outputs the human rewrote, replaced, or discarded before use. This is the single most diagnostic number we collect. (Class A.)
- Falsifier status — for every claim the system or the vendor is making, a specific test that would disprove it, plus the current state of that test (pass / fail / not yet run). If there is no falsifier, the claim is not allowed on the dashboard. (Class F.)
- Evidence-class mix — what share of our reported wins are empirical-in-session (A), what share are code or inspection (B), what share are configuration or integration (C), what share are expert citation (E), what share are unverified (U). The mix is the honesty signal. A deliverable that is all U is not a deliverable. (Class C.)
That is the entire dashboard. There is no seventh number we are hiding behind the chart.
Why a composite score lies, mechanically
Composites are weighted sums of things that move in opposite directions. If you push latency down by switching to a smaller model, override rate often goes up, because the smaller model rewrites worse. If you push refusal rate down by loosening guardrails, override rate often goes up too, because the human starts catching things the system should have caught. A composite hides these trade-offs by averaging them into a single moving number that nobody can interrogate. (Class B — observed across our own pipelines in the last twelve months of integration work.)
The fix is not a better composite. The fix is to publish the raw series and let the buyer weigh them against their own use. That is what receipts mean.
Override rate is the diagnostic line
Of the six, override rate is the one we watch first when something is wrong. It is the closest proxy for whether the system is actually doing the work or whether the team is doing it twice — once to prompt the system and once to fix its answer. We have seen organizations where the official metric said the model was saving hours per week, and the override-rate series said the analysts were quietly rewriting most of what came out. The official metric was not lying about the generation count. It was lying by omission about what happened next. (Class A — observed in two recent integrations.)
A high override rate is not always a verdict against the system. Sometimes the task is genuinely too brittle for current models and the human has to finish it. Sometimes the prompt scaffolding is wrong. Sometimes the guardrails are too tight and the model is hedging into uselessness. The override series tells you that something needs attention. The other five series tell you which one.
The cost line, specifically
Cost-per-useful-task is the line most vendors do not want printed. Token pricing implies the dollar flows linearly with the work. It does not. The real denominator is the number of outputs the human kept, and we have seen that denominator be a third or a quarter of the generation count on systems that the leaderboard called excellent. (Class A.)
Themesis published an honest piece on the broader version of this problem — how much of the field is overspending on one architectural bet while underspending on alternatives, and why the human-verification step is the one nobody is allowed to skip. We frame it our own way: receipts beat spectacle, and the verify-the-output discipline is the agency's actual product. (Reference: Maren on the cost question.)
Cost honesty also undermines the "you must spend giant compute to do anything" pitch. A small kit that solves the task in seconds is more informative than a giant kit that solves it in a press release. Themesis documented one such result this spring — the SeedIQ system clearing ARC-AGI 3 instances on a laptop in about nine seconds at roughly twenty watts, against foundation-model runs that burned through thousands of GPU-hours. Our one-line frame: the small efficient kit wins on the receipt sheet, and the receipt sheet is the only sheet that matters to a buyer. (Reference: Maren on SeedIQ and ARC-AGI 3.)
Falsifier status is the line vendors avoid
Most "AI claims" you read are not falsifiable as stated. "It improves productivity" — measured how, on whose work, against what baseline, at what override rate? "It reduces errors" — by what definition of error, observed by whom, over what window?
We require, before we even publish a claim about our own systems, a written falsifier — the specific observation that would force us to retract. If we cannot write the falsifier, the claim is unverified (Class U) and gets labelled that way on the page. If we can write the falsifier but have not yet run it, the claim is also Class U until we have. The day a falsifier fails is the day we change the page. (Class F — this is the discipline we operationalize.)
Buyers should ask every vendor for the falsifier sheet. If the vendor cannot produce one, the answer is not "their claim is wrong." The answer is "their claim is not yet a claim."
What a buyer should demand
If you are buying anything in this space, the minimum due-diligence packet is the six numbers above, in their raw distributions, on your actual task distribution, over a window long enough to see the bad days. Not a synthetic benchmark. Not the vendor's curated examples. Your work, your override calls, your refusals, your falsifier sheet.
We will publish ours. We expect any vendor we recommend to publish theirs. We will tell you when we cannot get the numbers, and we will not pretend the silence is a result.
This is the receipt-first posture. It is dull on purpose. The interesting work happens after the dashboards stop lying.
If you want to see the live versions of the six numbers for our own systems, they live at /six-receipts. The broader honesty posture is at /transparency. The working hypothesis behind the practice is described in plain language at /science. If you want to apply this discipline to your own engagement, the entry point is /workshop.
