A single score is a story the vendor tells. Six raw lines are a story your eyes can read for themselves.

The reason we will not roll up the six receipts

Buyers ask us for one number. They ask kindly, and they ask often, and the answer is still no. The six receipts — latency, cost per useful task, refusal rate, override rate, falsifier status, evidence-class mix — are kept as six separate raw series because rolling them up destroys the only information a buyer actually needs: how each one moved, when, and against which of the others.

A composite hides drift by construction. Two numbers can move in opposite directions and the composite sits still. The system can get cheaper and worse at the same time and the dashboard reads as flat. The team can rewrite half the outputs and the score barely twitches because a smaller cost divided by a slightly larger denominator looks fine on paper. That is not a measurement problem. That is the composite doing its job, which is to average the bad news into the good news until nothing is legible. (Class C — observed across our integration work; the receipts are kept raw on the SWU dashboard for exactly this reason.)

What twelve weeks of raw series actually look like

In an honest 12-week engagement the lines do not move in a single direction. Latency drops in week three when we switch model routing for the high-volume class, then climbs back partway in week five when we add a verification hop. Cost per useful task is jagged for the first month and only stabilizes once override rate has dropped enough that the denominator stops shifting under us. Refusal rate has a step change in week seven when the guardrails are tightened after a near-miss, and the override rate notches up in the same week because the team is now catching what the system declines to do.

A composite would have smoothed all of that into a vaguely rising line. The raw series instead lets the buyer say "that bump in week seven — what caused it?" and get a specific answer. That conversation is the product. The smoothed line is the absence of the conversation.

The trade-off pattern, by line

Latency vs override. Push latency down with a smaller model and the override rate usually rises, because the smaller model needs more human rewriting.
Refusal vs override. Loosen the refusals to be more "helpful" and override rate often rises again, because the human starts catching the things the system used to decline.
Cost vs evidence-class mix. Cheaper generations are tempting until you notice the share of outputs you trust without a human in the loop (your Class A share) is shrinking faster than the price.

These pairings are the reason we publish the lines side by side. A composite would average each pair into a single moving number and the trade-off would vanish. The raw view forces the conversation about which trade-off the engagement is actually making and whether the buyer agrees with it.

The "small kit, big result" lesson

The other reason we resist composites is that they reward spectacle. A giant model with a so-so override rate looks impressive on a leaderboard score. A small model that ships in seconds at a fraction of the cost looks unimpressive on a leaderboard score and dominant on a receipt sheet. Themesis documented a clean example of this gap earlier in the year — a small system clearing the ARC-AGI 3 challenge on a laptop in roughly nine seconds at about twenty watts, against foundation-model runs that consumed thousands of GPU-hours for similar instances. Our one-line frame: the receipt sheet is the only sheet that survives contact with a real budget. (Reference: Maren on the SeedIQ result. Class E.)

What to demand instead

When a vendor offers you a composite, the right move is not to argue with the formula. The right move is to ask for the raw series the formula was built from, over your task distribution, over a long enough window to see the bad days. If the vendor cannot produce the raw series, the composite is not a measurement — it is a marketing artifact. Walk.

If the vendor can produce the raw series but bundles them behind a proprietary index, ask them to publish the unbundled lines. If they decline, ask them why. The answers we have received over the years fall into two groups: "we have not built the unbundled view yet" (fixable) and "the unbundled view would look worse" (your answer).

Where to go from here

See the six receipts and why we keep them separate in the six receipts explained.
For the broader measurement posture, the pillar piece is measurement honesty for AI projects.
If you want this discipline applied to your own engagement, the entry point is /workshop.

Why Composite Scores Lie (And What To Demand Instead)