The dashboard had a bad week. We published it anyway, on the same page, on the same day. This is what that looked like, and why the bad week is exactly the week the receipts have to be visible.

The week, redacted

A live SWU engagement runs a small batch of tasks through a model-plus-glue pipeline that we operate for the client. Numbers and identifiers are redacted; the shape is real. (Class C — drawn from our own integration logs over the engagement window.)

Going into the week, the six lines on the dashboard looked normal. Latency median was where it had been. Cost per useful task was tracking flat. Refusal rate was inside the band. Override rate had been creeping down for three weeks, which we had flagged but not yet explained. Falsifier status was green on the two open claims. Evidence-class mix was healthy — mostly A and C, with one E.

Then, on a Tuesday, the override-rate line jumped. Not a small move. The share of outputs the operator marked "rewrote" roughly doubled inside one day, and stayed there.

What the dashboard caught before anyone asked

The reflex on a vendor scorecard is to wait, see if it settles, and explain it in the next monthly. The six-receipts discipline does not allow that. The override line moved, so by the next morning the dashboard had to show the move and we had to publish the reason — or publish that we did not yet know it. (Class C — this is the operating rule, not aspiration.)

What we found, by pulling the four lines that move together:

Latency 90th percentile had drifted up the previous Thursday. Not enough to trip an alarm; enough to push some tasks past the operator's patience threshold.
Refusal rate had dropped in the same window — the "missing context" bucket specifically. That looked like a win in isolation.
Cost per useful task was unchanged on the headline, but the gap between cost per generation and cost per useful task had widened.
Override rate jumped because the operator was now keeping fewer of the outputs that the system had stopped refusing.

Read together, the four lines told one story: an upstream change had made the model less likely to decline ambiguous prompts, and the outputs it now produced on those prompts were the ones the human was rewriting. The "refusal-rate win" was the override-rate loss in a different costume.

None of that would have been visible on a single composite score. It was visible because the six lines are kept separate, and because they are allowed — required, actually — to disagree.

What we changed

We did three things, in this order, on the same engagement page:

Marked the affected claim unverified for the duration of the investigation. The claim was "the pipeline reduces operator rewrite load on this task class." We could not stand behind that this week, so the page said so.
Rolled the prompt scaffolding back to the previous week's configuration for the affected task class, and ran the falsifier — a written test that the new prompts produce a measurable reduction in operator rewrites versus baseline. The falsifier passed on the rollback.
Re-published the engagement page with the override-rate spike still visible in the history, a short note on what we found, and the falsifier status restored to green on the rolled-back configuration. (Class C.)

The history line on the dashboard still shows the spike. We did not smooth it. The bad week is part of the record because the practice is worthless if the only weeks visible are the good ones.

Why publish the bad week at all

A receipts page that only shows good weeks is a marketing page. The whole point of publishing the six lines is that they have to be allowed to move against you — otherwise the client has no way to tell whether the dashboard is measuring the system or measuring the vendor's mood.

There is a parallel conversation happening in the wider field about over- investing in scaling without enough investment in verifying the outputs. Dr. Alianna Maren has written about that anxiety in her tummy-churning piece; the verify-the-output discipline she points at is exactly what an honest receipts page operationalizes for a working engagement. (Class E.)

The bad week is the week the discipline earns its keep. Anyone can publish a green dashboard. The test is whether the page still tells the truth on a Tuesday when one of the lines moves the wrong way.

If you want the broader posture this sits inside, see the pillar piece on measurement honesty. The wider architecture that makes a bad-week publication auditable is described in the transparency architecture overview. If you want to apply this discipline to your own engagement — including the parts that publish the bad weeks — the entry point is the workshop.

The Six Receipts On A Bad Week — And Why We Publish Those Too

The week, redacted

What the dashboard caught before anyone asked

What we changed

Why publish the bad week at all

Bring this into a working session.