If you have hired anyone to "do AI" for you in the last three years, you already know the pattern: a deck full of arrows, a demo that worked once on a Tuesday, an invoice, and a quiet hope that nobody asks what changed.

We do the opposite. This post defines exactly what "receipts" means inside a SolutionWright engagement — what gets written down, what gets tagged, what gets tested, and what you, the client, can open and inspect on any given day. If we ever drift from this standard, this page is the stick you hit us with.

The central claim, said plainly

Every non-trivial claim we make about your project is tagged with an evidence class and paired with a falsifier — a concrete observation that, if it occurred, would prove the claim wrong. Both live in a single append-only file called the ledger, which is the source of truth for the engagement (Class C).

Not a slide. Not an email thread. One file. Versioned. Yours to read.

If a claim cannot be evidence-classed and falsified, we either reduce it until it can be, or we say out loud that we believe it but cannot yet show it (Class U — unverified). We will not dress an opinion up as a result.

The six evidence classes

These are the labels we attach to every claim, in the ledger and in every artifact we hand you:

Class A — empirical-in-session. Something we ran, in front of you or on a recorded run, that produced the output we claim. The receipt is the run log.
Class B — code or inspection. A claim grounded in code that exists and that you can read. The receipt is a file path and a commit hash (Class B).
Class C — configuration or integration. A claim about how systems are wired — a webhook, a permission, a deployment target. The receipt is the configuration file or the platform's own panel (Class C).
Class E — expert citation. A claim that rests on someone else's published work. The receipt is the citation, with a URL and the relevant section.
Class F — falsifier present. A claim accompanied by the test that would refute it. The receipt is the test itself (Class F).
Class U — unverified. A claim we have not yet grounded. We mark it so neither of us forgets.

Every line in the ledger carries one of these tags. There is no untagged sentence.

The falsifier rule

A falsifier is not optional. For each claim we make about your system — "the lead form delivers to your CRM", "the dashboard reflects the live database", "this prompt is robust to user typos" — we write the observation that would prove us wrong, and then we go look for it.

"The form delivers to your CRM" has a falsifier: submit a test row with a known marker and confirm it lands in the CRM with that marker present. If it does not land, the claim is dead and we say so the same day.

This sounds obvious. It is rare. It is rare because once you commit to falsifiers in writing, you lose the ability to spin a half-working system into a finished one. That loss is the point.

What the ledger contains

The ledger is a plain text file, one entry per day per workstream, with this shape:

a one-line claim
the evidence class (A/B/C/E/F/U)
the receipt (file path, commit, URL, run log, or named test)
the falsifier
the date and the human who wrote the entry

It is boring to read. That is the feature. Boring is what unspun work looks like.

What you get at week 1, week 4, week 12

We tell you up front what is reasonable to see at each milestone. We do not promise anything beyond what falsifiers can confirm.

Week 1. The ledger exists and has its first entries. The scope is written down in your words, not ours. We have read your existing systems and recorded what is actually there (Class B) rather than what was supposed to be there. At least one falsifier has run — usually a smoke test on the integration that the engagement most depends on.

Week 4. A working slice is live end-to-end. Not a demo — the actual production path, behind your domain or inside your tooling, with the falsifier rerun and passing. The ledger now contains the configuration receipts (Class C) for the live wiring. Where the slice does not yet work, that is also written down. We are not in the business of hiding red entries.

Week 12. The engagement either has compounding receipts — a body of A/B/C entries you can point to without us in the room — or it does not, and we have already told you it does not. Twelve weeks is long enough that drift between what we said and what is true would be visible. The ledger is the place where that drift cannot hide.

If at any milestone you cannot find the receipt for a claim, you have caught us, and the right move is to escalate to a refund or a rewrite of the claim. That is the deal.

Why we are this strict

The wider AI services market right now leans hard on under-verified promises. Alicia Maren has written about the unease this creates — see her piece AI Is Tummy-Churning, which makes the case that human verification of AI output is non-optional and that the field has been pricing it as optional. We agree, and the ledger is how we operationalize that for clients.

There is also a sharper edge to the same problem: training and prompting choices that look small can compound into behavior that nobody signed up for. Maren's summary of Anthropic's recent emergent-misalignment work — AI Misalignment: Anthropic's Studies and More — makes the operational case for governance gates and explicit approvals on the systems you ship. We treat that as a baseline, not a stretch goal.

What this is not

This is not a guarantee that every claim we make will turn out true. It is a guarantee that every claim we make will be testable, that the test will be written down before we report on it, and that the result — pass or fail — will be recorded (Class F).

We will get things wrong. We will write them down when we get them wrong. That is the standard.

The Trust Receipts Standard: How An Honest AI Engagement Actually Looks