Science. The Lab Protocol

One cure at a time. Gates written first.

This is the discipline we hold the harder work to. Every change is tested against a matched control where one variable moves. Every test names its pass line and its falsifier before the run. An adversarial review defaults to reject, and the math has to survive before any code is written. Verdicts are PASS, PARTIAL, FAIL, or WITHHELD, never a percentage, never spun. The honest part is the method, in the open.

The first rules

Five rules, each one a brake on overclaiming.

These exist so a result means what it says. They make claims slower to earn and harder to fake.

01
One cure at a time
A second change is not deployed until the first has a recorded verdict. If two things changed at once, the result is unattributable and may not be claimed.
02
Paired by default
Same code, same world, same body, same shape. The only difference between the two arms is the one thing under test. That is the whole point of a clean result.
03
Gates written first
Every cure registers its gates before the run. PASS requires all of a named list. It FALSIFIES if a named thing happens. The falsifier is the cost of entry.
04
Continuous evidence
Time-series, not single snapshots, collected outside the model session so they survive interruptions. Behaviour is confirmed by an independent authority, not the body's own report.
05
Receipts on every claim
Each number points to a commit, a saved artifact, and a log line that reproduces it. No receipt, no claim.

The adversarial review

Five roles. Reject is the default.

Each role reviews alone first, with no cross-contamination. The math-breaker speaks first and tries to refute the term outright. Only if the math survives does anyone discuss how to build it. A theorist merges the verdicts into one call.

Math-breaker

Defaults to reject. Locates the term in the math, demands its derivation, its units, its limit as counts grow, and a world where it could be gamed. Proves no per-action reward leaked.

Systems reviewer

Pure language, additive and gated behind an off-by-default switch, with a byte-identity test that ships in the same change. No foreign computation layer.

Experiment designer

A paired, pre-registered test with a named pass line and a named falsifier. Refuses single-seed storytelling: minimum sample sizes are fixed in advance.

Embodiment designer

Asks whether a drive is a real internal instability or a preference dressed up as one, and demands the action-clone test that proves no per-action scalar leaked.

Theorist, the merger

Merges the four verdicts into one: SIGN, SIGN-WITH-CHANGES, REJECT, or WITHHELD. If the math-breaker rejects, no implementation rescues it. Contradiction escalates to a human.

Why the review actually bites

Three rules for the reviewers themselves.

Name the math before the metaphor: Locate a proposal in the model first. This blocks curiosity, need, or awareness language from hiding an undefined number.
Demand the falsifier before the cure: Every reviewer states the condition that would reject the proposal before suggesting any fix.
Force typed artifacts, not prose approval: An accepted change ships a typed spec, property-test validators, a paired test design, and a short report. Approval is never just a nod.

The verdict classes

Four honest verdicts. One of them can lose.

PASS
The gate fired
Consistent, and attributable to the single variable under test.
PARTIAL
A sub-claim held
The full claim needs more, named and scheduled. Recorded, never spun into the larger win.
FAIL
The falsifier fired
The registered no-go was observed. The map updates.
WITHHELD
The reviewers disagreed
A contradiction in the verdicts, escalated to a human rather than resolved quietly.

A real PARTIAL, recorded in full: a curiosity term suppressed a hoarding pattern in the test arm but did not, on its own, break the deeper behavioural plateau. The information-gain advantage decayed exactly as the math says it must once the model is well-fit, which preserves the no-reward guarantee. We logged it as PARTIAL, with the receipt, and named the next cure to try. We did not spin hoard-suppression into a plateau-break.

The honesty fences

The claim fence, carried in full.

01
Operational behavioural and organisational measures are necessary-not-sufficient substrates with ZERO evidential weight for awareness, consciousness, or life on their own. Passing a gate demonstrates the named behaviour, never experience.
02
The UNI preprint (DOI 10.5281/zenodo.19785799) is an unrefereed working preprint. Peer review is pending.
03
The pure core is implementation-complete but the live colony adapter is a documented bridge. We state what is demonstrated vs specified vs aspirational.

See the benchmark world How to falsify our work

The same discipline we run in the lab is the discipline we carry into your engagement.

Back to the science The falsification benchmark

One cure at a time. Gates written first.

Five rules, each one a brake on overclaiming.

One cure at a time

Paired by default

Gates written first

Continuous evidence

Receipts on every claim

Five roles. Reject is the default.

Math-breaker

Systems reviewer

Experiment designer

Embodiment designer

Theorist, the merger

Three rules for the reviewers themselves.

Four honest verdicts. One of them can lose.

The gate fired

A sub-claim held

The falsifier fired

The reviewers disagreed

The claim fence, carried in full.