Science. Cell Lab

Built to be proven wrong.

This is the strongest artifact we have, because it is the one most able to fail in public. A hidden 216-state service cell is hit by disturbances it never announces. Controllers see only noisy telemetry and must keep the cell viable. The claims, the pass lines, and the places where the active-inference controller loses are all written down before the run, and reported as plainly as the wins.

216 hidden states7 disturbance familiespre-registered and seededobservation-only agent

The claims and their pass lines

Five statements, each with a line it must clear.

A claim with no pass line is a slogan. Each of these has a falsification criterion and a command you can run. The status next to each is the result from the committed cache.

C1
It beats a random controller at keeping the cell viable
Pass line. Significantly beats random in at least 4 of 7 disturbance families.
Recorded. Recorded: beats random in 7 of 7 families, significant in 6 of 7.
C2
Its expected free energy is a usable planning signal
Pass line. The committed action is among the lowest-energy policies in at least 80% of decisions.
Recorded. Recorded: 24 of 24 decisions in the probe.
C3
Variational free energy upper-bounds surprisal
Pass line. The bound holds for every sampled belief, with equality at the posterior.
Recorded. Recorded: holds. Toy anchor F([0.5,0.5]) = 0.945, surprisal 0.371.
C4
The agent is observation-only. The Markov blanket does not leak
Pass line. No code path reads the hidden state. Distinct observation tuples stay below 216.
Recorded. Recorded: 33 distinct most-likely tuples for 216 hidden states.
C5
Results reproduce, and the public engine matches the canonical code
Pass line. Seeded trajectories are bit-identical. The inline engine mirrors the reference kernel.
Recorded. Recorded: holds.

The disconfirmation table

Where it loses, recorded as plainly as where it wins.

RecoveryScore is the fraction of ticks inside the viable set, weighted by excursion depth. Higher is better. Committed cache, depth 2, six seeds, 80 ticks. The leading score in each row is highlighted, and three of those leaders are not the active-inference controller.

Disturbance family	UNI	rule-based	random	neural
traffic_spike	0.969	0.960	0.880	0.924
memory_leak	0.740	0.731	0.676	0.810
bad_deploy	0.937	0.895	0.621	0.704
database_flaky	0.759	0.803	0.675	0.694
cache_down	0.664	0.641	0.518	0.588
cpu_noisy_neighbor	0.749	0.740	0.703	0.824
observability_loss	0.992	0.988	0.961	0.974

Honest reading. The active-inference controller beats random in 7 of 7 families (significant in 6 of 7), the rule-based controller in 6 of 7, and the neural baseline in 5 of 7. It loses in three places, all shown above. A single active-inference controller is not universally best here. That is exactly the disconfirmation the benchmark is designed to surface, not bury.

The three losses, in plain words

Three families where another controller wins.

database_flaky

Loses to the rule-based controller.

0.803 against 0.759

Its broken-dependency to failover heuristic is very effective on its home turf. A simple rule wins here, and we record it.

memory_leak

Loses to the neural baseline.

0.810 against 0.740

A trained MLP reads the slow-leak signature more directly than a single planner does.

cpu_noisy_neighbor

Loses to the neural baseline.

0.824 against 0.749

The neural baseline is ahead. And the active-inference margin over random here is not statistically significant, so we do not claim a win against random either.

The honesty fences

Published bounds we are working to close.

These are not a badge. They are the edges of what this benchmark can honestly say today. We name them so you can hold us to them, and so the next version has a clear thing to improve.

01
Autopoiesis here means viable-set maintenance, keeping the cell inside its operating band. It is not a claim about life, agency, or self-production.
02
No consciousness claim. The agent is a belief vector and a softmax over policies. Nothing here is sentient or aware.
03
No general-intelligence claim. This is a narrow controller on a toy 216-state world.
04
The world process is separate from the agent model. The agent never sees the hidden state.
05
Variational free energy is the quantity of Bayesian inference measured in nats, not a thermodynamic quantity in joules.

The framing follows Mikkilineni (2022). The structural controller in the live lab is our own work and does not reproduce that method. Cited, never claimed as ours.

How to prove it wrong

Beat it, or break it. Then show your work.

Run a controller that achieves a higher median RecoveryScore than the active-inference one across seeds. The database_flaky family already shows a heuristic can win.

Show a regime where the expected free energy does not predict the better action, or exhibit a belief where the bound fails. Either would mean a math error.

Find a leak, where a controller recovers the hidden state from a single observation. Or show a seed that does not reproduce. Every run carries its seed, its step count, and its per-tick log, so anyone can reproduce it or refute it.

Open the live Cell Lab How to falsify our work

The paper behind this is a preprint, with expert review pending. We present it that way on purpose.

Read the preprint See the other labs

Built to be proven wrong.

Five statements, each with a line it must clear.

It beats a random controller at keeping the cell viable

Its expected free energy is a usable planning signal

Variational free energy upper-bounds surprisal

The agent is observation-only. The Markov blanket does not leak

Results reproduce, and the public engine matches the canonical code

Where it loses, recorded as plainly as where it wins.

Three families where another controller wins.

Loses to the rule-based controller.

Loses to the neural baseline.

Loses to the neural baseline.

Published bounds we are working to close.

Beat it, or break it. Then show your work.