Science. Cell Lab
Built to be proven wrong.
This is the strongest artifact we have, because it is the one most able to fail in public. A hidden 216-state service cell is hit by disturbances it never announces. Controllers see only noisy telemetry and must keep the cell viable. The claims, the pass lines, and the places where the active-inference controller loses are all written down before the run, and reported as plainly as the wins.
The claims and their pass lines
Five statements, each with a line it must clear.
A claim with no pass line is a slogan. Each of these has a falsification criterion and a command you can run. The status next to each is the result from the committed cache.
- C1
It beats a random controller at keeping the cell viable
Pass line. Significantly beats random in at least 4 of 7 disturbance families.
Recorded. Recorded: beats random in 7 of 7 families, significant in 6 of 7.
- C2
Its expected free energy is a usable planning signal
Pass line. The committed action is among the lowest-energy policies in at least 80% of decisions.
Recorded. Recorded: 24 of 24 decisions in the probe.
- C3
Variational free energy upper-bounds surprisal
Pass line. The bound holds for every sampled belief, with equality at the posterior.
Recorded. Recorded: holds. Toy anchor F([0.5,0.5]) = 0.945, surprisal 0.371.
- C4
The agent is observation-only. The Markov blanket does not leak
Pass line. No code path reads the hidden state. Distinct observation tuples stay below 216.
Recorded. Recorded: 33 distinct most-likely tuples for 216 hidden states.
- C5
Results reproduce, and the public engine matches the canonical code
Pass line. Seeded trajectories are bit-identical. The inline engine mirrors the reference kernel.
Recorded. Recorded: holds.
The disconfirmation table
Where it loses, recorded as plainly as where it wins.
RecoveryScore is the fraction of ticks inside the viable set, weighted by excursion depth. Higher is better. Committed cache, depth 2, six seeds, 80 ticks. The leading score in each row is highlighted, and three of those leaders are not the active-inference controller.
| Disturbance family | UNI | rule-based | random | neural |
|---|---|---|---|---|
| traffic_spike | 0.969 | 0.960 | 0.880 | 0.924 |
| memory_leak | 0.740 | 0.731 | 0.676 | 0.810 |
| bad_deploy | 0.937 | 0.895 | 0.621 | 0.704 |
| database_flaky | 0.759 | 0.803 | 0.675 | 0.694 |
| cache_down | 0.664 | 0.641 | 0.518 | 0.588 |
| cpu_noisy_neighbor | 0.749 | 0.740 | 0.703 | 0.824 |
| observability_loss | 0.992 | 0.988 | 0.961 | 0.974 |
Honest reading. The active-inference controller beats random in 7 of 7 families (significant in 6 of 7), the rule-based controller in 6 of 7, and the neural baseline in 5 of 7. It loses in three places, all shown above. A single active-inference controller is not universally best here. That is exactly the disconfirmation the benchmark is designed to surface, not bury.
The three losses, in plain words
Three families where another controller wins.
Loses to the rule-based controller.
0.803 against 0.759
Its broken-dependency to failover heuristic is very effective on its home turf. A simple rule wins here, and we record it.
Loses to the neural baseline.
0.810 against 0.740
A trained MLP reads the slow-leak signature more directly than a single planner does.
Loses to the neural baseline.
0.824 against 0.749
The neural baseline is ahead. And the active-inference margin over random here is not statistically significant, so we do not claim a win against random either.
The honesty fences
Published bounds we are working to close.
These are not a badge. They are the edges of what this benchmark can honestly say today. We name them so you can hold us to them, and so the next version has a clear thing to improve.
- 01
Autopoiesis here means viable-set maintenance, keeping the cell inside its operating band. It is not a claim about life, agency, or self-production.
- 02
No consciousness claim. The agent is a belief vector and a softmax over policies. Nothing here is sentient or aware.
- 03
No general-intelligence claim. This is a narrow controller on a toy 216-state world.
- 04
The world process is separate from the agent model. The agent never sees the hidden state.
- 05
Variational free energy is the quantity of Bayesian inference measured in nats, not a thermodynamic quantity in joules.
The framing follows Mikkilineni (2022). The structural controller in the live lab is our own work and does not reproduce that method. Cited, never claimed as ours.
How to prove it wrong
Beat it, or break it. Then show your work.
Run a controller that achieves a higher median RecoveryScore than the active-inference one across seeds. The database_flaky family already shows a heuristic can win.
Show a regime where the expected free energy does not predict the better action, or exhibit a belief where the bound fails. Either would mean a math error.
Find a leak, where a controller recovers the hidden state from a single observation. Or show a seed that does not reproduce. Every run carries its seed, its step count, and its per-tick log, so anyone can reproduce it or refute it.
