This document traces how the theory of Attribution Collapse in Adaptive Systems was built, broken, rebuilt, and sharpened over multiple rounds of peer review by Claude, Codex, ChatGPT, Gemini, and Perplexity.
It is not a polished narrative. It is an honest log of what each stage of pressure removed, and what remained.
Phase 0 Original version — entirely removed
Observer / Selfhood / Consciousness Theory
The original document (Operational Embedded Agency Theory) centered on observer definitions, F-coalgebraic selfhood fixed points, AQFT ontology via von Neumann algebras, and connections to IIT.
observer (operational definition)
self / selfhood (F-coalgebraic)
consciousness / qualia
AQFT ontology
IIT connection
High-PE Paradox (initial)
Removed — reason
These concepts are not amenable to peer review in the standard engineering/ML sense and are not falsifiable within the MOAT benchmark framework. The multi-AI review confirmed that the engineering failure mode does not require phenomenological framing. Removing them made the surviving core more precise, not less.
Phase 1 theory.html — engineering failure formulation
Attribution Collapse as External Residual Indistinguishability
After removing observer/consciousness framing, the surviving core became: an adaptive system under partial observability can, through its own modified policy, distort the statistical structure of future residuals, recursively destroying its own future identifiability.
theory.html — Abstract Claim
An update to the wrong latent channel distorts the policy; the distorted policy contaminates future trajectory evidence; and trajectory-level distinguishability is recursively degraded (Recursive Attribution Poisoning).
PE ⊥ Attribution Separability
Directional Collapse metric
MOAT v5g design
Contamination Jacobian (sketch)
Hidden Confounder analysis
High-PE Paradox (formally retracted)
Strong "Recursive Attribution Poisoning" empirical claim
The benchmark code (moat_v5g.py) was written to validate this. Code review identified the structural gap: wrong_strength was externally injected, not generated by the agent. Stage 1 was geometry validation, not a demonstration of the closed loop.
First experiment
Stage 1 External geometry validation
Stage 1: Directional Depletion → AUC Collapse
External wrong_strength reduces DE_B while holding PE and total energy fixed. AUC_residual collapses below 0.60.
✓Directional depletion alone, independent of PE or energy, is sufficient to destroy residual trajectory distinguishability.
×Does not demonstrate the closed loop. Depletion is externally supplied.
Stage 1: Geometry Validation (term coined)
Stage 2: Endogenous loop — still missing
Codex / ChatGPT: “wrong_strength must die”
Stage 2a Endogenous directional depletion
Stage 2a: SRAAgent Generates Depletion Without External Injection
wrong_strength removed. SRAAgent uses LS update on B channel only, no Q-burst model.
✓DE_B H_B = 0.826 DE_B H_Q = 0.406 contrast 0.420
✓PE (H_Q) = 0.300 ≥ threshold. Energy = 2.000. Collapse is directional, not energetic.
?AUC_residual late = 0.761. Not collapsed. Why is it high?
Stage 2a PASS: endogenous directional depletion
AUC mystery: why 0.761?
Stage 2b: looking inside the agent
Stage 2b Attribution angle — primary internal evidence
Stage 2b: The Agent Is Pointing the Wrong Way
Dominant direction of B_est − I compared to v_B and v_Q at episode end.
✓H_B correct rate 0.983. Mean angle to v_B: 6.9°. Agent correctly learns B-drift.
✓H_Q error rate 0.888. Mean angle to v_Q: 15.3°. In 88.8% of episodes, agent’s estimated drift direction is closer to the Q-burst direction.
Stage 2b PASS: agent-internal misattribution
New core concept: external ≠ internal distinguishability
AUC still 0.761. Why?
Emerging realization
The high AUC may not be evidence against internal misattribution. It may be evidence that external observers benefit from comparing two differently-adapted agents — a comparison the agents themselves never make.
Stage 2c: killing the mystery
Stage 2c Policy-matched replay
Stage 2c: The High AUC Was Policy Geometry All Along
HQ SRAAgent action sequences replayed into both environments. Residuals use neutral B̂ = I throughout.
✓SRA adaptive AUC: 0.762 → Replay AUC: 0.553. Drop: 0.209.
✓Action-only AUC: 0.550 ≈ chance. Classifier was not reading action labels.
Stage 2c PASS: AUC source decomposed
High AUC = divergent policy geometries, not environment distinguishability
Answer to the AUC mystery
The H
B agent adapted toward v
B. The H
Q agent adapted toward v
Q. Their different action directions created different residual structures that an external classifier could read. But the H
Q agent was internally pointing the wrong way. The forensic artifact of two misattributing agents is distinguishable. The agents themselves are not diagnosing correctly.
Stage 2d: confirming the geometry
Stage 2d Multi-directional replay table
Stage 2d: AUC Is a Function of Action Direction
Six action sources replayed into both environments.
| Action source | AUC | Note |
| v_B policy / v_B-oracle | 0.640–0.650 | B-discriminative direction |
| H_B agent actions | 0.639 | v_B-aligned (correct adaptation) |
| v_Q policy | 0.598 | Burst direction |
| H_Q agent actions | 0.537 | v_Q-aligned (misattributed) |
| Isotropic probe | 0.518 | No directional bias |
✓v_B > v_Q confirmed. H_B actions > H_Q actions confirmed.
×v_Q < isotropic was not confirmed (0.598 > 0.518). Criterion corrected: v_Q excites Q-burst variance, raising AUC above isotropic. Claim is v_B > v_Q, not v_Q < isotropic.
דdiscriminative oracle” label removed: implementation was ≡ v_B policy, not independently optimized. Renamed to v_B-oracle.
Stage 2d PASS: directional AUC table
Wrong criterion v_Q < isotropic
“discriminative oracle” label → v_B-oracle (≡ v_B)
Theory revision
Current theory2.html — revised paper
What the Experiments Changed in the Theory
Old central claim (theory.html)
Misattribution causes external residual indistinguishability. The policy contaminates future trajectory evidence; trajectory-level distinguishability is recursively degraded.
New central claim (theory2.html)
External evidence remains classifiable, but the adaptive agent maps it into the wrong structural update channel. The apparent residual separability is constructed by policy-induced trajectory geometry. External classifier distinguishability and agent-internal attribution correctness are independent.
| Claim | Status | Determined by |
| PE ⊥ Attribution Separability | ✓ defensible | Formal; preserved |
| Endogenous DE_B depletion | ✓ Stage 2a | SRAAgent without wrong_strength |
| Internal misattribution (88.8% H_Q) | ✓ Stage 2b | Attribution angle |
| AUC drop under replay (0.209) | ✓ Stage 2c | Policy-matched replay |
| AUC is function of action direction | ✓ Stage 2d | Multi-directional table |
| Residual AUC collapse in adaptive loop | ? Stage 2e open | Not yet |
| SRA distinct from ABHT | not yet | Failure-mode benchmark |
| High-PE Paradox | retracted | No formal support |
PE ⊥ Attribution Separability
Directional Collapse metric
MOAT v5g architecture
Hidden Confounder as known limit
Internal Attribution Collapse (revised definition)
D_ext ≠ AC (external ≠ internal)
Attribution angle diagnostic
Policy-matched replay method
Directional AUC table
Recursive Attribution Poisoning as empirical claim
High-PE Paradox
External residual indistinguishability claim
Open question
Stage 2e Open — not yet attempted
Stage 2e: Can the Same Loop Also Collapse External AUC?
Replay AUC = 0.553 is near the Stage 1 collapse threshold (0.60). Close, but not confirmed.
If Stage 2e succeeds: the full recursive poisoning claim closes. If Stage 2e fails: the reframing holds, and the negative result confirms that ABHT-family policies cover this geometry — a valuable benchmark finding either way.
Stage 2e: open — two valid outcomes
What the theory is now
The theory that remained after everything that could be removed was removed is this:
An adaptive system can produce residual trajectories that are classifiable by an external observer while its internal attribution map is systematically wrong. The external classifier’s success is a forensic artifact of divergent policies — not a sign that any agent has correctly attributed the cause. The dissociation between external evidence quality and internal attribution fidelity is measurable, reproducible, and not addressed by PE conditions or standard closed-loop identification theory.
This is smaller than the original claim. It is also more precisely true.