How the Theory Evolved

This document traces how the theory of Attribution Collapse in Adaptive Systems was built, broken, rebuilt, and sharpened over multiple rounds of peer review by Claude, Codex, ChatGPT, Gemini, and Perplexity.

It is not a polished narrative. It is an honest log of what each stage of pressure removed, and what remained.

Phase 0 Original version — entirely removed

Observer / Selfhood / Consciousness Theory

The original document (Operational Embedded Agency Theory) centered on observer definitions, F-coalgebraic selfhood fixed points, AQFT ontology via von Neumann algebras, and connections to IIT.

observer (operational definition) self / selfhood (F-coalgebraic) consciousness / qualia AQFT ontology IIT connection High-PE Paradox (initial)

Removed — reason

These concepts are not amenable to peer review in the standard engineering/ML sense and are not falsifiable within the MOAT benchmark framework. The multi-AI review confirmed that the engineering failure mode does not require phenomenological framing. Removing them made the surviving core more precise, not less.

Phase 1 theory.html — engineering failure formulation

Attribution Collapse as External Residual Indistinguishability

After removing observer/consciousness framing, the surviving core became: an adaptive system under partial observability can, through its own modified policy, distort the statistical structure of future residuals, recursively destroying its own future identifiability.

theory.html — Abstract Claim

An update to the wrong latent channel distorts the policy; the distorted policy contaminates future trajectory evidence; and trajectory-level distinguishability is recursively degraded (Recursive Attribution Poisoning).

PE ⊥ Attribution Separability Directional Collapse metric MOAT v5g design Contamination Jacobian (sketch) Hidden Confounder analysis High-PE Paradox (formally retracted) Strong "Recursive Attribution Poisoning" empirical claim

The benchmark code (moat_v5g.py) was written to validate this. Code review identified the structural gap: wrong_strength was externally injected, not generated by the agent. Stage 1 was geometry validation, not a demonstration of the closed loop.

First experiment

Stage 1 External geometry validation

Stage 1: Directional Depletion → AUC Collapse

External wrong_strength reduces DE_B while holding PE and total energy fixed. AUC_residual collapses below 0.60.

✓Directional depletion alone, independent of PE or energy, is sufficient to destroy residual trajectory distinguishability.

×Does not demonstrate the closed loop. Depletion is externally supplied.

Stage 1: Geometry Validation (term coined) Stage 2: Endogenous loop — still missing

Codex / ChatGPT: “wrong_strength must die”

Stage 2a Endogenous directional depletion

Stage 2a: SRAAgent Generates Depletion Without External Injection

wrong_strength removed. SRAAgent uses LS update on B channel only, no Q-burst model.

✓DE_B H_B = 0.826 DE_B H_Q = 0.406 contrast 0.420

✓PE (H_Q) = 0.300 ≥ threshold. Energy = 2.000. Collapse is directional, not energetic.

?AUC_residual late = 0.761. Not collapsed. Why is it high?

Stage 2a PASS: endogenous directional depletion AUC mystery: why 0.761?

Stage 2b: looking inside the agent

Stage 2b Attribution angle — primary internal evidence

Stage 2b: The Agent Is Pointing the Wrong Way

Dominant direction of B_est − I compared to v_B and v_Q at episode end.

✓H_B correct rate 0.983. Mean angle to v_B: 6.9°. Agent correctly learns B-drift.

✓H_Q error rate 0.888. Mean angle to v_Q: 15.3°. In 88.8% of episodes, agent’s estimated drift direction is closer to the Q-burst direction.

Stage 2b PASS: agent-internal misattribution New core concept: external ≠ internal distinguishability AUC still 0.761. Why?

Emerging realization

The high AUC may not be evidence against internal misattribution. It may be evidence that external observers benefit from comparing two differently-adapted agents — a comparison the agents themselves never make.

Stage 2c: killing the mystery

Stage 2c Policy-matched replay

Stage 2c: The High AUC Was Policy Geometry All Along

H_Q SRAAgent action sequences replayed into both environments. Residuals use neutral B̂ = I throughout.

✓SRA adaptive AUC: 0.762 → Replay AUC: 0.553. Drop: 0.209.

✓Action-only AUC: 0.550 ≈ chance. Classifier was not reading action labels.

Stage 2c PASS: AUC source decomposed High AUC = divergent policy geometries, not environment distinguishability

Answer to the AUC mystery

The H_B agent adapted toward v_B. The H_Q agent adapted toward v_Q. Their different action directions created different residual structures that an external classifier could read. But the H_Q agent was internally pointing the wrong way. The forensic artifact of two misattributing agents is distinguishable. The agents themselves are not diagnosing correctly.

Stage 2d: confirming the geometry

Stage 2d Multi-directional replay table

Stage 2d: AUC Is a Function of Action Direction

Six action sources replayed into both environments.

Action source	AUC	Note
v_B policy / v_B-oracle	0.640–0.650	B-discriminative direction
H_B agent actions	0.639	v_B-aligned (correct adaptation)
v_Q policy	0.598	Burst direction
H_Q agent actions	0.537	v_Q-aligned (misattributed)
Isotropic probe	0.518	No directional bias

✓v_B > v_Q confirmed. H_B actions > H_Q actions confirmed.

×v_Q < isotropic was not confirmed (0.598 > 0.518). Criterion corrected: v_Q excites Q-burst variance, raising AUC above isotropic. Claim is v_B > v_Q, not v_Q < isotropic.

×“discriminative oracle” label removed: implementation was ≡ v_B policy, not independently optimized. Renamed to v_B-oracle.

Stage 2d PASS: directional AUC table Wrong criterion v_Q < isotropic “discriminative oracle” label → v_B-oracle (≡ v_B)

Theory revision

Current theory2.html — revised paper

What the Experiments Changed in the Theory

Old central claim (theory.html)

Misattribution causes external residual indistinguishability. The policy contaminates future trajectory evidence; trajectory-level distinguishability is recursively degraded.

New central claim (theory2.html)

External evidence remains classifiable, but the adaptive agent maps it into the wrong structural update channel. The apparent residual separability is constructed by policy-induced trajectory geometry. External classifier distinguishability and agent-internal attribution correctness are independent.

Claim	Status	Determined by
PE ⊥ Attribution Separability	✓ defensible	Formal; preserved
Endogenous DE_B depletion	✓ Stage 2a	SRAAgent without wrong_strength
Internal misattribution (88.8% H_Q)	✓ Stage 2b	Attribution angle
AUC drop under replay (0.209)	✓ Stage 2c	Policy-matched replay
AUC is function of action direction	✓ Stage 2d	Multi-directional table
Residual AUC collapse in adaptive loop	? Stage 2e open	Not yet
SRA distinct from ABHT	not yet	Failure-mode benchmark
High-PE Paradox	retracted	No formal support

PE ⊥ Attribution Separability Directional Collapse metric MOAT v5g architecture Hidden Confounder as known limit Internal Attribution Collapse (revised definition) D_ext ≠ AC (external ≠ internal) Attribution angle diagnostic Policy-matched replay method Directional AUC table Recursive Attribution Poisoning as empirical claim High-PE Paradox External residual indistinguishability claim

Open question

Stage 2e Open — not yet attempted

Stage 2e: Can the Same Loop Also Collapse External AUC?

Replay AUC = 0.553 is near the Stage 1 collapse threshold (0.60). Close, but not confirmed.

If Stage 2e succeeds: the full recursive poisoning claim closes. If Stage 2e fails: the reframing holds, and the negative result confirms that ABHT-family policies cover this geometry — a valuable benchmark finding either way.

Stage 2e: open — two valid outcomes

What the theory is now

The theory that remained after everything that could be removed was removed is this:

An adaptive system can produce residual trajectories that are classifiable by an external observer while its internal attribution map is systematically wrong. The external classifier’s success is a forensic artifact of divergent policies — not a sign that any agent has correctly attributed the cause. The dissociation between external evidence quality and internal attribution fidelity is measurable, reproducible, and not addressed by PE conditions or standard closed-loop identification theory.

This is smaller than the original claim. It is also more precisely true.

How Attribution Collapse became Internal Attribution Collapse