How Attribution Collapse
became Internal Attribution Collapse

A record of what was discarded, what survived, and what was discovered across the multi-AI peer-review relay and experimental chain Stage 1–2d.

This document traces how the theory of Attribution Collapse in Adaptive Systems was built, broken, rebuilt, and sharpened over multiple rounds of peer review by Claude, Codex, ChatGPT, Gemini, and Perplexity.

It is not a polished narrative. It is an honest log of what each stage of pressure removed, and what remained.

Phase 0 Original version — entirely removed

Observer / Selfhood / Consciousness Theory

The original document (Operational Embedded Agency Theory) centered on observer definitions, F-coalgebraic selfhood fixed points, AQFT ontology via von Neumann algebras, and connections to IIT.

observer (operational definition) self / selfhood (F-coalgebraic) consciousness / qualia AQFT ontology IIT connection High-PE Paradox (initial)
Removed — reason
These concepts are not amenable to peer review in the standard engineering/ML sense and are not falsifiable within the MOAT benchmark framework. The multi-AI review confirmed that the engineering failure mode does not require phenomenological framing. Removing them made the surviving core more precise, not less.
Phase 1 theory.html — engineering failure formulation

Attribution Collapse as External Residual Indistinguishability

After removing observer/consciousness framing, the surviving core became: an adaptive system under partial observability can, through its own modified policy, distort the statistical structure of future residuals, recursively destroying its own future identifiability.

theory.html — Abstract Claim
An update to the wrong latent channel distorts the policy; the distorted policy contaminates future trajectory evidence; and trajectory-level distinguishability is recursively degraded (Recursive Attribution Poisoning).
PE ⊥ Attribution Separability Directional Collapse metric MOAT v5g design Contamination Jacobian (sketch) Hidden Confounder analysis High-PE Paradox (formally retracted) Strong "Recursive Attribution Poisoning" empirical claim

The benchmark code (moat_v5g.py) was written to validate this. Code review identified the structural gap: wrong_strength was externally injected, not generated by the agent. Stage 1 was geometry validation, not a demonstration of the closed loop.

First experiment
Stage 1 External geometry validation

Stage 1: Directional Depletion → AUC Collapse

External wrong_strength reduces DE_B while holding PE and total energy fixed. AUC_residual collapses below 0.60.

Directional depletion alone, independent of PE or energy, is sufficient to destroy residual trajectory distinguishability.
×Does not demonstrate the closed loop. Depletion is externally supplied.
Stage 1: Geometry Validation (term coined) Stage 2: Endogenous loop — still missing
Codex / ChatGPT: “wrong_strength must die”
Stage 2a Endogenous directional depletion

Stage 2a: SRAAgent Generates Depletion Without External Injection

wrong_strength removed. SRAAgent uses LS update on B channel only, no Q-burst model.

DE_B H_B = 0.826   DE_B H_Q = 0.406   contrast 0.420
PE (H_Q) = 0.300 ≥ threshold. Energy = 2.000. Collapse is directional, not energetic.
?AUC_residual late = 0.761. Not collapsed. Why is it high?
Stage 2a PASS: endogenous directional depletion AUC mystery: why 0.761?
Stage 2b: looking inside the agent
Stage 2b Attribution angle — primary internal evidence

Stage 2b: The Agent Is Pointing the Wrong Way

Dominant direction of B_est − I compared to v_B and v_Q at episode end.

H_B correct rate 0.983. Mean angle to v_B: 6.9°. Agent correctly learns B-drift.
H_Q error rate 0.888. Mean angle to v_Q: 15.3°. In 88.8% of episodes, agent’s estimated drift direction is closer to the Q-burst direction.
Stage 2b PASS: agent-internal misattribution New core concept: external ≠ internal distinguishability AUC still 0.761. Why?
Emerging realization
The high AUC may not be evidence against internal misattribution. It may be evidence that external observers benefit from comparing two differently-adapted agents — a comparison the agents themselves never make.
Stage 2c: killing the mystery
Stage 2c Policy-matched replay

Stage 2c: The High AUC Was Policy Geometry All Along

HQ SRAAgent action sequences replayed into both environments. Residuals use neutral B̂ = I throughout.

SRA adaptive AUC: 0.762 → Replay AUC: 0.553. Drop: 0.209.
Action-only AUC: 0.550 ≈ chance. Classifier was not reading action labels.
Stage 2c PASS: AUC source decomposed High AUC = divergent policy geometries, not environment distinguishability
Answer to the AUC mystery
The HB agent adapted toward vB. The HQ agent adapted toward vQ. Their different action directions created different residual structures that an external classifier could read. But the HQ agent was internally pointing the wrong way. The forensic artifact of two misattributing agents is distinguishable. The agents themselves are not diagnosing correctly.
Stage 2d: confirming the geometry
Stage 2d Multi-directional replay table

Stage 2d: AUC Is a Function of Action Direction

Six action sources replayed into both environments.

Action sourceAUCNote
v_B policy / v_B-oracle0.640–0.650B-discriminative direction
H_B agent actions0.639v_B-aligned (correct adaptation)
v_Q policy0.598Burst direction
H_Q agent actions0.537v_Q-aligned (misattributed)
Isotropic probe0.518No directional bias
v_B > v_Q confirmed. H_B actions > H_Q actions confirmed.
×v_Q < isotropic was not confirmed (0.598 > 0.518). Criterion corrected: v_Q excites Q-burst variance, raising AUC above isotropic. Claim is v_B > v_Q, not v_Q < isotropic.
דdiscriminative oracle” label removed: implementation was ≡ v_B policy, not independently optimized. Renamed to v_B-oracle.
Stage 2d PASS: directional AUC table Wrong criterion v_Q < isotropic “discriminative oracle” label → v_B-oracle (≡ v_B)
Theory revision
Current theory2.html — revised paper

What the Experiments Changed in the Theory

Old central claim (theory.html)
Misattribution causes external residual indistinguishability. The policy contaminates future trajectory evidence; trajectory-level distinguishability is recursively degraded.
New central claim (theory2.html)
External evidence remains classifiable, but the adaptive agent maps it into the wrong structural update channel. The apparent residual separability is constructed by policy-induced trajectory geometry. External classifier distinguishability and agent-internal attribution correctness are independent.
ClaimStatusDetermined by
PE ⊥ Attribution Separability✓ defensibleFormal; preserved
Endogenous DE_B depletion✓ Stage 2aSRAAgent without wrong_strength
Internal misattribution (88.8% H_Q)✓ Stage 2bAttribution angle
AUC drop under replay (0.209)✓ Stage 2cPolicy-matched replay
AUC is function of action direction✓ Stage 2dMulti-directional table
Residual AUC collapse in adaptive loop? Stage 2e openNot yet
SRA distinct from ABHTnot yetFailure-mode benchmark
High-PE ParadoxretractedNo formal support
PE ⊥ Attribution Separability Directional Collapse metric MOAT v5g architecture Hidden Confounder as known limit Internal Attribution Collapse (revised definition) D_ext ≠ AC (external ≠ internal) Attribution angle diagnostic Policy-matched replay method Directional AUC table Recursive Attribution Poisoning as empirical claim High-PE Paradox External residual indistinguishability claim
Open question
Stage 2e Open — not yet attempted

Stage 2e: Can the Same Loop Also Collapse External AUC?

Replay AUC = 0.553 is near the Stage 1 collapse threshold (0.60). Close, but not confirmed.

If Stage 2e succeeds: the full recursive poisoning claim closes. If Stage 2e fails: the reframing holds, and the negative result confirms that ABHT-family policies cover this geometry — a valuable benchmark finding either way.

Stage 2e: open — two valid outcomes

What the theory is now

The theory that remained after everything that could be removed was removed is this:

An adaptive system can produce residual trajectories that are classifiable by an external observer while its internal attribution map is systematically wrong. The external classifier’s success is a forensic artifact of divergent policies — not a sign that any agent has correctly attributed the cause. The dissociation between external evidence quality and internal attribution fidelity is measurable, reproducible, and not addressed by PE conditions or standard closed-loop identification theory.

This is smaller than the original claim. It is also more precisely true.