theory.html · revised paper · stage 1–2d results incorporated

Internal Attribution Collapse in a Minimal Adaptive Agent:
Policy-Geometry Dependent Evidence and Structural Misattribution

A revised empirical account superseding the collapse definition in theory.html. The core proposition (PE ⊥ Attribution Separability) is preserved; the claim of external residual indistinguishability is retracted and replaced.

Multi-AI peer-review relay: Claude · Codex · ChatGPT · Gemini · Perplexity
Basis: theory.html · experimental chain Stage 1–2d
Status: draft · Stage 2e open · baseline experiments pending

01 / Overview

About This Archive

index.html

02 / Theory

Theory (SRA / MOAT v5g)

theory.html

03 / Appendix

Code, Logs & Supplementary Material

appendix.html

§1 · Abstract

We define and empirically investigate Internal Attribution Collapse — a constructive failure mode in which a minimal adaptive agent under nonstationary partial observability structurally misattributes residuals to the wrong latent channel, generating policies that avoid the discriminative direction and making apparent residual evidence policy-geometry dependent. The central finding is a dissociation:

Central Result

External classifier distinguishability and agent-internal attribution correctness are independent. The SRAAgent can produce externally classifiable residual trajectories while its internal belief structure is systematically wrong about the true structural cause. This is a constructive counterexample, not a universal claim about adaptive agents.

The original claim of theory.html — that misattribution recursively degrades external trajectory-level distinguishability — was not confirmed experimentally. What was confirmed is stronger in a different direction: apparent residual separability is constructed by the policy-induced trajectory geometry, while the agent's internal attribution map points in the wrong structural direction.

We present MOAT v5g as a stress-test benchmark and SRAAgent as a constructive minimal counterexample showing that this failure mode can occur under a single-channel LS update with no Q-burst model.

§2 · What Changed from theory.html

theory.html framed Attribution Collapse as producing external residual indistinguishability via recursive contamination:

Retracted Claim (theory.html abstract)

An update to the wrong latent channel distorts the policy; the distorted policy contaminates future trajectory evidence; and trajectory-level distinguishability is recursively degraded (Recursive Attribution Poisoning).

Experiments show that adaptive residual AUC does not collapse; it remains high (0.762). The experimental chain instead reveals a decomposition of that high AUC:

Stage 2c: Policy-matched replay drops AUC 0.762 → 0.553. The high AUC was policy-geometry dependent.
Stage 2b: Attribution angle: 88.8% of H_Q episodes have $\angle(\hat{v}, v_Q) < \angle(\hat{v}, v_B)$. The agent internally misattributes.
Stage 2d: Directional AUC table: AUC is a function of action direction ($v_B$-aligned > $v_Q$-aligned).

The revised definition of Attribution Collapse:

Revised Definition

External evidence remains classifiable, but the adaptive agent maps it into the wrong structural update channel, generating policies that avoid the B-discriminative direction and making apparent residual separability policy-geometry dependent.

What survives from theory.html: the PE ⊥ Attribution Separability proposition (§3.3); the Directional Collapse metric; the MOAT v5g measurement architecture (D_probe / D_oracle / D_policy separation); the Hidden Confounder analysis (§6); and the honest Limitations section. The Contamination Jacobian analysis ($\rho(J) > 1$) survives as a mechanism sketch but remains without constructive numerical demonstration.

§3 · The Core Distinction: External vs Internal

3.1 Two Independent Quantities

Let $\mathcal{H} = \{H_B, H_Q\}$ be competing structural hypotheses (B-drift vs Q-burst). Define:

External Distinguishability $$D_\text{ext}(\pi_t) \;=\; \mathrm{AUC}\!\bigl(\text{classifier on } \{e_{t:t+k}\} \text{ generated under policy } \pi_t\bigr)$$

Agent-Internal Attribution Correctness $$\mathrm{AC}_t \;=\; \mathbf{1}\!\bigl[\angle(\hat{v}_t,\, v_B) < \angle(\hat{v}_t,\, v_Q)\bigr], \quad \hat{v}_t = \mathrm{left\_svec}(\hat{B}_t - I)$$

The empirical result: under H_Q + SRAAgent, $D_\text{ext} = 0.762$ while $\mathrm{AC}_t = 0$ in 88.8% of episodes.

3.2 Why the Gap Exists

The H_Q SRAAgent and H_B SRAAgent adapt toward different action directions ($v_Q$ and $v_B$ respectively). Their divergent policies generate structurally different residual trajectories that an external classifier can read. However, the H_Q agent does not know this; it has absorbed $v_Q$ as the learned B-drift direction.

The external classifier's success is a forensic artifact of the divergence between two misattributing policies, not a sign that any agent has correctly attributed the cause. Removing the policy divergence (policy-matched replay) drops AUC to 0.553, approaching the single-step indistinguishability regime by design.

3.3 Engineering Significance

This dissociation matters because standard adaptive system evaluation uses external performance metrics to infer internal attribution quality. If $D_\text{ext}$ is high, a system is deemed to be "correctly reading its environment." The SRAAgent result shows this inference fails when policies are adapted: the external observer benefits from a comparison the agent itself never makes.

§4 · SRA Framework (Condensed)

4.1 Residual Factorization

e_t \;=\; \underbrace{(B_\text{true} - \hat{B}_t)u_t}_{B\text{-drift}} \;+\; \underbrace{\Delta Q_t}_{Q\text{-burst}} \;+\; \underbrace{(A_\text{true}-\hat{A}_t)x_t}_{A\text{-drift}} \;+\; \underbrace{w_t + \eta_t}_{\text{noise}}

4.2 Attribution Separability

\theta(\mathcal{S}_B, \mathcal{S}_Q) \;=\; \arccos\!\left(\sigma_{\max}\bigl(\hat{S}_B^\top \hat{S}_Q\bigr)\right)

Attribution separability measures how geometrically distinguishable the residual subspaces of each cause are under the current policy. This angle provides the theoretical framing; in MOAT v5g, the operational diagnostic is $\mathrm{DE}_B$ depletion under preserved PE and energy, with internal misattribution confirmed via the attribution-angle diagnostic (§6.3).

4.3 Core Proposition (PE ⊥ Attribution Separability)

Defensible Central Proposition

$$\exists\, \pi,\, t \;:\; \mathrm{PE}(\pi_t) \geq \varepsilon_\text{PE} \;\wedge\; \theta(\mathcal{S}_B(\pi_t), \mathcal{S}_Q(\pi_t)) < \varepsilon_\theta$$ Persistent excitation aids parameter identifiability but does not guarantee attribution separability between competing structural hypotheses. These two conditions are independent.

4.4 Directional Collapse (Operationalized)

Directional Energy — B direction $$\mathrm{DE}_B(t) \;=\; \frac{v_B^\top \mathbb{E}[u_t u_t^\top]\, v_B}{\mathrm{tr}\,\mathbb{E}[u_t u_t^\top]}$$

Directional Collapse is declared when $\mathrm{DE}_B \downarrow$ while $\lambda_{\min}(\mathbb{E}[u_t u_t^\top]) \geq \varepsilon_\text{PE}$ and $\mathrm{tr}\,\mathbb{E}[u_t u_t^\top] \geq \varepsilon_E$. This distinguishes directional depletion from PE collapse and from energy starvation.

4.5 Identifiability Assumptions

SRA functions as a causal attribution proxy only under Spec-1 (process noise independent of action), Spec-2 (observation noise independent of action), Spec-3 (no hidden confounder), Spec-4 (PE condition), and Spec-5 (quasi-stationarity of $B_\text{true}$). Spec-3 violation is formally analyzed in theory.html §6; it produces spurious B-channel updates even when $B_\text{true}$ is unchanged.

§5 · MOAT v5g Benchmark

5.1 System

x_{t+1} = Ax_t + B_\text{true}u_t + w_t, \quad x_t, u_t \in \mathbb{R}^2$$ $$H_B:\; B_\text{true} = I + \delta_B v_B v_B^\top \qquad\qquad H_Q:\; Q_t = \sigma_w^2 I + \delta_Q\,\mathbf{1}_\text{burst}(t)\,v_Q v_Q^\top$$ $$v_B \sim \mathrm{Uniform}(S^1), \quad v_Q = R(\theta)v_B, \quad \theta \sim \mathrm{Uniform}(30°,150°)$$ $$\delta_B^2 \cdot \mathbb{E}[\|u_t\|^2] = \delta_Q \quad \text{(single-step indistinguishability)}

5.2 Measurement Architecture (Two Levels)

Metric	Level	Purpose
`D_probe(t)`	Diagnostic	AUC under fixed isotropic probe — environment identifiability check
`D_oracle(t)`	Diagnostic	AUC under correct-belief policy — counterfactual diagnostic
`DE_B(t)`	Diagnostic	Directional energy: $v_B^\top\mathbb{E}[uu^\top]v_B / \mathrm{tr}$
`Attribution angle`	Diagnostic	$\angle(\hat{v},v_B)$ vs $\angle(\hat{v},v_Q)$ — internal misattribution evidence
`AUC_residual(t)`	Performance	Classifier on $e_{t+3:t+3+k}$ — no ground-truth geometry
`AUC_action(t)`	Leakage	Classifier on $u_{t:t+k}$ — should remain near chance

5.3 SRAAgent (Constructive Minimal Counterexample)

Update:    B_est += lr * outer(e_t, u_t) / ‖u_t‖²
Policy:    v_est = left_singular_vector(B_est − I)
           cov_u = E·(1−δ)·v_est⊗v_est + E·δ·v_est⊥⊗v_est⊥
Under H_Q: e_t ≈ burst_noise along v_Q
           → B_est−I accumulates v_Q⊗v_Q component
           → policy concentrates along v_Q (positive feedback)

SRAAgent is a constructive minimal counterexample, not a claim about general adaptive agents. The relevant condition is a single-channel LS update with no Q-burst model.

5.4 Collapse Detection Criteria

Conditions C1–C5 are established by Stage 1–2d. C6 (AUC_residual < 0.60) is the Stage 2e open stronger claim — residual AUC collapse inside the endogenous adaptive loop has not yet been demonstrated and remains open.

C1  D_probe AUC          > 0.75   environment identifiable          ✓ Stage 1
C2  D_oracle AUC         > 0.75   correct belief preserves ident.   ✓ Stage 1
C3  PE_policy            ≥ ε_PE   sufficient input rank             ✓ Stage 2a
C4  InputEnergy          ≥ ε_E    sufficient input energy           ✓ Stage 2a
C5  DE_B                 ↓        directional depletion             ✓ Stage 2a
    Attribution angle    error rate ≥ 0.55 under H_Q               ✓ Stage 2b
    AUC_action           ≈ chance  leakage check                   ✓ Stage 2c
─── Stage 2e open stronger claim ──────────────────────────────────────────
C6  AUC_residual         < 0.60   residual collapse in adaptive loop   open

§6 · Experimental Results: Stage 1–2d

Parameters: $n=600$ episodes, $T=60$ steps, $\delta_B=0.9$, $\sigma_w=0.25$, input energy $=2.0$, lr $=0.15$, $\theta\in[30°,150°]$. Classifiers: Linear SVM + RFF-SVM. AUC $= \max(\text{auc},1-\text{auc})$.

6.1 Stage 1 — External Geometry Validation

Externally controlled wrong_strength reduces $\mathrm{DE}_B$ while holding PE and total energy fixed. Result: AUC_residual falls below the 0.60 collapse threshold in this externally controlled setting. Under externally controlled depletion, directional depletion can drive trajectory-level residual distinguishability below the collapse threshold while PE and total energy are preserved.

6.2 Stage 2a — Endogenous Directional Depletion

wrong_strength removed. SRAAgent runs from $\hat{B}_0 = I$.

Metric	H_B	H_Q	Status
DE_B (late, t=38–55)	0.826 ↑	0.406 ↓	contrast 0.420 ✓
PE (H_Q late)	—	0.300	≥ 0.15 ✓
Input Energy (H_Q late)	—	2.000	≥ 1.0 ✓
DE_B at t=0	0.500	0.500	both isotropic ✓

Contamination completes within t=2–3 and stabilises. The loop $e_t \to \hat{B}_t \to u_t \to e_{t+1}$ is self-reinforcing under H_Q.

6.3 Stage 2b — Attribution Angle (Primary Internal Evidence)

Condition	Rate	Mean $\angle(\hat{v},v_B)$	Mean $\angle(\hat{v},v_Q)$
H_B (correct)	0.983	6.9°	58.3°
H_Q (error)	0.888	55.2°	15.3°

In 88.8% of H_Q episodes, the agent's estimated drift direction is closer to $v_Q$ than $v_B$. The agent has structurally absorbed Q-burst evidence as B-channel drift.

6.4 Stage 2c — Policy-Matched Replay

Policy source	AUC	Interpretation
SRA adaptive (own policies)	0.762	Divergent policies create readable residuals
Action-only (leakage check)	0.550	Near chance; no action-label leakage
Replay: H_Q actions → both envs	0.553	Under equal actions: near-indistinct
AUC drop	0.209	Policy geometry was the AUC source

6.5 Stage 2d — Multi-Directional Replay Table

Action source	AUC	lin	RFF
SRA adaptive	0.761	0.524	0.998
Action-only (leakage)	0.550	0.543	0.556
Replay: H_B actions	0.639	0.533	0.745
Replay: H_Q actions	0.537	0.530	0.544
Replay: $v_B$ policy	0.650	0.506	0.794
Replay: $v_Q$ policy	0.598	0.586	0.609
Replay: isotropic	0.518	0.511	0.525
Replay: $v_B$-oracle (≡ $v_B$)	0.640	0.503	0.777

Pattern: $v_B$-aligned $\approx$ H_B-actions $>$ $v_Q$-aligned $>$ H_Q-actions $\approx$ isotropic. AUC is a function of action direction. Note: $v_Q > \text{isotropic}$ is expected ($v_Q$ excites Q-burst variance); the claim is $v_B > v_Q$, not $v_Q < \text{isotropic}$.

B-channel absorption ratio (supplementary proxy: ||B̂-I||F / ē²): H_B = 3.274, H_Q = 0.506. Both absorb signal into the B channel; H_Q's absorption is spurious and directed along $v_Q$.

§7 · Claim Hierarchy

Claim	Status
PE and attribution separability are independent conditions	✓ defensible
SRAAgent generates endogenous $\mathrm{DE}_B$ depletion without `wrong_strength`	✓ Stage 2a
$\hat{v} \approx v_Q$ in 88.8% of H_Q episodes (internal misattribution)	✓ Stage 2b
Adaptive AUC drop 0.762→0.553 under policy-matched replay	✓ Stage 2c
AUC varies directionally: $v_B > v_Q$ in replay table	✓ Stage 2d
$D_\text{ext} \neq \mathrm{AC}$: external distinguishability ≠ internal attribution	✓ empirically supported
Residual AUC collapse inside same adaptive loop (Stage 2e)	open stronger claim
SRA is a new theory distinct from ABHT	not yet — failure-mode benchmark
General causal ID under Spec-3 (hidden confounder)	not reached
High-PE accelerates collapse (High-PE Paradox)	retracted

§8 · Relationship to ABHT

Active Bayesian Hypothesis Testing (ABHT) selects actions to maximize the distinguishability $D(P(r \mid H_B,\pi) \| P(r \mid H_Q,\pi))$. It addresses: which actions increase identifiability?

SRA/MOAT addresses: how does wrong structural attribution make an agent avoid the identifiability-increasing direction? The two are complementary. ABHT prescribes the optimal policy; SRA/MOAT characterizes the failure mode when the agent does not hold the correct structural hypothesis.

Safe Differential Claim

MOAT v5g does not stand outside ABHT as an independent theory. It presents a closed-loop failure mode — in which incorrect structural attribution shifts the policy away from the B-discriminative direction, making the apparent distinguishability used by the evaluator policy-geometry dependent — as a falsifiable benchmark directly comparable against ABHT baselines. If ABHT-family agents avoid Directional Collapse in the Stage 2e experiment, this is a valuable negative result: ABHT-style active sensing already covers this pathology.

A targeted question for the baseline experiments: can an ABHT-style agent avoid internal structural misattribution when apparent residual separability is policy-geometry dependent and attribution angle degrades toward $v_Q$?

§9 · Limitations

SRAAgent is a constructive minimal counterexample. The policy-concentration rule is designed to exhibit the failure. The claim is that this failure mode can occur endogenously; not that it occurs in all adaptive agents or under all update rules.

Action-only AUC = 0.550 is near the leakage threshold. Multi-seed confidence intervals are required before ruling out residual policy-signature leakage. Reported as near-chance; not as zero.

Contamination Jacobian lacks constructive proof. The conditions $\rho(J(\delta_t)) > 1$ for the linearized contamination system have not been numerically demonstrated in the 2D counterexample. This remains an analytic sketch.

Spec-3 violation (Hidden Confounder) breaks SRA. When a hidden confounder $c_t$ satisfies $\mathbb{E}[e_t u_t^\top \mid c_t] \neq 0$, $\hat{B}$ updates even when $B_\text{true}$ is constant. SRA is a causal attribution proxy only under Spec-1 to 5. Phase 5 of MOAT v5g tests this limit explicitly.

Stage 2e is open. Residual AUC collapse inside the same adaptive loop has not been demonstrated. Replay AUC = 0.553 is near the 0.60 threshold but does not constitute confirmation.

§10 · Conclusions

The experimental chain Stage 1–2d establishes five results:

Under externally controlled depletion, directional energy depletion can drive trajectory-level residual distinguishability below the collapse threshold while PE and total energy are preserved (Stage 1).
A minimal adaptive agent generates this depletion endogenously through structural misattribution, without external injection (Stage 2a).
The agent's internal attribution map fails: estimated drift direction aligns with the Q-burst direction in 88.8% of adversarial episodes (Stage 2b).
The high apparent residual AUC (0.762) is policy-geometry dependent: it drops to 0.553 under policy-matched replay while action-only AUC remains near chance (Stage 2c).
AUC is a function of action direction; $v_B$-aligned actions yield higher separability than $v_Q$-aligned or H_Q-generated actions (Stage 2d).

The central contribution is the empirical demonstration that external classifier distinguishability and agent-internal attribution correctness are independent. Standard adaptive system evaluation uses external performance metrics to infer internal attribution quality; the SRAAgent result shows this inference fails when policies are adapted under structural misattribution.

SRA/MOAT is positioned not as a theory of external indistinguishability but as a benchmark for the dissociation between external evidence quality and internal attribution fidelity — a dissociation that ABHT and classical closed-loop identification do not directly target, and whose boundary MOAT v5g is designed to map.

Stage 2e (residual AUC collapse inside the adaptive loop) remains open. If Stage 2e succeeds, the full recursive poisoning claim is established. If Stage 2e fails, the negative result confirms ABHT-family policies cover this geometry — a valuable benchmark finding either way.