We define and empirically investigate Internal Attribution Collapse — a constructive failure mode in which a minimal adaptive agent under nonstationary partial observability structurally misattributes residuals to the wrong latent channel, generating policies that avoid the discriminative direction and making apparent residual evidence policy-geometry dependent. The central finding is a dissociation:
The original claim of theory.html — that misattribution recursively degrades external trajectory-level distinguishability — was not confirmed experimentally. What was confirmed is stronger in a different direction: apparent residual separability is constructed by the policy-induced trajectory geometry, while the agent's internal attribution map points in the wrong structural direction.
We present MOAT v5g as a stress-test benchmark and SRAAgent as a constructive minimal counterexample showing that this failure mode can occur under a single-channel LS update with no Q-burst model.
theory.html framed Attribution Collapse as producing external residual indistinguishability via recursive contamination:
Experiments show that adaptive residual AUC does not collapse; it remains high (0.762). The experimental chain instead reveals a decomposition of that high AUC:
The revised definition of Attribution Collapse:
What survives from theory.html: the PE ⊥ Attribution Separability proposition (§3.3); the Directional Collapse metric; the MOAT v5g measurement architecture (D_probe / D_oracle / D_policy separation); the Hidden Confounder analysis (§6); and the honest Limitations section. The Contamination Jacobian analysis ($\rho(J) > 1$) survives as a mechanism sketch but remains without constructive numerical demonstration.
Let $\mathcal{H} = \{H_B, H_Q\}$ be competing structural hypotheses (B-drift vs Q-burst). Define:
The empirical result: under HQ + SRAAgent, $D_\text{ext} = 0.762$ while $\mathrm{AC}_t = 0$ in 88.8% of episodes.
The HQ SRAAgent and HB SRAAgent adapt toward different action directions ($v_Q$ and $v_B$ respectively). Their divergent policies generate structurally different residual trajectories that an external classifier can read. However, the HQ agent does not know this; it has absorbed $v_Q$ as the learned B-drift direction.
The external classifier's success is a forensic artifact of the divergence between two misattributing policies, not a sign that any agent has correctly attributed the cause. Removing the policy divergence (policy-matched replay) drops AUC to 0.553, approaching the single-step indistinguishability regime by design.
This dissociation matters because standard adaptive system evaluation uses external performance metrics to infer internal attribution quality. If $D_\text{ext}$ is high, a system is deemed to be "correctly reading its environment." The SRAAgent result shows this inference fails when policies are adapted: the external observer benefits from a comparison the agent itself never makes.
Attribution separability measures how geometrically distinguishable the residual subspaces of each cause are under the current policy. This angle provides the theoretical framing; in MOAT v5g, the operational diagnostic is $\mathrm{DE}_B$ depletion under preserved PE and energy, with internal misattribution confirmed via the attribution-angle diagnostic (§6.3).
Directional Collapse is declared when $\mathrm{DE}_B \downarrow$ while $\lambda_{\min}(\mathbb{E}[u_t u_t^\top]) \geq \varepsilon_\text{PE}$ and $\mathrm{tr}\,\mathbb{E}[u_t u_t^\top] \geq \varepsilon_E$. This distinguishes directional depletion from PE collapse and from energy starvation.
SRA functions as a causal attribution proxy only under Spec-1 (process noise independent of action), Spec-2 (observation noise independent of action), Spec-3 (no hidden confounder), Spec-4 (PE condition), and Spec-5 (quasi-stationarity of $B_\text{true}$). Spec-3 violation is formally analyzed in theory.html §6; it produces spurious B-channel updates even when $B_\text{true}$ is unchanged.
| Metric | Level | Purpose |
|---|---|---|
D_probe(t) | Diagnostic | AUC under fixed isotropic probe — environment identifiability check |
D_oracle(t) | Diagnostic | AUC under correct-belief policy — counterfactual diagnostic |
DE_B(t) | Diagnostic | Directional energy: $v_B^\top\mathbb{E}[uu^\top]v_B / \mathrm{tr}$ |
Attribution angle | Diagnostic | $\angle(\hat{v},v_B)$ vs $\angle(\hat{v},v_Q)$ — internal misattribution evidence |
AUC_residual(t) | Performance | Classifier on $e_{t+3:t+3+k}$ — no ground-truth geometry |
AUC_action(t) | Leakage | Classifier on $u_{t:t+k}$ — should remain near chance |
Update: B_est += lr * outer(e_t, u_t) / ‖u_t‖² Policy: v_est = left_singular_vector(B_est − I) cov_u = E·(1−δ)·v_est⊗v_est + E·δ·v_est⊥⊗v_est⊥ Under H_Q: e_t ≈ burst_noise along v_Q → B_est−I accumulates v_Q⊗v_Q component → policy concentrates along v_Q (positive feedback)
SRAAgent is a constructive minimal counterexample, not a claim about general adaptive agents. The relevant condition is a single-channel LS update with no Q-burst model.
Conditions C1–C5 are established by Stage 1–2d. C6 (AUC_residual < 0.60) is the Stage 2e open stronger claim — residual AUC collapse inside the endogenous adaptive loop has not yet been demonstrated and remains open.
C1 D_probe AUC > 0.75 environment identifiable ✓ Stage 1
C2 D_oracle AUC > 0.75 correct belief preserves ident. ✓ Stage 1
C3 PE_policy ≥ ε_PE sufficient input rank ✓ Stage 2a
C4 InputEnergy ≥ ε_E sufficient input energy ✓ Stage 2a
C5 DE_B ↓ directional depletion ✓ Stage 2a
Attribution angle error rate ≥ 0.55 under H_Q ✓ Stage 2b
AUC_action ≈ chance leakage check ✓ Stage 2c
─── Stage 2e open stronger claim ──────────────────────────────────────────
C6 AUC_residual < 0.60 residual collapse in adaptive loop open
Parameters: $n=600$ episodes, $T=60$ steps, $\delta_B=0.9$, $\sigma_w=0.25$, input energy $=2.0$, lr $=0.15$, $\theta\in[30°,150°]$. Classifiers: Linear SVM + RFF-SVM. AUC $= \max(\text{auc},1-\text{auc})$.
Externally controlled wrong_strength reduces $\mathrm{DE}_B$ while holding PE and total energy fixed.
Result: AUC_residual falls below the 0.60 collapse threshold in this externally controlled setting.
Under externally controlled depletion, directional depletion can drive trajectory-level residual distinguishability below the collapse threshold while PE and total energy are preserved.
wrong_strength removed. SRAAgent runs from $\hat{B}_0 = I$.
| Metric | H_B | H_Q | Status |
|---|---|---|---|
| DE_B (late, t=38–55) | 0.826 ↑ | 0.406 ↓ | contrast 0.420 ✓ |
| PE (H_Q late) | — | 0.300 | ≥ 0.15 ✓ |
| Input Energy (H_Q late) | — | 2.000 | ≥ 1.0 ✓ |
| DE_B at t=0 | 0.500 | 0.500 | both isotropic ✓ |
Contamination completes within t=2–3 and stabilises. The loop $e_t \to \hat{B}_t \to u_t \to e_{t+1}$ is self-reinforcing under HQ.
| Condition | Rate | Mean $\angle(\hat{v},v_B)$ | Mean $\angle(\hat{v},v_Q)$ |
|---|---|---|---|
| H_B (correct) | 0.983 | 6.9° | 58.3° |
| H_Q (error) | 0.888 | 55.2° | 15.3° |
In 88.8% of HQ episodes, the agent's estimated drift direction is closer to $v_Q$ than $v_B$. The agent has structurally absorbed Q-burst evidence as B-channel drift.
| Policy source | AUC | Interpretation |
|---|---|---|
| SRA adaptive (own policies) | 0.762 | Divergent policies create readable residuals |
| Action-only (leakage check) | 0.550 | Near chance; no action-label leakage |
| Replay: H_Q actions → both envs | 0.553 | Under equal actions: near-indistinct |
| AUC drop | 0.209 | Policy geometry was the AUC source |
| Action source | AUC | lin | RFF |
|---|---|---|---|
| SRA adaptive | 0.761 | 0.524 | 0.998 |
| Action-only (leakage) | 0.550 | 0.543 | 0.556 |
| Replay: H_B actions | 0.639 | 0.533 | 0.745 |
| Replay: H_Q actions | 0.537 | 0.530 | 0.544 |
| Replay: $v_B$ policy | 0.650 | 0.506 | 0.794 |
| Replay: $v_Q$ policy | 0.598 | 0.586 | 0.609 |
| Replay: isotropic | 0.518 | 0.511 | 0.525 |
| Replay: $v_B$-oracle (≡ $v_B$) | 0.640 | 0.503 | 0.777 |
Pattern: $v_B$-aligned $\approx$ HB-actions $>$ $v_Q$-aligned $>$ HQ-actions $\approx$ isotropic. AUC is a function of action direction. Note: $v_Q > \text{isotropic}$ is expected ($v_Q$ excites Q-burst variance); the claim is $v_B > v_Q$, not $v_Q < \text{isotropic}$.
B-channel absorption ratio (supplementary proxy: ||B̂-I||F / ē²): HB = 3.274, HQ = 0.506. Both absorb signal into the B channel; HQ's absorption is spurious and directed along $v_Q$.
| Claim | Status |
|---|---|
| PE and attribution separability are independent conditions | ✓ defensible |
SRAAgent generates endogenous $\mathrm{DE}_B$ depletion without wrong_strength | ✓ Stage 2a |
| $\hat{v} \approx v_Q$ in 88.8% of HQ episodes (internal misattribution) | ✓ Stage 2b |
| Adaptive AUC drop 0.762→0.553 under policy-matched replay | ✓ Stage 2c |
| AUC varies directionally: $v_B > v_Q$ in replay table | ✓ Stage 2d |
| $D_\text{ext} \neq \mathrm{AC}$: external distinguishability ≠ internal attribution | ✓ empirically supported |
| Residual AUC collapse inside same adaptive loop (Stage 2e) | open stronger claim |
| SRA is a new theory distinct from ABHT | not yet — failure-mode benchmark |
| General causal ID under Spec-3 (hidden confounder) | not reached |
| High-PE accelerates collapse (High-PE Paradox) | retracted |
Active Bayesian Hypothesis Testing (ABHT) selects actions to maximize the distinguishability $D(P(r \mid H_B,\pi) \| P(r \mid H_Q,\pi))$. It addresses: which actions increase identifiability?
SRA/MOAT addresses: how does wrong structural attribution make an agent avoid the identifiability-increasing direction? The two are complementary. ABHT prescribes the optimal policy; SRA/MOAT characterizes the failure mode when the agent does not hold the correct structural hypothesis.
A targeted question for the baseline experiments: can an ABHT-style agent avoid internal structural misattribution when apparent residual separability is policy-geometry dependent and attribution angle degrades toward $v_Q$?
SRAAgent is a constructive minimal counterexample. The policy-concentration rule is designed to exhibit the failure. The claim is that this failure mode can occur endogenously; not that it occurs in all adaptive agents or under all update rules.
Action-only AUC = 0.550 is near the leakage threshold. Multi-seed confidence intervals are required before ruling out residual policy-signature leakage. Reported as near-chance; not as zero.
Contamination Jacobian lacks constructive proof. The conditions $\rho(J(\delta_t)) > 1$ for the linearized contamination system have not been numerically demonstrated in the 2D counterexample. This remains an analytic sketch.
Spec-3 violation (Hidden Confounder) breaks SRA. When a hidden confounder $c_t$ satisfies $\mathbb{E}[e_t u_t^\top \mid c_t] \neq 0$, $\hat{B}$ updates even when $B_\text{true}$ is constant. SRA is a causal attribution proxy only under Spec-1 to 5. Phase 5 of MOAT v5g tests this limit explicitly.
Stage 2e is open. Residual AUC collapse inside the same adaptive loop has not been demonstrated. Replay AUC = 0.553 is near the 0.60 threshold but does not constitute confirmation.
The experimental chain Stage 1–2d establishes five results:
The central contribution is the empirical demonstration that external classifier distinguishability and agent-internal attribution correctness are independent. Standard adaptive system evaluation uses external performance metrics to infer internal attribution quality; the SRAAgent result shows this inference fails when policies are adapted under structural misattribution.
SRA/MOAT is positioned not as a theory of external indistinguishability but as a benchmark for the dissociation between external evidence quality and internal attribution fidelity — a dissociation that ABHT and classical closed-loop identification do not directly target, and whose boundary MOAT v5g is designed to map.
Stage 2e (residual AUC collapse inside the adaptive loop) remains open. If Stage 2e succeeds, the full recursive poisoning claim is established. If Stage 2e fails, the negative result confirms ABHT-family policies cover this geometry — a valuable benchmark finding either way.