04  ResultsCollatz finite-block diagnostics

Results, by test

Each test is reported in its own section, with the numbers taken verbatim from the corresponding report. Diagnostic results come first (Tests 1–2), then the generative attempts (Tests 3–4). The consolidated reading is deferred to §5.

4.1 Test 1 — block anomaly score (B3/B4) class B

In the focus state late_growth | tail_64_95 | even, the \(B_4\) score's actual median (\(-0.047\)) lies above the iid median (\(-0.060\)), and the weighted AUC separating high-scoring actual words from iid is \(0.719\). Within that state the lowest \(B_4\)-score decile has survival ratio \(S = 0.035\) and the highest decile \(S = 1.806\) — a clear monotonic trend in this sparsely sampled state.

Against the full logistic baseline, the block score adds a small but real amount of separation, and crucially does not explain away the structural covariates:

Table 4.1 — Test 1 regression (weighted AUC and selected coefficients). Baseline = \(x_K\), parity, bridge cluster, path-shape \(z\).
modelweighted AUCΔ vs baseB4 coefbridge |coef|parity coef
x_K + parity + bridge + z0.50230.00000.00001.13120.3682
+ B40.52600.02360.23301.11510.4268
+ B3 + B40.52460.02230.16951.11320.4423

Adding \(B_4\) moves AUC by \(0.0236\). The bridge coefficient falls by only \(0.0162\); the parity coefficient does not decrease; instead, it increases by \(0.0586\). Test 1 is therefore classified as B — "B4 score helps but bridge/parity remain strong", with the explicit caveat that this is a split-sample, sampled diagnostic, not an exact aggregation proof.

Takeaway. The block score improves discrimination, but the bridge/parity effects remain.

4.2 Test 2 — block-length renormalization (L = 3…6) class B

The discriminative signal increases monotonically with block length, by both the marginal-score AUC and the \(+\)score logistic AUC, while the logistic base AUC remains at chance level (\(0.5025\)):

Table 4.2 — Test 2 AUC by block length \(L\).
Lmarginal score AUClogistic base AUClogistic + score AUCΔ
30.56110.50250.53630.0337
40.57680.50250.54360.0411
50.59660.50250.55360.0511
60.61980.50250.56430.0618

The best \(+\)score logistic AUC is \(0.5643\) at \(L=6\); the gain from \(L=3\) to \(L=6\) is \(0.0280\). But the residual gap does not close: at \(L=6\) the bridge absolute residual is \(1.1011\) and the parity residual is \(0.3749\). In the focus state at \(L=6\), the lowest score decile has \(S = 0.117\), the highest decile with finite survival ratio \(S = 38.091\), and the top decile has zero iid mass — the strongest effect is also observed in the sparsest region.

Overall, Test 2 is therefore classified as B — "AUC grows but residual remains": longer blocks discriminate better but do not remove the bridge/parity structure.

Takeaway. Longer blocks consistently improve discrimination, but leave substantial bridge/parity structure unexplained.

4.3 Test 3 — finite-block reweighting class C

Reweighting iid words by \(2^{\alpha S_L}\) is the first generative attempt. The best full-state fit is obtained with mildly damped short blocks — \(L=3\), \(\alpha=0.25\), RMSE \(0.000440978\), JS \(0.000921815\). Longer or stronger reweighting is worse, not better: at \(L=6\) the best choice is \(\alpha = 0.0\) (no reweighting at all), RMSE \(0.000642649\).

The focus state shows the overcorrection directly. Its best reweighting (\(L=6\), \(\alpha=0.5\)) still under-predicts the mass — actual \(0.00092052\) vs predicted \(0.00029821\) — and increasing \(\alpha\) to \(1\) drives the modelled focus-state survival down to \(0.00416667\), against an actual survival of \(0.472461\). At the global best fit the bridge RMSE is \(0.000427051\) and the parity RMSE \(0.00262763\).

Taken together, these results correspond to C — "reweighting overcorrects or is not generative": the small improvement is entirely from damped short-block reweighting; longer raw reweighting is over-sharp — useful diagnostically, weak as a generative model.

Takeaway. Short, mildly damped reweighting can improve the aggregate fit, but stronger block reweighting overcorrects rather than generating the actual distribution.

4.4 Test 4 — maximum-entropy block projection class C

The second generative attempt replaces heuristic reweighting with an approximate regularized IPF that matches block marginals. It does not outperform the simpler baseline: the best maximum-entropy fit (\(L=3\), regularization \(0.75\)) has RMSE \(0.000493147\) and JS \(0.000868073\), worse than the best raw/damped reweighting RMSE of \(0.000440978\).

Bar chart comparing best raw/damped RMSE 0.000440978 with best maxent RMSE 0.000493147.
Figure 1. Best-fit state-mass RMSE: the maximum-entropy projection (right) does not improve on the best raw/damped reweighting (left). Lower is better.

Regularization does reduce the parity residual for the longer blocks — at regularization \(0\) the \(L=5\) and \(L=6\) parity residuals begin at high values and decrease toward the \(L=3/L=4\) level as regularization increases — but this reflects regularization pulling the projection toward the damped baseline, not a new structure being captured.

Line chart of parity residual RMSE versus regularization for L=3,4,5,6; L=5 and L=6 start high and decrease with regularization.
Figure 2. Parity residual RMSE vs. regularization, by block length. Heavier regularization pulls the \(L=5,6\) projections back toward the low-residual region already occupied by \(L=3,4\).

On the focus state the projection's best survival is \(0.444001\) (\(L=5\), regularization \(0.9\)) against an actual survival of \(0.472461\) — close in this one state, but obtained at heavy regularization and not accompanied by a better global fit.

Line chart of focus-state predicted survival versus regularization for L=3,4,5,6.
Figure 3. Focus-state predicted survival vs. regularization, by block length. The single-cell fit can be brought close to the actual value, but only under regularization that is not justified by the global RMSE.

Accordingly, Test 4 also falls into C — "maxent no better than raw/damped", with the explicit reminder that this is an approximate projection, not a full, exact IPF solution over all words.

Takeaway. The projection can fit the focus state under heavy regularization, but it does not improve the global state-mass fit.

4.5 Summary of the four verdicts

Table 4.3 — Headline numbers and self-classification per test.
TestTypeHeadlineSelf-class
1 · anomaly B3/B4diagnostic+0.0236 AUC; focus AUC 0.719B
2 · length renormdiagnostic+score AUC 0.5363→0.5643B
3 · reweightinggenerativebest RMSE 0.000440978 (L3, α0.25)C
4 · maxent projectiongenerativebest RMSE 0.000493147 (worse)C

Diagnostic tests are classified as B, whereas the generative tests are classified as C. The next section reads these four verdicts together and explains the meaning of the letter grades.

4.6 Auxiliary analysis: the actual−iid Δ-map

As a descriptive statistic, this section reports the actual–iid mass difference

\[ \Delta(\cdot) \;=\; \mu_{\text{actual}}(\cdot) \;-\; \mu_{\text{iid}}(\cdot) \]

projected onto several coordinates. This is not a new generative model; it is a diagnostic quantity that supplements where the discrepancy detected by Tests 1–4 is localized. We project Δ onto state, prefix cylinder, transition, and boundary / remaining_K coordinates.

In state coordinates, the combination bridge_cluster + x_K_window + parity provides a sharp localization of Δ. In prefix cylinders the difference is visible from the early prefix on, but lengthening the window does not concentrate it in a single prefix. Viewed in terms of transition structure and prefix growth, Δ likewise does not collapse onto any single edge or branch.

In boundary coordinates, remaining_K provides the sharpest localization of Δ, and the largest \(|\Delta|\) appears at remaining_K = 32–63. The thinness extends into remaining_K = 64–95 and 96–127, but the largest absolute mass difference remains at 32–63 (96–127 has the lowest ratio, while its mass and L1 share are small).

Table 4.4 — Mass difference along the remaining_K boundary distance (ALL long words with a defined final state). \(\Delta = \) actual − iid.
remaining_Kactualiid\(\Delta\)ratioL1 share
32–631.9595322.139743−0.1802110.91638.57%
64–950.2664350.341662−0.0752270.78016.10%
96–1270.0186440.031704−0.0130590.5882.80%

In the leading bands (32–63, 64–95, 96–127) the mass delta is negative, while the conditional downstream transition delta can be positive. For instance 64-95 -> 32-63 has mass delta \(-0.006059\) but conditional delta \(+0.007861\), and 32-63 -> 16-31 has mass delta \(-0.009262\) but conditional delta \(+0.004618\). This separates two statements: actual carries little mass in the band, yet conditioned on being in the band its share moving downstream is not necessarily weak. It is therefore consistent with a difference in how mass is placed along the remaining_K chain between actual and iid, rather than with a simple local-transition malfunction.

Table 4.5 — How Δ appears by coordinate (auxiliary analysis).
coordinatesharpnessmain finding
block scorediagnostic signal present, generation not reproduced (§4.1–4.4)
statemedium–highlocalized by bridge_cluster + x_K_window + parity
prefixlowvisible early but not concentrated in a single prefix
transitionlownot collapsed onto a single edge / branch
boundary remaining_Khighlargest \(|\Delta|\) at 32–63 (the band with the largest observed discrepancy)
Scope of this auxiliary analysis The Δ-map is a descriptive statistic. It does not identify a new Collatz mechanism, and it does not establish that remaining_K is causal. remaining_K = 32–63 should be read as the band where the difference between finite-integer escape words and the iid reference is largest in observation (the band with the largest observed discrepancy), not as a generating source of the discrepancy.

Takeaway. The residual is not concentrated in a single prefix or transition cylinder; it localizes more sharply in state coordinates and in the remaining_K boundary distance, with the band showing the largest observed discrepancy at remaining_K = 32–63.