paper/readiness.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

# Path to a citable, build-on-able preprint — status

Framing (locked 2026-06-19, per user correction): the axis is **expansive vs more-expansive**
(graded; for TRM both classes have λ₁>0), NOT settled-vs-chaotic. The phenomenon is the
**cleanness of a graded separation**; the **mechanism is explicitly OPEN** and is the natural seed
for follow-on projects. This preprint = rigorous phenomenology + precise characterization +
honest open-mechanism. Do NOT force a fixed-point / suppression-of-chaos framing (rejected).

## Tier 0 — measurement bulletproofing (others build on it)
- [x] **T0.1 estimator validation** — `paper/validation/`: QR/Benettin core recovers known spectra
  to <1e-3 (diagonal, symmetric, non-normal asymptotic) and Hénon λ₁ to 8e-5. PASS. Confirms the
  numerical core (orthonormalization cadence, log|diagR| bookkeeping, ordering, averaging).
- [ ] **T0.2 robustness reruns (GPU)** — λ stability vs t_ons, tangent-basis seed, k>8. Window
  dependence already covered offline (Char 3). Small queue; spec below.
- [ ] **T0.3 language/scope pass** — finite-time vs asymptotic, "expansive not chaotic" for HRM
  (negative λ), metric/coordinate-dependence caveat (Lohmiller–Slotine). Prose task.

## Characterization (replaces the rejected Tier 2; describes WHAT, not WHY)
- [x] **Char 1 whole-spectrum** — separation is a ~rigid shift of the ENTIRE k=8 spectrum, not a
  single mode (per-exponent AUC uniformly 0.98–0.99; HRM gap ≈constant −0.16/exponent). Spectral
  MEAN separates ≥ λ₁ alone (AUC 0.991–0.995). CAVEAT: KS-proxy Σλ⁺ is the wrong aggregate for HRM
  (all-negative spectra → 0); use spectral mean for HRM.
- [x] **Char 2 shape** — two overlapping UNIMODAL classes with well-separated means, NOT two
  discrete clusters (within-class BC 0.26–0.40). Outcome is a moderately sharp threshold on the
  λ₁ continuum (25→75% transition spans 12–30% of the λ₁ spread).
- [x] **Char 3 integration-time scaling (the key descriptor)** — separation BUILDS monotonically
  with window H: Cohen's d 1.06→4.84 (TRM, H=2→16), 0.03→3.45 (HRM). Near-zero at H=2, near-perfect
  at the full 16-segment budget. The cleanness is an integration-time phenomenon. COHERENCE with
  E5: this accumulation tracks the unfolding of outcomes (more trajectories revealed by larger H),
  NOT anticipation — among undecided@H examples λ₁ still doesn't predict (E5). State both together.
- [x] **Char 4 effect size** — "clean" quantified: Cohen's d 3.4–4.8, distributional overlap
  <10% (TRM hist-overlap 0.049). Beyond AUC.

## Tier 1 — causal content (the level-up from correlation)
- [ ] **T1 inference-side causal probe** — nudge a failing trajectory toward lower expansion (or
  toward the success-mean manifold) mid-rollout and measure outcome recovery; conversely inject
  expansion into a settling-correct trajectory. Tests settling⟹correct as causal, not correlational.
  Spec next. GPU.

## Open-mechanism (NOT this paper; the hook for follow-ons)
Why a graded (both-expansive) difference separates so cleanly. Char 1–4 bound the description;
the why is deferred. Candidate angles are the user's to pursue, not asserted here.

## Maze cross-task result + checkpoint evolution (2026-06-20)

**Deflationary finding stands and is now grounded:** the FTLE/CLV separation reduces to
convergence+confidence (λ1, full k=8 spectrum, AND leading-CLV geometry all reduce; partial-corr
→0 once drift+q_halt controlled). The dynamical signal is a (redundant) convergence readout.

**Maze (TRM att, friend's run, all 10 ckpts, k=1):** separation WEAK (λ1 Cohen's d 0.2–0.5 vs
Sudoku 3–5). Failures SETTLE (B/fail 0.81–0.98, D/fail 0.02–0.19) at ALL ckpts and are NEAR-MISSES
(token_acc ~0.97). Opposite of Sudoku (failures wander, far-from-correct token ~0.63).

**Checkpoint evolution (the key new result, offline):** wandering is a LATE-TRAINING property.
Sudoku HRM failures SETTLE early (B/fail ~0.9 at acc 2–15%) then flip to WANDER late
(D/fail ~1.0 at acc 50%), transition ~step 13–18k. So "failures wander" is learned, not intrinsic.
BUT matched-accuracy contrast cuts the other way: at acc≈0.76, Sudoku-TRM D/fail=1.00 vs
Maze-TRM D/fail=0.19 — same skill, opposite dynamics → TASK STRUCTURE also matters, not just maturity.
And early-Sudoku settling (token 0.63, confidently-wrong) ≠ Maze settling (token 0.97, near-miss):
not the same phenomenon. Fig: analysis_2x2/checkpoint_evolution_wander.png.

**Task structure (offline):** Maze solution path (median 113 cells) passes through ~76 branch
points (67% of path cells at deg≥3 junctions; 48% of open cells are junctions) → abundant
locally-coherent alternative paths = many STABLE WRONG ANSWERS available. Sudoku: unique
globally-coupled solution, a wrong cell violates constraints globally → no local near-miss
equilibrium. This structurally explains settle-to-near-miss (Maze) vs wander (Sudoku).

**Unresolved confound (queued):** TRM-Maze never develops wandering, but can't tell task-structure
from TRM-Maze SATURATION (Maze too easy for TRM). Queued before HRM-Maze:
(1) continue-train TRM-Maze from step_130200 (does acc climb toward ~1.0 = saturation, or plateau?);
(2) per-cell failure structure (are failure errors a connected detour = coherent stable wrong path,
or scattered?). Then HRM-Maze (harder model-task fit, more likely to be stressed into wandering).

## Solution-space test (2026-06-20) — refutes the measurement-artifact concern, strengthens task-structure
User asked: is weak Maze separation an artifact of analyzing the FULL latent (88% trivial copy)
instead of the SOLUTION space? Tested directly: per-step decoded-ANSWER Hamming drift over
solution cells (label!=input), Maze vs Sudoku control.
- MAZE: failures SETTLE in solution space too (late answer-drift median 0.00, 98.4% settled;
  AUC 0.30). Same conclusion as full-latent. NOT an artifact.
- SUDOKU control: failures DON'T settle in solution space (late drift median 8.5/step, 0% settled;
  AUC 0.99). Same as full-latent. Both spaces agree.
- Per-cell failure STRUCTURE (direct task-structure evidence): MAZE failures = CONNECTED DETOUR
  (97% have ≤2 error components, median 22 cells one blob) = a coherent stable wrong PATH.
  SUDOKU failures = SCATTERED (100% have ≥5 components, median 13) = no coherent wrong answer.
  Fig: analysis_2x2/maze_failure_detour.png. This is the mechanism-grounding for why Maze settles
  (stable wrong answers exist as detours) and Sudoku wanders (no stable wrong answer).

## CORRECTION (2026-06-20) — Maze exact-match labeling was the artifact; failure=more-chaotic HOLDS
The earlier "Maze dissociates / completeness≠correctness" reading was largely a LABELING ARTIFACT,
not a real dynamical dissociation. Maze exact-match marks VALID alternative solutions (incl.
equal-length valid shortest paths) as "failures"; 100% of exact-match "failures" are valid connected
paths (complete answers) → they settle, trivially. That is a benchmark-design flaw, not a result.
**Under the correct criterion (CONNECTIVITY = is it a valid complete path = is it actually solved):**
genuine failures (broken/disconnected) ARE more chaotic — AUC(-late_drift→connected) = 0.864 @step_13020
(15 broken), 0.895 pooled (18 broken); bootstrap 95% CI [0.80, 0.96], excludes 0.5. So
"failure = more chaotic" is TASK-GENERAL (Sudoku + Maze) once failure is defined by validity.
LIMITATION (now RESOLVED): trained Maze SATURATES before the first saved ckpt (step_13020 already
97% complete) → only n=18 broken from existing ckpts. FIX DONE: fresh early-save TRM-Maze run
(maze_earlysave_freshTRM, saved every 250 epochs) captured the broken-rich pre-saturation phase;
cheap forward dumps (drift_zH + connectivity, no JVP) on 8 early ckpts give **n=4096, 1835 broken**.
**Pooled: AUC(-latent drift_zH -> connected/complete) = 0.834, bootstrap 95% CI [0.822, 0.846]**
(broken late-drift median 1.06 vs connected 0.56). Per-ckpt AUC rises with training 0.66->0.88
(mirrors Sudoku's separation-grows-with-training). So 'genuine failure (incomplete) = more chaotic'
is now LARGE-N BULLETPROOF on Maze under validity labeling. Fig: maze_broken_morechaotic.png.
Honest detail: 'more chaotic' is a LATENT-dynamics property (drift_zH AUC 0.834, λ1 AUC 0.86);
the DECODED-answer drift does NOT separate (ans_drift AUC 0.38) — broken paths commit an incomplete
decoded answer while churning internally. Consistent with the FTLE/drift (latent) story.

## Synthesis for the paper (current honest thesis, corrected)
Genuine failures (incomplete/invalid answers) are MORE CHAOTIC — measurable, task-general (Sudoku;
Maze under validity labeling). Mechanism: the dynamical signal detects answer completeness/convergence
(FTLE reducible to drift+q_halt). On unique-solution tasks completeness=correctness, so it predicts
correctness directly. On multi-solution tasks exact-match mislabels valid alternatives as failures;
use validity labeling. The phenomenon stands; the convergence-detection mechanism is the honest
interpretation, not a refutation.

## Status: offline T0.1 + Char 1–4 + Maze evolution + task structure DONE. Running: TRM CLV (done),
## HRM CLV (queued on card1), maze-followup queue (continue-train + per-cell, waiting for GPU).
## Remaining: T0.2/T0.3, T1, HRM-Maze (after saturation test).