Recursive reasoning dynamics: analysis pipeline, paper drafts, toy models

Failure=more-chaotic (task-general under validity labeling) reduces to convergence/completeness detection; mechanism (transient chaos vs multistability vs input-induced) under investigation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-06-29 12:15:51 -0500
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-06-29 12:15:51 -0500
commit: a6ec4288a2232988b130b2f00bb2565f81706966 (patch)
tree: 1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /paper
11 files changed, 733 insertions, 0 deletions
diff --git a/paper/claims.md b/paper/claims.md
new file mode 100644
index 0000000..590b388
--- /dev/null
+++ b/paper/claims.md
@@ -0,0 +1,37 @@
+# Claim table (write structure FROM this, not from session chronology)
+
+Rule: every claim states its evidence, its strongest counter-reading, and where that
+counter-reading is killed or conceded. A claim that can't fill all four columns gets cut
+or demoted to an observation in the discussion.
+
+| # | Claim | Evidence | Strongest counter-reading | Disposition |
+|---|---|---|---|---|
+| C1 | Per-example full-trajectory FTLE separates success/failure near-perfectly in trained HRM and TRM on Sudoku-Extreme | AUC(−λ₁→correct): HRM 0.984 (n=8192) / 0.987 (n=2048, 2nd estimator); TRM official 87.6% ckpt 0.993 (n=2048). Medians: HRM joint −0.152 vs +0.032; TRM +0.012 vs +0.103 | Estimator artifact | Replicates across two estimator implementations and two architectures; report both scales, never cross-compare |
+| C2 | Failures overwhelmingly never settle; settled-wrong is rare (HRM) to absent (TRM) | TRM: 0/254 failures settled, threshold-free (min wrong-drift > late-drift of 96.5% of correct); HRM: 21/3894 (0.55%) strict-band; replicated 5/971 on 2nd estimator | Threshold choice ("settled" is arbitrary) | State threshold-free version (distribution separation); show full percentile sweep; define settled band by its narrow characteristic velocity |
+| C3 | The rare settled-wrong failures are selector-blind: success-like contraction AND success-like halting confidence | n=21: λ₁ med −0.842 (A: −0.867), q_halt(final) +7.47 (= A), all would halt early (halted_at 4–9), token_acc med 0.62 | Small n | Concede explicitly; the point is existence + boundedness (~0.5%), not precision |
+| C4 | The FTLE-outcome signal is not reducible to non-settling, and not a difficulty artifact | Drift-decile-matched AUC within unsettled stratum: 0.879 (n=8192) / 0.900 (2nd estimator); #givens-bin AUC 0.982 vs 0.984 overall | (a) residual within-decile drift variation; (b) #givens is a weak difficulty proxy | (a) deciles are narrow (table shows ranges); (b) concede openly, name solver-backtracks as the proper proxy, future work |
+| C5 | The chaotic signature is outcome-concurrent, not antecedent: nothing in the first 4 ACT steps forecasts eventual success among not-yet-solved examples — and on HRM the dynamical signals point the other way | Restricted to not-correct@4: TRM AUC λ₁ 0.543 / drift 0.492 / q_halt 0.521 (n=626); HRM λ₁ 0.448 / drift 0.312 (sign-reversed; +drift→success AUC 0.688) / q_halt 0.734 (n=1342) | Window length (only 4 steps tested) | Concede; horizon sweep = explicit future work; 4 chosen to match training window |
+| C6a | Correction: TRM failures do not "plateau at stable high-loss attractors" in state space | C2 + λ₁(D)=+0.103 + residual velocity ≈56/step ≈0.77× early; their own Fig 5 oscillation consistent | We strawman "attractor" (bounded attracting set can be chaotic) | Quote their exact wording ("stable", "local minima", "stabilize rather than explore"); credit their loss/boundedness data and intervention; correct only the settledness reading |
+| C6b | Refinement: Ren & Liu's four modes confirmed and quantified; wrong-fixed-point mode is real but marginal at trajectory end | Their mode (4) = our strict B (0.5%); mode (3) = our D (~99.5% of failures) | We measure end-of-window only; mid-trajectory lingering invisible | Concede explicitly; their non-trivial-success lingering claim untouched |
+| C7a | Training widens the success/failure gap from the failure side: failures become more expansive while the success regime barely moves | λ₁(D): +0.036→+0.102 over the TRM series, λ₁(A) within ±0.03 of 0 throughout | Estimator-scale / single-run | Holds; HRM series shows the mass-migration version. Robust claim. |
+| C7b | **DEMOTED by matched-objective control (E6).** Under MATCHED objective (step9 fixed-unroll, E-vs-F / G-vs-H), perturbation training's effect on the wandering cell is small and inconsistent at matched step | HRM fD 0.387→0.369 / 0.385→0.379 / 0.381→**0.387**(↑ at best) / 0.410→0.387; TRM 0.344→0.305 / 0.396→0.361 / 0.334→**0.361**(↑ at best) / 0.312→0.270 | The earlier large shrink (D 274→175) was partly a fixed-unroll-vs-ACT-streaming artifact, NOT a pure intervention effect | **Rewrite §3.4**: report the matched effect as small/equivocal; drop the strong "intervention shrinks wandering" reading. The May-28 mismatched comparison is retired. C7a stands; C7b does not support an intervention claim. |
+
+## The spine (one sentence)
+Direct per-example measurement of settling and tangent expansion decomposes recursive-reasoner
+failure: failures are overwhelmingly trajectories that never settle, the chaotic signature is
+concurrent with — not antecedent to — the outcome, and it is not explained by non-convergence
+alone or by problem difficulty.
+
+## What this paper is NOT claiming (write these into Discussion to pre-empt)
+- No mechanism for WHY trajectories fail to find the settled band (explicitly open).
+- No claim that early intervention is impossible in principle — only that λ/drift at 4 steps carry no signal.
+- No claim about tasks beyond Sudoku-Extreme or models beyond HRM/TRM.
+
+## Anti-patterns checklist (apply at style pass)
+- [ ] No chronological narration of the investigation; structure = claims order.
+- [ ] Numbers in topic sentences; every section's first sentence is a finding, not a plan.
+- [ ] No "notably/crucially/interestingly/delve/underscore"; hedges only where the claim table says concede.
+- [ ] Related work = positioning (each paragraph ends with the gap), not annotated bibliography.
+- [ ] Negative result (C5) framed as a finding with its own section, not a limitation apology.
+- [ ] Limitations: specific, short, no re-hedging of already-scoped claims.
+- [ ] Prose paragraphs in the body; tables only for numbers; no bullet lists in Results.
diff --git a/paper/experiment_framework.md b/paper/experiment_framework.md
new file mode 100644
index 0000000..07a69d0
--- /dev/null
+++ b/paper/experiment_framework.md
@@ -0,0 +1,55 @@
+# Experiment framework — locked 2026-06-12
+
+Purpose: every paper claim gets its evidence gap named and the run that closes it specified,
+so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze
+~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed).
+
+## Locked measurement protocol (uniformity rules)
+- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary;
+  n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM).
+- λ values never compared across estimator implementations; scale ownership stated at first use.
+- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile
+  sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness.
+- idx-pairing: same seed + same n across any runs that will be joined per example.
+- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU.
+
+## Claims → gaps → runs
+
+| ID | Closes | Run | Spec | Cost | Priority / target |
+|---|---|---|---|---|---|
+| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** |
+| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** |
+| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** |
+| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** |
+| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR |
+| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR |
+| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR |
+| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR |
+
+### E8 detail — which model/variant (decided 2026-06-13)
+- **PRIMARY = TRM, attention variant.** Official Maze-Hard recipe = `arch=trm`, mlp_t=False (default),
+  L_layers=2 H_cycles=3 L_cycles=4. NOTE: our Sudoku TRM analysis used mlp_t=TRUE (attention OFF).
+  So TRM-Maze varies BOTH the task (Sudoku→Maze) AND the token mixer (mlp_t→attention) at once.
+  Framing for the paper: this is a STRONGER generality test (decomposition surviving task + mixer),
+  but state the dual change honestly; do not call it a pure same-architecture task transfer.
+- Optional matched control (only if cheap + GPU spare): TRM-Maze with arch.mlp_t=true to isolate
+  task-only transfer. Not the validated recipe → may underperform 75%; treat as secondary.
+- HRM-Maze (completes the {HRM,TRM}×{Sudoku,Maze} grid): deferred. Heavier (27M, HRM Maze recipe
+  costlier). Decide AFTER TRM-Maze lands — nice-to-have if TRM-Maze confirms, important if it surprises.
+- Diagnostic caveat stands: attention arch → verify JVP closures in diagnose_trm_joint.py; seq 900 → n=512/256.
+| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix |
+| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 |
+
+## Decision rules
+- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve.
+- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1).
+- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results.
+- New results that contradict current text → claims.md updated first, prose second.
+
+## Status board (update as runs land)
+- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md
+- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives).
+- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate.
+- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite.
+- [ ] E3, E4, E7 — phase-2
+- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards.
diff --git a/paper/intro.md b/paper/intro.md
new file mode 100644
index 0000000..85f06e7
--- /dev/null
+++ b/paper/intro.md
@@ -0,0 +1,57 @@
+# Recursive Reasoning Models Fail by Wandering, Not by Settling
+
+## 1 Introduction
+
+Recursive reasoning models such as the Hierarchical Reasoning Model (HRM; Wang et al., 2025)
+and the Tiny Recursive Model (TRM; Jolicoeur-Martineau, 2025) solve constraint-satisfaction
+puzzles that defeat far larger language models, by iterating a small network on a latent state
+for hundreds of updates per puzzle. When such a model fails, what is dynamically different
+about the trajectory it produced? Two recent mechanistic studies answer in attractor language.
+Failed TRM runs "plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026); failed
+HRM runs converge to spurious fixed points that rival the correct one (Ren & Liu, 2026). The
+evidence behind both labels is indirect, resting on loss plateaus and two-dimensional
+projections of 512-dimensional trajectories, and the labels disagree about the basic character
+of failure: premature stability in one account, partly aimless drift in the other. Neither
+measures the trajectory's stability directly. We do, per example, and the measurements support
+a third description: recursive reasoning models fail by wandering, not by settling.
+
+Across 2,048 to 8,192 held-out Sudoku-Extreme puzzles, correct trajectories end inside a
+narrow low-velocity band of the latent dynamics, and failures essentially never do. In an
+official-recipe TRM at 87.6% test accuracy, none of 254 failures settles: the least mobile
+failure still moves faster at the end of inference than 96.5% of successes, a separation of
+distributions that no threshold choice can undo, and failed trajectories remain locally
+expansive throughout (median leading finite-time Lyapunov exponent λ₁ = +0.103, against +0.012
+for successes; AUC 0.993). HRM shows the same structure with one addition. Settled-but-wrong
+trajectories exist, but they account for 0.55% of failures, carry success-like contraction
+(λ₁ = −0.84, against −0.87 for settled successes) and success-like halting confidence, and
+every one of them would have halted early under adaptive computation. The wrong-attractor
+failure mode is real, rare, and the only failure a confidence-based selector cannot catch.
+
+Two controls locate what the Lyapunov signature adds, and a third experiment locates when it
+exists. Matched for displacement level within the unsettled population, λ₁ still separates
+eventual successes from failures (decile-matched AUC 0.88–0.90), so the exponent does more
+than restate non-convergence. Binned by the number of givens, the separation is unchanged
+(within-bin AUC 0.982, against 0.984 unconditioned), so it is not an artifact of problem
+difficulty. It is, however, strictly retrospective. Restricted to puzzles still unsolved after
+four of sixteen segments, neither early-window exponents nor early state velocity predicts
+which trajectories will eventually succeed (AUC ≈ 0.5 in TRM), and in HRM the association
+inverts — among the undecided, the trajectories that move more in the early segments are the
+ones that go on to solve the puzzle (positive-direction AUC 0.69). The chaos of failure
+arrives with the failure; nothing dynamical in the early trajectory anticipates it.
+
+These measurements redraw the intervention map for this model class. Because failure is almost
+never a stable wrong answer, restart-and-select inference strategies have a high ceiling and a
+quantifiable blind spot of roughly half a percent. Because the early trajectory carries no
+dynamical death sentence, compute spent on early failure prediction is compute wasted, and
+restart diversity is the better buy. Our contributions: (i) per-example, outcome-conditioned
+measurement of settling and finite-time Lyapunov spectra in HRM and TRM, at sample sizes up to
+8,192 and replicated across two estimator implementations; (ii) a decomposition of failure
+that corrects the settled-attractor reading and bounds the wrong-attractor mode at ~0.5% of
+failures; (iii) controls showing the signature is not reducible to non-convergence or
+difficulty; (iv) evidence that the signature is concurrent with the outcome and carries no
+early-warning content at the granularity tested.
+
+---
+*[em-dash count: 1. Contrast-template count: title + one echo (end of ¶1). Flourish count:
+1 ("death sentence", ¶4) — cuttable. "essentially never" is the one hedge in ¶2, scoped by
+the 0.55% in the next sentence.]*
diff --git a/paper/outline.md b/paper/outline.md
new file mode 100644
index 0000000..0dde354
--- /dev/null
+++ b/paper/outline.md
@@ -0,0 +1,79 @@
+# Outline — "Recursive Reasoning Models Fail by Wandering, Not by Settling" (title FIXED 2026-06-12)
+
+Status: intro.md ✅ (v2, audited) · setup_results.md ✅ (Secs 2–3) · style_contract.md ✅ ·
+remaining: Sec 4 (relation to prior accounts), Sec 5 (implications), Sec 6 (limitations),
+abstract, tables T1–T3 + figures F3/F4 composition.
+
+Target: ~8 pages main. Every section header below lists [claims served] and [assets].
+
+## 1 Introduction [C1, spine]
+- Para 1: recursive reasoners (HRM/TRM) solve hard puzzles by iterating a latent state; when they
+  fail, what is dynamically different? Existing mechanistic accounts infer dynamics from loss
+  curves and 2-D projections; we measure the dynamics directly, per example.
+- Para 2: the answer, with numbers (settling × correctness decomposition; B≈0; AUC 0.99;
+  concurrent-not-antecedent).
+- Para 3: contributions (4 items, one line each): (i) per-example outcome-conditioned FTLE/settling
+  measurement at n≤8192 across two architectures; (ii) failure-mode decomposition correcting two
+  published labels; (iii) independence controls (drift-matched, difficulty-binned); (iv) the
+  early-window null + sign reversal.
+- NO general AI-reasoning throat-clearing. First sentence is about the object of study.
+
+## 2 Setup [assets: estimator details from diagnose_trm_joint.py; OBSERVATIONS.md provenance table]
+- 2.1 Models & task: HRM 27M @26040 (acc .526), TRM-MLP official recipe @58590 (acc .876),
+  Sudoku-Extreme-1k-aug; fixed 16-step unroll, ACT recorded not applied.
+- 2.2 Measurements: joint (z_H,z_L) tangent dynamics, JVP+QR, k=8, per-sub-update normalization;
+  per-ACT-step state displacement (drift); q_halt; exact/token accuracy. Estimator-scale caveat.
+- 2.3 The 2×2 design: settled band defined by bimodal late-drift split (Otsu primary, full
+  percentile sweep + threshold-free statement in appendix); cells A/B/C/D.
+
+## 3 Results
+- 3.1 Decomposition [C1, C2, C3; assets: cells tables, fig_*_scatter, fig_*_lyap_by_cell,
+  strict-B table + fig_hrm_strictB_profiles]
+  Lead: "Across 2048–8192 held-out puzzles, no TRM failure and 0.55% of HRM failures end in the
+  settled band." Then per-cell λ₁; then the 21 selector-blind examples (their three lowest
+  token-acc are all 17-givens puzzles).
+- 3.2 What the signal is not [C4; assets: decile table, givens table]
+  Drift-matched AUC 0.88–0.90; givens-binned AUC unchanged. One paragraph each, tables carry
+  the numbers.
+- 3.3 When the signal exists [C5; assets: early_pairing_{trm,hrm}.md tables]
+  The early-window null; the HRM sign reversal (drift@4 +direction AUC 0.688); q_halt@4 0.734
+  vs TRM 0.521 (factual note: TRM removed the continue head). Frame as the temporal anatomy of
+  the signature.
+- 3.4 Training evolution [C7; assets: evolution_{trm,hrm}.png/csv; multi4 quick-compare]
+  Gap widens via λ₁(D); multi4 shrinks D-cell mass at matched steps (preliminary, objective
+  caveat); multi4 collapse = λ₁(A) sign flip.
+
+## 4 Relation to prior accounts [C6a, C6b; assets: papers/notes/*]
+- Para 1: network-level Lyapunov–performance work (Vogt 2022; AeLLE 2024; Engelken flossing
+  App. D.3 trains-vs-fails at network level, opposite sign) → none condition per example on outcome.
+- Para 2: the 2026 mechanistic trio. Efstathiou & Balwani: credit loss/boundedness/intervention;
+  quote and correct the settledness reading (C6a). Ren & Liu: confirm + quantify their taxonomy
+  (C6b). Es'kin & Smorkalov (CMM): their endpoint-stability losses + engineered early repeller
+  are consistent, at the design level, with where our measurements localize the signal — cite,
+  don't claim confirmation.
+- Para 3: stability-by-construction line (monDEQ, Jacobian-reg DEQ, REN/Sandwich; TRM's own
+  TorchDEQ negative result; Solve-the-Loop) — what "enforce settling" buys and where it failed;
+  our measurements say which kind of settling is the operative one.
+
+## 5 Implications (restrained, half page)
+- Intervention design space bifurcates: widen/deepen the settled tube at training time
+  (perturbation training, equilibrium losses) vs restart-and-select at inference
+  (q_halt tracks correctness at trajectory end; selector-blind ceiling ≈0.5%).
+- Early pruning/reallocation unsupported at 4-step granularity; on HRM the gradient of usable
+  early signal lives in the learned head, not the generic dynamical quantities.
+
+## 6 Limitations & future
+Sudoku-Extreme only; two models; #givens is a weak difficulty proxy (solver backtracks next);
+single early horizon (sweep queued); end-of-window criterion blind to mid-trajectory lingering;
+no mechanism offered for why settling fails — measurement paper.
+
+## Figures plan (all exist or one rerun away)
+F1: drift–λ₁ scatter, both models (have).
+F2: per-cell λ₁ + strict-B profiles inset (have).
+F3: decile-matched AUC + givens-binned AUC (compose from CSVs).
+F4: early-window pairing summary (compose: 3 signals × 2 models, restricted set).
+F5: checkpoint evolution (have).
+
+## Order of writing
+1. Results 3.1–3.3 (numbers already final) → 2. Setup → 3. Sec 4 (notes ready) → 4. Intro →
+5. Implications/Limitations → 6. style pass against claims.md checklist.
diff --git a/paper/rainer_followup_draft.md b/paper/rainer_followup_draft.md
new file mode 100644
index 0000000..12091de
--- /dev/null
+++ b/paper/rainer_followup_draft.md
@@ -0,0 +1,37 @@
+Subject: Re: Question on gradient flossing vs forward trajectory stability in recursive reasoning models
+
+Hi Rainer,
+
+A short follow-up to my email of June 5 — we have since measured the things I was speculating
+about, and two results seem worth sharing because they sharpen the question I asked you.
+
+First, conditioning per-example finite-time Lyapunov spectra on both outcome and terminal
+settling (n = 2048–8192, two architectures) shows that failure is almost exclusively
+non-settling: in an official-recipe TRM at 87.6% accuracy, none of 254 failed trajectories
+ever enters the low-velocity band that all successes occupy, and they remain locally expansive
+to the end (median λ₁ +0.10 vs +0.01). "Converged to the wrong attractor" failures exist in
+HRM but make up only ~0.5% of failures. The chaotic signature also survives two controls: it
+persists after matching trajectories on displacement level (so it is not just re-measuring
+non-convergence), and after binning by puzzle difficulty.
+
+Second — and this is the part that genuinely surprised us — the signature is strictly
+concurrent. Among puzzles still unsolved after a quarter of the inference budget, neither the
+early-window exponents nor early state velocity predict which trajectories will eventually
+succeed (AUC ≈ 0.5); in HRM the association even inverts, with eventually-successful
+trajectories moving more in the early phase. So the failed trajectories are not "born chaotic":
+chaos at the end and failure appear together.
+
+This makes me think the right framing for my earlier question is reachability of the settled
+region (escape from a long chaotic transient) rather than per-example landscape quality, which
+would be consistent with your view of flossing as a learning-time tool rather than an
+inference-time one. If you know of work that conditions finite-time exponents on trajectory
+fate in this way — in transient-chaos settings or elsewhere — I would be grateful for a
+pointer; we have not found a precedent.
+
+Best,
+Yuren
+
+---
+[Notes, not part of the email: numbers from analysis_2x2/OBSERVATIONS.md addenda 1-2. Send only
+if/after Rainer replies to the June 5 email, or as a gentle bump after ~2 weeks (June 19+).
+The "born chaotic" phrasing mirrors his literature's transient-chaos vocabulary deliberately.]
diff --git a/paper/readiness.md b/paper/readiness.md
new file mode 100644
index 0000000..f07f78e
--- /dev/null
+++ b/paper/readiness.md
@@ -0,0 +1,119 @@
+# Path to a citable, build-on-able preprint — status
+
+Framing (locked 2026-06-19, per user correction): the axis is **expansive vs more-expansive**
+(graded; for TRM both classes have λ₁>0), NOT settled-vs-chaotic. The phenomenon is the
+**cleanness of a graded separation**; the **mechanism is explicitly OPEN** and is the natural seed
+for follow-on projects. This preprint = rigorous phenomenology + precise characterization +
+honest open-mechanism. Do NOT force a fixed-point / suppression-of-chaos framing (rejected).
+
+## Tier 0 — measurement bulletproofing (others build on it)
+- [x] **T0.1 estimator validation** — `paper/validation/`: QR/Benettin core recovers known spectra
+  to <1e-3 (diagonal, symmetric, non-normal asymptotic) and Hénon λ₁ to 8e-5. PASS. Confirms the
+  numerical core (orthonormalization cadence, log|diagR| bookkeeping, ordering, averaging).
+- [ ] **T0.2 robustness reruns (GPU)** — λ stability vs t_ons, tangent-basis seed, k>8. Window
+  dependence already covered offline (Char 3). Small queue; spec below.
+- [ ] **T0.3 language/scope pass** — finite-time vs asymptotic, "expansive not chaotic" for HRM
+  (negative λ), metric/coordinate-dependence caveat (Lohmiller–Slotine). Prose task.
+
+## Characterization (replaces the rejected Tier 2; describes WHAT, not WHY)
+- [x] **Char 1 whole-spectrum** — separation is a ~rigid shift of the ENTIRE k=8 spectrum, not a
+  single mode (per-exponent AUC uniformly 0.98–0.99; HRM gap ≈constant −0.16/exponent). Spectral
+  MEAN separates ≥ λ₁ alone (AUC 0.991–0.995). CAVEAT: KS-proxy Σλ⁺ is the wrong aggregate for HRM
+  (all-negative spectra → 0); use spectral mean for HRM.
+- [x] **Char 2 shape** — two overlapping UNIMODAL classes with well-separated means, NOT two
+  discrete clusters (within-class BC 0.26–0.40). Outcome is a moderately sharp threshold on the
+  λ₁ continuum (25→75% transition spans 12–30% of the λ₁ spread).
+- [x] **Char 3 integration-time scaling (the key descriptor)** — separation BUILDS monotonically
+  with window H: Cohen's d 1.06→4.84 (TRM, H=2→16), 0.03→3.45 (HRM). Near-zero at H=2, near-perfect
+  at the full 16-segment budget. The cleanness is an integration-time phenomenon. COHERENCE with
+  E5: this accumulation tracks the unfolding of outcomes (more trajectories revealed by larger H),
+  NOT anticipation — among undecided@H examples λ₁ still doesn't predict (E5). State both together.
+- [x] **Char 4 effect size** — "clean" quantified: Cohen's d 3.4–4.8, distributional overlap
+  <10% (TRM hist-overlap 0.049). Beyond AUC.
+
+## Tier 1 — causal content (the level-up from correlation)
+- [ ] **T1 inference-side causal probe** — nudge a failing trajectory toward lower expansion (or
+  toward the success-mean manifold) mid-rollout and measure outcome recovery; conversely inject
+  expansion into a settling-correct trajectory. Tests settling⟹correct as causal, not correlational.
+  Spec next. GPU.
+
+## Open-mechanism (NOT this paper; the hook for follow-ons)
+Why a graded (both-expansive) difference separates so cleanly. Char 1–4 bound the description;
+the why is deferred. Candidate angles are the user's to pursue, not asserted here.
+
+## Maze cross-task result + checkpoint evolution (2026-06-20)
+
+**Deflationary finding stands and is now grounded:** the FTLE/CLV separation reduces to
+convergence+confidence (λ1, full k=8 spectrum, AND leading-CLV geometry all reduce; partial-corr
+→0 once drift+q_halt controlled). The dynamical signal is a (redundant) convergence readout.
+
+**Maze (TRM att, friend's run, all 10 ckpts, k=1):** separation WEAK (λ1 Cohen's d 0.2–0.5 vs
+Sudoku 3–5). Failures SETTLE (B/fail 0.81–0.98, D/fail 0.02–0.19) at ALL ckpts and are NEAR-MISSES
+(token_acc ~0.97). Opposite of Sudoku (failures wander, far-from-correct token ~0.63).
+
+**Checkpoint evolution (the key new result, offline):** wandering is a LATE-TRAINING property.
+Sudoku HRM failures SETTLE early (B/fail ~0.9 at acc 2–15%) then flip to WANDER late
+(D/fail ~1.0 at acc 50%), transition ~step 13–18k. So "failures wander" is learned, not intrinsic.
+BUT matched-accuracy contrast cuts the other way: at acc≈0.76, Sudoku-TRM D/fail=1.00 vs
+Maze-TRM D/fail=0.19 — same skill, opposite dynamics → TASK STRUCTURE also matters, not just maturity.
+And early-Sudoku settling (token 0.63, confidently-wrong) ≠ Maze settling (token 0.97, near-miss):
+not the same phenomenon. Fig: analysis_2x2/checkpoint_evolution_wander.png.
+
+**Task structure (offline):** Maze solution path (median 113 cells) passes through ~76 branch
+points (67% of path cells at deg≥3 junctions; 48% of open cells are junctions) → abundant
+locally-coherent alternative paths = many STABLE WRONG ANSWERS available. Sudoku: unique
+globally-coupled solution, a wrong cell violates constraints globally → no local near-miss
+equilibrium. This structurally explains settle-to-near-miss (Maze) vs wander (Sudoku).
+
+**Unresolved confound (queued):** TRM-Maze never develops wandering, but can't tell task-structure
+from TRM-Maze SATURATION (Maze too easy for TRM). Queued before HRM-Maze:
+(1) continue-train TRM-Maze from step_130200 (does acc climb toward ~1.0 = saturation, or plateau?);
+(2) per-cell failure structure (are failure errors a connected detour = coherent stable wrong path,
+or scattered?). Then HRM-Maze (harder model-task fit, more likely to be stressed into wandering).
+
+## Solution-space test (2026-06-20) — refutes the measurement-artifact concern, strengthens task-structure
+User asked: is weak Maze separation an artifact of analyzing the FULL latent (88% trivial copy)
+instead of the SOLUTION space? Tested directly: per-step decoded-ANSWER Hamming drift over
+solution cells (label!=input), Maze vs Sudoku control.
+- MAZE: failures SETTLE in solution space too (late answer-drift median 0.00, 98.4% settled;
+  AUC 0.30). Same conclusion as full-latent. NOT an artifact.
+- SUDOKU control: failures DON'T settle in solution space (late drift median 8.5/step, 0% settled;
+  AUC 0.99). Same as full-latent. Both spaces agree.
+- Per-cell failure STRUCTURE (direct task-structure evidence): MAZE failures = CONNECTED DETOUR
+  (97% have ≤2 error components, median 22 cells one blob) = a coherent stable wrong PATH.
+  SUDOKU failures = SCATTERED (100% have ≥5 components, median 13) = no coherent wrong answer.
+  Fig: analysis_2x2/maze_failure_detour.png. This is the mechanism-grounding for why Maze settles
+  (stable wrong answers exist as detours) and Sudoku wanders (no stable wrong answer).
+
+## CORRECTION (2026-06-20) — Maze exact-match labeling was the artifact; failure=more-chaotic HOLDS
+The earlier "Maze dissociates / completeness≠correctness" reading was largely a LABELING ARTIFACT,
+not a real dynamical dissociation. Maze exact-match marks VALID alternative solutions (incl.
+equal-length valid shortest paths) as "failures"; 100% of exact-match "failures" are valid connected
+paths (complete answers) → they settle, trivially. That is a benchmark-design flaw, not a result.
+**Under the correct criterion (CONNECTIVITY = is it a valid complete path = is it actually solved):**
+genuine failures (broken/disconnected) ARE more chaotic — AUC(-late_drift→connected) = 0.864 @step_13020
+(15 broken), 0.895 pooled (18 broken); bootstrap 95% CI [0.80, 0.96], excludes 0.5. So
+"failure = more chaotic" is TASK-GENERAL (Sudoku + Maze) once failure is defined by validity.
+LIMITATION (now RESOLVED): trained Maze SATURATES before the first saved ckpt (step_13020 already
+97% complete) → only n=18 broken from existing ckpts. FIX DONE: fresh early-save TRM-Maze run
+(maze_earlysave_freshTRM, saved every 250 epochs) captured the broken-rich pre-saturation phase;
+cheap forward dumps (drift_zH + connectivity, no JVP) on 8 early ckpts give **n=4096, 1835 broken**.
+**Pooled: AUC(-latent drift_zH -> connected/complete) = 0.834, bootstrap 95% CI [0.822, 0.846]**
+(broken late-drift median 1.06 vs connected 0.56). Per-ckpt AUC rises with training 0.66->0.88
+(mirrors Sudoku's separation-grows-with-training). So 'genuine failure (incomplete) = more chaotic'
+is now LARGE-N BULLETPROOF on Maze under validity labeling. Fig: maze_broken_morechaotic.png.
+Honest detail: 'more chaotic' is a LATENT-dynamics property (drift_zH AUC 0.834, λ1 AUC 0.86);
+the DECODED-answer drift does NOT separate (ans_drift AUC 0.38) — broken paths commit an incomplete
+decoded answer while churning internally. Consistent with the FTLE/drift (latent) story.
+
+## Synthesis for the paper (current honest thesis, corrected)
+Genuine failures (incomplete/invalid answers) are MORE CHAOTIC — measurable, task-general (Sudoku;
+Maze under validity labeling). Mechanism: the dynamical signal detects answer completeness/convergence
+(FTLE reducible to drift+q_halt). On unique-solution tasks completeness=correctness, so it predicts
+correctness directly. On multi-solution tasks exact-match mislabels valid alternatives as failures;
+use validity labeling. The phenomenon stands; the convergence-detection mechanism is the honest
+interpretation, not a refutation.
+
+## Status: offline T0.1 + Char 1–4 + Maze evolution + task structure DONE. Running: TRM CLV (done),
+## HRM CLV (queued on card1), maze-followup queue (continue-train + per-cell, waiting for GPU).
+## Remaining: T0.2/T0.3, T1, HRM-Maze (after saturation test).
diff --git a/paper/sample_intro.md b/paper/sample_intro.md
new file mode 100644
index 0000000..183faa4
--- /dev/null
+++ b/paper/sample_intro.md
@@ -0,0 +1,49 @@
+# Sample section: Introduction (taste-calibration draft)
+
+Recursive reasoning models solve constraint-satisfaction problems that defeat much larger
+language models by iterating a small network on a latent state — up to several hundred state
+updates per puzzle in the Hierarchical Reasoning Model (HRM) and the Tiny Recursive Model
+(TRM). When such a model fails, what is dynamically different about the trajectory it
+produced? Recent mechanistic studies have answered with attractor language: failed runs
+"plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026), or converge to spurious
+fixed points that rival the correct one (Ren & Liu, 2026). These accounts rest on indirect
+evidence — loss plateaus, two-dimensional projections of 512-dimensional trajectories — and
+the two papers do not agree: one describes failure as premature stability, the other partly as
+wandering. Neither measures stability itself.
+
+We measure it directly. For every test puzzle we record two per-example quantities along the
+full 16-segment inference trajectory: the finite-time Lyapunov spectrum of the joint latent
+dynamics, and the per-segment state displacement. Conditioning these on outcome over 2,048 to
+8,192 puzzles per model yields a complete decomposition of failure for HRM (52.6% accuracy)
+and an official-recipe TRM (87.6%), and the decomposition contradicts the settled-attractor
+picture. Correct trajectories enter a narrow low-velocity band and stay in it; failed
+trajectories never do. In TRM, not one of 254 failures settles — the least mobile failure still
+moves faster at the end of inference than 96.5% of successes — while remaining locally
+expansive (median λ₁ = +0.103 versus +0.012 for successes; AUC 0.993). In HRM, settled-but-wrong
+trajectories exist but account for 0.55% of failures; the other 99.45% wander. Failure in these
+models is not a wrong attractor. It is the sustained absence of settling.
+
+Two controls sharpen what the Lyapunov signature adds. Matched for displacement level within
+the unsettled population, λ₁ still separates eventual successes from failures (decile-matched
+AUC 0.88–0.90), so the exponent is not merely re-measuring non-convergence; and binning by
+puzzle givens leaves the separation intact (within-bin AUC 0.982 versus 0.984 overall), so it
+is not a difficulty artifact. The signature is, however, strictly retrospective. Restricted to
+puzzles still unsolved after four segments, nothing dynamical about those first four segments
+predicts which will eventually be solved: AUC ≈ 0.5 in TRM for exponent, displacement, and
+halting confidence alike — and in HRM the association inverts, with eventual successes moving
+*more* in the early trajectory than eventual failures (AUC 0.69 in the positive direction).
+The chaos of failure is concurrent with the outcome, not an omen visible at the start.
+
+These measurements reframe both the diagnosis and the levers. Because failure is almost never
+a stable wrong answer, selection-based inference strategies have a high ceiling — final-step
+halting confidence tracks correctness on all but the ~0.5% of failures that settle confidently
+— and because the early trajectory carries no dynamical death sentence, compute is better
+spent on restarts than on early pruning. We quantify both points, correct the published
+attractor labels they depend on, and release the per-example measurement tooling.
+
+---
+*[Style notes for review, not part of the draft: (1) every paragraph opens with a finding or a
+question, none with "In recent years"; (2) the two prior papers are quoted precisely and
+credited for what their data shows before the correction is made; (3) hedges appear only where
+the claim table concedes (e.g., "almost never", "~0.5%"); (4) the one rhetorical flourish —
+"not an omen" — is load-bearing; cut it if it reads as flavor.]*
diff --git a/paper/setup_results.md b/paper/setup_results.md
new file mode 100644
index 0000000..d6976d5
--- /dev/null
+++ b/paper/setup_results.md
@@ -0,0 +1,129 @@
+# 2 Setup
+
+**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the
+1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact
+accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global
+batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set,
+87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation
+halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is
+observed for the full budget. Answers are decoded at segment 16.
+
+**Per-example measurements.** Along each trajectory we record three families of quantities.
+First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent
+vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products
+through every state update and re-orthonormalized by QR at each sub-step; λ_i is the
+time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64
+for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L
+separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values
+are comparable only within an estimator implementation; we replicate the HRM analysis under a
+second, earlier implementation (n=8,192) and report its scale separately.
+
+**The settling criterion.** Late drift — the mean z_H displacement over the final four
+segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band
+(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width
+under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a
+trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's
+method on the pooled log distribution; every result below is reported with a full percentile
+sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed
+point: both bands have nonzero characteristic velocity.
+
+**Design.** Crossing the settling criterion with answer correctness yields four cells:
+settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The
+analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add
+beyond the settling split? And when along the trajectory does the discriminative signal exist?
+
+# 3 Results
+
+## 3.1 Failure is wandering: the 2×2 decomposition
+
+At the end of inference, success and failure occupy different dynamical regimes almost without
+exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is
+settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the
+late drift of 96.5% of successes, so no threshold assignment can place a failure in the
+settled band. Failed trajectories also remain locally expansive over the full window (median
+λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction
+(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint
+series, shows the settled-wrong cell empty from 20% of training onward.
+
+HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894
+failures (0.55%; n=8,192) end settled; the replication under the second estimator gives
+5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable
+from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the
+A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes
+(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the
+halting threshold between segments 4 and 9; under adaptive computation each would have stopped
+early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate
+are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu
+(2026), measured: it exists, it carries exactly the contraction signature their account
+predicts, and it is two orders of magnitude less common than wandering.
+
+The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at
+segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as
+settled successes. Their existence shows the decode head can read a correct answer off a
+moving state; we do not observe what happens to them past the window.
+
+## 3.2 What the exponent is not measuring
+
+The λ₁ separation is not a restatement of the settling split. Within the unsettled population,
+where every trajectory is still moving, λ₁ ranks eventual successes above failures inside
+narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile
+width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift
+to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900
+(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is
+moving.
+
+The separation is also not a difficulty artifact, at least not at the resolution of clue
+count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself
+difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning
+removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982)
+against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be
+the sharper control — but at this resolution the dynamical signature is orthogonal to how hard
+the puzzle is.
+
+## 3.3 When the signal exists: concurrent, with no early warning
+
+The discriminative power of the dynamics is a property of the realized trajectory, and it is
+absent at the start. We re-measured both models over only the first four segments (idx-paired
+with the full-window runs, same sampling) and asked whether anything visible by segment 4
+forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM,
+0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of
+TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet
+correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually
+succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting
+confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert:
+eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially
+higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM
+trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle.
+
+One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at
+segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training
+removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note
+without interpreting. Window length is the untested variable here: four segments matches the
+deep-supervision horizon, and we have not yet swept longer prefixes.
+
+## 3.4 Training widens the gap from the failure side
+
+Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures
+rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of
+zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows
+over training because the failures become more expansive, while the success regime barely
+moves. HRM's series shows a mass migration instead: at early checkpoints nearly all
+trajectories are low-drift and wrong (the model barely updates state), this cell drains
+through mid-training into high-drift wandering, and accuracy growth then tracks transfer from
+wandering into the settled-correct band.
+
+A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained
+with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering-
+failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and
+247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the
+known late-run collapse of this variant coincides with the settled band itself destabilizing
+(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training
+objective (ACT-streaming versus fixed unroll), so we report this as directional evidence
+pending a matched-objective control.
+
+---
+*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0
+(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable.
+Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window
+restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and
+offline_followups/followups.md.]*
diff --git a/paper/style_contract.md b/paper/style_contract.md
new file mode 100644
index 0000000..4e0f319
--- /dev/null
+++ b/paper/style_contract.md
@@ -0,0 +1,47 @@
+# Style contract (operative checklist for every section pass)
+
+Sources: Shaib et al. 2026 slop taxonomy (density/relevance/coherence are the load-bearing
+dimensions); Buschek's four reviewer complaints; 2026 banned-pattern lists. Applied as a per-
+section pass, not a vibe.
+
+## Hard bans (lexical)
+delve, tapestry, landscape (figurative), testament, pivotal, showcase, intricate, vibrant,
+seamless, elegant, dramatically, novel (self-describing), comprehensive (self-describing),
+notably, crucially, interestingly, importantly, "It is worth noting", "Moreover," as sentence
+opener, "In recent years", "has garnered attention", "paradigm shift".
+Technical terms that overlap ban lists (e.g., "robust" in "threshold-robust") stay.
+
+## Punctuation & rhythm
+- Em dash: ≤2 per section, never two in one sentence, never as paired parenthetical.
+- No exclamation marks. Semicolons fine.
+- Every paragraph contains at least one sentence under ~12 words.
+- No uniform paragraph shapes: don't end three consecutive paragraphs with a summary clause.
+
+## Structure
+- The contrast template ("X, not Y" / "not X, but Y") has a total budget of ONE device:
+  the title and its single echo at the end of paragraph 1. Zero elsewhere. Express other
+  contrasts by stating the positive finding and letting the numbers carry the negation.
+- Rule-of-three closers only when the three items are an exhaustive measured list.
+- No chronological narration of the investigation. Structure follows claims.md.
+- Results topic sentences contain a finding or a number, never a plan ("We then examine...").
+
+## Integrity (Buschek's four, mapped to our risks)
+- Marketing language: zero unevidenced evaluative adjectives. "near-perfect" only with the AUC
+  in the same sentence.
+- Performative related work: every citation paired with the precise claim it supports; check
+  each against papers/notes/*.md before the citation lands.
+- Misrepresentation: prior-work characterizations quote verbatim with page/section; corrections
+  target the quoted words only (claims.md C6a discipline).
+- Stretched summaries: no interpretive metaphors for results. One flourish per section maximum,
+  and it must be cuttable without losing content.
+
+## Density (the actual anti-slop weapon)
+- Every sentence must add a number, a definition, a caveat, or a logical step. Otherwise cut.
+- Modal verbs (could/might/may) confined to Discussion and future work.
+- Hedges appear only where claims.md concedes; one hedge per concession, not a seesaw.
+
+## Honesty rails (project-specific)
+- Never compare λ values across estimator versions; state scale ownership at first use.
+- "Settled" always defined as the measured low-velocity band, with its residual velocity given.
+- No mechanism claims; observations and their direct logical consequences only.
+- No promises (code release, future experiments) that the authors have not decided.
diff --git a/paper/validation/validate_le_estimator.py b/paper/validation/validate_le_estimator.py
new file mode 100644
index 0000000..7ad4fd0
--- /dev/null
+++ b/paper/validation/validate_le_estimator.py
@@ -0,0 +1,107 @@
+"""T0.1 — validate the QR/Benettin FTLE estimator core against systems with KNOWN spectra.
+
+Reimplements the IDENTICAL accumulation used in diagnose_{trm,hrm}_joint.py:
+  Q in R^{n x k} init random-orthonormal; each step apply the (known) Jacobian to Q's columns;
+  every t_ons steps QR-decompose, accumulate sum of log|diag(R)|; LE_i = sum / n_qr_steps.
+
+Test systems (known answers):
+  (a) diagonal linear map     LE_i = log|d_i|           (exact at all T)
+  (b) symmetric linear map     LE_i = log|eig_i|          (exact; eig=singular values)
+  (c) non-normal (shear) map   LE_i = log|eig_i| asympt.  (finite-time transient from singular values)
+  (d) Henon map (a=1.4,b=0.3)  LE = {+0.41922, -1.62319}  (nonlinear chaotic; literature value)
+
+A passing result = recovered exponents match known to within tolerance, confirming the QR core
+(orthonormalization cadence, log|diag R| bookkeeping, ordering, averaging) is correct.
+No GPU, no model — this isolates the numerical estimator.
+"""
+from __future__ import annotations
+import numpy as np
+
+RNG = np.random.default_rng(0)
+
+
+def qr_le(jac_fn, x0, n_steps, k, t_ons=1, warmup=0):
+    """Benettin/QR LE estimate. jac_fn(x)->(x_next, J) gives next state and Jacobian at x.
+    Mirrors diagnose_*_joint.py: QR every t_ons steps, accumulate log|diag R|, average over QR steps."""
+    x = np.asarray(x0, float)
+    d = x.shape[0]
+    Q, _ = np.linalg.qr(RNG.standard_normal((d, k)))
+    log_R_sum = np.zeros(k)
+    n_qr = 0
+    for t in range(n_steps):
+        x, J = jac_fn(x)
+        Q = J @ Q
+        if (t + 1) % t_ons == 0:
+            Q, R = np.linalg.qr(Q)
+            if t >= warmup:
+                log_R_sum += np.log(np.clip(np.abs(np.diag(R)), 1e-30, None))
+                n_qr += 1
+    return np.sort(log_R_sum / max(n_qr, 1))[::-1]
+
+
+def run():
+    out = ["# T0.1 estimator validation (QR/Benettin core vs known spectra)", ""]
+    tol = 5e-3
+
+    # (a) diagonal
+    d_vals = np.array([1.5, 0.8, 0.3, 0.05])
+    M = np.diag(d_vals)
+    known = np.sort(np.log(np.abs(d_vals)))[::-1]
+    est = qr_le(lambda x: (x, M), np.ones(4), 4000, k=4)  # linear: J state-independent, don't grow x
+    out += [f"(a) diagonal linear: known {np.round(known,4)}",
+            f"    recovered        {np.round(est,4)}   max|err|={np.max(np.abs(est-known)):.2e}  "
+            f"{'PASS' if np.max(np.abs(est-known))<tol else 'FAIL'}"]
+
+    # (b) symmetric
+    A = RNG.standard_normal((5, 5)); S = (A + A.T) / 2
+    # scale so spectral radius < ~1.3 (keep magnitudes spread, finite)
+    S = 0.9 * S / np.max(np.abs(np.linalg.eigvalsh(S)))
+    eig = np.linalg.eigvalsh(S)
+    known = np.sort(np.log(np.abs(eig)))[::-1]
+    est = qr_le(lambda x: (x, S), np.ones(5), 8000, k=5)
+    out += [f"(b) symmetric linear: known {np.round(known,4)}",
+            f"    recovered         {np.round(est,4)}   max|err|={np.max(np.abs(est-known)):.2e}  "
+            f"{'PASS' if np.max(np.abs(est-known))<tol else 'FAIL'}"]
+
+    # (c) non-normal shear: LE -> log|eig| asymptotically; finite-time transient from singular values
+    N = np.array([[1.1, 5.0], [0.0, 0.6]])  # eigenvalues 1.1, 0.6 (triangular); highly non-normal
+    known = np.sort(np.log(np.abs(np.linalg.eigvals(N))))[::-1]
+    est_long = qr_le(lambda x: (x, N), np.ones(2), 40000, k=2)
+    sv = np.sort(np.log(np.linalg.svd(N, compute_uv=False)))[::-1]
+    est_short = qr_le(lambda x: (x, N), np.ones(2), 5, k=2)
+    out += [f"(c) non-normal shear: known asymptotic log|eig| {np.round(known,4)}",
+            f"    recovered (T=40000)                        {np.round(est_long,4)}   "
+            f"max|err|={np.max(np.abs(est_long-known)):.2e}  "
+            f"{'PASS' if np.max(np.abs(est_long-known))<1e-2 else 'FAIL'}",
+            f"    single-step log singular values            {np.round(sv,4)}  (finite-time transient ref)",
+            f"    recovered (T=5, finite-time)               {np.round(est_short,4)}  "
+            f"(should sit between sv and asymptotic -> confirms finite-time != asymptotic)"]
+
+    # (d) Henon map
+    a, b = 1.4, 0.3
+    def henon(x):
+        xn = np.array([1 - a * x[0] ** 2 + x[1], b * x[0]])
+        J = np.array([[-2 * a * x[0], 1.0], [b, 0.0]])
+        return xn, J
+    # settle onto attractor first
+    x = np.array([0.1, 0.1])
+    for _ in range(1000):
+        x, _ = henon(x)
+    known = np.array([0.41922, -1.62319])  # literature (Sprott)
+    est = qr_le(henon, x, 200000, k=2, warmup=1000)
+    out += [f"(d) Henon (a=1.4,b=0.3): literature {np.round(known,4)}  (sum={known.sum():.4f})",
+            f"    recovered                       {np.round(est,4)}  (sum={est.sum():.4f})   "
+            f"|err λ1|={abs(est[0]-known[0]):.2e}  "
+            f"{'PASS' if abs(est[0]-known[0])<5e-3 else 'FAIL'}"]
+
+    out += ["", "Interpretation: (a)(b) confirm exact recovery for normal maps; (c) confirms the",
+            "estimator converges to log|eig| asymptotically while finite-time windows reflect",
+            "singular-value growth (the regime our paper operates in); (d) confirms correct",
+            "recovery on a known chaotic nonlinear system. The QR core is validated."]
+    print("\n".join(out))
+    from pathlib import Path
+    Path(__file__).resolve().parent.joinpath("validation_results.md").write_text("\n".join(out))
+
+
+if __name__ == "__main__":
+    run()
diff --git a/paper/validation/validation_results.md b/paper/validation/validation_results.md
new file mode 100644
index 0000000..008a73a
--- /dev/null
+++ b/paper/validation/validation_results.md
@@ -0,0 +1,17 @@
+# T0.1 estimator validation (QR/Benettin core vs known spectra)
+
+(a) diagonal linear: known [ 0.4055 -0.2231 -1.204  -2.9957]
+    recovered        [ 0.4047 -0.2236 -1.2032 -2.9953]   max|err|=7.99e-04  PASS
+(b) symmetric linear: known [-0.1054 -0.6391 -0.9467 -1.4414 -3.074 ]
+    recovered         [-0.1055 -0.6392 -0.9465 -1.4415 -3.0738]   max|err|=1.85e-04  PASS
+(c) non-normal shear: known asymptotic log|eig| [ 0.0953 -0.5108]
+    recovered (T=40000)                        [ 0.0953 -0.5109]   max|err|=3.84e-05  PASS
+    single-step log singular values            [ 1.6396 -2.0551]  (finite-time transient ref)
+    recovered (T=5, finite-time)               [ 0.5468 -0.9623]  (should sit between sv and asymptotic -> confirms finite-time != asymptotic)
+(d) Henon (a=1.4,b=0.3): literature [ 0.4192 -1.6232]  (sum=-1.2040)
+    recovered                       [ 0.4193 -1.6233]  (sum=-1.2040)   |err λ1|=8.44e-05  PASS
+
+Interpretation: (a)(b) confirm exact recovery for normal maps; (c) confirms the
+estimator converges to log|eig| asymptotically while finite-time windows reflect
+singular-value growth (the regime our paper operates in); (d) confirms correct
+recovery on a known chaotic nonlinear system. The QR core is validated.
+\ No newline at end of file
author	YurenHao0426 <blackhao0426@gmail.com>	2026-06-29 12:15:51 -0500
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-06-29 12:15:51 -0500
commit	a6ec4288a2232988b130b2f00bb2565f81706966 (patch)
tree	1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /paper