diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-29 12:15:51 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-29 12:15:51 -0500 |
| commit | a6ec4288a2232988b130b2f00bb2565f81706966 (patch) | |
| tree | 1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /paper | |
Recursive reasoning dynamics: analysis pipeline, paper drafts, toy models
Failure=more-chaotic (task-general under validity labeling) reduces to convergence/completeness
detection; mechanism (transient chaos vs multistability vs input-induced) under investigation.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'paper')
| -rw-r--r-- | paper/claims.md | 37 | ||||
| -rw-r--r-- | paper/experiment_framework.md | 55 | ||||
| -rw-r--r-- | paper/intro.md | 57 | ||||
| -rw-r--r-- | paper/outline.md | 79 | ||||
| -rw-r--r-- | paper/rainer_followup_draft.md | 37 | ||||
| -rw-r--r-- | paper/readiness.md | 119 | ||||
| -rw-r--r-- | paper/sample_intro.md | 49 | ||||
| -rw-r--r-- | paper/setup_results.md | 129 | ||||
| -rw-r--r-- | paper/style_contract.md | 47 | ||||
| -rw-r--r-- | paper/validation/validate_le_estimator.py | 107 | ||||
| -rw-r--r-- | paper/validation/validation_results.md | 17 |
11 files changed, 733 insertions, 0 deletions
diff --git a/paper/claims.md b/paper/claims.md new file mode 100644 index 0000000..590b388 --- /dev/null +++ b/paper/claims.md @@ -0,0 +1,37 @@ +# Claim table (write structure FROM this, not from session chronology) + +Rule: every claim states its evidence, its strongest counter-reading, and where that +counter-reading is killed or conceded. A claim that can't fill all four columns gets cut +or demoted to an observation in the discussion. + +| # | Claim | Evidence | Strongest counter-reading | Disposition | +|---|---|---|---|---| +| C1 | Per-example full-trajectory FTLE separates success/failure near-perfectly in trained HRM and TRM on Sudoku-Extreme | AUC(−λ₁→correct): HRM 0.984 (n=8192) / 0.987 (n=2048, 2nd estimator); TRM official 87.6% ckpt 0.993 (n=2048). Medians: HRM joint −0.152 vs +0.032; TRM +0.012 vs +0.103 | Estimator artifact | Replicates across two estimator implementations and two architectures; report both scales, never cross-compare | +| C2 | Failures overwhelmingly never settle; settled-wrong is rare (HRM) to absent (TRM) | TRM: 0/254 failures settled, threshold-free (min wrong-drift > late-drift of 96.5% of correct); HRM: 21/3894 (0.55%) strict-band; replicated 5/971 on 2nd estimator | Threshold choice ("settled" is arbitrary) | State threshold-free version (distribution separation); show full percentile sweep; define settled band by its narrow characteristic velocity | +| C3 | The rare settled-wrong failures are selector-blind: success-like contraction AND success-like halting confidence | n=21: λ₁ med −0.842 (A: −0.867), q_halt(final) +7.47 (= A), all would halt early (halted_at 4–9), token_acc med 0.62 | Small n | Concede explicitly; the point is existence + boundedness (~0.5%), not precision | +| C4 | The FTLE-outcome signal is not reducible to non-settling, and not a difficulty artifact | Drift-decile-matched AUC within unsettled stratum: 0.879 (n=8192) / 0.900 (2nd estimator); #givens-bin AUC 0.982 vs 0.984 overall | (a) residual within-decile drift variation; (b) #givens is a weak difficulty proxy | (a) deciles are narrow (table shows ranges); (b) concede openly, name solver-backtracks as the proper proxy, future work | +| C5 | The chaotic signature is outcome-concurrent, not antecedent: nothing in the first 4 ACT steps forecasts eventual success among not-yet-solved examples — and on HRM the dynamical signals point the other way | Restricted to not-correct@4: TRM AUC λ₁ 0.543 / drift 0.492 / q_halt 0.521 (n=626); HRM λ₁ 0.448 / drift 0.312 (sign-reversed; +drift→success AUC 0.688) / q_halt 0.734 (n=1342) | Window length (only 4 steps tested) | Concede; horizon sweep = explicit future work; 4 chosen to match training window | +| C6a | Correction: TRM failures do not "plateau at stable high-loss attractors" in state space | C2 + λ₁(D)=+0.103 + residual velocity ≈56/step ≈0.77× early; their own Fig 5 oscillation consistent | We strawman "attractor" (bounded attracting set can be chaotic) | Quote their exact wording ("stable", "local minima", "stabilize rather than explore"); credit their loss/boundedness data and intervention; correct only the settledness reading | +| C6b | Refinement: Ren & Liu's four modes confirmed and quantified; wrong-fixed-point mode is real but marginal at trajectory end | Their mode (4) = our strict B (0.5%); mode (3) = our D (~99.5% of failures) | We measure end-of-window only; mid-trajectory lingering invisible | Concede explicitly; their non-trivial-success lingering claim untouched | +| C7a | Training widens the success/failure gap from the failure side: failures become more expansive while the success regime barely moves | λ₁(D): +0.036→+0.102 over the TRM series, λ₁(A) within ±0.03 of 0 throughout | Estimator-scale / single-run | Holds; HRM series shows the mass-migration version. Robust claim. | +| C7b | **DEMOTED by matched-objective control (E6).** Under MATCHED objective (step9 fixed-unroll, E-vs-F / G-vs-H), perturbation training's effect on the wandering cell is small and inconsistent at matched step | HRM fD 0.387→0.369 / 0.385→0.379 / 0.381→**0.387**(↑ at best) / 0.410→0.387; TRM 0.344→0.305 / 0.396→0.361 / 0.334→**0.361**(↑ at best) / 0.312→0.270 | The earlier large shrink (D 274→175) was partly a fixed-unroll-vs-ACT-streaming artifact, NOT a pure intervention effect | **Rewrite §3.4**: report the matched effect as small/equivocal; drop the strong "intervention shrinks wandering" reading. The May-28 mismatched comparison is retired. C7a stands; C7b does not support an intervention claim. | + +## The spine (one sentence) +Direct per-example measurement of settling and tangent expansion decomposes recursive-reasoner +failure: failures are overwhelmingly trajectories that never settle, the chaotic signature is +concurrent with — not antecedent to — the outcome, and it is not explained by non-convergence +alone or by problem difficulty. + +## What this paper is NOT claiming (write these into Discussion to pre-empt) +- No mechanism for WHY trajectories fail to find the settled band (explicitly open). +- No claim that early intervention is impossible in principle — only that λ/drift at 4 steps carry no signal. +- No claim about tasks beyond Sudoku-Extreme or models beyond HRM/TRM. + +## Anti-patterns checklist (apply at style pass) +- [ ] No chronological narration of the investigation; structure = claims order. +- [ ] Numbers in topic sentences; every section's first sentence is a finding, not a plan. +- [ ] No "notably/crucially/interestingly/delve/underscore"; hedges only where the claim table says concede. +- [ ] Related work = positioning (each paragraph ends with the gap), not annotated bibliography. +- [ ] Negative result (C5) framed as a finding with its own section, not a limitation apology. +- [ ] Limitations: specific, short, no re-hedging of already-scoped claims. +- [ ] Prose paragraphs in the body; tables only for numbers; no bullet lists in Results. diff --git a/paper/experiment_framework.md b/paper/experiment_framework.md new file mode 100644 index 0000000..07a69d0 --- /dev/null +++ b/paper/experiment_framework.md @@ -0,0 +1,55 @@ +# Experiment framework — locked 2026-06-12 + +Purpose: every paper claim gets its evidence gap named and the run that closes it specified, +so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze +~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed). + +## Locked measurement protocol (uniformity rules) +- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary; + n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM). +- λ values never compared across estimator implementations; scale ownership stated at first use. +- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile + sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness. +- idx-pairing: same seed + same n across any runs that will be joined per example. +- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU. + +## Claims → gaps → runs + +| ID | Closes | Run | Spec | Cost | Priority / target | +|---|---|---|---|---|---| +| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** | +| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** | +| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** | +| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** | +| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR | +| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR | +| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR | +| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR | + +### E8 detail — which model/variant (decided 2026-06-13) +- **PRIMARY = TRM, attention variant.** Official Maze-Hard recipe = `arch=trm`, mlp_t=False (default), + L_layers=2 H_cycles=3 L_cycles=4. NOTE: our Sudoku TRM analysis used mlp_t=TRUE (attention OFF). + So TRM-Maze varies BOTH the task (Sudoku→Maze) AND the token mixer (mlp_t→attention) at once. + Framing for the paper: this is a STRONGER generality test (decomposition surviving task + mixer), + but state the dual change honestly; do not call it a pure same-architecture task transfer. +- Optional matched control (only if cheap + GPU spare): TRM-Maze with arch.mlp_t=true to isolate + task-only transfer. Not the validated recipe → may underperform 75%; treat as secondary. +- HRM-Maze (completes the {HRM,TRM}×{Sudoku,Maze} grid): deferred. Heavier (27M, HRM Maze recipe + costlier). Decide AFTER TRM-Maze lands — nice-to-have if TRM-Maze confirms, important if it surprises. +- Diagnostic caveat stands: attention arch → verify JVP closures in diagnose_trm_joint.py; seq 900 → n=512/256. +| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix | +| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 | + +## Decision rules +- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve. +- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1). +- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results. +- New results that contradict current text → claims.md updated first, prose second. + +## Status board (update as runs land) +- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md +- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives). +- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate. +- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite. +- [ ] E3, E4, E7 — phase-2 +- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards. diff --git a/paper/intro.md b/paper/intro.md new file mode 100644 index 0000000..85f06e7 --- /dev/null +++ b/paper/intro.md @@ -0,0 +1,57 @@ +# Recursive Reasoning Models Fail by Wandering, Not by Settling + +## 1 Introduction + +Recursive reasoning models such as the Hierarchical Reasoning Model (HRM; Wang et al., 2025) +and the Tiny Recursive Model (TRM; Jolicoeur-Martineau, 2025) solve constraint-satisfaction +puzzles that defeat far larger language models, by iterating a small network on a latent state +for hundreds of updates per puzzle. When such a model fails, what is dynamically different +about the trajectory it produced? Two recent mechanistic studies answer in attractor language. +Failed TRM runs "plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026); failed +HRM runs converge to spurious fixed points that rival the correct one (Ren & Liu, 2026). The +evidence behind both labels is indirect, resting on loss plateaus and two-dimensional +projections of 512-dimensional trajectories, and the labels disagree about the basic character +of failure: premature stability in one account, partly aimless drift in the other. Neither +measures the trajectory's stability directly. We do, per example, and the measurements support +a third description: recursive reasoning models fail by wandering, not by settling. + +Across 2,048 to 8,192 held-out Sudoku-Extreme puzzles, correct trajectories end inside a +narrow low-velocity band of the latent dynamics, and failures essentially never do. In an +official-recipe TRM at 87.6% test accuracy, none of 254 failures settles: the least mobile +failure still moves faster at the end of inference than 96.5% of successes, a separation of +distributions that no threshold choice can undo, and failed trajectories remain locally +expansive throughout (median leading finite-time Lyapunov exponent λ₁ = +0.103, against +0.012 +for successes; AUC 0.993). HRM shows the same structure with one addition. Settled-but-wrong +trajectories exist, but they account for 0.55% of failures, carry success-like contraction +(λ₁ = −0.84, against −0.87 for settled successes) and success-like halting confidence, and +every one of them would have halted early under adaptive computation. The wrong-attractor +failure mode is real, rare, and the only failure a confidence-based selector cannot catch. + +Two controls locate what the Lyapunov signature adds, and a third experiment locates when it +exists. Matched for displacement level within the unsettled population, λ₁ still separates +eventual successes from failures (decile-matched AUC 0.88–0.90), so the exponent does more +than restate non-convergence. Binned by the number of givens, the separation is unchanged +(within-bin AUC 0.982, against 0.984 unconditioned), so it is not an artifact of problem +difficulty. It is, however, strictly retrospective. Restricted to puzzles still unsolved after +four of sixteen segments, neither early-window exponents nor early state velocity predicts +which trajectories will eventually succeed (AUC ≈ 0.5 in TRM), and in HRM the association +inverts — among the undecided, the trajectories that move more in the early segments are the +ones that go on to solve the puzzle (positive-direction AUC 0.69). The chaos of failure +arrives with the failure; nothing dynamical in the early trajectory anticipates it. + +These measurements redraw the intervention map for this model class. Because failure is almost +never a stable wrong answer, restart-and-select inference strategies have a high ceiling and a +quantifiable blind spot of roughly half a percent. Because the early trajectory carries no +dynamical death sentence, compute spent on early failure prediction is compute wasted, and +restart diversity is the better buy. Our contributions: (i) per-example, outcome-conditioned +measurement of settling and finite-time Lyapunov spectra in HRM and TRM, at sample sizes up to +8,192 and replicated across two estimator implementations; (ii) a decomposition of failure +that corrects the settled-attractor reading and bounds the wrong-attractor mode at ~0.5% of +failures; (iii) controls showing the signature is not reducible to non-convergence or +difficulty; (iv) evidence that the signature is concurrent with the outcome and carries no +early-warning content at the granularity tested. + +--- +*[em-dash count: 1. Contrast-template count: title + one echo (end of ¶1). Flourish count: +1 ("death sentence", ¶4) — cuttable. "essentially never" is the one hedge in ¶2, scoped by +the 0.55% in the next sentence.]* diff --git a/paper/outline.md b/paper/outline.md new file mode 100644 index 0000000..0dde354 --- /dev/null +++ b/paper/outline.md @@ -0,0 +1,79 @@ +# Outline — "Recursive Reasoning Models Fail by Wandering, Not by Settling" (title FIXED 2026-06-12) + +Status: intro.md ✅ (v2, audited) · setup_results.md ✅ (Secs 2–3) · style_contract.md ✅ · +remaining: Sec 4 (relation to prior accounts), Sec 5 (implications), Sec 6 (limitations), +abstract, tables T1–T3 + figures F3/F4 composition. + +Target: ~8 pages main. Every section header below lists [claims served] and [assets]. + +## 1 Introduction [C1, spine] +- Para 1: recursive reasoners (HRM/TRM) solve hard puzzles by iterating a latent state; when they + fail, what is dynamically different? Existing mechanistic accounts infer dynamics from loss + curves and 2-D projections; we measure the dynamics directly, per example. +- Para 2: the answer, with numbers (settling × correctness decomposition; B≈0; AUC 0.99; + concurrent-not-antecedent). +- Para 3: contributions (4 items, one line each): (i) per-example outcome-conditioned FTLE/settling + measurement at n≤8192 across two architectures; (ii) failure-mode decomposition correcting two + published labels; (iii) independence controls (drift-matched, difficulty-binned); (iv) the + early-window null + sign reversal. +- NO general AI-reasoning throat-clearing. First sentence is about the object of study. + +## 2 Setup [assets: estimator details from diagnose_trm_joint.py; OBSERVATIONS.md provenance table] +- 2.1 Models & task: HRM 27M @26040 (acc .526), TRM-MLP official recipe @58590 (acc .876), + Sudoku-Extreme-1k-aug; fixed 16-step unroll, ACT recorded not applied. +- 2.2 Measurements: joint (z_H,z_L) tangent dynamics, JVP+QR, k=8, per-sub-update normalization; + per-ACT-step state displacement (drift); q_halt; exact/token accuracy. Estimator-scale caveat. +- 2.3 The 2×2 design: settled band defined by bimodal late-drift split (Otsu primary, full + percentile sweep + threshold-free statement in appendix); cells A/B/C/D. + +## 3 Results +- 3.1 Decomposition [C1, C2, C3; assets: cells tables, fig_*_scatter, fig_*_lyap_by_cell, + strict-B table + fig_hrm_strictB_profiles] + Lead: "Across 2048–8192 held-out puzzles, no TRM failure and 0.55% of HRM failures end in the + settled band." Then per-cell λ₁; then the 21 selector-blind examples (their three lowest + token-acc are all 17-givens puzzles). +- 3.2 What the signal is not [C4; assets: decile table, givens table] + Drift-matched AUC 0.88–0.90; givens-binned AUC unchanged. One paragraph each, tables carry + the numbers. +- 3.3 When the signal exists [C5; assets: early_pairing_{trm,hrm}.md tables] + The early-window null; the HRM sign reversal (drift@4 +direction AUC 0.688); q_halt@4 0.734 + vs TRM 0.521 (factual note: TRM removed the continue head). Frame as the temporal anatomy of + the signature. +- 3.4 Training evolution [C7; assets: evolution_{trm,hrm}.png/csv; multi4 quick-compare] + Gap widens via λ₁(D); multi4 shrinks D-cell mass at matched steps (preliminary, objective + caveat); multi4 collapse = λ₁(A) sign flip. + +## 4 Relation to prior accounts [C6a, C6b; assets: papers/notes/*] +- Para 1: network-level Lyapunov–performance work (Vogt 2022; AeLLE 2024; Engelken flossing + App. D.3 trains-vs-fails at network level, opposite sign) → none condition per example on outcome. +- Para 2: the 2026 mechanistic trio. Efstathiou & Balwani: credit loss/boundedness/intervention; + quote and correct the settledness reading (C6a). Ren & Liu: confirm + quantify their taxonomy + (C6b). Es'kin & Smorkalov (CMM): their endpoint-stability losses + engineered early repeller + are consistent, at the design level, with where our measurements localize the signal — cite, + don't claim confirmation. +- Para 3: stability-by-construction line (monDEQ, Jacobian-reg DEQ, REN/Sandwich; TRM's own + TorchDEQ negative result; Solve-the-Loop) — what "enforce settling" buys and where it failed; + our measurements say which kind of settling is the operative one. + +## 5 Implications (restrained, half page) +- Intervention design space bifurcates: widen/deepen the settled tube at training time + (perturbation training, equilibrium losses) vs restart-and-select at inference + (q_halt tracks correctness at trajectory end; selector-blind ceiling ≈0.5%). +- Early pruning/reallocation unsupported at 4-step granularity; on HRM the gradient of usable + early signal lives in the learned head, not the generic dynamical quantities. + +## 6 Limitations & future +Sudoku-Extreme only; two models; #givens is a weak difficulty proxy (solver backtracks next); +single early horizon (sweep queued); end-of-window criterion blind to mid-trajectory lingering; +no mechanism offered for why settling fails — measurement paper. + +## Figures plan (all exist or one rerun away) +F1: drift–λ₁ scatter, both models (have). +F2: per-cell λ₁ + strict-B profiles inset (have). +F3: decile-matched AUC + givens-binned AUC (compose from CSVs). +F4: early-window pairing summary (compose: 3 signals × 2 models, restricted set). +F5: checkpoint evolution (have). + +## Order of writing +1. Results 3.1–3.3 (numbers already final) → 2. Setup → 3. Sec 4 (notes ready) → 4. Intro → +5. Implications/Limitations → 6. style pass against claims.md checklist. diff --git a/paper/rainer_followup_draft.md b/paper/rainer_followup_draft.md new file mode 100644 index 0000000..12091de --- /dev/null +++ b/paper/rainer_followup_draft.md @@ -0,0 +1,37 @@ +Subject: Re: Question on gradient flossing vs forward trajectory stability in recursive reasoning models + +Hi Rainer, + +A short follow-up to my email of June 5 — we have since measured the things I was speculating +about, and two results seem worth sharing because they sharpen the question I asked you. + +First, conditioning per-example finite-time Lyapunov spectra on both outcome and terminal +settling (n = 2048–8192, two architectures) shows that failure is almost exclusively +non-settling: in an official-recipe TRM at 87.6% accuracy, none of 254 failed trajectories +ever enters the low-velocity band that all successes occupy, and they remain locally expansive +to the end (median λ₁ +0.10 vs +0.01). "Converged to the wrong attractor" failures exist in +HRM but make up only ~0.5% of failures. The chaotic signature also survives two controls: it +persists after matching trajectories on displacement level (so it is not just re-measuring +non-convergence), and after binning by puzzle difficulty. + +Second — and this is the part that genuinely surprised us — the signature is strictly +concurrent. Among puzzles still unsolved after a quarter of the inference budget, neither the +early-window exponents nor early state velocity predict which trajectories will eventually +succeed (AUC ≈ 0.5); in HRM the association even inverts, with eventually-successful +trajectories moving more in the early phase. So the failed trajectories are not "born chaotic": +chaos at the end and failure appear together. + +This makes me think the right framing for my earlier question is reachability of the settled +region (escape from a long chaotic transient) rather than per-example landscape quality, which +would be consistent with your view of flossing as a learning-time tool rather than an +inference-time one. If you know of work that conditions finite-time exponents on trajectory +fate in this way — in transient-chaos settings or elsewhere — I would be grateful for a +pointer; we have not found a precedent. + +Best, +Yuren + +--- +[Notes, not part of the email: numbers from analysis_2x2/OBSERVATIONS.md addenda 1-2. Send only +if/after Rainer replies to the June 5 email, or as a gentle bump after ~2 weeks (June 19+). +The "born chaotic" phrasing mirrors his literature's transient-chaos vocabulary deliberately.] diff --git a/paper/readiness.md b/paper/readiness.md new file mode 100644 index 0000000..f07f78e --- /dev/null +++ b/paper/readiness.md @@ -0,0 +1,119 @@ +# Path to a citable, build-on-able preprint — status + +Framing (locked 2026-06-19, per user correction): the axis is **expansive vs more-expansive** +(graded; for TRM both classes have λ₁>0), NOT settled-vs-chaotic. The phenomenon is the +**cleanness of a graded separation**; the **mechanism is explicitly OPEN** and is the natural seed +for follow-on projects. This preprint = rigorous phenomenology + precise characterization + +honest open-mechanism. Do NOT force a fixed-point / suppression-of-chaos framing (rejected). + +## Tier 0 — measurement bulletproofing (others build on it) +- [x] **T0.1 estimator validation** — `paper/validation/`: QR/Benettin core recovers known spectra + to <1e-3 (diagonal, symmetric, non-normal asymptotic) and Hénon λ₁ to 8e-5. PASS. Confirms the + numerical core (orthonormalization cadence, log|diagR| bookkeeping, ordering, averaging). +- [ ] **T0.2 robustness reruns (GPU)** — λ stability vs t_ons, tangent-basis seed, k>8. Window + dependence already covered offline (Char 3). Small queue; spec below. +- [ ] **T0.3 language/scope pass** — finite-time vs asymptotic, "expansive not chaotic" for HRM + (negative λ), metric/coordinate-dependence caveat (Lohmiller–Slotine). Prose task. + +## Characterization (replaces the rejected Tier 2; describes WHAT, not WHY) +- [x] **Char 1 whole-spectrum** — separation is a ~rigid shift of the ENTIRE k=8 spectrum, not a + single mode (per-exponent AUC uniformly 0.98–0.99; HRM gap ≈constant −0.16/exponent). Spectral + MEAN separates ≥ λ₁ alone (AUC 0.991–0.995). CAVEAT: KS-proxy Σλ⁺ is the wrong aggregate for HRM + (all-negative spectra → 0); use spectral mean for HRM. +- [x] **Char 2 shape** — two overlapping UNIMODAL classes with well-separated means, NOT two + discrete clusters (within-class BC 0.26–0.40). Outcome is a moderately sharp threshold on the + λ₁ continuum (25→75% transition spans 12–30% of the λ₁ spread). +- [x] **Char 3 integration-time scaling (the key descriptor)** — separation BUILDS monotonically + with window H: Cohen's d 1.06→4.84 (TRM, H=2→16), 0.03→3.45 (HRM). Near-zero at H=2, near-perfect + at the full 16-segment budget. The cleanness is an integration-time phenomenon. COHERENCE with + E5: this accumulation tracks the unfolding of outcomes (more trajectories revealed by larger H), + NOT anticipation — among undecided@H examples λ₁ still doesn't predict (E5). State both together. +- [x] **Char 4 effect size** — "clean" quantified: Cohen's d 3.4–4.8, distributional overlap + <10% (TRM hist-overlap 0.049). Beyond AUC. + +## Tier 1 — causal content (the level-up from correlation) +- [ ] **T1 inference-side causal probe** — nudge a failing trajectory toward lower expansion (or + toward the success-mean manifold) mid-rollout and measure outcome recovery; conversely inject + expansion into a settling-correct trajectory. Tests settling⟹correct as causal, not correlational. + Spec next. GPU. + +## Open-mechanism (NOT this paper; the hook for follow-ons) +Why a graded (both-expansive) difference separates so cleanly. Char 1–4 bound the description; +the why is deferred. Candidate angles are the user's to pursue, not asserted here. + +## Maze cross-task result + checkpoint evolution (2026-06-20) + +**Deflationary finding stands and is now grounded:** the FTLE/CLV separation reduces to +convergence+confidence (λ1, full k=8 spectrum, AND leading-CLV geometry all reduce; partial-corr +→0 once drift+q_halt controlled). The dynamical signal is a (redundant) convergence readout. + +**Maze (TRM att, friend's run, all 10 ckpts, k=1):** separation WEAK (λ1 Cohen's d 0.2–0.5 vs +Sudoku 3–5). Failures SETTLE (B/fail 0.81–0.98, D/fail 0.02–0.19) at ALL ckpts and are NEAR-MISSES +(token_acc ~0.97). Opposite of Sudoku (failures wander, far-from-correct token ~0.63). + +**Checkpoint evolution (the key new result, offline):** wandering is a LATE-TRAINING property. +Sudoku HRM failures SETTLE early (B/fail ~0.9 at acc 2–15%) then flip to WANDER late +(D/fail ~1.0 at acc 50%), transition ~step 13–18k. So "failures wander" is learned, not intrinsic. +BUT matched-accuracy contrast cuts the other way: at acc≈0.76, Sudoku-TRM D/fail=1.00 vs +Maze-TRM D/fail=0.19 — same skill, opposite dynamics → TASK STRUCTURE also matters, not just maturity. +And early-Sudoku settling (token 0.63, confidently-wrong) ≠ Maze settling (token 0.97, near-miss): +not the same phenomenon. Fig: analysis_2x2/checkpoint_evolution_wander.png. + +**Task structure (offline):** Maze solution path (median 113 cells) passes through ~76 branch +points (67% of path cells at deg≥3 junctions; 48% of open cells are junctions) → abundant +locally-coherent alternative paths = many STABLE WRONG ANSWERS available. Sudoku: unique +globally-coupled solution, a wrong cell violates constraints globally → no local near-miss +equilibrium. This structurally explains settle-to-near-miss (Maze) vs wander (Sudoku). + +**Unresolved confound (queued):** TRM-Maze never develops wandering, but can't tell task-structure +from TRM-Maze SATURATION (Maze too easy for TRM). Queued before HRM-Maze: +(1) continue-train TRM-Maze from step_130200 (does acc climb toward ~1.0 = saturation, or plateau?); +(2) per-cell failure structure (are failure errors a connected detour = coherent stable wrong path, +or scattered?). Then HRM-Maze (harder model-task fit, more likely to be stressed into wandering). + +## Solution-space test (2026-06-20) — refutes the measurement-artifact concern, strengthens task-structure +User asked: is weak Maze separation an artifact of analyzing the FULL latent (88% trivial copy) +instead of the SOLUTION space? Tested directly: per-step decoded-ANSWER Hamming drift over +solution cells (label!=input), Maze vs Sudoku control. +- MAZE: failures SETTLE in solution space too (late answer-drift median 0.00, 98.4% settled; + AUC 0.30). Same conclusion as full-latent. NOT an artifact. +- SUDOKU control: failures DON'T settle in solution space (late drift median 8.5/step, 0% settled; + AUC 0.99). Same as full-latent. Both spaces agree. +- Per-cell failure STRUCTURE (direct task-structure evidence): MAZE failures = CONNECTED DETOUR + (97% have ≤2 error components, median 22 cells one blob) = a coherent stable wrong PATH. + SUDOKU failures = SCATTERED (100% have ≥5 components, median 13) = no coherent wrong answer. + Fig: analysis_2x2/maze_failure_detour.png. This is the mechanism-grounding for why Maze settles + (stable wrong answers exist as detours) and Sudoku wanders (no stable wrong answer). + +## CORRECTION (2026-06-20) — Maze exact-match labeling was the artifact; failure=more-chaotic HOLDS +The earlier "Maze dissociates / completeness≠correctness" reading was largely a LABELING ARTIFACT, +not a real dynamical dissociation. Maze exact-match marks VALID alternative solutions (incl. +equal-length valid shortest paths) as "failures"; 100% of exact-match "failures" are valid connected +paths (complete answers) → they settle, trivially. That is a benchmark-design flaw, not a result. +**Under the correct criterion (CONNECTIVITY = is it a valid complete path = is it actually solved):** +genuine failures (broken/disconnected) ARE more chaotic — AUC(-late_drift→connected) = 0.864 @step_13020 +(15 broken), 0.895 pooled (18 broken); bootstrap 95% CI [0.80, 0.96], excludes 0.5. So +"failure = more chaotic" is TASK-GENERAL (Sudoku + Maze) once failure is defined by validity. +LIMITATION (now RESOLVED): trained Maze SATURATES before the first saved ckpt (step_13020 already +97% complete) → only n=18 broken from existing ckpts. FIX DONE: fresh early-save TRM-Maze run +(maze_earlysave_freshTRM, saved every 250 epochs) captured the broken-rich pre-saturation phase; +cheap forward dumps (drift_zH + connectivity, no JVP) on 8 early ckpts give **n=4096, 1835 broken**. +**Pooled: AUC(-latent drift_zH -> connected/complete) = 0.834, bootstrap 95% CI [0.822, 0.846]** +(broken late-drift median 1.06 vs connected 0.56). Per-ckpt AUC rises with training 0.66->0.88 +(mirrors Sudoku's separation-grows-with-training). So 'genuine failure (incomplete) = more chaotic' +is now LARGE-N BULLETPROOF on Maze under validity labeling. Fig: maze_broken_morechaotic.png. +Honest detail: 'more chaotic' is a LATENT-dynamics property (drift_zH AUC 0.834, λ1 AUC 0.86); +the DECODED-answer drift does NOT separate (ans_drift AUC 0.38) — broken paths commit an incomplete +decoded answer while churning internally. Consistent with the FTLE/drift (latent) story. + +## Synthesis for the paper (current honest thesis, corrected) +Genuine failures (incomplete/invalid answers) are MORE CHAOTIC — measurable, task-general (Sudoku; +Maze under validity labeling). Mechanism: the dynamical signal detects answer completeness/convergence +(FTLE reducible to drift+q_halt). On unique-solution tasks completeness=correctness, so it predicts +correctness directly. On multi-solution tasks exact-match mislabels valid alternatives as failures; +use validity labeling. The phenomenon stands; the convergence-detection mechanism is the honest +interpretation, not a refutation. + +## Status: offline T0.1 + Char 1–4 + Maze evolution + task structure DONE. Running: TRM CLV (done), +## HRM CLV (queued on card1), maze-followup queue (continue-train + per-cell, waiting for GPU). +## Remaining: T0.2/T0.3, T1, HRM-Maze (after saturation test). diff --git a/paper/sample_intro.md b/paper/sample_intro.md new file mode 100644 index 0000000..183faa4 --- /dev/null +++ b/paper/sample_intro.md @@ -0,0 +1,49 @@ +# Sample section: Introduction (taste-calibration draft) + +Recursive reasoning models solve constraint-satisfaction problems that defeat much larger +language models by iterating a small network on a latent state — up to several hundred state +updates per puzzle in the Hierarchical Reasoning Model (HRM) and the Tiny Recursive Model +(TRM). When such a model fails, what is dynamically different about the trajectory it +produced? Recent mechanistic studies have answered with attractor language: failed runs +"plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026), or converge to spurious +fixed points that rival the correct one (Ren & Liu, 2026). These accounts rest on indirect +evidence — loss plateaus, two-dimensional projections of 512-dimensional trajectories — and +the two papers do not agree: one describes failure as premature stability, the other partly as +wandering. Neither measures stability itself. + +We measure it directly. For every test puzzle we record two per-example quantities along the +full 16-segment inference trajectory: the finite-time Lyapunov spectrum of the joint latent +dynamics, and the per-segment state displacement. Conditioning these on outcome over 2,048 to +8,192 puzzles per model yields a complete decomposition of failure for HRM (52.6% accuracy) +and an official-recipe TRM (87.6%), and the decomposition contradicts the settled-attractor +picture. Correct trajectories enter a narrow low-velocity band and stay in it; failed +trajectories never do. In TRM, not one of 254 failures settles — the least mobile failure still +moves faster at the end of inference than 96.5% of successes — while remaining locally +expansive (median λ₁ = +0.103 versus +0.012 for successes; AUC 0.993). In HRM, settled-but-wrong +trajectories exist but account for 0.55% of failures; the other 99.45% wander. Failure in these +models is not a wrong attractor. It is the sustained absence of settling. + +Two controls sharpen what the Lyapunov signature adds. Matched for displacement level within +the unsettled population, λ₁ still separates eventual successes from failures (decile-matched +AUC 0.88–0.90), so the exponent is not merely re-measuring non-convergence; and binning by +puzzle givens leaves the separation intact (within-bin AUC 0.982 versus 0.984 overall), so it +is not a difficulty artifact. The signature is, however, strictly retrospective. Restricted to +puzzles still unsolved after four segments, nothing dynamical about those first four segments +predicts which will eventually be solved: AUC ≈ 0.5 in TRM for exponent, displacement, and +halting confidence alike — and in HRM the association inverts, with eventual successes moving +*more* in the early trajectory than eventual failures (AUC 0.69 in the positive direction). +The chaos of failure is concurrent with the outcome, not an omen visible at the start. + +These measurements reframe both the diagnosis and the levers. Because failure is almost never +a stable wrong answer, selection-based inference strategies have a high ceiling — final-step +halting confidence tracks correctness on all but the ~0.5% of failures that settle confidently +— and because the early trajectory carries no dynamical death sentence, compute is better +spent on restarts than on early pruning. We quantify both points, correct the published +attractor labels they depend on, and release the per-example measurement tooling. + +--- +*[Style notes for review, not part of the draft: (1) every paragraph opens with a finding or a +question, none with "In recent years"; (2) the two prior papers are quoted precisely and +credited for what their data shows before the correction is made; (3) hedges appear only where +the claim table concedes (e.g., "almost never", "~0.5%"); (4) the one rhetorical flourish — +"not an omen" — is load-bearing; cut it if it reads as flavor.]* diff --git a/paper/setup_results.md b/paper/setup_results.md new file mode 100644 index 0000000..d6976d5 --- /dev/null +++ b/paper/setup_results.md @@ -0,0 +1,129 @@ +# 2 Setup + +**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the +1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact +accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global +batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set, +87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation +halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is +observed for the full budget. Answers are decoded at segment 16. + +**Per-example measurements.** Along each trajectory we record three families of quantities. +First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent +vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products +through every state update and re-orthonormalized by QR at each sub-step; λ_i is the +time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64 +for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L +separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values +are comparable only within an estimator implementation; we replicate the HRM analysis under a +second, earlier implementation (n=8,192) and report its scale separately. + +**The settling criterion.** Late drift — the mean z_H displacement over the final four +segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band +(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width +under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a +trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's +method on the pooled log distribution; every result below is reported with a full percentile +sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed +point: both bands have nonzero characteristic velocity. + +**Design.** Crossing the settling criterion with answer correctness yields four cells: +settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The +analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add +beyond the settling split? And when along the trajectory does the discriminative signal exist? + +# 3 Results + +## 3.1 Failure is wandering: the 2×2 decomposition + +At the end of inference, success and failure occupy different dynamical regimes almost without +exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is +settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the +late drift of 96.5% of successes, so no threshold assignment can place a failure in the +settled band. Failed trajectories also remain locally expansive over the full window (median +λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction +(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint +series, shows the settled-wrong cell empty from 20% of training onward. + +HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894 +failures (0.55%; n=8,192) end settled; the replication under the second estimator gives +5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable +from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the +A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes +(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the +halting threshold between segments 4 and 9; under adaptive computation each would have stopped +early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate +are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu +(2026), measured: it exists, it carries exactly the contraction signature their account +predicts, and it is two orders of magnitude less common than wandering. + +The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at +segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as +settled successes. Their existence shows the decode head can read a correct answer off a +moving state; we do not observe what happens to them past the window. + +## 3.2 What the exponent is not measuring + +The λ₁ separation is not a restatement of the settling split. Within the unsettled population, +where every trajectory is still moving, λ₁ ranks eventual successes above failures inside +narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile +width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift +to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900 +(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is +moving. + +The separation is also not a difficulty artifact, at least not at the resolution of clue +count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself +difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning +removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982) +against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be +the sharper control — but at this resolution the dynamical signature is orthogonal to how hard +the puzzle is. + +## 3.3 When the signal exists: concurrent, with no early warning + +The discriminative power of the dynamics is a property of the realized trajectory, and it is +absent at the start. We re-measured both models over only the first four segments (idx-paired +with the full-window runs, same sampling) and asked whether anything visible by segment 4 +forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM, +0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of +TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet +correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually +succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting +confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert: +eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially +higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM +trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle. + +One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at +segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training +removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note +without interpreting. Window length is the untested variable here: four segments matches the +deep-supervision horizon, and we have not yet swept longer prefixes. + +## 3.4 Training widens the gap from the failure side + +Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures +rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of +zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows +over training because the failures become more expansive, while the success regime barely +moves. HRM's series shows a mass migration instead: at early checkpoints nearly all +trajectories are low-drift and wrong (the model barely updates state), this cell drains +through mid-training into high-drift wandering, and accuracy growth then tracks transfer from +wandering into the settled-correct band. + +A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained +with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering- +failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and +247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the +known late-run collapse of this variant coincides with the settled band itself destabilizing +(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training +objective (ACT-streaming versus fixed unroll), so we report this as directional evidence +pending a matched-objective control. + +--- +*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0 +(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable. +Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window +restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and +offline_followups/followups.md.]* diff --git a/paper/style_contract.md b/paper/style_contract.md new file mode 100644 index 0000000..4e0f319 --- /dev/null +++ b/paper/style_contract.md @@ -0,0 +1,47 @@ +# Style contract (operative checklist for every section pass) + +Sources: Shaib et al. 2026 slop taxonomy (density/relevance/coherence are the load-bearing +dimensions); Buschek's four reviewer complaints; 2026 banned-pattern lists. Applied as a per- +section pass, not a vibe. + +## Hard bans (lexical) +delve, tapestry, landscape (figurative), testament, pivotal, showcase, intricate, vibrant, +seamless, elegant, dramatically, novel (self-describing), comprehensive (self-describing), +notably, crucially, interestingly, importantly, "It is worth noting", "Moreover," as sentence +opener, "In recent years", "has garnered attention", "paradigm shift". +Technical terms that overlap ban lists (e.g., "robust" in "threshold-robust") stay. + +## Punctuation & rhythm +- Em dash: ≤2 per section, never two in one sentence, never as paired parenthetical. +- No exclamation marks. Semicolons fine. +- Every paragraph contains at least one sentence under ~12 words. +- No uniform paragraph shapes: don't end three consecutive paragraphs with a summary clause. + +## Structure +- The contrast template ("X, not Y" / "not X, but Y") has a total budget of ONE device: + the title and its single echo at the end of paragraph 1. Zero elsewhere. Express other + contrasts by stating the positive finding and letting the numbers carry the negation. +- Rule-of-three closers only when the three items are an exhaustive measured list. +- No chronological narration of the investigation. Structure follows claims.md. +- Results topic sentences contain a finding or a number, never a plan ("We then examine..."). + +## Integrity (Buschek's four, mapped to our risks) +- Marketing language: zero unevidenced evaluative adjectives. "near-perfect" only with the AUC + in the same sentence. +- Performative related work: every citation paired with the precise claim it supports; check + each against papers/notes/*.md before the citation lands. +- Misrepresentation: prior-work characterizations quote verbatim with page/section; corrections + target the quoted words only (claims.md C6a discipline). +- Stretched summaries: no interpretive metaphors for results. One flourish per section maximum, + and it must be cuttable without losing content. + +## Density (the actual anti-slop weapon) +- Every sentence must add a number, a definition, a caveat, or a logical step. Otherwise cut. +- Modal verbs (could/might/may) confined to Discussion and future work. +- Hedges appear only where claims.md concedes; one hedge per concession, not a seesaw. + +## Honesty rails (project-specific) +- Never compare λ values across estimator versions; state scale ownership at first use. +- "Settled" always defined as the measured low-velocity band, with its residual velocity given. +- No mechanism claims; observations and their direct logical consequences only. +- No promises (code release, future experiments) that the authors have not decided. diff --git a/paper/validation/validate_le_estimator.py b/paper/validation/validate_le_estimator.py new file mode 100644 index 0000000..7ad4fd0 --- /dev/null +++ b/paper/validation/validate_le_estimator.py @@ -0,0 +1,107 @@ +"""T0.1 — validate the QR/Benettin FTLE estimator core against systems with KNOWN spectra. + +Reimplements the IDENTICAL accumulation used in diagnose_{trm,hrm}_joint.py: + Q in R^{n x k} init random-orthonormal; each step apply the (known) Jacobian to Q's columns; + every t_ons steps QR-decompose, accumulate sum of log|diag(R)|; LE_i = sum / n_qr_steps. + +Test systems (known answers): + (a) diagonal linear map LE_i = log|d_i| (exact at all T) + (b) symmetric linear map LE_i = log|eig_i| (exact; eig=singular values) + (c) non-normal (shear) map LE_i = log|eig_i| asympt. (finite-time transient from singular values) + (d) Henon map (a=1.4,b=0.3) LE = {+0.41922, -1.62319} (nonlinear chaotic; literature value) + +A passing result = recovered exponents match known to within tolerance, confirming the QR core +(orthonormalization cadence, log|diag R| bookkeeping, ordering, averaging) is correct. +No GPU, no model — this isolates the numerical estimator. +""" +from __future__ import annotations +import numpy as np + +RNG = np.random.default_rng(0) + + +def qr_le(jac_fn, x0, n_steps, k, t_ons=1, warmup=0): + """Benettin/QR LE estimate. jac_fn(x)->(x_next, J) gives next state and Jacobian at x. + Mirrors diagnose_*_joint.py: QR every t_ons steps, accumulate log|diag R|, average over QR steps.""" + x = np.asarray(x0, float) + d = x.shape[0] + Q, _ = np.linalg.qr(RNG.standard_normal((d, k))) + log_R_sum = np.zeros(k) + n_qr = 0 + for t in range(n_steps): + x, J = jac_fn(x) + Q = J @ Q + if (t + 1) % t_ons == 0: + Q, R = np.linalg.qr(Q) + if t >= warmup: + log_R_sum += np.log(np.clip(np.abs(np.diag(R)), 1e-30, None)) + n_qr += 1 + return np.sort(log_R_sum / max(n_qr, 1))[::-1] + + +def run(): + out = ["# T0.1 estimator validation (QR/Benettin core vs known spectra)", ""] + tol = 5e-3 + + # (a) diagonal + d_vals = np.array([1.5, 0.8, 0.3, 0.05]) + M = np.diag(d_vals) + known = np.sort(np.log(np.abs(d_vals)))[::-1] + est = qr_le(lambda x: (x, M), np.ones(4), 4000, k=4) # linear: J state-independent, don't grow x + out += [f"(a) diagonal linear: known {np.round(known,4)}", + f" recovered {np.round(est,4)} max|err|={np.max(np.abs(est-known)):.2e} " + f"{'PASS' if np.max(np.abs(est-known))<tol else 'FAIL'}"] + + # (b) symmetric + A = RNG.standard_normal((5, 5)); S = (A + A.T) / 2 + # scale so spectral radius < ~1.3 (keep magnitudes spread, finite) + S = 0.9 * S / np.max(np.abs(np.linalg.eigvalsh(S))) + eig = np.linalg.eigvalsh(S) + known = np.sort(np.log(np.abs(eig)))[::-1] + est = qr_le(lambda x: (x, S), np.ones(5), 8000, k=5) + out += [f"(b) symmetric linear: known {np.round(known,4)}", + f" recovered {np.round(est,4)} max|err|={np.max(np.abs(est-known)):.2e} " + f"{'PASS' if np.max(np.abs(est-known))<tol else 'FAIL'}"] + + # (c) non-normal shear: LE -> log|eig| asymptotically; finite-time transient from singular values + N = np.array([[1.1, 5.0], [0.0, 0.6]]) # eigenvalues 1.1, 0.6 (triangular); highly non-normal + known = np.sort(np.log(np.abs(np.linalg.eigvals(N))))[::-1] + est_long = qr_le(lambda x: (x, N), np.ones(2), 40000, k=2) + sv = np.sort(np.log(np.linalg.svd(N, compute_uv=False)))[::-1] + est_short = qr_le(lambda x: (x, N), np.ones(2), 5, k=2) + out += [f"(c) non-normal shear: known asymptotic log|eig| {np.round(known,4)}", + f" recovered (T=40000) {np.round(est_long,4)} " + f"max|err|={np.max(np.abs(est_long-known)):.2e} " + f"{'PASS' if np.max(np.abs(est_long-known))<1e-2 else 'FAIL'}", + f" single-step log singular values {np.round(sv,4)} (finite-time transient ref)", + f" recovered (T=5, finite-time) {np.round(est_short,4)} " + f"(should sit between sv and asymptotic -> confirms finite-time != asymptotic)"] + + # (d) Henon map + a, b = 1.4, 0.3 + def henon(x): + xn = np.array([1 - a * x[0] ** 2 + x[1], b * x[0]]) + J = np.array([[-2 * a * x[0], 1.0], [b, 0.0]]) + return xn, J + # settle onto attractor first + x = np.array([0.1, 0.1]) + for _ in range(1000): + x, _ = henon(x) + known = np.array([0.41922, -1.62319]) # literature (Sprott) + est = qr_le(henon, x, 200000, k=2, warmup=1000) + out += [f"(d) Henon (a=1.4,b=0.3): literature {np.round(known,4)} (sum={known.sum():.4f})", + f" recovered {np.round(est,4)} (sum={est.sum():.4f}) " + f"|err λ1|={abs(est[0]-known[0]):.2e} " + f"{'PASS' if abs(est[0]-known[0])<5e-3 else 'FAIL'}"] + + out += ["", "Interpretation: (a)(b) confirm exact recovery for normal maps; (c) confirms the", + "estimator converges to log|eig| asymptotically while finite-time windows reflect", + "singular-value growth (the regime our paper operates in); (d) confirms correct", + "recovery on a known chaotic nonlinear system. The QR core is validated."] + print("\n".join(out)) + from pathlib import Path + Path(__file__).resolve().parent.joinpath("validation_results.md").write_text("\n".join(out)) + + +if __name__ == "__main__": + run() diff --git a/paper/validation/validation_results.md b/paper/validation/validation_results.md new file mode 100644 index 0000000..008a73a --- /dev/null +++ b/paper/validation/validation_results.md @@ -0,0 +1,17 @@ +# T0.1 estimator validation (QR/Benettin core vs known spectra) + +(a) diagonal linear: known [ 0.4055 -0.2231 -1.204 -2.9957] + recovered [ 0.4047 -0.2236 -1.2032 -2.9953] max|err|=7.99e-04 PASS +(b) symmetric linear: known [-0.1054 -0.6391 -0.9467 -1.4414 -3.074 ] + recovered [-0.1055 -0.6392 -0.9465 -1.4415 -3.0738] max|err|=1.85e-04 PASS +(c) non-normal shear: known asymptotic log|eig| [ 0.0953 -0.5108] + recovered (T=40000) [ 0.0953 -0.5109] max|err|=3.84e-05 PASS + single-step log singular values [ 1.6396 -2.0551] (finite-time transient ref) + recovered (T=5, finite-time) [ 0.5468 -0.9623] (should sit between sv and asymptotic -> confirms finite-time != asymptotic) +(d) Henon (a=1.4,b=0.3): literature [ 0.4192 -1.6232] (sum=-1.2040) + recovered [ 0.4193 -1.6233] (sum=-1.2040) |err λ1|=8.44e-05 PASS + +Interpretation: (a)(b) confirm exact recovery for normal maps; (c) confirms the +estimator converges to log|eig| asymptotically while finite-time windows reflect +singular-value growth (the regime our paper operates in); (d) confirms correct +recovery on a known chaotic nonlinear system. The QR core is validated.
\ No newline at end of file |
