diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/paper | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/paper')
| -rw-r--r-- | research/flossing/paper/claims.md | 37 | ||||
| -rw-r--r-- | research/flossing/paper/experiment_framework.md | 43 | ||||
| -rw-r--r-- | research/flossing/paper/intro.md | 57 | ||||
| -rw-r--r-- | research/flossing/paper/outline.md | 79 | ||||
| -rw-r--r-- | research/flossing/paper/rainer_followup_draft.md | 37 | ||||
| -rw-r--r-- | research/flossing/paper/sample_intro.md | 49 | ||||
| -rw-r--r-- | research/flossing/paper/setup_results.md | 129 | ||||
| -rw-r--r-- | research/flossing/paper/style_contract.md | 47 |
8 files changed, 478 insertions, 0 deletions
diff --git a/research/flossing/paper/claims.md b/research/flossing/paper/claims.md new file mode 100644 index 0000000..590b388 --- /dev/null +++ b/research/flossing/paper/claims.md @@ -0,0 +1,37 @@ +# Claim table (write structure FROM this, not from session chronology) + +Rule: every claim states its evidence, its strongest counter-reading, and where that +counter-reading is killed or conceded. A claim that can't fill all four columns gets cut +or demoted to an observation in the discussion. + +| # | Claim | Evidence | Strongest counter-reading | Disposition | +|---|---|---|---|---| +| C1 | Per-example full-trajectory FTLE separates success/failure near-perfectly in trained HRM and TRM on Sudoku-Extreme | AUC(−λ₁→correct): HRM 0.984 (n=8192) / 0.987 (n=2048, 2nd estimator); TRM official 87.6% ckpt 0.993 (n=2048). Medians: HRM joint −0.152 vs +0.032; TRM +0.012 vs +0.103 | Estimator artifact | Replicates across two estimator implementations and two architectures; report both scales, never cross-compare | +| C2 | Failures overwhelmingly never settle; settled-wrong is rare (HRM) to absent (TRM) | TRM: 0/254 failures settled, threshold-free (min wrong-drift > late-drift of 96.5% of correct); HRM: 21/3894 (0.55%) strict-band; replicated 5/971 on 2nd estimator | Threshold choice ("settled" is arbitrary) | State threshold-free version (distribution separation); show full percentile sweep; define settled band by its narrow characteristic velocity | +| C3 | The rare settled-wrong failures are selector-blind: success-like contraction AND success-like halting confidence | n=21: λ₁ med −0.842 (A: −0.867), q_halt(final) +7.47 (= A), all would halt early (halted_at 4–9), token_acc med 0.62 | Small n | Concede explicitly; the point is existence + boundedness (~0.5%), not precision | +| C4 | The FTLE-outcome signal is not reducible to non-settling, and not a difficulty artifact | Drift-decile-matched AUC within unsettled stratum: 0.879 (n=8192) / 0.900 (2nd estimator); #givens-bin AUC 0.982 vs 0.984 overall | (a) residual within-decile drift variation; (b) #givens is a weak difficulty proxy | (a) deciles are narrow (table shows ranges); (b) concede openly, name solver-backtracks as the proper proxy, future work | +| C5 | The chaotic signature is outcome-concurrent, not antecedent: nothing in the first 4 ACT steps forecasts eventual success among not-yet-solved examples — and on HRM the dynamical signals point the other way | Restricted to not-correct@4: TRM AUC λ₁ 0.543 / drift 0.492 / q_halt 0.521 (n=626); HRM λ₁ 0.448 / drift 0.312 (sign-reversed; +drift→success AUC 0.688) / q_halt 0.734 (n=1342) | Window length (only 4 steps tested) | Concede; horizon sweep = explicit future work; 4 chosen to match training window | +| C6a | Correction: TRM failures do not "plateau at stable high-loss attractors" in state space | C2 + λ₁(D)=+0.103 + residual velocity ≈56/step ≈0.77× early; their own Fig 5 oscillation consistent | We strawman "attractor" (bounded attracting set can be chaotic) | Quote their exact wording ("stable", "local minima", "stabilize rather than explore"); credit their loss/boundedness data and intervention; correct only the settledness reading | +| C6b | Refinement: Ren & Liu's four modes confirmed and quantified; wrong-fixed-point mode is real but marginal at trajectory end | Their mode (4) = our strict B (0.5%); mode (3) = our D (~99.5% of failures) | We measure end-of-window only; mid-trajectory lingering invisible | Concede explicitly; their non-trivial-success lingering claim untouched | +| C7a | Training widens the success/failure gap from the failure side: failures become more expansive while the success regime barely moves | λ₁(D): +0.036→+0.102 over the TRM series, λ₁(A) within ±0.03 of 0 throughout | Estimator-scale / single-run | Holds; HRM series shows the mass-migration version. Robust claim. | +| C7b | **DEMOTED by matched-objective control (E6).** Under MATCHED objective (step9 fixed-unroll, E-vs-F / G-vs-H), perturbation training's effect on the wandering cell is small and inconsistent at matched step | HRM fD 0.387→0.369 / 0.385→0.379 / 0.381→**0.387**(↑ at best) / 0.410→0.387; TRM 0.344→0.305 / 0.396→0.361 / 0.334→**0.361**(↑ at best) / 0.312→0.270 | The earlier large shrink (D 274→175) was partly a fixed-unroll-vs-ACT-streaming artifact, NOT a pure intervention effect | **Rewrite §3.4**: report the matched effect as small/equivocal; drop the strong "intervention shrinks wandering" reading. The May-28 mismatched comparison is retired. C7a stands; C7b does not support an intervention claim. | + +## The spine (one sentence) +Direct per-example measurement of settling and tangent expansion decomposes recursive-reasoner +failure: failures are overwhelmingly trajectories that never settle, the chaotic signature is +concurrent with — not antecedent to — the outcome, and it is not explained by non-convergence +alone or by problem difficulty. + +## What this paper is NOT claiming (write these into Discussion to pre-empt) +- No mechanism for WHY trajectories fail to find the settled band (explicitly open). +- No claim that early intervention is impossible in principle — only that λ/drift at 4 steps carry no signal. +- No claim about tasks beyond Sudoku-Extreme or models beyond HRM/TRM. + +## Anti-patterns checklist (apply at style pass) +- [ ] No chronological narration of the investigation; structure = claims order. +- [ ] Numbers in topic sentences; every section's first sentence is a finding, not a plan. +- [ ] No "notably/crucially/interestingly/delve/underscore"; hedges only where the claim table says concede. +- [ ] Related work = positioning (each paragraph ends with the gap), not annotated bibliography. +- [ ] Negative result (C5) framed as a finding with its own section, not a limitation apology. +- [ ] Limitations: specific, short, no re-hedging of already-scoped claims. +- [ ] Prose paragraphs in the body; tables only for numbers; no bullet lists in Results. diff --git a/research/flossing/paper/experiment_framework.md b/research/flossing/paper/experiment_framework.md new file mode 100644 index 0000000..945a9a3 --- /dev/null +++ b/research/flossing/paper/experiment_framework.md @@ -0,0 +1,43 @@ +# Experiment framework — locked 2026-06-12 + +Purpose: every paper claim gets its evidence gap named and the run that closes it specified, +so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze +~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed). + +## Locked measurement protocol (uniformity rules) +- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary; + n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM). +- λ values never compared across estimator implementations; scale ownership stated at first use. +- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile + sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness. +- idx-pairing: same seed + same n across any runs that will be joined per example. +- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU. + +## Claims → gaps → runs + +| ID | Closes | Run | Spec | Cost | Priority / target | +|---|---|---|---|---|---| +| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** | +| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** | +| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** | +| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** | +| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR | +| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR | +| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR | +| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR | +| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix | +| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 | + +## Decision rules +- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve. +- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1). +- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results. +- New results that contradict current text → claims.md updated first, prose second. + +## Status board (update as runs land) +- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md +- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives). +- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate. +- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite. +- [ ] E3, E4, E7 — phase-2 +- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards. diff --git a/research/flossing/paper/intro.md b/research/flossing/paper/intro.md new file mode 100644 index 0000000..85f06e7 --- /dev/null +++ b/research/flossing/paper/intro.md @@ -0,0 +1,57 @@ +# Recursive Reasoning Models Fail by Wandering, Not by Settling + +## 1 Introduction + +Recursive reasoning models such as the Hierarchical Reasoning Model (HRM; Wang et al., 2025) +and the Tiny Recursive Model (TRM; Jolicoeur-Martineau, 2025) solve constraint-satisfaction +puzzles that defeat far larger language models, by iterating a small network on a latent state +for hundreds of updates per puzzle. When such a model fails, what is dynamically different +about the trajectory it produced? Two recent mechanistic studies answer in attractor language. +Failed TRM runs "plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026); failed +HRM runs converge to spurious fixed points that rival the correct one (Ren & Liu, 2026). The +evidence behind both labels is indirect, resting on loss plateaus and two-dimensional +projections of 512-dimensional trajectories, and the labels disagree about the basic character +of failure: premature stability in one account, partly aimless drift in the other. Neither +measures the trajectory's stability directly. We do, per example, and the measurements support +a third description: recursive reasoning models fail by wandering, not by settling. + +Across 2,048 to 8,192 held-out Sudoku-Extreme puzzles, correct trajectories end inside a +narrow low-velocity band of the latent dynamics, and failures essentially never do. In an +official-recipe TRM at 87.6% test accuracy, none of 254 failures settles: the least mobile +failure still moves faster at the end of inference than 96.5% of successes, a separation of +distributions that no threshold choice can undo, and failed trajectories remain locally +expansive throughout (median leading finite-time Lyapunov exponent λ₁ = +0.103, against +0.012 +for successes; AUC 0.993). HRM shows the same structure with one addition. Settled-but-wrong +trajectories exist, but they account for 0.55% of failures, carry success-like contraction +(λ₁ = −0.84, against −0.87 for settled successes) and success-like halting confidence, and +every one of them would have halted early under adaptive computation. The wrong-attractor +failure mode is real, rare, and the only failure a confidence-based selector cannot catch. + +Two controls locate what the Lyapunov signature adds, and a third experiment locates when it +exists. Matched for displacement level within the unsettled population, λ₁ still separates +eventual successes from failures (decile-matched AUC 0.88–0.90), so the exponent does more +than restate non-convergence. Binned by the number of givens, the separation is unchanged +(within-bin AUC 0.982, against 0.984 unconditioned), so it is not an artifact of problem +difficulty. It is, however, strictly retrospective. Restricted to puzzles still unsolved after +four of sixteen segments, neither early-window exponents nor early state velocity predicts +which trajectories will eventually succeed (AUC ≈ 0.5 in TRM), and in HRM the association +inverts — among the undecided, the trajectories that move more in the early segments are the +ones that go on to solve the puzzle (positive-direction AUC 0.69). The chaos of failure +arrives with the failure; nothing dynamical in the early trajectory anticipates it. + +These measurements redraw the intervention map for this model class. Because failure is almost +never a stable wrong answer, restart-and-select inference strategies have a high ceiling and a +quantifiable blind spot of roughly half a percent. Because the early trajectory carries no +dynamical death sentence, compute spent on early failure prediction is compute wasted, and +restart diversity is the better buy. Our contributions: (i) per-example, outcome-conditioned +measurement of settling and finite-time Lyapunov spectra in HRM and TRM, at sample sizes up to +8,192 and replicated across two estimator implementations; (ii) a decomposition of failure +that corrects the settled-attractor reading and bounds the wrong-attractor mode at ~0.5% of +failures; (iii) controls showing the signature is not reducible to non-convergence or +difficulty; (iv) evidence that the signature is concurrent with the outcome and carries no +early-warning content at the granularity tested. + +--- +*[em-dash count: 1. Contrast-template count: title + one echo (end of ¶1). Flourish count: +1 ("death sentence", ¶4) — cuttable. "essentially never" is the one hedge in ¶2, scoped by +the 0.55% in the next sentence.]* diff --git a/research/flossing/paper/outline.md b/research/flossing/paper/outline.md new file mode 100644 index 0000000..0dde354 --- /dev/null +++ b/research/flossing/paper/outline.md @@ -0,0 +1,79 @@ +# Outline — "Recursive Reasoning Models Fail by Wandering, Not by Settling" (title FIXED 2026-06-12) + +Status: intro.md ✅ (v2, audited) · setup_results.md ✅ (Secs 2–3) · style_contract.md ✅ · +remaining: Sec 4 (relation to prior accounts), Sec 5 (implications), Sec 6 (limitations), +abstract, tables T1–T3 + figures F3/F4 composition. + +Target: ~8 pages main. Every section header below lists [claims served] and [assets]. + +## 1 Introduction [C1, spine] +- Para 1: recursive reasoners (HRM/TRM) solve hard puzzles by iterating a latent state; when they + fail, what is dynamically different? Existing mechanistic accounts infer dynamics from loss + curves and 2-D projections; we measure the dynamics directly, per example. +- Para 2: the answer, with numbers (settling × correctness decomposition; B≈0; AUC 0.99; + concurrent-not-antecedent). +- Para 3: contributions (4 items, one line each): (i) per-example outcome-conditioned FTLE/settling + measurement at n≤8192 across two architectures; (ii) failure-mode decomposition correcting two + published labels; (iii) independence controls (drift-matched, difficulty-binned); (iv) the + early-window null + sign reversal. +- NO general AI-reasoning throat-clearing. First sentence is about the object of study. + +## 2 Setup [assets: estimator details from diagnose_trm_joint.py; OBSERVATIONS.md provenance table] +- 2.1 Models & task: HRM 27M @26040 (acc .526), TRM-MLP official recipe @58590 (acc .876), + Sudoku-Extreme-1k-aug; fixed 16-step unroll, ACT recorded not applied. +- 2.2 Measurements: joint (z_H,z_L) tangent dynamics, JVP+QR, k=8, per-sub-update normalization; + per-ACT-step state displacement (drift); q_halt; exact/token accuracy. Estimator-scale caveat. +- 2.3 The 2×2 design: settled band defined by bimodal late-drift split (Otsu primary, full + percentile sweep + threshold-free statement in appendix); cells A/B/C/D. + +## 3 Results +- 3.1 Decomposition [C1, C2, C3; assets: cells tables, fig_*_scatter, fig_*_lyap_by_cell, + strict-B table + fig_hrm_strictB_profiles] + Lead: "Across 2048–8192 held-out puzzles, no TRM failure and 0.55% of HRM failures end in the + settled band." Then per-cell λ₁; then the 21 selector-blind examples (their three lowest + token-acc are all 17-givens puzzles). +- 3.2 What the signal is not [C4; assets: decile table, givens table] + Drift-matched AUC 0.88–0.90; givens-binned AUC unchanged. One paragraph each, tables carry + the numbers. +- 3.3 When the signal exists [C5; assets: early_pairing_{trm,hrm}.md tables] + The early-window null; the HRM sign reversal (drift@4 +direction AUC 0.688); q_halt@4 0.734 + vs TRM 0.521 (factual note: TRM removed the continue head). Frame as the temporal anatomy of + the signature. +- 3.4 Training evolution [C7; assets: evolution_{trm,hrm}.png/csv; multi4 quick-compare] + Gap widens via λ₁(D); multi4 shrinks D-cell mass at matched steps (preliminary, objective + caveat); multi4 collapse = λ₁(A) sign flip. + +## 4 Relation to prior accounts [C6a, C6b; assets: papers/notes/*] +- Para 1: network-level Lyapunov–performance work (Vogt 2022; AeLLE 2024; Engelken flossing + App. D.3 trains-vs-fails at network level, opposite sign) → none condition per example on outcome. +- Para 2: the 2026 mechanistic trio. Efstathiou & Balwani: credit loss/boundedness/intervention; + quote and correct the settledness reading (C6a). Ren & Liu: confirm + quantify their taxonomy + (C6b). Es'kin & Smorkalov (CMM): their endpoint-stability losses + engineered early repeller + are consistent, at the design level, with where our measurements localize the signal — cite, + don't claim confirmation. +- Para 3: stability-by-construction line (monDEQ, Jacobian-reg DEQ, REN/Sandwich; TRM's own + TorchDEQ negative result; Solve-the-Loop) — what "enforce settling" buys and where it failed; + our measurements say which kind of settling is the operative one. + +## 5 Implications (restrained, half page) +- Intervention design space bifurcates: widen/deepen the settled tube at training time + (perturbation training, equilibrium losses) vs restart-and-select at inference + (q_halt tracks correctness at trajectory end; selector-blind ceiling ≈0.5%). +- Early pruning/reallocation unsupported at 4-step granularity; on HRM the gradient of usable + early signal lives in the learned head, not the generic dynamical quantities. + +## 6 Limitations & future +Sudoku-Extreme only; two models; #givens is a weak difficulty proxy (solver backtracks next); +single early horizon (sweep queued); end-of-window criterion blind to mid-trajectory lingering; +no mechanism offered for why settling fails — measurement paper. + +## Figures plan (all exist or one rerun away) +F1: drift–λ₁ scatter, both models (have). +F2: per-cell λ₁ + strict-B profiles inset (have). +F3: decile-matched AUC + givens-binned AUC (compose from CSVs). +F4: early-window pairing summary (compose: 3 signals × 2 models, restricted set). +F5: checkpoint evolution (have). + +## Order of writing +1. Results 3.1–3.3 (numbers already final) → 2. Setup → 3. Sec 4 (notes ready) → 4. Intro → +5. Implications/Limitations → 6. style pass against claims.md checklist. diff --git a/research/flossing/paper/rainer_followup_draft.md b/research/flossing/paper/rainer_followup_draft.md new file mode 100644 index 0000000..12091de --- /dev/null +++ b/research/flossing/paper/rainer_followup_draft.md @@ -0,0 +1,37 @@ +Subject: Re: Question on gradient flossing vs forward trajectory stability in recursive reasoning models + +Hi Rainer, + +A short follow-up to my email of June 5 — we have since measured the things I was speculating +about, and two results seem worth sharing because they sharpen the question I asked you. + +First, conditioning per-example finite-time Lyapunov spectra on both outcome and terminal +settling (n = 2048–8192, two architectures) shows that failure is almost exclusively +non-settling: in an official-recipe TRM at 87.6% accuracy, none of 254 failed trajectories +ever enters the low-velocity band that all successes occupy, and they remain locally expansive +to the end (median λ₁ +0.10 vs +0.01). "Converged to the wrong attractor" failures exist in +HRM but make up only ~0.5% of failures. The chaotic signature also survives two controls: it +persists after matching trajectories on displacement level (so it is not just re-measuring +non-convergence), and after binning by puzzle difficulty. + +Second — and this is the part that genuinely surprised us — the signature is strictly +concurrent. Among puzzles still unsolved after a quarter of the inference budget, neither the +early-window exponents nor early state velocity predict which trajectories will eventually +succeed (AUC ≈ 0.5); in HRM the association even inverts, with eventually-successful +trajectories moving more in the early phase. So the failed trajectories are not "born chaotic": +chaos at the end and failure appear together. + +This makes me think the right framing for my earlier question is reachability of the settled +region (escape from a long chaotic transient) rather than per-example landscape quality, which +would be consistent with your view of flossing as a learning-time tool rather than an +inference-time one. If you know of work that conditions finite-time exponents on trajectory +fate in this way — in transient-chaos settings or elsewhere — I would be grateful for a +pointer; we have not found a precedent. + +Best, +Yuren + +--- +[Notes, not part of the email: numbers from analysis_2x2/OBSERVATIONS.md addenda 1-2. Send only +if/after Rainer replies to the June 5 email, or as a gentle bump after ~2 weeks (June 19+). +The "born chaotic" phrasing mirrors his literature's transient-chaos vocabulary deliberately.] diff --git a/research/flossing/paper/sample_intro.md b/research/flossing/paper/sample_intro.md new file mode 100644 index 0000000..183faa4 --- /dev/null +++ b/research/flossing/paper/sample_intro.md @@ -0,0 +1,49 @@ +# Sample section: Introduction (taste-calibration draft) + +Recursive reasoning models solve constraint-satisfaction problems that defeat much larger +language models by iterating a small network on a latent state — up to several hundred state +updates per puzzle in the Hierarchical Reasoning Model (HRM) and the Tiny Recursive Model +(TRM). When such a model fails, what is dynamically different about the trajectory it +produced? Recent mechanistic studies have answered with attractor language: failed runs +"plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026), or converge to spurious +fixed points that rival the correct one (Ren & Liu, 2026). These accounts rest on indirect +evidence — loss plateaus, two-dimensional projections of 512-dimensional trajectories — and +the two papers do not agree: one describes failure as premature stability, the other partly as +wandering. Neither measures stability itself. + +We measure it directly. For every test puzzle we record two per-example quantities along the +full 16-segment inference trajectory: the finite-time Lyapunov spectrum of the joint latent +dynamics, and the per-segment state displacement. Conditioning these on outcome over 2,048 to +8,192 puzzles per model yields a complete decomposition of failure for HRM (52.6% accuracy) +and an official-recipe TRM (87.6%), and the decomposition contradicts the settled-attractor +picture. Correct trajectories enter a narrow low-velocity band and stay in it; failed +trajectories never do. In TRM, not one of 254 failures settles — the least mobile failure still +moves faster at the end of inference than 96.5% of successes — while remaining locally +expansive (median λ₁ = +0.103 versus +0.012 for successes; AUC 0.993). In HRM, settled-but-wrong +trajectories exist but account for 0.55% of failures; the other 99.45% wander. Failure in these +models is not a wrong attractor. It is the sustained absence of settling. + +Two controls sharpen what the Lyapunov signature adds. Matched for displacement level within +the unsettled population, λ₁ still separates eventual successes from failures (decile-matched +AUC 0.88–0.90), so the exponent is not merely re-measuring non-convergence; and binning by +puzzle givens leaves the separation intact (within-bin AUC 0.982 versus 0.984 overall), so it +is not a difficulty artifact. The signature is, however, strictly retrospective. Restricted to +puzzles still unsolved after four segments, nothing dynamical about those first four segments +predicts which will eventually be solved: AUC ≈ 0.5 in TRM for exponent, displacement, and +halting confidence alike — and in HRM the association inverts, with eventual successes moving +*more* in the early trajectory than eventual failures (AUC 0.69 in the positive direction). +The chaos of failure is concurrent with the outcome, not an omen visible at the start. + +These measurements reframe both the diagnosis and the levers. Because failure is almost never +a stable wrong answer, selection-based inference strategies have a high ceiling — final-step +halting confidence tracks correctness on all but the ~0.5% of failures that settle confidently +— and because the early trajectory carries no dynamical death sentence, compute is better +spent on restarts than on early pruning. We quantify both points, correct the published +attractor labels they depend on, and release the per-example measurement tooling. + +--- +*[Style notes for review, not part of the draft: (1) every paragraph opens with a finding or a +question, none with "In recent years"; (2) the two prior papers are quoted precisely and +credited for what their data shows before the correction is made; (3) hedges appear only where +the claim table concedes (e.g., "almost never", "~0.5%"); (4) the one rhetorical flourish — +"not an omen" — is load-bearing; cut it if it reads as flavor.]* diff --git a/research/flossing/paper/setup_results.md b/research/flossing/paper/setup_results.md new file mode 100644 index 0000000..d6976d5 --- /dev/null +++ b/research/flossing/paper/setup_results.md @@ -0,0 +1,129 @@ +# 2 Setup + +**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the +1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact +accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global +batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set, +87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation +halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is +observed for the full budget. Answers are decoded at segment 16. + +**Per-example measurements.** Along each trajectory we record three families of quantities. +First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent +vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products +through every state update and re-orthonormalized by QR at each sub-step; λ_i is the +time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64 +for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L +separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values +are comparable only within an estimator implementation; we replicate the HRM analysis under a +second, earlier implementation (n=8,192) and report its scale separately. + +**The settling criterion.** Late drift — the mean z_H displacement over the final four +segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band +(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width +under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a +trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's +method on the pooled log distribution; every result below is reported with a full percentile +sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed +point: both bands have nonzero characteristic velocity. + +**Design.** Crossing the settling criterion with answer correctness yields four cells: +settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The +analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add +beyond the settling split? And when along the trajectory does the discriminative signal exist? + +# 3 Results + +## 3.1 Failure is wandering: the 2×2 decomposition + +At the end of inference, success and failure occupy different dynamical regimes almost without +exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is +settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the +late drift of 96.5% of successes, so no threshold assignment can place a failure in the +settled band. Failed trajectories also remain locally expansive over the full window (median +λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction +(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint +series, shows the settled-wrong cell empty from 20% of training onward. + +HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894 +failures (0.55%; n=8,192) end settled; the replication under the second estimator gives +5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable +from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the +A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes +(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the +halting threshold between segments 4 and 9; under adaptive computation each would have stopped +early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate +are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu +(2026), measured: it exists, it carries exactly the contraction signature their account +predicts, and it is two orders of magnitude less common than wandering. + +The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at +segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as +settled successes. Their existence shows the decode head can read a correct answer off a +moving state; we do not observe what happens to them past the window. + +## 3.2 What the exponent is not measuring + +The λ₁ separation is not a restatement of the settling split. Within the unsettled population, +where every trajectory is still moving, λ₁ ranks eventual successes above failures inside +narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile +width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift +to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900 +(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is +moving. + +The separation is also not a difficulty artifact, at least not at the resolution of clue +count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself +difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning +removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982) +against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be +the sharper control — but at this resolution the dynamical signature is orthogonal to how hard +the puzzle is. + +## 3.3 When the signal exists: concurrent, with no early warning + +The discriminative power of the dynamics is a property of the realized trajectory, and it is +absent at the start. We re-measured both models over only the first four segments (idx-paired +with the full-window runs, same sampling) and asked whether anything visible by segment 4 +forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM, +0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of +TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet +correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually +succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting +confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert: +eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially +higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM +trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle. + +One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at +segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training +removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note +without interpreting. Window length is the untested variable here: four segments matches the +deep-supervision horizon, and we have not yet swept longer prefixes. + +## 3.4 Training widens the gap from the failure side + +Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures +rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of +zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows +over training because the failures become more expansive, while the success regime barely +moves. HRM's series shows a mass migration instead: at early checkpoints nearly all +trajectories are low-drift and wrong (the model barely updates state), this cell drains +through mid-training into high-drift wandering, and accuracy growth then tracks transfer from +wandering into the settled-correct band. + +A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained +with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering- +failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and +247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the +known late-run collapse of this variant coincides with the settled band itself destabilizing +(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training +objective (ACT-streaming versus fixed unroll), so we report this as directional evidence +pending a matched-objective control. + +--- +*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0 +(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable. +Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window +restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and +offline_followups/followups.md.]* diff --git a/research/flossing/paper/style_contract.md b/research/flossing/paper/style_contract.md new file mode 100644 index 0000000..4e0f319 --- /dev/null +++ b/research/flossing/paper/style_contract.md @@ -0,0 +1,47 @@ +# Style contract (operative checklist for every section pass) + +Sources: Shaib et al. 2026 slop taxonomy (density/relevance/coherence are the load-bearing +dimensions); Buschek's four reviewer complaints; 2026 banned-pattern lists. Applied as a per- +section pass, not a vibe. + +## Hard bans (lexical) +delve, tapestry, landscape (figurative), testament, pivotal, showcase, intricate, vibrant, +seamless, elegant, dramatically, novel (self-describing), comprehensive (self-describing), +notably, crucially, interestingly, importantly, "It is worth noting", "Moreover," as sentence +opener, "In recent years", "has garnered attention", "paradigm shift". +Technical terms that overlap ban lists (e.g., "robust" in "threshold-robust") stay. + +## Punctuation & rhythm +- Em dash: ≤2 per section, never two in one sentence, never as paired parenthetical. +- No exclamation marks. Semicolons fine. +- Every paragraph contains at least one sentence under ~12 words. +- No uniform paragraph shapes: don't end three consecutive paragraphs with a summary clause. + +## Structure +- The contrast template ("X, not Y" / "not X, but Y") has a total budget of ONE device: + the title and its single echo at the end of paragraph 1. Zero elsewhere. Express other + contrasts by stating the positive finding and letting the numbers carry the negation. +- Rule-of-three closers only when the three items are an exhaustive measured list. +- No chronological narration of the investigation. Structure follows claims.md. +- Results topic sentences contain a finding or a number, never a plan ("We then examine..."). + +## Integrity (Buschek's four, mapped to our risks) +- Marketing language: zero unevidenced evaluative adjectives. "near-perfect" only with the AUC + in the same sentence. +- Performative related work: every citation paired with the precise claim it supports; check + each against papers/notes/*.md before the citation lands. +- Misrepresentation: prior-work characterizations quote verbatim with page/section; corrections + target the quoted words only (claims.md C6a discipline). +- Stretched summaries: no interpretive metaphors for results. One flourish per section maximum, + and it must be cuttable without losing content. + +## Density (the actual anti-slop weapon) +- Every sentence must add a number, a definition, a caveat, or a logical step. Otherwise cut. +- Modal verbs (could/might/may) confined to Discussion and future work. +- Hedges appear only where claims.md concedes; one hedge per concession, not a seesaw. + +## Honesty rails (project-specific) +- Never compare λ values across estimator versions; state scale ownership at first use. +- "Settled" always defined as the measured low-velocity band, with its residual velocity given. +- No mechanism claims; observations and their direct logical consequences only. +- No promises (code release, future experiments) that the authors have not decided. |
