summaryrefslogtreecommitdiff
path: root/research/flossing/paper
diff options
context:
space:
mode:
Diffstat (limited to 'research/flossing/paper')
-rw-r--r--research/flossing/paper/claims.md37
-rw-r--r--research/flossing/paper/experiment_framework.md43
-rw-r--r--research/flossing/paper/intro.md57
-rw-r--r--research/flossing/paper/outline.md79
-rw-r--r--research/flossing/paper/rainer_followup_draft.md37
-rw-r--r--research/flossing/paper/sample_intro.md49
-rw-r--r--research/flossing/paper/setup_results.md129
-rw-r--r--research/flossing/paper/style_contract.md47
8 files changed, 478 insertions, 0 deletions
diff --git a/research/flossing/paper/claims.md b/research/flossing/paper/claims.md
new file mode 100644
index 0000000..590b388
--- /dev/null
+++ b/research/flossing/paper/claims.md
@@ -0,0 +1,37 @@
+# Claim table (write structure FROM this, not from session chronology)
+
+Rule: every claim states its evidence, its strongest counter-reading, and where that
+counter-reading is killed or conceded. A claim that can't fill all four columns gets cut
+or demoted to an observation in the discussion.
+
+| # | Claim | Evidence | Strongest counter-reading | Disposition |
+|---|---|---|---|---|
+| C1 | Per-example full-trajectory FTLE separates success/failure near-perfectly in trained HRM and TRM on Sudoku-Extreme | AUC(−λ₁→correct): HRM 0.984 (n=8192) / 0.987 (n=2048, 2nd estimator); TRM official 87.6% ckpt 0.993 (n=2048). Medians: HRM joint −0.152 vs +0.032; TRM +0.012 vs +0.103 | Estimator artifact | Replicates across two estimator implementations and two architectures; report both scales, never cross-compare |
+| C2 | Failures overwhelmingly never settle; settled-wrong is rare (HRM) to absent (TRM) | TRM: 0/254 failures settled, threshold-free (min wrong-drift > late-drift of 96.5% of correct); HRM: 21/3894 (0.55%) strict-band; replicated 5/971 on 2nd estimator | Threshold choice ("settled" is arbitrary) | State threshold-free version (distribution separation); show full percentile sweep; define settled band by its narrow characteristic velocity |
+| C3 | The rare settled-wrong failures are selector-blind: success-like contraction AND success-like halting confidence | n=21: λ₁ med −0.842 (A: −0.867), q_halt(final) +7.47 (= A), all would halt early (halted_at 4–9), token_acc med 0.62 | Small n | Concede explicitly; the point is existence + boundedness (~0.5%), not precision |
+| C4 | The FTLE-outcome signal is not reducible to non-settling, and not a difficulty artifact | Drift-decile-matched AUC within unsettled stratum: 0.879 (n=8192) / 0.900 (2nd estimator); #givens-bin AUC 0.982 vs 0.984 overall | (a) residual within-decile drift variation; (b) #givens is a weak difficulty proxy | (a) deciles are narrow (table shows ranges); (b) concede openly, name solver-backtracks as the proper proxy, future work |
+| C5 | The chaotic signature is outcome-concurrent, not antecedent: nothing in the first 4 ACT steps forecasts eventual success among not-yet-solved examples — and on HRM the dynamical signals point the other way | Restricted to not-correct@4: TRM AUC λ₁ 0.543 / drift 0.492 / q_halt 0.521 (n=626); HRM λ₁ 0.448 / drift 0.312 (sign-reversed; +drift→success AUC 0.688) / q_halt 0.734 (n=1342) | Window length (only 4 steps tested) | Concede; horizon sweep = explicit future work; 4 chosen to match training window |
+| C6a | Correction: TRM failures do not "plateau at stable high-loss attractors" in state space | C2 + λ₁(D)=+0.103 + residual velocity ≈56/step ≈0.77× early; their own Fig 5 oscillation consistent | We strawman "attractor" (bounded attracting set can be chaotic) | Quote their exact wording ("stable", "local minima", "stabilize rather than explore"); credit their loss/boundedness data and intervention; correct only the settledness reading |
+| C6b | Refinement: Ren & Liu's four modes confirmed and quantified; wrong-fixed-point mode is real but marginal at trajectory end | Their mode (4) = our strict B (0.5%); mode (3) = our D (~99.5% of failures) | We measure end-of-window only; mid-trajectory lingering invisible | Concede explicitly; their non-trivial-success lingering claim untouched |
+| C7a | Training widens the success/failure gap from the failure side: failures become more expansive while the success regime barely moves | λ₁(D): +0.036→+0.102 over the TRM series, λ₁(A) within ±0.03 of 0 throughout | Estimator-scale / single-run | Holds; HRM series shows the mass-migration version. Robust claim. |
+| C7b | **DEMOTED by matched-objective control (E6).** Under MATCHED objective (step9 fixed-unroll, E-vs-F / G-vs-H), perturbation training's effect on the wandering cell is small and inconsistent at matched step | HRM fD 0.387→0.369 / 0.385→0.379 / 0.381→**0.387**(↑ at best) / 0.410→0.387; TRM 0.344→0.305 / 0.396→0.361 / 0.334→**0.361**(↑ at best) / 0.312→0.270 | The earlier large shrink (D 274→175) was partly a fixed-unroll-vs-ACT-streaming artifact, NOT a pure intervention effect | **Rewrite §3.4**: report the matched effect as small/equivocal; drop the strong "intervention shrinks wandering" reading. The May-28 mismatched comparison is retired. C7a stands; C7b does not support an intervention claim. |
+
+## The spine (one sentence)
+Direct per-example measurement of settling and tangent expansion decomposes recursive-reasoner
+failure: failures are overwhelmingly trajectories that never settle, the chaotic signature is
+concurrent with — not antecedent to — the outcome, and it is not explained by non-convergence
+alone or by problem difficulty.
+
+## What this paper is NOT claiming (write these into Discussion to pre-empt)
+- No mechanism for WHY trajectories fail to find the settled band (explicitly open).
+- No claim that early intervention is impossible in principle — only that λ/drift at 4 steps carry no signal.
+- No claim about tasks beyond Sudoku-Extreme or models beyond HRM/TRM.
+
+## Anti-patterns checklist (apply at style pass)
+- [ ] No chronological narration of the investigation; structure = claims order.
+- [ ] Numbers in topic sentences; every section's first sentence is a finding, not a plan.
+- [ ] No "notably/crucially/interestingly/delve/underscore"; hedges only where the claim table says concede.
+- [ ] Related work = positioning (each paragraph ends with the gap), not annotated bibliography.
+- [ ] Negative result (C5) framed as a finding with its own section, not a limitation apology.
+- [ ] Limitations: specific, short, no re-hedging of already-scoped claims.
+- [ ] Prose paragraphs in the body; tables only for numbers; no bullet lists in Results.
diff --git a/research/flossing/paper/experiment_framework.md b/research/flossing/paper/experiment_framework.md
new file mode 100644
index 0000000..945a9a3
--- /dev/null
+++ b/research/flossing/paper/experiment_framework.md
@@ -0,0 +1,43 @@
+# Experiment framework — locked 2026-06-12
+
+Purpose: every paper claim gets its evidence gap named and the run that closes it specified,
+so all remaining GPU/CPU work is execution against this table. Targets: workshop freeze
+~Jun 25 (COLM deadline Jul 12), ICLR freeze ~Sep 5 (deadline ~Sep 16, unconfirmed).
+
+## Locked measurement protocol (uniformity rules)
+- Estimator: current `diagnose_{hrm,trm}_joint.py` (joint JVP+QR, k=8, t_ons=1); seed 0 primary;
+ n=2048 headline, n=512 series/sweeps; batch 16 (TRM) / 32 (HRM).
+- λ values never compared across estimator implementations; scale ownership stated at first use.
+- Settled = low band of late z_H drift (mean of final 4 segments), Otsu primary + percentile
+ sweep + threshold-free statement; z_L and combined-drift definitions reported as robustness.
+- idx-pairing: same seed + same n across any runs that will be joined per example.
+- All new npz under analysis_2x2/ subdirs with queue logs; nothing hand-run on GPU.
+
+## Claims → gaps → runs
+
+| ID | Closes | Run | Spec | Cost | Priority / target |
+|---|---|---|---|---|---|
+| E1 | C1/C2 formal uncertainty; C7-TRM matched pair | OFFLINE: bootstrap CIs (cells, AUCs, B-fraction rule-of-three bound); z_L/combined settling robustness; 2×2 on existing TRM official multi4 npz (@35805 best, @65100 final) vs baseline @58590 | existing npz | CPU, now | **P0 / workshop** |
+| E2 | C1/C2 run-level replication (HRM has 1 training run) | diagnose step9_E (HRM fixed-unroll baseline 50k) best.pt + final.pt + 2 mid ckpts, n=512–2048 | loader risk: step9 .pt may need key remap; fix on first failure | ~1–2 h shared | **P1 / workshop** |
+| E5 | C5 single-horizon caveat → "when does fate become legible" curve | horizon sweep h∈{2,4,6,8,10,12} segments, both models, idx-paired to full-window retest (seed 0, n=2048) | env-var horizon variant of *_short.py | ~5 h shared | **P1 / workshop headline fig** |
+| E6 | C7 matched-objective intervention probe | (a) verify provenance of diag_hrm_multi4_* run; (b) diagnose step9 pairs E vs F (HRM), G vs H (TRM) at {12500, 25000, best, final}, n=512; compare at matched ACCURACY (not matched step) + best-vs-best | ~2–3 h shared | **P1 / ICLR (workshop if cheap)** |
+| E3 | C3 "would have halted" + C-cell fate beyond window | 32-segment unroll restricted to B∪C idx sets (≤300 examples/model) + true-ACT inference on HRM's 21 B-cell idx | ~1 h shared | P2 / ICLR |
+| E4 | C4 difficulty proxy weakness | CPU sudoku solver with backtrack counting on the 81-token inputs; redo difficulty control binned by backtracks | CPU only | P2 / ICLR |
+| E7 | Sec-5 implication → measured result | restart-and-select on TRM official @58590: K∈{1,2,4,8,16,32} h_init Gaussian restarts; select by q_halt vs late-drift vs λ-proxy vs oracle; n=1024; quantify B-blind-spot empirically | ~5–10 h shared | P2 / ICLR |
+| E8 | "Sudoku-only" limitation (biggest generalization win) | build Maze dataset; train TRM official-recipe Maze; full diagnostic suite on it | **multi-day dedicated GPU** — needs explicit go | P2-gated / ICLR |
+| E9 | transverse/readout framing | logit-space FTLE (diagnose-script modification) | dev + ~2 h | P3 / ICLR appendix |
+| E10 | sampling-seed robustness (if bootstrap deemed insufficient) | rerun headline diags seeds 1,2 | ~5 h shared | P3 |
+
+## Decision rules
+- Workshop version ships on E1+E2+E5 (+E6 if done): decomposition, controls, horizon curve.
+- ICLR adds E3/E4/E6/E7 (+E8 if approved in time: training must START by ~Jul 1).
+- Any FAILED queue job: fix loader/script, rerun; never substitute hand-run results.
+- New results that contradict current text → claims.md updated first, prose second.
+
+## Status board (update as runs land)
+- [x] E1 offline batch — done 2026-06-12, see offline_followups/phase1_e1.md
+- [x] E2 done — step9_E (2nd HRM run, fixed-unroll): B≈0 replicates (best 1/743=0.13%, final 3/832=0.36%). C2 STRENGTHENED (now 2 estimators × 2 archs × 2 training runs × 2 objectives).
+- [x] E5 done — horizon sweep {2,4,6,8,10,12}×2 models. C5 STRENGTHENED to a curve: dynamics never legible early (TRM AUC≈0.48–0.53 all H; HRM reversed/below 0.5 all H); only HRM q_halt predicts (0.81→0.64). See phase1/fig_E5_horizon_sweep.png. NEW: undecided-pool eventual-success drops to 4%(HRM)/19%(TRM) by H=12 yet dynamics still don't discriminate.
+- [x] E6 done — matched-objective. **C7b demoted** (effect small/inconsistent at matched step); C7a (λ₁(D) rises over training) stands. §3.4 needs rewrite.
+- [ ] E3, E4, E7 — phase-2
+- [ ] E8 Maze — GO (user has dedicated GPUs). Build dataset + package launch/diagnostic; train on user cards.
diff --git a/research/flossing/paper/intro.md b/research/flossing/paper/intro.md
new file mode 100644
index 0000000..85f06e7
--- /dev/null
+++ b/research/flossing/paper/intro.md
@@ -0,0 +1,57 @@
+# Recursive Reasoning Models Fail by Wandering, Not by Settling
+
+## 1 Introduction
+
+Recursive reasoning models such as the Hierarchical Reasoning Model (HRM; Wang et al., 2025)
+and the Tiny Recursive Model (TRM; Jolicoeur-Martineau, 2025) solve constraint-satisfaction
+puzzles that defeat far larger language models, by iterating a small network on a latent state
+for hundreds of updates per puzzle. When such a model fails, what is dynamically different
+about the trajectory it produced? Two recent mechanistic studies answer in attractor language.
+Failed TRM runs "plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026); failed
+HRM runs converge to spurious fixed points that rival the correct one (Ren & Liu, 2026). The
+evidence behind both labels is indirect, resting on loss plateaus and two-dimensional
+projections of 512-dimensional trajectories, and the labels disagree about the basic character
+of failure: premature stability in one account, partly aimless drift in the other. Neither
+measures the trajectory's stability directly. We do, per example, and the measurements support
+a third description: recursive reasoning models fail by wandering, not by settling.
+
+Across 2,048 to 8,192 held-out Sudoku-Extreme puzzles, correct trajectories end inside a
+narrow low-velocity band of the latent dynamics, and failures essentially never do. In an
+official-recipe TRM at 87.6% test accuracy, none of 254 failures settles: the least mobile
+failure still moves faster at the end of inference than 96.5% of successes, a separation of
+distributions that no threshold choice can undo, and failed trajectories remain locally
+expansive throughout (median leading finite-time Lyapunov exponent λ₁ = +0.103, against +0.012
+for successes; AUC 0.993). HRM shows the same structure with one addition. Settled-but-wrong
+trajectories exist, but they account for 0.55% of failures, carry success-like contraction
+(λ₁ = −0.84, against −0.87 for settled successes) and success-like halting confidence, and
+every one of them would have halted early under adaptive computation. The wrong-attractor
+failure mode is real, rare, and the only failure a confidence-based selector cannot catch.
+
+Two controls locate what the Lyapunov signature adds, and a third experiment locates when it
+exists. Matched for displacement level within the unsettled population, λ₁ still separates
+eventual successes from failures (decile-matched AUC 0.88–0.90), so the exponent does more
+than restate non-convergence. Binned by the number of givens, the separation is unchanged
+(within-bin AUC 0.982, against 0.984 unconditioned), so it is not an artifact of problem
+difficulty. It is, however, strictly retrospective. Restricted to puzzles still unsolved after
+four of sixteen segments, neither early-window exponents nor early state velocity predicts
+which trajectories will eventually succeed (AUC ≈ 0.5 in TRM), and in HRM the association
+inverts — among the undecided, the trajectories that move more in the early segments are the
+ones that go on to solve the puzzle (positive-direction AUC 0.69). The chaos of failure
+arrives with the failure; nothing dynamical in the early trajectory anticipates it.
+
+These measurements redraw the intervention map for this model class. Because failure is almost
+never a stable wrong answer, restart-and-select inference strategies have a high ceiling and a
+quantifiable blind spot of roughly half a percent. Because the early trajectory carries no
+dynamical death sentence, compute spent on early failure prediction is compute wasted, and
+restart diversity is the better buy. Our contributions: (i) per-example, outcome-conditioned
+measurement of settling and finite-time Lyapunov spectra in HRM and TRM, at sample sizes up to
+8,192 and replicated across two estimator implementations; (ii) a decomposition of failure
+that corrects the settled-attractor reading and bounds the wrong-attractor mode at ~0.5% of
+failures; (iii) controls showing the signature is not reducible to non-convergence or
+difficulty; (iv) evidence that the signature is concurrent with the outcome and carries no
+early-warning content at the granularity tested.
+
+---
+*[em-dash count: 1. Contrast-template count: title + one echo (end of ¶1). Flourish count:
+1 ("death sentence", ¶4) — cuttable. "essentially never" is the one hedge in ¶2, scoped by
+the 0.55% in the next sentence.]*
diff --git a/research/flossing/paper/outline.md b/research/flossing/paper/outline.md
new file mode 100644
index 0000000..0dde354
--- /dev/null
+++ b/research/flossing/paper/outline.md
@@ -0,0 +1,79 @@
+# Outline — "Recursive Reasoning Models Fail by Wandering, Not by Settling" (title FIXED 2026-06-12)
+
+Status: intro.md ✅ (v2, audited) · setup_results.md ✅ (Secs 2–3) · style_contract.md ✅ ·
+remaining: Sec 4 (relation to prior accounts), Sec 5 (implications), Sec 6 (limitations),
+abstract, tables T1–T3 + figures F3/F4 composition.
+
+Target: ~8 pages main. Every section header below lists [claims served] and [assets].
+
+## 1 Introduction [C1, spine]
+- Para 1: recursive reasoners (HRM/TRM) solve hard puzzles by iterating a latent state; when they
+ fail, what is dynamically different? Existing mechanistic accounts infer dynamics from loss
+ curves and 2-D projections; we measure the dynamics directly, per example.
+- Para 2: the answer, with numbers (settling × correctness decomposition; B≈0; AUC 0.99;
+ concurrent-not-antecedent).
+- Para 3: contributions (4 items, one line each): (i) per-example outcome-conditioned FTLE/settling
+ measurement at n≤8192 across two architectures; (ii) failure-mode decomposition correcting two
+ published labels; (iii) independence controls (drift-matched, difficulty-binned); (iv) the
+ early-window null + sign reversal.
+- NO general AI-reasoning throat-clearing. First sentence is about the object of study.
+
+## 2 Setup [assets: estimator details from diagnose_trm_joint.py; OBSERVATIONS.md provenance table]
+- 2.1 Models & task: HRM 27M @26040 (acc .526), TRM-MLP official recipe @58590 (acc .876),
+ Sudoku-Extreme-1k-aug; fixed 16-step unroll, ACT recorded not applied.
+- 2.2 Measurements: joint (z_H,z_L) tangent dynamics, JVP+QR, k=8, per-sub-update normalization;
+ per-ACT-step state displacement (drift); q_halt; exact/token accuracy. Estimator-scale caveat.
+- 2.3 The 2×2 design: settled band defined by bimodal late-drift split (Otsu primary, full
+ percentile sweep + threshold-free statement in appendix); cells A/B/C/D.
+
+## 3 Results
+- 3.1 Decomposition [C1, C2, C3; assets: cells tables, fig_*_scatter, fig_*_lyap_by_cell,
+ strict-B table + fig_hrm_strictB_profiles]
+ Lead: "Across 2048–8192 held-out puzzles, no TRM failure and 0.55% of HRM failures end in the
+ settled band." Then per-cell λ₁; then the 21 selector-blind examples (their three lowest
+ token-acc are all 17-givens puzzles).
+- 3.2 What the signal is not [C4; assets: decile table, givens table]
+ Drift-matched AUC 0.88–0.90; givens-binned AUC unchanged. One paragraph each, tables carry
+ the numbers.
+- 3.3 When the signal exists [C5; assets: early_pairing_{trm,hrm}.md tables]
+ The early-window null; the HRM sign reversal (drift@4 +direction AUC 0.688); q_halt@4 0.734
+ vs TRM 0.521 (factual note: TRM removed the continue head). Frame as the temporal anatomy of
+ the signature.
+- 3.4 Training evolution [C7; assets: evolution_{trm,hrm}.png/csv; multi4 quick-compare]
+ Gap widens via λ₁(D); multi4 shrinks D-cell mass at matched steps (preliminary, objective
+ caveat); multi4 collapse = λ₁(A) sign flip.
+
+## 4 Relation to prior accounts [C6a, C6b; assets: papers/notes/*]
+- Para 1: network-level Lyapunov–performance work (Vogt 2022; AeLLE 2024; Engelken flossing
+ App. D.3 trains-vs-fails at network level, opposite sign) → none condition per example on outcome.
+- Para 2: the 2026 mechanistic trio. Efstathiou & Balwani: credit loss/boundedness/intervention;
+ quote and correct the settledness reading (C6a). Ren & Liu: confirm + quantify their taxonomy
+ (C6b). Es'kin & Smorkalov (CMM): their endpoint-stability losses + engineered early repeller
+ are consistent, at the design level, with where our measurements localize the signal — cite,
+ don't claim confirmation.
+- Para 3: stability-by-construction line (monDEQ, Jacobian-reg DEQ, REN/Sandwich; TRM's own
+ TorchDEQ negative result; Solve-the-Loop) — what "enforce settling" buys and where it failed;
+ our measurements say which kind of settling is the operative one.
+
+## 5 Implications (restrained, half page)
+- Intervention design space bifurcates: widen/deepen the settled tube at training time
+ (perturbation training, equilibrium losses) vs restart-and-select at inference
+ (q_halt tracks correctness at trajectory end; selector-blind ceiling ≈0.5%).
+- Early pruning/reallocation unsupported at 4-step granularity; on HRM the gradient of usable
+ early signal lives in the learned head, not the generic dynamical quantities.
+
+## 6 Limitations & future
+Sudoku-Extreme only; two models; #givens is a weak difficulty proxy (solver backtracks next);
+single early horizon (sweep queued); end-of-window criterion blind to mid-trajectory lingering;
+no mechanism offered for why settling fails — measurement paper.
+
+## Figures plan (all exist or one rerun away)
+F1: drift–λ₁ scatter, both models (have).
+F2: per-cell λ₁ + strict-B profiles inset (have).
+F3: decile-matched AUC + givens-binned AUC (compose from CSVs).
+F4: early-window pairing summary (compose: 3 signals × 2 models, restricted set).
+F5: checkpoint evolution (have).
+
+## Order of writing
+1. Results 3.1–3.3 (numbers already final) → 2. Setup → 3. Sec 4 (notes ready) → 4. Intro →
+5. Implications/Limitations → 6. style pass against claims.md checklist.
diff --git a/research/flossing/paper/rainer_followup_draft.md b/research/flossing/paper/rainer_followup_draft.md
new file mode 100644
index 0000000..12091de
--- /dev/null
+++ b/research/flossing/paper/rainer_followup_draft.md
@@ -0,0 +1,37 @@
+Subject: Re: Question on gradient flossing vs forward trajectory stability in recursive reasoning models
+
+Hi Rainer,
+
+A short follow-up to my email of June 5 — we have since measured the things I was speculating
+about, and two results seem worth sharing because they sharpen the question I asked you.
+
+First, conditioning per-example finite-time Lyapunov spectra on both outcome and terminal
+settling (n = 2048–8192, two architectures) shows that failure is almost exclusively
+non-settling: in an official-recipe TRM at 87.6% accuracy, none of 254 failed trajectories
+ever enters the low-velocity band that all successes occupy, and they remain locally expansive
+to the end (median λ₁ +0.10 vs +0.01). "Converged to the wrong attractor" failures exist in
+HRM but make up only ~0.5% of failures. The chaotic signature also survives two controls: it
+persists after matching trajectories on displacement level (so it is not just re-measuring
+non-convergence), and after binning by puzzle difficulty.
+
+Second — and this is the part that genuinely surprised us — the signature is strictly
+concurrent. Among puzzles still unsolved after a quarter of the inference budget, neither the
+early-window exponents nor early state velocity predict which trajectories will eventually
+succeed (AUC ≈ 0.5); in HRM the association even inverts, with eventually-successful
+trajectories moving more in the early phase. So the failed trajectories are not "born chaotic":
+chaos at the end and failure appear together.
+
+This makes me think the right framing for my earlier question is reachability of the settled
+region (escape from a long chaotic transient) rather than per-example landscape quality, which
+would be consistent with your view of flossing as a learning-time tool rather than an
+inference-time one. If you know of work that conditions finite-time exponents on trajectory
+fate in this way — in transient-chaos settings or elsewhere — I would be grateful for a
+pointer; we have not found a precedent.
+
+Best,
+Yuren
+
+---
+[Notes, not part of the email: numbers from analysis_2x2/OBSERVATIONS.md addenda 1-2. Send only
+if/after Rainer replies to the June 5 email, or as a gentle bump after ~2 weeks (June 19+).
+The "born chaotic" phrasing mirrors his literature's transient-chaos vocabulary deliberately.]
diff --git a/research/flossing/paper/sample_intro.md b/research/flossing/paper/sample_intro.md
new file mode 100644
index 0000000..183faa4
--- /dev/null
+++ b/research/flossing/paper/sample_intro.md
@@ -0,0 +1,49 @@
+# Sample section: Introduction (taste-calibration draft)
+
+Recursive reasoning models solve constraint-satisfaction problems that defeat much larger
+language models by iterating a small network on a latent state — up to several hundred state
+updates per puzzle in the Hierarchical Reasoning Model (HRM) and the Tiny Recursive Model
+(TRM). When such a model fails, what is dynamically different about the trajectory it
+produced? Recent mechanistic studies have answered with attractor language: failed runs
+"plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026), or converge to spurious
+fixed points that rival the correct one (Ren & Liu, 2026). These accounts rest on indirect
+evidence — loss plateaus, two-dimensional projections of 512-dimensional trajectories — and
+the two papers do not agree: one describes failure as premature stability, the other partly as
+wandering. Neither measures stability itself.
+
+We measure it directly. For every test puzzle we record two per-example quantities along the
+full 16-segment inference trajectory: the finite-time Lyapunov spectrum of the joint latent
+dynamics, and the per-segment state displacement. Conditioning these on outcome over 2,048 to
+8,192 puzzles per model yields a complete decomposition of failure for HRM (52.6% accuracy)
+and an official-recipe TRM (87.6%), and the decomposition contradicts the settled-attractor
+picture. Correct trajectories enter a narrow low-velocity band and stay in it; failed
+trajectories never do. In TRM, not one of 254 failures settles — the least mobile failure still
+moves faster at the end of inference than 96.5% of successes — while remaining locally
+expansive (median λ₁ = +0.103 versus +0.012 for successes; AUC 0.993). In HRM, settled-but-wrong
+trajectories exist but account for 0.55% of failures; the other 99.45% wander. Failure in these
+models is not a wrong attractor. It is the sustained absence of settling.
+
+Two controls sharpen what the Lyapunov signature adds. Matched for displacement level within
+the unsettled population, λ₁ still separates eventual successes from failures (decile-matched
+AUC 0.88–0.90), so the exponent is not merely re-measuring non-convergence; and binning by
+puzzle givens leaves the separation intact (within-bin AUC 0.982 versus 0.984 overall), so it
+is not a difficulty artifact. The signature is, however, strictly retrospective. Restricted to
+puzzles still unsolved after four segments, nothing dynamical about those first four segments
+predicts which will eventually be solved: AUC ≈ 0.5 in TRM for exponent, displacement, and
+halting confidence alike — and in HRM the association inverts, with eventual successes moving
+*more* in the early trajectory than eventual failures (AUC 0.69 in the positive direction).
+The chaos of failure is concurrent with the outcome, not an omen visible at the start.
+
+These measurements reframe both the diagnosis and the levers. Because failure is almost never
+a stable wrong answer, selection-based inference strategies have a high ceiling — final-step
+halting confidence tracks correctness on all but the ~0.5% of failures that settle confidently
+— and because the early trajectory carries no dynamical death sentence, compute is better
+spent on restarts than on early pruning. We quantify both points, correct the published
+attractor labels they depend on, and release the per-example measurement tooling.
+
+---
+*[Style notes for review, not part of the draft: (1) every paragraph opens with a finding or a
+question, none with "In recent years"; (2) the two prior papers are quoted precisely and
+credited for what their data shows before the correction is made; (3) hedges appear only where
+the claim table concedes (e.g., "almost never", "~0.5%"); (4) the one rhetorical flourish —
+"not an omen" — is load-bearing; cut it if it reads as flavor.]*
diff --git a/research/flossing/paper/setup_results.md b/research/flossing/paper/setup_results.md
new file mode 100644
index 0000000..d6976d5
--- /dev/null
+++ b/research/flossing/paper/setup_results.md
@@ -0,0 +1,129 @@
+# 2 Setup
+
+**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the
+1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact
+accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global
+batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set,
+87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation
+halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is
+observed for the full budget. Answers are decoded at segment 16.
+
+**Per-example measurements.** Along each trajectory we record three families of quantities.
+First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent
+vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products
+through every state update and re-orthonormalized by QR at each sub-step; λ_i is the
+time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64
+for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L
+separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values
+are comparable only within an estimator implementation; we replicate the HRM analysis under a
+second, earlier implementation (n=8,192) and report its scale separately.
+
+**The settling criterion.** Late drift — the mean z_H displacement over the final four
+segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band
+(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width
+under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a
+trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's
+method on the pooled log distribution; every result below is reported with a full percentile
+sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed
+point: both bands have nonzero characteristic velocity.
+
+**Design.** Crossing the settling criterion with answer correctness yields four cells:
+settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The
+analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add
+beyond the settling split? And when along the trajectory does the discriminative signal exist?
+
+# 3 Results
+
+## 3.1 Failure is wandering: the 2×2 decomposition
+
+At the end of inference, success and failure occupy different dynamical regimes almost without
+exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is
+settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the
+late drift of 96.5% of successes, so no threshold assignment can place a failure in the
+settled band. Failed trajectories also remain locally expansive over the full window (median
+λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction
+(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint
+series, shows the settled-wrong cell empty from 20% of training onward.
+
+HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894
+failures (0.55%; n=8,192) end settled; the replication under the second estimator gives
+5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable
+from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the
+A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes
+(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the
+halting threshold between segments 4 and 9; under adaptive computation each would have stopped
+early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate
+are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu
+(2026), measured: it exists, it carries exactly the contraction signature their account
+predicts, and it is two orders of magnitude less common than wandering.
+
+The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at
+segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as
+settled successes. Their existence shows the decode head can read a correct answer off a
+moving state; we do not observe what happens to them past the window.
+
+## 3.2 What the exponent is not measuring
+
+The λ₁ separation is not a restatement of the settling split. Within the unsettled population,
+where every trajectory is still moving, λ₁ ranks eventual successes above failures inside
+narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile
+width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift
+to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900
+(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is
+moving.
+
+The separation is also not a difficulty artifact, at least not at the resolution of clue
+count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself
+difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning
+removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982)
+against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be
+the sharper control — but at this resolution the dynamical signature is orthogonal to how hard
+the puzzle is.
+
+## 3.3 When the signal exists: concurrent, with no early warning
+
+The discriminative power of the dynamics is a property of the realized trajectory, and it is
+absent at the start. We re-measured both models over only the first four segments (idx-paired
+with the full-window runs, same sampling) and asked whether anything visible by segment 4
+forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM,
+0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of
+TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet
+correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually
+succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting
+confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert:
+eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially
+higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM
+trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle.
+
+One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at
+segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training
+removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note
+without interpreting. Window length is the untested variable here: four segments matches the
+deep-supervision horizon, and we have not yet swept longer prefixes.
+
+## 3.4 Training widens the gap from the failure side
+
+Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures
+rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of
+zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows
+over training because the failures become more expansive, while the success regime barely
+moves. HRM's series shows a mass migration instead: at early checkpoints nearly all
+trajectories are low-drift and wrong (the model barely updates state), this cell drains
+through mid-training into high-drift wandering, and accuracy growth then tracks transfer from
+wandering into the settled-correct band.
+
+A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained
+with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering-
+failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and
+247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the
+known late-run collapse of this variant coincides with the settled band itself destabilizing
+(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training
+objective (ACT-streaming versus fixed unroll), so we report this as directional evidence
+pending a matched-objective control.
+
+---
+*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0
+(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable.
+Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window
+restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and
+offline_followups/followups.md.]*
diff --git a/research/flossing/paper/style_contract.md b/research/flossing/paper/style_contract.md
new file mode 100644
index 0000000..4e0f319
--- /dev/null
+++ b/research/flossing/paper/style_contract.md
@@ -0,0 +1,47 @@
+# Style contract (operative checklist for every section pass)
+
+Sources: Shaib et al. 2026 slop taxonomy (density/relevance/coherence are the load-bearing
+dimensions); Buschek's four reviewer complaints; 2026 banned-pattern lists. Applied as a per-
+section pass, not a vibe.
+
+## Hard bans (lexical)
+delve, tapestry, landscape (figurative), testament, pivotal, showcase, intricate, vibrant,
+seamless, elegant, dramatically, novel (self-describing), comprehensive (self-describing),
+notably, crucially, interestingly, importantly, "It is worth noting", "Moreover," as sentence
+opener, "In recent years", "has garnered attention", "paradigm shift".
+Technical terms that overlap ban lists (e.g., "robust" in "threshold-robust") stay.
+
+## Punctuation & rhythm
+- Em dash: ≤2 per section, never two in one sentence, never as paired parenthetical.
+- No exclamation marks. Semicolons fine.
+- Every paragraph contains at least one sentence under ~12 words.
+- No uniform paragraph shapes: don't end three consecutive paragraphs with a summary clause.
+
+## Structure
+- The contrast template ("X, not Y" / "not X, but Y") has a total budget of ONE device:
+ the title and its single echo at the end of paragraph 1. Zero elsewhere. Express other
+ contrasts by stating the positive finding and letting the numbers carry the negation.
+- Rule-of-three closers only when the three items are an exhaustive measured list.
+- No chronological narration of the investigation. Structure follows claims.md.
+- Results topic sentences contain a finding or a number, never a plan ("We then examine...").
+
+## Integrity (Buschek's four, mapped to our risks)
+- Marketing language: zero unevidenced evaluative adjectives. "near-perfect" only with the AUC
+ in the same sentence.
+- Performative related work: every citation paired with the precise claim it supports; check
+ each against papers/notes/*.md before the citation lands.
+- Misrepresentation: prior-work characterizations quote verbatim with page/section; corrections
+ target the quoted words only (claims.md C6a discipline).
+- Stretched summaries: no interpretive metaphors for results. One flourish per section maximum,
+ and it must be cuttable without losing content.
+
+## Density (the actual anti-slop weapon)
+- Every sentence must add a number, a definition, a caveat, or a logical step. Otherwise cut.
+- Modal verbs (could/might/may) confined to Discussion and future work.
+- Hedges appear only where claims.md concedes; one hedge per concession, not a seesaw.
+
+## Honesty rails (project-specific)
+- Never compare λ values across estimator versions; state scale ownership at first use.
+- "Settled" always defined as the measured low-velocity band, with its residual velocity given.
+- No mechanism claims; observations and their direct logical consequences only.
+- No promises (code release, future experiments) that the authors have not decided.