1 files changed, 79 insertions, 0 deletions
diff --git a/research/flossing/paper/outline.md b/research/flossing/paper/outline.md
new file mode 100644
index 0000000..0dde354
--- /dev/null
+++ b/research/flossing/paper/outline.md
@@ -0,0 +1,79 @@
+# Outline — "Recursive Reasoning Models Fail by Wandering, Not by Settling" (title FIXED 2026-06-12)
+
+Status: intro.md ✅ (v2, audited) · setup_results.md ✅ (Secs 2–3) · style_contract.md ✅ ·
+remaining: Sec 4 (relation to prior accounts), Sec 5 (implications), Sec 6 (limitations),
+abstract, tables T1–T3 + figures F3/F4 composition.
+
+Target: ~8 pages main. Every section header below lists [claims served] and [assets].
+
+## 1 Introduction [C1, spine]
+- Para 1: recursive reasoners (HRM/TRM) solve hard puzzles by iterating a latent state; when they
+  fail, what is dynamically different? Existing mechanistic accounts infer dynamics from loss
+  curves and 2-D projections; we measure the dynamics directly, per example.
+- Para 2: the answer, with numbers (settling × correctness decomposition; B≈0; AUC 0.99;
+  concurrent-not-antecedent).
+- Para 3: contributions (4 items, one line each): (i) per-example outcome-conditioned FTLE/settling
+  measurement at n≤8192 across two architectures; (ii) failure-mode decomposition correcting two
+  published labels; (iii) independence controls (drift-matched, difficulty-binned); (iv) the
+  early-window null + sign reversal.
+- NO general AI-reasoning throat-clearing. First sentence is about the object of study.
+
+## 2 Setup [assets: estimator details from diagnose_trm_joint.py; OBSERVATIONS.md provenance table]
+- 2.1 Models & task: HRM 27M @26040 (acc .526), TRM-MLP official recipe @58590 (acc .876),
+  Sudoku-Extreme-1k-aug; fixed 16-step unroll, ACT recorded not applied.
+- 2.2 Measurements: joint (z_H,z_L) tangent dynamics, JVP+QR, k=8, per-sub-update normalization;
+  per-ACT-step state displacement (drift); q_halt; exact/token accuracy. Estimator-scale caveat.
+- 2.3 The 2×2 design: settled band defined by bimodal late-drift split (Otsu primary, full
+  percentile sweep + threshold-free statement in appendix); cells A/B/C/D.
+
+## 3 Results
+- 3.1 Decomposition [C1, C2, C3; assets: cells tables, fig_*_scatter, fig_*_lyap_by_cell,
+  strict-B table + fig_hrm_strictB_profiles]
+  Lead: "Across 2048–8192 held-out puzzles, no TRM failure and 0.55% of HRM failures end in the
+  settled band." Then per-cell λ₁; then the 21 selector-blind examples (their three lowest
+  token-acc are all 17-givens puzzles).
+- 3.2 What the signal is not [C4; assets: decile table, givens table]
+  Drift-matched AUC 0.88–0.90; givens-binned AUC unchanged. One paragraph each, tables carry
+  the numbers.
+- 3.3 When the signal exists [C5; assets: early_pairing_{trm,hrm}.md tables]
+  The early-window null; the HRM sign reversal (drift@4 +direction AUC 0.688); q_halt@4 0.734
+  vs TRM 0.521 (factual note: TRM removed the continue head). Frame as the temporal anatomy of
+  the signature.
+- 3.4 Training evolution [C7; assets: evolution_{trm,hrm}.png/csv; multi4 quick-compare]
+  Gap widens via λ₁(D); multi4 shrinks D-cell mass at matched steps (preliminary, objective
+  caveat); multi4 collapse = λ₁(A) sign flip.
+
+## 4 Relation to prior accounts [C6a, C6b; assets: papers/notes/*]
+- Para 1: network-level Lyapunov–performance work (Vogt 2022; AeLLE 2024; Engelken flossing
+  App. D.3 trains-vs-fails at network level, opposite sign) → none condition per example on outcome.
+- Para 2: the 2026 mechanistic trio. Efstathiou & Balwani: credit loss/boundedness/intervention;
+  quote and correct the settledness reading (C6a). Ren & Liu: confirm + quantify their taxonomy
+  (C6b). Es'kin & Smorkalov (CMM): their endpoint-stability losses + engineered early repeller
+  are consistent, at the design level, with where our measurements localize the signal — cite,
+  don't claim confirmation.
+- Para 3: stability-by-construction line (monDEQ, Jacobian-reg DEQ, REN/Sandwich; TRM's own
+  TorchDEQ negative result; Solve-the-Loop) — what "enforce settling" buys and where it failed;
+  our measurements say which kind of settling is the operative one.
+
+## 5 Implications (restrained, half page)
+- Intervention design space bifurcates: widen/deepen the settled tube at training time
+  (perturbation training, equilibrium losses) vs restart-and-select at inference
+  (q_halt tracks correctness at trajectory end; selector-blind ceiling ≈0.5%).
+- Early pruning/reallocation unsupported at 4-step granularity; on HRM the gradient of usable
+  early signal lives in the learned head, not the generic dynamical quantities.
+
+## 6 Limitations & future
+Sudoku-Extreme only; two models; #givens is a weak difficulty proxy (solver backtracks next);
+single early horizon (sweep queued); end-of-window criterion blind to mid-trajectory lingering;
+no mechanism offered for why settling fails — measurement paper.
+
+## Figures plan (all exist or one rerun away)
+F1: drift–λ₁ scatter, both models (have).
+F2: per-cell λ₁ + strict-B profiles inset (have).
+F3: decile-matched AUC + givens-binned AUC (compose from CSVs).
+F4: early-window pairing summary (compose: 3 signals × 2 models, restricted set).
+F5: checkpoint evolution (have).
+
+## Order of writing
+1. Results 3.1–3.3 (numbers already final) → 2. Setup → 3. Sec 4 (notes ready) → 4. Intro →
+5. Implications/Limitations → 6. style pass against claims.md checklist.