# Outline — "Recursive Reasoning Models Fail by Wandering, Not by Settling" (title FIXED 2026-06-12) Status: intro.md ✅ (v2, audited) · setup_results.md ✅ (Secs 2–3) · style_contract.md ✅ · remaining: Sec 4 (relation to prior accounts), Sec 5 (implications), Sec 6 (limitations), abstract, tables T1–T3 + figures F3/F4 composition. Target: ~8 pages main. Every section header below lists [claims served] and [assets]. ## 1 Introduction [C1, spine] - Para 1: recursive reasoners (HRM/TRM) solve hard puzzles by iterating a latent state; when they fail, what is dynamically different? Existing mechanistic accounts infer dynamics from loss curves and 2-D projections; we measure the dynamics directly, per example. - Para 2: the answer, with numbers (settling × correctness decomposition; B≈0; AUC 0.99; concurrent-not-antecedent). - Para 3: contributions (4 items, one line each): (i) per-example outcome-conditioned FTLE/settling measurement at n≤8192 across two architectures; (ii) failure-mode decomposition correcting two published labels; (iii) independence controls (drift-matched, difficulty-binned); (iv) the early-window null + sign reversal. - NO general AI-reasoning throat-clearing. First sentence is about the object of study. ## 2 Setup [assets: estimator details from diagnose_trm_joint.py; OBSERVATIONS.md provenance table] - 2.1 Models & task: HRM 27M @26040 (acc .526), TRM-MLP official recipe @58590 (acc .876), Sudoku-Extreme-1k-aug; fixed 16-step unroll, ACT recorded not applied. - 2.2 Measurements: joint (z_H,z_L) tangent dynamics, JVP+QR, k=8, per-sub-update normalization; per-ACT-step state displacement (drift); q_halt; exact/token accuracy. Estimator-scale caveat. - 2.3 The 2×2 design: settled band defined by bimodal late-drift split (Otsu primary, full percentile sweep + threshold-free statement in appendix); cells A/B/C/D. ## 3 Results - 3.1 Decomposition [C1, C2, C3; assets: cells tables, fig_*_scatter, fig_*_lyap_by_cell, strict-B table + fig_hrm_strictB_profiles] Lead: "Across 2048–8192 held-out puzzles, no TRM failure and 0.55% of HRM failures end in the settled band." Then per-cell λ₁; then the 21 selector-blind examples (their three lowest token-acc are all 17-givens puzzles). - 3.2 What the signal is not [C4; assets: decile table, givens table] Drift-matched AUC 0.88–0.90; givens-binned AUC unchanged. One paragraph each, tables carry the numbers. - 3.3 When the signal exists [C5; assets: early_pairing_{trm,hrm}.md tables] The early-window null; the HRM sign reversal (drift@4 +direction AUC 0.688); q_halt@4 0.734 vs TRM 0.521 (factual note: TRM removed the continue head). Frame as the temporal anatomy of the signature. - 3.4 Training evolution [C7; assets: evolution_{trm,hrm}.png/csv; multi4 quick-compare] Gap widens via λ₁(D); multi4 shrinks D-cell mass at matched steps (preliminary, objective caveat); multi4 collapse = λ₁(A) sign flip. ## 4 Relation to prior accounts [C6a, C6b; assets: papers/notes/*] - Para 1: network-level Lyapunov–performance work (Vogt 2022; AeLLE 2024; Engelken flossing App. D.3 trains-vs-fails at network level, opposite sign) → none condition per example on outcome. - Para 2: the 2026 mechanistic trio. Efstathiou & Balwani: credit loss/boundedness/intervention; quote and correct the settledness reading (C6a). Ren & Liu: confirm + quantify their taxonomy (C6b). Es'kin & Smorkalov (CMM): their endpoint-stability losses + engineered early repeller are consistent, at the design level, with where our measurements localize the signal — cite, don't claim confirmation. - Para 3: stability-by-construction line (monDEQ, Jacobian-reg DEQ, REN/Sandwich; TRM's own TorchDEQ negative result; Solve-the-Loop) — what "enforce settling" buys and where it failed; our measurements say which kind of settling is the operative one. ## 5 Implications (restrained, half page) - Intervention design space bifurcates: widen/deepen the settled tube at training time (perturbation training, equilibrium losses) vs restart-and-select at inference (q_halt tracks correctness at trajectory end; selector-blind ceiling ≈0.5%). - Early pruning/reallocation unsupported at 4-step granularity; on HRM the gradient of usable early signal lives in the learned head, not the generic dynamical quantities. ## 6 Limitations & future Sudoku-Extreme only; two models; #givens is a weak difficulty proxy (solver backtracks next); single early horizon (sweep queued); end-of-window criterion blind to mid-trajectory lingering; no mechanism offered for why settling fails — measurement paper. ## Figures plan (all exist or one rerun away) F1: drift–λ₁ scatter, both models (have). F2: per-cell λ₁ + strict-B profiles inset (have). F3: decile-matched AUC + givens-binned AUC (compose from CSVs). F4: early-window pairing summary (compose: 3 signals × 2 models, restricted set). F5: checkpoint evolution (have). ## Order of writing 1. Results 3.1–3.3 (numbers already final) → 2. Setup → 3. Sec 4 (notes ready) → 4. Intro → 5. Implications/Limitations → 6. style pass against claims.md checklist.