1 files changed, 129 insertions, 0 deletions
diff --git a/research/flossing/paper/setup_results.md b/research/flossing/paper/setup_results.md
new file mode 100644
index 0000000..d6976d5
--- /dev/null
+++ b/research/flossing/paper/setup_results.md
@@ -0,0 +1,129 @@
+# 2 Setup
+
+**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the
+1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact
+accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global
+batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set,
+87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation
+halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is
+observed for the full budget. Answers are decoded at segment 16.
+
+**Per-example measurements.** Along each trajectory we record three families of quantities.
+First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent
+vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products
+through every state update and re-orthonormalized by QR at each sub-step; λ_i is the
+time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64
+for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L
+separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values
+are comparable only within an estimator implementation; we replicate the HRM analysis under a
+second, earlier implementation (n=8,192) and report its scale separately.
+
+**The settling criterion.** Late drift — the mean z_H displacement over the final four
+segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band
+(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width
+under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a
+trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's
+method on the pooled log distribution; every result below is reported with a full percentile
+sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed
+point: both bands have nonzero characteristic velocity.
+
+**Design.** Crossing the settling criterion with answer correctness yields four cells:
+settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The
+analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add
+beyond the settling split? And when along the trajectory does the discriminative signal exist?
+
+# 3 Results
+
+## 3.1 Failure is wandering: the 2×2 decomposition
+
+At the end of inference, success and failure occupy different dynamical regimes almost without
+exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is
+settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the
+late drift of 96.5% of successes, so no threshold assignment can place a failure in the
+settled band. Failed trajectories also remain locally expansive over the full window (median
+λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction
+(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint
+series, shows the settled-wrong cell empty from 20% of training onward.
+
+HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894
+failures (0.55%; n=8,192) end settled; the replication under the second estimator gives
+5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable
+from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the
+A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes
+(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the
+halting threshold between segments 4 and 9; under adaptive computation each would have stopped
+early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate
+are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu
+(2026), measured: it exists, it carries exactly the contraction signature their account
+predicts, and it is two orders of magnitude less common than wandering.
+
+The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at
+segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as
+settled successes. Their existence shows the decode head can read a correct answer off a
+moving state; we do not observe what happens to them past the window.
+
+## 3.2 What the exponent is not measuring
+
+The λ₁ separation is not a restatement of the settling split. Within the unsettled population,
+where every trajectory is still moving, λ₁ ranks eventual successes above failures inside
+narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile
+width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift
+to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900
+(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is
+moving.
+
+The separation is also not a difficulty artifact, at least not at the resolution of clue
+count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself
+difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning
+removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982)
+against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be
+the sharper control — but at this resolution the dynamical signature is orthogonal to how hard
+the puzzle is.
+
+## 3.3 When the signal exists: concurrent, with no early warning
+
+The discriminative power of the dynamics is a property of the realized trajectory, and it is
+absent at the start. We re-measured both models over only the first four segments (idx-paired
+with the full-window runs, same sampling) and asked whether anything visible by segment 4
+forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM,
+0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of
+TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet
+correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually
+succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting
+confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert:
+eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially
+higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM
+trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle.
+
+One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at
+segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training
+removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note
+without interpreting. Window length is the untested variable here: four segments matches the
+deep-supervision horizon, and we have not yet swept longer prefixes.
+
+## 3.4 Training widens the gap from the failure side
+
+Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures
+rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of
+zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows
+over training because the failures become more expansive, while the success regime barely
+moves. HRM's series shows a mass migration instead: at early checkpoints nearly all
+trajectories are low-drift and wrong (the model barely updates state), this cell drains
+through mid-training into high-drift wandering, and accuracy growth then tracks transfer from
+wandering into the settled-correct band.
+
+A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained
+with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering-
+failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and
+247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the
+known late-run collapse of this variant coincides with the settled band itself destabilizing
+(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training
+objective (ACT-streaming versus fixed unroll), so we report this as directional evidence
+pending a matched-objective control.
+
+---
+*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0
+(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable.
+Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window
+restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and
+offline_followups/followups.md.]*