diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-13 12:35:36 -0500 |
| commit | 66e0d8b9fd4d0f7a2231d689c055e26fdf1cf04a (patch) | |
| tree | c29cba61124018755a19b02c9d33e3ad5f2e05cc /research/flossing/paper/setup_results.md | |
Curated export for clone-and-run Maze training (2x A6000) + diagnostics.
trm/hrm pretrain.py carry trajectory-augmentation code (backward-compatible).
Heavy artifacts (checkpoints/wandb/npz) gitignored; see PROVENANCE.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'research/flossing/paper/setup_results.md')
| -rw-r--r-- | research/flossing/paper/setup_results.md | 129 |
1 files changed, 129 insertions, 0 deletions
diff --git a/research/flossing/paper/setup_results.md b/research/flossing/paper/setup_results.md new file mode 100644 index 0000000..d6976d5 --- /dev/null +++ b/research/flossing/paper/setup_results.md @@ -0,0 +1,129 @@ +# 2 Setup + +**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the +1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact +accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global +batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set, +87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation +halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is +observed for the full budget. Answers are decoded at segment 16. + +**Per-example measurements.** Along each trajectory we record three families of quantities. +First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent +vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products +through every state update and re-orthonormalized by QR at each sub-step; λ_i is the +time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64 +for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L +separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values +are comparable only within an estimator implementation; we replicate the HRM analysis under a +second, earlier implementation (n=8,192) and report its scale separately. + +**The settling criterion.** Late drift — the mean z_H displacement over the final four +segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band +(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width +under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a +trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's +method on the pooled log distribution; every result below is reported with a full percentile +sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed +point: both bands have nonzero characteristic velocity. + +**Design.** Crossing the settling criterion with answer correctness yields four cells: +settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The +analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add +beyond the settling split? And when along the trajectory does the discriminative signal exist? + +# 3 Results + +## 3.1 Failure is wandering: the 2×2 decomposition + +At the end of inference, success and failure occupy different dynamical regimes almost without +exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is +settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the +late drift of 96.5% of successes, so no threshold assignment can place a failure in the +settled band. Failed trajectories also remain locally expansive over the full window (median +λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction +(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint +series, shows the settled-wrong cell empty from 20% of training onward. + +HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894 +failures (0.55%; n=8,192) end settled; the replication under the second estimator gives +5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable +from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the +A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes +(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the +halting threshold between segments 4 and 9; under adaptive computation each would have stopped +early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate +are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu +(2026), measured: it exists, it carries exactly the contraction signature their account +predicts, and it is two orders of magnitude less common than wandering. + +The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at +segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as +settled successes. Their existence shows the decode head can read a correct answer off a +moving state; we do not observe what happens to them past the window. + +## 3.2 What the exponent is not measuring + +The λ₁ separation is not a restatement of the settling split. Within the unsettled population, +where every trajectory is still moving, λ₁ ranks eventual successes above failures inside +narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile +width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift +to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900 +(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is +moving. + +The separation is also not a difficulty artifact, at least not at the resolution of clue +count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself +difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning +removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982) +against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be +the sharper control — but at this resolution the dynamical signature is orthogonal to how hard +the puzzle is. + +## 3.3 When the signal exists: concurrent, with no early warning + +The discriminative power of the dynamics is a property of the realized trajectory, and it is +absent at the start. We re-measured both models over only the first four segments (idx-paired +with the full-window runs, same sampling) and asked whether anything visible by segment 4 +forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM, +0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of +TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet +correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually +succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting +confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert: +eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially +higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM +trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle. + +One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at +segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training +removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note +without interpreting. Window length is the untested variable here: four segments matches the +deep-supervision horizon, and we have not yet swept longer prefixes. + +## 3.4 Training widens the gap from the failure side + +Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures +rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of +zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows +over training because the failures become more expansive, while the success regime barely +moves. HRM's series shows a mass migration instead: at early checkpoints nearly all +trajectories are low-drift and wrong (the model barely updates state), this cell drains +through mid-training into high-drift wandering, and accuracy growth then tracks transfer from +wandering into the settled-correct band. + +A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained +with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering- +failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and +247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the +known late-run collapse of this variant coincides with the settled band itself destabilizing +(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training +objective (ACT-streaming versus fixed unroll), so we report this as directional evidence +pending a matched-objective control. + +--- +*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0 +(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable. +Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window +restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and +offline_followups/followups.md.]* |
