summaryrefslogtreecommitdiff
path: root/research/flossing/paper/setup_results.md
blob: d6976d565581e3b7d184ecd14615fcded9334d0b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# 2 Setup

**Models and task.** We study two trained recursive reasoners on Sudoku-Extreme with the
1k×1000-augmentation training set: HRM (27M parameters; checkpoint at step 26,040; 52.6% exact
accuracy on our evaluation samples) and TRM-MLP trained with the official recipe (5M; global
batch size 768; checkpoint at step 58,590, the best of its run; 86.9% on the full test set,
87.6% on our n=2,048 sample). Inference runs a fixed 16-segment unroll; the adaptive-computation
halting signal (q_halt) is recorded at every segment but not applied, so every trajectory is
observed for the full budget. Answers are decoded at segment 16.

**Per-example measurements.** Along each trajectory we record three families of quantities.
First, the leading k=8 finite-time Lyapunov exponents of the joint latent dynamics: tangent
vectors in the concatenated (z_H, z_L) space are propagated with Jacobian-vector products
through every state update and re-orthonormalized by QR at each sub-step; λ_i is the
time-average of the log diagonal of R over the full trajectory (336 sub-updates for TRM, 64
for HRM). Second, the per-segment state displacement ‖z^{(t)} − z^{(t−1)}‖ for z_H and z_L
separately ("drift"). Third, q_halt, exact correctness, and token accuracy. Exponent values
are comparable only within an estimator implementation; we replicate the HRM analysis under a
second, earlier implementation (n=8,192) and report its scale separately.

**The settling criterion.** Late drift — the mean z_H displacement over the final four
segments — is bimodal in log scale for every checkpoint we examine: a narrow low-velocity band
(characteristic residual velocity 0.96 per segment for HRM, 18.5 for TRM, interquartile width
under 10%) separated from a high-velocity mode by one to two orders of magnitude. We call a
trajectory *settled* if its late drift falls in the low band. Thresholds are set by Otsu's
method on the pooled log distribution; every result below is reported with a full percentile
sweep, and the headline TRM result is threshold-free. Settled is a band property, not a fixed
point: both bands have nonzero characteristic velocity.

**Design.** Crossing the settling criterion with answer correctness yields four cells:
settled-correct (A), settled-wrong (B), unsettled-correct (C), unsettled-wrong (D). The
analysis asks three questions. How is failure distributed over B versus D? What does λ₁ add
beyond the settling split? And when along the trajectory does the discriminative signal exist?

# 3 Results

## 3.1 Failure is wandering: the 2×2 decomposition

At the end of inference, success and failure occupy different dynamical regimes almost without
exception (Table 1). In TRM, 254 of 2,048 puzzles are answered incorrectly and none of them is
settled: the minimum late drift among failures (log₁₀ = 1.66, ≈46 per segment) exceeds the
late drift of 96.5% of successes, so no threshold assignment can place a failure in the
settled band. Failed trajectories also remain locally expansive over the full window (median
λ₁ = +0.103, IQR +0.094 to +0.111) while settled successes sit at the edge of contraction
(+0.011). The same decomposition at a mid-training checkpoint, and across a ten-checkpoint
series, shows the settled-wrong cell empty from 20% of training onward.

HRM adds the one exception, and it is small. At the strict band threshold, 21 of 3,894
failures (0.55%; n=8,192) end settled; the replication under the second estimator gives
5 of 971 (0.5%; n=2,048). These settled-wrong trajectories are dynamically indistinguishable
from successes: λ₁ median −0.842 against −0.867 for settled-correct, drift profiles inside the
A band from segment ~4 onward (Figure 2), and final halting confidence identical to successes
(median q_halt +7.47 in both cells, against −9.6 for wandering failures). All 21 crossed the
halting threshold between segments 4 and 9; under adaptive computation each would have stopped
early, confident and wrong. Their token accuracy spans 0.41–0.88, and the three least accurate
are all 17-givens (minimum-clue) puzzles. This cell is the wrong-attractor mode of Ren & Liu
(2026), measured: it exists, it carries exactly the contraction signature their account
predicts, and it is two orders of magnitude less common than wandering.

The unsettled-correct cell (C) is the mirror curiosity: 3–7% of successes are still moving at
segment 16 (70 of 1,794 in TRM; 57 of 1,077 in HRM), with halting confidence as high as
settled successes. Their existence shows the decode head can read a correct answer off a
moving state; we do not observe what happens to them past the window.

## 3.2 What the exponent is not measuring

The λ₁ separation is not a restatement of the settling split. Within the unsettled population,
where every trajectory is still moving, λ₁ ranks eventual successes above failures inside
narrow displacement bands: splitting unsettled HRM trajectories into late-drift deciles (decile
width ≤0.04 log units over most of the range) gives within-decile AUC from 0.97 at low drift
to 0.69 at the highest decile, weighted mean 0.879 (n=8,192); the second estimator gives 0.900
(n=2,048). A trajectory's expansion rate carries outcome information beyond how fast it is
moving.

The separation is also not a difficulty artifact, at least not at the resolution of clue
count. Accuracy varies strongly with the number of givens (Spearman +0.28), and λ₁ is itself
difficulty-correlated (−0.35 overall, −0.16/−0.18 within outcome classes), yet conditioning
removes nothing: within givens bins, AUC(−λ₁ → correct) is 0.976–0.987 (weighted 0.982)
against 0.984 unconditioned. Givens count is a coarse proxy — solver backtrack counts would be
the sharper control — but at this resolution the dynamical signature is orthogonal to how hard
the puzzle is.

## 3.3 When the signal exists: concurrent, with no early warning

The discriminative power of the dynamics is a property of the realized trajectory, and it is
absent at the start. We re-measured both models over only the first four segments (idx-paired
with the full-window runs, same sampling) and asked whether anything visible by segment 4
forecasts the final outcome. Unconditioned, early-window λ₁ appears predictive (AUC 0.89 TRM,
0.73 HRM), but the appearance is inherited from puzzles already solved by segment 4 (69% of
TRM successes, 34% of HRM's). Restricted to the decision-relevant population — puzzles not yet
correct at segment 4 — prediction collapses (Table 3). In TRM (n=626, of which 59% eventually
succeed), AUC is 0.543 for early λ₁, 0.492 for early drift, 0.521 for early halting
confidence. In HRM (n=1,342, 28% eventually succeed) the dynamical associations invert:
eventual successes have marginally higher early λ₁ (reverse AUC 0.448) and substantially
higher early displacement (AUC 0.688 in the positive direction). Among undecided HRM
trajectories, the ones still moving vigorously are the ones that go on to solve the puzzle.

One early signal does carry information, and it is learned, not dynamical: HRM's q_halt at
segment 4 predicts eventual success at AUC 0.734. TRM's does not (0.521); TRM's training
removes HRM's Q-learning continue head in favor of a binary halt loss, a difference we note
without interpreting. Window length is the untested variable here: four segments matches the
deep-supervision horizon, and we have not yet swept longer prefixes.

## 3.4 Training widens the gap from the failure side

Over the TRM training series (ten checkpoints, 512 puzzles each), λ₁ of wandering failures
rises monotonically from +0.036 to +0.102 while λ₁ of settled successes stays within ±0.03 of
zero; the settled-wrong cell empties by step 52k and stays empty. The outcome separation grows
over training because the failures become more expansive, while the success regime barely
moves. HRM's series shows a mass migration instead: at early checkpoints nearly all
trajectories are low-drift and wrong (the model barely updates state), this cell drains
through mid-training into high-drift wandering, and accuracy growth then tracks transfer from
wandering into the settled-correct band.

A preliminary intervention probe is consistent with the decomposition. HRM checkpoints trained
with multi-rollout initial-state perturbation (K=4, log-uniform noise) shrink the wandering-
failure cell at matched steps relative to an ordinary baseline (D: 274→175 at step 20,832 and
247→176 at 23,436, accuracy +0.20 and +0.15), with surviving failures more expansive, and the
known late-run collapse of this variant coincides with the settled band itself destabilizing
(λ₁ of settled successes flipping to +0.04). The comparison baseline differs in training
objective (ACT-streaming versus fixed unroll), so we report this as directional evidence
pending a matched-objective control.

---
*[Section pass notes: em-dash count 2 (§3.2, §3.3 one each). Contrast-template count: 0
(budget spent in title/intro). Flourish count: 1 ("mirror curiosity", §3.1) — cuttable.
Tables referenced: T1 = 2×2 cells both models; T2 = decile + givens AUCs; T3 = early-window
restricted AUCs. All numbers traceable to analysis_2x2/OBSERVATIONS.md (+ addenda) and
offline_followups/followups.md.]*