summaryrefslogtreecommitdiff
path: root/research/flossing/paper/intro.md
blob: 85f06e74f6d208845f28001d9402eba07b0af876 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Recursive Reasoning Models Fail by Wandering, Not by Settling

## 1 Introduction

Recursive reasoning models such as the Hierarchical Reasoning Model (HRM; Wang et al., 2025)
and the Tiny Recursive Model (TRM; Jolicoeur-Martineau, 2025) solve constraint-satisfaction
puzzles that defeat far larger language models, by iterating a small network on a latent state
for hundreds of updates per puzzle. When such a model fails, what is dynamically different
about the trajectory it produced? Two recent mechanistic studies answer in attractor language.
Failed TRM runs "plateau at stable high-loss attractors" (Efstathiou & Balwani, 2026); failed
HRM runs converge to spurious fixed points that rival the correct one (Ren & Liu, 2026). The
evidence behind both labels is indirect, resting on loss plateaus and two-dimensional
projections of 512-dimensional trajectories, and the labels disagree about the basic character
of failure: premature stability in one account, partly aimless drift in the other. Neither
measures the trajectory's stability directly. We do, per example, and the measurements support
a third description: recursive reasoning models fail by wandering, not by settling.

Across 2,048 to 8,192 held-out Sudoku-Extreme puzzles, correct trajectories end inside a
narrow low-velocity band of the latent dynamics, and failures essentially never do. In an
official-recipe TRM at 87.6% test accuracy, none of 254 failures settles: the least mobile
failure still moves faster at the end of inference than 96.5% of successes, a separation of
distributions that no threshold choice can undo, and failed trajectories remain locally
expansive throughout (median leading finite-time Lyapunov exponent λ₁ = +0.103, against +0.012
for successes; AUC 0.993). HRM shows the same structure with one addition. Settled-but-wrong
trajectories exist, but they account for 0.55% of failures, carry success-like contraction
(λ₁ = −0.84, against −0.87 for settled successes) and success-like halting confidence, and
every one of them would have halted early under adaptive computation. The wrong-attractor
failure mode is real, rare, and the only failure a confidence-based selector cannot catch.

Two controls locate what the Lyapunov signature adds, and a third experiment locates when it
exists. Matched for displacement level within the unsettled population, λ₁ still separates
eventual successes from failures (decile-matched AUC 0.88–0.90), so the exponent does more
than restate non-convergence. Binned by the number of givens, the separation is unchanged
(within-bin AUC 0.982, against 0.984 unconditioned), so it is not an artifact of problem
difficulty. It is, however, strictly retrospective. Restricted to puzzles still unsolved after
four of sixteen segments, neither early-window exponents nor early state velocity predicts
which trajectories will eventually succeed (AUC ≈ 0.5 in TRM), and in HRM the association
inverts — among the undecided, the trajectories that move more in the early segments are the
ones that go on to solve the puzzle (positive-direction AUC 0.69). The chaos of failure
arrives with the failure; nothing dynamical in the early trajectory anticipates it.

These measurements redraw the intervention map for this model class. Because failure is almost
never a stable wrong answer, restart-and-select inference strategies have a high ceiling and a
quantifiable blind spot of roughly half a percent. Because the early trajectory carries no
dynamical death sentence, compute spent on early failure prediction is compute wasted, and
restart diversity is the better buy. Our contributions: (i) per-example, outcome-conditioned
measurement of settling and finite-time Lyapunov spectra in HRM and TRM, at sample sizes up to
8,192 and replicated across two estimator implementations; (ii) a decomposition of failure
that corrects the settled-attractor reading and bounds the wrong-attractor mode at ~0.5% of
failures; (iii) controls showing the signature is not reducible to non-convergence or
difficulty; (iv) evidence that the signature is concurrent with the outcome and carries no
early-warning content at the granularity tested.

---
*[em-dash count: 1. Contrast-template count: title + one echo (end of ¶1). Flourish count:
1 ("death sentence", ¶4) — cuttable. "essentially never" is the one hedge in ¶2, scoped by
the 0.55% in the next sentence.]*