1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
|
Probabilistic Tiny Recursive Model
Amin Sghaier Ali Parviz Alexia Jolicoeur-Martineau
Mila – Quebec AI Institute Mila – Quebec AI Institute Independent
ILLS & ETS Montreal
{amin.sghaier, ali.parviz}@mila.quebec
arXiv:2605.19943v1 [cs.AI] 19 May 2026
alexia.jolicoeur-martineau@mail.mcgill.ca
Abstract
Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of
the parameters of modern large language models (LLMs) by iteratively refining a
latent state and final answer. While powerful, their deterministic recursion can lead
to convergence at suboptimal solutions, without escape mechanism. A common
workaround relies on task-specific input perturbations at test time combined with
answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-
agnostic framework for test-time compute scaling that addresses this limitation
through stochastic exploration. PTRM injects Gaussian noise at each deep recursion
step, enabling parallel trajectories to explore diverse solution basins, and selects
among them using the model’s existing Q head (used for early stopping in the
original TRM). Without requiring retraining or task-specific augmentations, PTRM
enables substantial accuracy gains across benchmarks, including Sudoku-Extreme
(87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to
91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs
(91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.
PPBench Puzzles
sudoku, lightup, nurikabe, heyawake, and tapa Sudoku-Extreme
100 100 98.75
91.2 87.4
80 80
Accuracy (%)
62.6
60 55.1 60 55
40 34.7 40
24.5 24.5
20 20
2 0 0 0 0
0 0
Direct pred
Direct pred
gemini-3.1-pro
claude-opus-4-6
TRM
Deepseek R1
o3-mini-high
HRM
TRM
gpt-5.2@xhigh
PTRM (ours)
Claude 3.7 8K
PTRM (ours)
LLM ensemble
Chain-of-thought, pretrained Direct prediction Probabilistic recursive prediction (ours)
LLM ensemble Deterministic recursive prediction
Best of 7 strongest LLMs. Assumes access to a perfect verifier.
Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM
performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs
by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36
points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%.
1 Introduction
Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with
orders of magnitude fewer parameters than the large language models (LLMs) they outperform on
tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning
Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning
models. Rather than autoregressively generating chains of token-level reasoning, they recursively
refine a latent state. This approach produces a single deterministic answer per input, fitting well with
tasks where the answer is unique.
Despite their strong performance, their deterministic inference does not make full use of their
capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent
space basins (i.e., regions of the latent space which decode to incorrect answers and from which the
deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work
on related models [5], suggests that TRM has the capabilities to solve significantly more problems
but is limited by its standard inference procedure.
Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This
is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique
answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different
trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using
a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage
test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time
scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic
decoded answer, and selecting among the answers using the model’s own Q head.
TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is
conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable
information that the standard inference procedure discards.
We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a
new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian
noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different
latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head
is used to select the one most likely to be correct. PTRM requires no training changes and no
task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy
gains across diverse reasoning benchmarks.
2 Background: Tiny Recursive Model
Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a
question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion
consists of n updates to the latent state z followed by one update to the predicted answer y, all using
the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z).
fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T
latent recursions in sequence, with only the final one retaining gradients, allowing the model to
leverage a large effective depth while keeping training efficient.
Rather than doing one optimization step per sample, TRM is trained via deep supervision, which
consists in keeping the previous latent state z and answer y as initialization (after being detached from
the computational graph) for the next supervision step. This is done for up to Nsup supervision steps.
The loss at each step is calculated using cross entropy between the predicted answer logits fO (y)
(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively
refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more
steps than during training, providing a depth axis for test-time compute scaling (additional steps may
correct otherwise-incorrect answers).
Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision
steps rather than being replaced after each one. To avoid wasting compute on already-solved samples,
an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary
cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact
2
Correct answer Incorrect answer Cell accuracy Start End
Quick success Delayed success Failure
PC 2 (15% var)
PC 2 (36% var)
PC 2 (8% var)
PC 1 (84% var) PC 1 (58% var) PC 1 (85% var)
1.0 1.0 1.0
5.0 5.0 5
Cell accuracy
2.5 0.9 2.5 0.9 0.8
Q value
0.0 0
0.0 0.8 2.5 0.8
0.6
2.5 5
5.0
0.7 0.7
5.0
0 5 10 15 0 5 10 15 0 5 10 15
Supervision step Supervision step Supervision step
Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell
accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating
three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are
projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories
fade from light (early steps) to dark (later steps). Circle marks the start and square marks end.
correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ =
ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5,
improving data efficiency. During inference, the Q head is not used, and the model performs Nsup
supervision steps to maximize answer correctness.
While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will
investigate such failures cases in order to determine a way to remedy them.
3 Problem: When Does TRM Fail?
3.1 Analysis of failures and successes
We present observations about TRM that motivate our method. In this section, we train a TRM on
multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head
behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent
yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into
the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction
of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and
Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe:
Quick success: the trajectory transitions in a few steps from its starting location to a convergence
region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima
within the same few steps.
Delayed success: the trajectory initially oscillates around one region and remains there for multiple
supervision steps before sharply escaping to a different region where it converges. During the initial
3
phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell
accuracy spike together.
Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches
near 100%, and the Q value stays negative for all supervision steps.
We refer to latent space regions that trajectories remain in across multiple supervision steps and
exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are
good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave
similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories,
when delayed successes find an escape to a good basin while failures remain stuck.
3.2 The Q head tracks trajectory quality
6 1.00
0.95 Figure 3: Q value follows cell ac-
4
0.90 curacy across reasoning. Mean
2
Cell accuracy
Incorrect (28) 0.85 Q value (solid, left axis) and mean
Q value
0 Correct (69) 0.80 cell accuracy (dashed, right axis)
Cell accuracy (right axis) over supervision steps, aggregated
2 0.75
0.70
over 100 PPBench validation puz-
4 zles, separated by final correctness
0.65
6 (green: correct, red: incorrect).
0.60
0 2 4 6 8 10 12 14
Supervision step
Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s
value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3
aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer
correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell
accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at
convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for
correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable
learned indicator of whether a trajectory has reached a good basin.
Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows:
can we leverage the Q head to identify better trajectories? The main challenge is that the standard
TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given
problem. In the next section, we will show that by simply adding Gaussian noise to the latent state,
we can sample different parallel trajectories and leverage the Q head to pick the best one.
4 Method: Test-Time Compute Scaling via Stochastic Rollouts
We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion
stochastic and selects the best of K resulting trajectories. PTRM requires no special training and
can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific
augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by
σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the
highest Q value is selected and then decoded using the model’s output head fO . The algorithm in
Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories
to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new
axis for test-time scaling.
4.1 Escaping bad basins
In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in
latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough
for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows
K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1),
4
PTRM Inference (a) Standard TRM (deterministic)
1: Input: puzzle x, rollouts K, puzzle ··· answer
2: supervision steps D, noise scale σ
3: for k = 1, . . . , K in parallel do depth axis: D deep recursion steps
(k) (k)
4: Initialize z0 , y0 (b) PTRM (ours): K stochastic rollouts + Q-head selection
5: for t = 1, . . . , D do +ϵ +ϵ +ϵ
···
width axis: K rollouts
(k)
6: zt−1 += ϵ, ϵ ∼ N (0, σ 2 I) 1 +ϵ +ϵ +ϵ
k=
(k) (k) (k) (k)
7: zt , yt ← rec(x, zt−1 , yt−1 ) ···
puzzle arg maxk Qk
8: end for k=2
·
·
·
·
·
·
(k)
9: ŷ (k) ← arg max fO (yD ) k=
K
·
+ϵ
·
+ϵ
·
+ϵ
(k) (k)
10: q̂ ← fQ (yD ) ··· final answer
11: end for ∗
12: return ŷ (k ) , k ∗ = arg maxk q̂ (k) deep recursion step +ϵ Gaussian noise injection
Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right:
PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent
rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer.
projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while
a minority (8%) escape to a distinct region in latent space and produce correct answers. We also
observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape,
at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity
needed to occasionally find an escape trajectory.
4.2 Width scaling
Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the
number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any
rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a
metric available at inference without a correctness oracle. The choice of Q as selector is motivated by
Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3).
Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench
validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at
K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between
pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set.
By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same
range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct
solutions even when they are rare.
Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which
consists of running more deep recursions (supervision steps) at inference than the Nsup the model
was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises
PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only
provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth
and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are
independent and parallelizable while extra depth is sequential, width is the more practical scaling
axis.
PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next
section evaluates the method across multiple benchmarks and against several baselines, including
frontier LLMs.
5 Experiments
This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against
the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs.
Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4],
PTRM substantially boosts the performance of each pretrained TRM using only inference compute.
5
Correct (8)
10 Incorrect (92)
Start
8 End 92.5 pass@K
90.0 best-Q@K
6 mode@K
87.5
PC 2 (34% var)
PPBench accuracy (%)
4 85.0
2
82.5
80.0
0
77.5
2 75.0
72.5
4 1 5 10 25 100
2.5 0.0 2.5 5.0 7.5 10.0 12.5 Rollouts per puzzle K (log scale)
PC 1 (53% var)
Figure 5: Stochastic rollouts escape bad Figure 6: Width scaling. pass@K, best-Q@K,
basins. Principal plane projection of K = and mode@K as K grows, averaged over 3
100 independent rollouts of the same failed seeds on a held-out PPBench validation set. The
puzzle as in Figure 2 (right). 92 rollouts Q head is a strong verifier on the tested puzzles,
remain caught in the bad basin (red). 8 consistently outperforming selection of the most
escape to a good basin and produce correct frequent answer.
answers (green).
5.1 Setup
Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles
(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types)
selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a
fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size),
and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup,
nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest.
We use the validation set to track performance during training and select the final checkpoint. We
report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on
shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across
types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets.
Models and inference. For each benchmark we use a standard TRM checkpoint. For Sudoku-
Extreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku),
and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running
D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step.
The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics
are averaged across three seeds.
Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone,
we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For
each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and
ARC2 we additionally report the published direct prediction and TRM baselines from [1].
Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s
wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9])
so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle.
5.2 Pencil Puzzle Bench
5.2.1 Per-puzzle accuracy
Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2
raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48)
gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes
from scaling width (stochastic rollouts). The largest improvements are on puzzle types where
6
the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to
97.8% and tapa from 40.0% to 80.0%.
% accuracy # Params sudoku lightup nurikabe heyawake tapa agg.
Direct prediction 27M 0.0 0.0 0.0 14.3 0.0 2.0
TRM (K=1, D=16) 7M 46.7 87.5 74.1 85.7 40.0 62.6
TRM (K=1, D=48) 7M 57.8 87.5 74.1 85.7 40.0 66.0
PTRM, best-Q@K (K=100, D=16) 7M 93.3 100 88.9 85.7 80.0 89.8
PTRM, best-Q@K (K=100, D=48) 7M 97.8 100 88.9 85.7 80.0 91.2
Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as
the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points
over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises
accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer
trained on the same data.
5.2.2 Comparison with frontier LLMs on golden set
PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response
from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct
and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM
substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively
verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while
it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn
verifier (the LLM could still self-verify by writing verification code within the same response). We
additionally observe that the agentic strategy was applied selectively in the published PPBench data:
across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried
with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our
golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high,
claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in
each strategy block.
We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved
if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against
PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under
any strategy, the ensemble counts it as solved, even though in practice we would not have access
to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that
even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble
cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct
divides total cost by the number of puzzles the ensemble solved.
Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57
points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%)
despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for
LLMs than PTRM.
5.3 Sudoku-Extreme, Maze-Hard, and ARC-AGI-2
For each benchmark we use the standard TRM checkpoint trained as described in [1] without
modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2).
Table 3 summarizes results on all three.
On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to
99.06% pass@K and 98.75% best-Q@K, achieving state of the art.
On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain
over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73%
(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows
that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on
the previous benchmarks.
7
% accuracy sudoku lightup nurikabe heyawake tapa agg. $/att. $/corr.
Direct
gemini-3.1-pro 6.7 75.0 22.2 0.0 30.0 24.5 $0.40 $1.62
gpt-5.2@xhigh 20.0 50.0 0.0 0.0 50.0 24.5 $1.79 $7.29
claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 60.0 34.7 $2.91 $8.40
Any strategy (direct or agentic)†
gemini-3.1-pro 6.7 87.5 33.3 0.0 40.0 30.6 $10.38 $33.91
gpt-5.2@xhigh 33.3 75.0 0.0 0.0 60.0 34.7 $3.09 $8.90
claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 70.0 36.7 $4.38 $11.92
LLM ensemble†
Any strategy (direct or agentic) 46.7 100 44.4 0.0 80.0 55.1 $2.66 $38.51
Ours, trained from scratch, 7M parameters
PTRM, best-Q@K 97.8 100 88.9 85.7 80.0 91.2 $0.001 $0.001
Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt /
per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100
wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate,
and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier.
On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them.
PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the
highest Q value within each augmentation, then vote across augmentations as in the standard pipeline.
With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to
15.97% over our deterministic TRM baseline, while matching it at pass@2.
Sudoku-Extreme Maze-Hard ARC-AGI-2
Method # Params Acc. (%) Acc. (%) pass@1 pass@2 pass@100
HRM 27M 55.0 74.5 – 5.0 –
TRM 5M / 7M† 87.4 85.3 – 7.8 –
Ours
Standard TRM, our reproduction 5M / 7M† 87.28 83.80 7.36 9.72 14.31
PTRM 5M / 7M† 98.75 86.73 8.47 9.72 15.97
Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100,
D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2.
pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM
shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M
for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att).
5.4 Q head selection as σ grows
With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses.
For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises
pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset
we tested, it unlocks correct solutions well beyond the deterministic model’s reach.
TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and Sudoku-
Extreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration
translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces
significantly more correct rollouts, but the existing Q head is not able to identify them, leaving
performance on the table. The gap between best-Q@K and pass@K represents headroom for a
stronger verifier which is left for future work. Appendix B reports the full σ sweep.
8
6 Related Work
A long line of work explores recursive computation for iterative reasoning and representation re-
finement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep
Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation
steps to improve reasoning performance. More recent work has introduced methods to substantially
accelerate TRM training [15], while TRM-style recursive architectures have also been extended to
language modeling tasks [16].
Building on this broader perspective of recursive computation, a growing body of work studies
latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous
“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively
generated and appended to the model context, enabling reasoning directly in latent space without
producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition
and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token
sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate
the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21].
Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state
z is sampled instead of deterministic. This improves performance on multiple tasks, but requires
retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute
method where they only apply noise in the initial hidden state z, while we apply noise at every
supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme
dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that
Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no
improvement in accuracy on two datasets).
Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who
identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates
these attractors through a combination of task-specific training data augmentation, inference-time
input perturbations, and model bootstrapping across training checkpoints, thereby effectively in-
creasing test-time compute. However, these interventions are comparatively less general and less
computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple
puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism:
injecting Gaussian noise into the latent state at each supervision step while using a single deterministic
checkpoint.
7 Conclusion
In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for
Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales
test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth
scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on
task-specific data augmentations making it extremely easy to use and versatile.
By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide
selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM
nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51
cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4%
to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1).
Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test
on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited
computational resources. It is not guaranteed that the method works as well for all types of problems
(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller).
Future work. It would be interesting to understand why some puzzles benefit from test-time scaling
more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less
from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones.
Developing stronger verifiers than the existing Q head is an interesting direction for future work.
9
References
[1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
preprint arXiv:2510.04871, 2025.
[2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
[3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
[4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
2025.
[5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic
analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026.
[6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti-
mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314,
2024.
[7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
arXiv:1603.08983, 2016.
[8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv
preprint arXiv:2603.02119, 2026.
[9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026.
Accessed: 2026-05-01.
[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
versal transformers. arXiv preprint arXiv:1807.03819, 2018.
[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum,
and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with
recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021.
[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum,
and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation
without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242,
2022.
[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning
of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems,
37:97027–97052, 2024.
[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641,
2026.
[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie,
Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation.
In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning,
2026.
[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
Tian. Training large language models to reason in a continuous latent space. arXiv preprint
arXiv:2412.06769, 2024.
[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
Advances in neural information processing systems, 35:24824–24837, 2022.
10
[19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
arXiv:2505.12514, 2025.
[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny
Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring
faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul-
man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t
always say what they think. arXiv preprint arXiv:2505.05410, 2025.
[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive
reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026.
[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search:
Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Think-
ing – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id=
kKps9W1K7n.
11
A Implementation Details
A.1 Compute
We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no
additional training cost over standard TRM since it operates entirely at inference time.
A.2 Models
All experiments use the standard TRM backbone [1] with the released architecture and training recipes.
Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme
and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench.
Layout and hyperparameters are unchanged from TRM.
A.3 PPBench dataset construction
Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM.
The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we
detail how we built our training, validation, and golden splits.
Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types.
Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8].
Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles
from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training
set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types
(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku
and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform
sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline
reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no
headroom to compare against PTRM).
Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1)
trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s
solution trajectory rather than always the empty initial grid, while the label is always the fully solved
grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among
the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the
label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining
9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not
augmented.
Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len =
100. Per-type sample counts are reported in Table 4.
puzzle type train val golden
sudoku 7,810 97 15
lightup 9,504 65 8
nurikabe 15,180 55 9
heyawake 42,108 70 7
tapa 3,663 26 10
shakashaka∗ 20,702 62 12
total 98,967 375 61
Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation.
∗
Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deter-
ministic TRM already solves all evaluated shakashaka puzzles.
12
B Noise Ablation
We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and
D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the
test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ,
averaged over three random seeds.
pass@K best-Q@K mode@K K = 1 baseline
Sudoku-Extreme Maze-Hard ARC-AGI-2 (within-aug)
100
96 5.5
90
94
80 5.0
92
accuracy (%)
70 90 4.5
60 88
50 86 4.0
40 84
3.5
30 82
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task,
increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to
a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely
on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region
represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has
the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation
to isolate the Q head’s verification abilities from the augmentation pipeline.
On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then
plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the
sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the
ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of
pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K
is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close).
C Q-guided Langevin sampling
We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration
mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic
search by additionally steering each rollout (using the Q head gradient) toward regions of high Q
value. We ultimately found that the gain from this approach was entirely attributable to the Langevin
noise term, with the gradient component contributing nothing measurable on top of the equivalent
recurrent noise of Sec. 4. We document the approach here as a negative result.
Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote
the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent
space. Empirical observations during early experiments suggested that regions of low E correspond
to good basins from which the decoded answer is likely correct. PCA visualizations of the latent
dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and
bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which
to push latents.
Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin
dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to
the latent,
p
z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I),
The number of Langevin steps N is the additional scaling axis under this scheme.
13
t=0 t=5 t = 10 t = 15
Correct (21)
Incorrect (4)
Q
Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recur-
sive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts
in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both
good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization
motivated the Langevin sampling experiment described below.
Tractable gradient computation. TRM’s original Q head is a linear projection on a single token,
fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent
of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion.
This works but requires backpropagating through a full latent recursion at every Langevin step, which
scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with
an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear
network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which
is ∼8× faster per step and does not sacrifice accuracy.
The gain came from the noise,√ not the gradient. Comparing Langevin sampling against a noise-
only ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially
identical accuracy at matched N . The gradient component contributed nothing measurable on
top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in
Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time
backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective.
D Per-puzzle accuracy on the PPBench validation set
The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct compara-
bility with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement,
Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for
golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift,
and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K
from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles
where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69%
to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only
marginally.
% accuracy # Params sudoku lightup nurikabe heyawake tapa agg.
Direct prediction 27M 0.0 10.0 4.0 14.0 0.0 6.2
TRM (K=1, D=16) 7M 68.7 83.3 76.0 96.7 39.7 76.4
TRM (K=1, D=48) 7M 74.0 84.0 76.7 98.0 41.0 78.3
PTRM, best-Q@K (K=100, D=48) 7M 93.3 93.3 84.7 100 71.8 90.4
Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the
deterministic TRM. Results on the larger validation set follow the same trends as on the golden set.
14
|