1 files changed, 906 insertions, 0 deletions
diff --git a/papers/txt/ptrm2025_probabilistic_trm.txt b/papers/txt/ptrm2025_probabilistic_trm.txt
new file mode 100644
index 0000000..a00f31c
--- /dev/null
+++ b/papers/txt/ptrm2025_probabilistic_trm.txt
@@ -0,0 +1,906 @@
+                                                                                            Probabilistic Tiny Recursive Model
+
+
+                                                              Amin Sghaier                                                                             Ali Parviz                                              Alexia Jolicoeur-Martineau
+                                                        Mila – Quebec AI Institute                                                              Mila – Quebec AI Institute                                             Independent
+                                                         ILLS & ETS Montreal
+
+                                                                                                             {amin.sghaier, ali.parviz}@mila.quebec
+arXiv:2605.19943v1 [cs.AI] 19 May 2026
+
+
+
+
+                                                                                                            alexia.jolicoeur-martineau@mail.mcgill.ca
+
+
+
+                                                                                                                                                             Abstract
+                                                                 Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of
+                                                                 the parameters of modern large language models (LLMs) by iteratively refining a
+                                                                 latent state and final answer. While powerful, their deterministic recursion can lead
+                                                                 to convergence at suboptimal solutions, without escape mechanism. A common
+                                                                 workaround relies on task-specific input perturbations at test time combined with
+                                                                 answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-
+                                                                 agnostic framework for test-time compute scaling that addresses this limitation
+                                                                 through stochastic exploration. PTRM injects Gaussian noise at each deep recursion
+                                                                 step, enabling parallel trajectories to explore diverse solution basins, and selects
+                                                                 among them using the model’s existing Q head (used for early stopping in the
+                                                                 original TRM). Without requiring retraining or task-specific augmentations, PTRM
+                                                                 enables substantial accuracy gains across benchmarks, including Sudoku-Extreme
+                                                                 (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to
+                                                                 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs
+                                                                 (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.
+
+
+
+                                                                                                   PPBench Puzzles
+                                                                               sudoku, lightup, nurikabe, heyawake, and tapa                                                                                    Sudoku-Extreme
+                                                        100                                                                                                                100                                                                             98.75
+                                                                                                                                                              91.2                                                                                  87.4
+                                                         80                                                                                                                 80
+                                         Accuracy (%)
+
+
+
+
+                                                                                                                                                     62.6
+                                                         60                                                                           55.1                                  60                                                                55
+                                                         40                                                         34.7                                                    40
+                                                                                   24.5            24.5
+                                                         20                                                                                                                 20
+                                                                      2                                                                                                               0              0                0              0
+                                                          0                                                                                                                  0
+                                                                 Direct pred
+
+
+
+
+                                                                                                                                                                                                                                Direct pred
+                                                                                                   gemini-3.1-pro
+
+                                                                                                                    claude-opus-4-6
+
+
+
+                                                                                                                                                     TRM
+
+
+
+
+                                                                                                                                                                                 Deepseek R1
+
+
+
+                                                                                                                                                                                                                 o3-mini-high
+
+
+
+                                                                                                                                                                                                                                              HRM
+
+                                                                                                                                                                                                                                                    TRM
+                                                                                   gpt-5.2@xhigh
+
+
+
+
+                                                                                                                                                             PTRM (ours)
+
+
+
+
+                                                                                                                                                                                               Claude 3.7 8K
+
+
+
+
+                                                                                                                                                                                                                                                           PTRM (ours)
+                                                                                                                                      LLM ensemble
+
+
+
+
+                                                              Chain-of-thought, pretrained                                                       Direct prediction                                             Probabilistic recursive prediction (ours)
+                                                              LLM ensemble                                                                       Deterministic recursive prediction
+                                                                                                                          Best of 7 strongest LLMs. Assumes access to a perfect verifier.
+
+                                           Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM
+                                           performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs
+                                           by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36
+                                           points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%.
+1   Introduction
+Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with
+orders of magnitude fewer parameters than the large language models (LLMs) they outperform on
+tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning
+Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning
+models. Rather than autoregressively generating chains of token-level reasoning, they recursively
+refine a latent state. This approach produces a single deterministic answer per input, fitting well with
+tasks where the answer is unique.
+Despite their strong performance, their deterministic inference does not make full use of their
+capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent
+space basins (i.e., regions of the latent space which decode to incorrect answers and from which the
+deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work
+on related models [5], suggests that TRM has the capabilities to solve significantly more problems
+but is limited by its standard inference procedure.
+Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This
+is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique
+answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different
+trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using
+a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage
+test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time
+scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic
+decoded answer, and selecting among the answers using the model’s own Q head.
+TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is
+conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable
+information that the standard inference procedure discards.
+We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a
+new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian
+noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different
+latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head
+is used to select the one most likely to be correct. PTRM requires no training changes and no
+task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy
+gains across diverse reasoning benchmarks.
+
+2   Background: Tiny Recursive Model
+Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a
+question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion
+consists of n updates to the latent state z followed by one update to the predicted answer y, all using
+the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z).
+fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T
+latent recursions in sequence, with only the final one retaining gradients, allowing the model to
+leverage a large effective depth while keeping training efficient.
+Rather than doing one optimization step per sample, TRM is trained via deep supervision, which
+consists in keeping the previous latent state z and answer y as initialization (after being detached from
+the computational graph) for the next supervision step. This is done for up to Nsup supervision steps.
+The loss at each step is calculated using cross entropy between the predicted answer logits fO (y)
+(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively
+refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more
+steps than during training, providing a depth axis for test-time compute scaling (additional steps may
+correct otherwise-incorrect answers).
+Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision
+steps rather than being replaced after each one. To avoid wasting compute on already-solved samples,
+an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary
+cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact
+
+
+                                                    2
+                                Correct answer           Incorrect answer                          Cell accuracy                       Start         End
+                               Quick success                                      Delayed success                                          Failure
+
+          PC 2 (15% var)
+
+
+
+
+                                                             PC 2 (36% var)
+
+
+
+
+                                                                                                                   PC 2 (8% var)
+                               PC 1 (84% var)                                       PC 1 (58% var)                                      PC 1 (85% var)
+
+
+                                                       1.0                                                   1.0                                                1.0
+          5.0                                              5.0                                                          5
+
+
+
+
+                                                                                                                                                                  Cell accuracy
+          2.5                                          0.9 2.5                                               0.9                                                0.8
+Q value
+
+
+
+
+                                                           0.0                                                          0
+          0.0                                          0.8 2.5                                               0.8
+                                                                                                                                                                0.6
+          2.5                                                                                                           5
+                                                           5.0
+                                                       0.7                                                   0.7
+          5.0
+                           0      5      10       15                          0       5       10        15                         0       5     10        15
+                               Supervision step                                    Supervision step                                    Supervision step
+ Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell
+ accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating
+ three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are
+ projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories
+ fade from light (early steps) to dark (later steps). Circle marks the start and square marks end.
+
+
+ correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ =
+ ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5,
+ improving data efficiency. During inference, the Q head is not used, and the model performs Nsup
+ supervision steps to maximize answer correctness.
+ While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will
+ investigate such failures cases in order to determine a way to remedy them.
+
+
+ 3        Problem: When Does TRM Fail?
+
+ 3.1           Analysis of failures and successes
+
+ We present observations about TRM that motivate our method. In this section, we train a TRM on
+ multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head
+ behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent
+ yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into
+ the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction
+ of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and
+ Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe:
+ Quick success: the trajectory transitions in a few steps from its starting location to a convergence
+ region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima
+ within the same few steps.
+ Delayed success: the trajectory initially oscillates around one region and remains there for multiple
+ supervision steps before sharply escaping to a different region where it converges. During the initial
+
+
+                                                                                          3
+phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell
+accuracy spike together.
+Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches
+near 100%, and the Q value stays negative for all supervision steps.
+We refer to latent space regions that trajectories remain in across multiple supervision steps and
+exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are
+good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave
+similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories,
+when delayed successes find an escape to a good basin while failures remain stuck.
+
+3.2           The Q head tracks trajectory quality
+
+          6                                                                  1.00
+                                                                             0.95               Figure 3: Q value follows cell ac-
+          4
+                                                                             0.90               curacy across reasoning. Mean
+          2
+
+
+
+
+                                                                                Cell accuracy
+                                                  Incorrect (28)             0.85               Q value (solid, left axis) and mean
+Q value
+
+
+
+
+          0                                       Correct (69)               0.80               cell accuracy (dashed, right axis)
+                                                  Cell accuracy (right axis)                    over supervision steps, aggregated
+          2                                                                  0.75
+                                                                             0.70
+                                                                                                over 100 PPBench validation puz-
+          4                                                                                     zles, separated by final correctness
+                                                                             0.65
+          6                                                                                     (green: correct, red: incorrect).
+                                                                             0.60
+               0     2    4     6        8         10      12       14
+                               Supervision step
+Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s
+value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3
+aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer
+correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell
+accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at
+convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for
+correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable
+learned indicator of whether a trajectory has reached a good basin.
+Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows:
+can we leverage the Q head to identify better trajectories? The main challenge is that the standard
+TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given
+problem. In the next section, we will show that by simply adding Gaussian noise to the latent state,
+we can sample different parallel trajectories and leverage the Q head to pick the best one.
+
+4         Method: Test-Time Compute Scaling via Stochastic Rollouts
+We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion
+stochastic and selects the best of K resulting trajectories. PTRM requires no special training and
+can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific
+augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by
+σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the
+highest Q value is selected and then decoded using the model’s output head fO . The algorithm in
+Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories
+to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new
+axis for test-time scaling.
+
+4.1           Escaping bad basins
+
+In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in
+latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough
+for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows
+K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1),
+
+
+                                                                   4
+PTRM Inference                                                                 (a) Standard TRM (deterministic)
+
+ 1: Input: puzzle x, rollouts K,                                                       puzzle                            ···                answer
+ 2: supervision steps D, noise scale σ
+ 3: for k = 1, . . . , K in parallel do                                                             depth axis: D deep recursion steps
+                   (k)   (k)
+ 4:     Initialize z0 , y0                                                     (b) PTRM (ours): K stochastic rollouts + Q-head selection
+ 5:     for t = 1, . . . , D do                                                                      +ϵ         +ϵ                 +ϵ
+                                                                                                                         ···
+
+
+
+
+                                                 width axis: K rollouts
+               (k)
+ 6:           zt−1 += ϵ, ϵ ∼ N (0, σ 2 I)                                                       1    +ϵ         +ϵ                 +ϵ
+                                                                                           k=
+               (k)   (k)            (k)    (k)
+ 7:           zt , yt ← rec(x, zt−1 , yt−1 )                                                                             ···
+                                                                              puzzle                                                             arg maxk Qk
+ 8:     end for                                                                            k=2
+                                                                                                      ·
+                                                                                                      ·
+                                                                                                                 ·
+                                                                                                                 ·
+                                                                                                                                    ·
+                                                                                                                                    ·
+                                (k)
+ 9:     ŷ (k) ← arg max fO (yD )                                                        k=
+                                                                                           K
+                                                                                                      ·
+                                                                                                     +ϵ
+                                                                                                                 ·
+                                                                                                                +ϵ
+                                                                                                                                    ·
+                                                                                                                                   +ϵ
+          (k)            (k)
+10:     q̂ ← fQ (yD )                                                                                                    ···                     final answer
+
+11: end for ∗
+12: return ŷ (k ) , k ∗ = arg maxk q̂ (k)                                             deep recursion step       +ϵ   Gaussian noise injection
+
+
+
+
+Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right:
+PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent
+rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer.
+
+
+projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while
+a minority (8%) escape to a distinct region in latent space and produce correct answers. We also
+observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape,
+at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity
+needed to occasionally find an escape trajectory.
+
+4.2   Width scaling
+
+Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the
+number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any
+rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a
+metric available at inference without a correctness oracle. The choice of Q as selector is motivated by
+Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3).
+Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench
+validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at
+K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between
+pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set.
+By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same
+range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct
+solutions even when they are rare.
+Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which
+consists of running more deep recursions (supervision steps) at inference than the Nsup the model
+was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises
+PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only
+provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth
+and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are
+independent and parallelizable while extra depth is sequential, width is the more practical scaling
+axis.
+PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next
+section evaluates the method across multiple benchmarks and against several baselines, including
+frontier LLMs.
+
+5     Experiments
+This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against
+the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs.
+Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4],
+PTRM substantially boosts the performance of each pretrained TRM using only inference compute.
+
+
+                                                                          5
+                                                            Correct (8)
+                 10                                         Incorrect (92)
+                                                            Start
+                  8                                         End                                         92.5       pass@K
+                                                                                                        90.0       best-Q@K
+                  6                                                                                                mode@K
+                                                                                                        87.5
+PC 2 (34% var)
+
+
+
+
+                                                                                 PPBench accuracy (%)
+                  4                                                                                     85.0
+
+                  2
+                                                                                                        82.5
+                                                                                                        80.0
+                  0
+                                                                                                        77.5
+                  2                                                                                     75.0
+                                                                                                        72.5
+                  4                                                                                            1                    5        10           25      100
+                       2.5    0.0   2.5      5.0     7.5   10.0    12.5                                                       Rollouts per puzzle K (log scale)
+                                       PC 1 (53% var)
+
+Figure 5: Stochastic rollouts escape bad                                         Figure 6: Width scaling. pass@K, best-Q@K,
+basins. Principal plane projection of K =                                        and mode@K as K grows, averaged over 3
+100 independent rollouts of the same failed                                      seeds on a held-out PPBench validation set. The
+puzzle as in Figure 2 (right). 92 rollouts                                       Q head is a strong verifier on the tested puzzles,
+remain caught in the bad basin (red). 8                                          consistently outperforming selection of the most
+escape to a good basin and produce correct                                       frequent answer.
+answers (green).
+
+
+ 5.1                  Setup
+
+Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles
+(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types)
+selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a
+fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size),
+and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup,
+nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest.
+We use the validation set to track performance during training and select the final checkpoint. We
+report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on
+shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across
+types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets.
+ Models and inference. For each benchmark we use a standard TRM checkpoint. For Sudoku-
+ Extreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku),
+ and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running
+ D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step.
+ The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics
+ are averaged across three seeds.
+ Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone,
+ we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For
+ each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and
+ ARC2 we additionally report the published direct prediction and TRM baselines from [1].
+ Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s
+ wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9])
+ so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle.
+
+
+ 5.2                  Pencil Puzzle Bench
+
+ 5.2.1                 Per-puzzle accuracy
+
+ Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2
+ raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48)
+ gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes
+ from scaling width (stochastic rollouts). The largest improvements are on puzzle types where
+
+
+                                                                             6
+the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to
+97.8% and tapa from 40.0% to 80.0%.
+
+  % accuracy                           # Params sudoku lightup nurikabe heyawake          tapa   agg.
+  Direct prediction                      27M        0.0      0.0      0.0       14.3      0.0  2.0
+  TRM (K=1, D=16)                        7M        46.7     87.5     74.1       85.7     40.0 62.6
+  TRM (K=1, D=48)                        7M        57.8     87.5     74.1       85.7     40.0 66.0
+  PTRM, best-Q@K (K=100, D=16)           7M        93.3     100      88.9       85.7     80.0 89.8
+  PTRM, best-Q@K (K=100, D=48)           7M        97.8     100      88.9       85.7     80.0 91.2
+Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as
+the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points
+over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises
+accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer
+trained on the same data.
+
+
+
+5.2.2    Comparison with frontier LLMs on golden set
+PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response
+from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct
+and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM
+substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively
+verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while
+it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn
+verifier (the LLM could still self-verify by writing verification code within the same response). We
+additionally observe that the agentic strategy was applied selectively in the published PPBench data:
+across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried
+with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our
+golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high,
+claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in
+each strategy block.
+We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved
+if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against
+PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under
+any strategy, the ensemble counts it as solved, even though in practice we would not have access
+to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that
+even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble
+cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct
+divides total cost by the number of puzzles the ensemble solved.
+Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57
+points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%)
+despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for
+LLMs than PTRM.
+
+5.3     Sudoku-Extreme, Maze-Hard, and ARC-AGI-2
+
+For each benchmark we use the standard TRM checkpoint trained as described in [1] without
+modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2).
+Table 3 summarizes results on all three.
+On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to
+99.06% pass@K and 98.75% best-Q@K, achieving state of the art.
+On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain
+over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73%
+(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows
+that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on
+the previous benchmarks.
+
+
+                                                   7
+ % accuracy                         sudoku lightup nurikabe heyawake               tapa   agg.   $/att.   $/corr.
+                                                      Direct
+ gemini-3.1-pro                        6.7     75.0      22.2         0.0          30.0   24.5   $0.40    $1.62
+ gpt-5.2@xhigh                        20.0     50.0      0.0          0.0          50.0   24.5   $1.79    $7.29
+ claude-opus-4-6@thinking              0.0     87.5      44.4         0.0          60.0   34.7   $2.91    $8.40
+                                         Any strategy (direct or agentic)†
+ gemini-3.1-pro                        6.7     87.5      33.3         0.0          40.0   30.6   $10.38   $33.91
+ gpt-5.2@xhigh                        33.3     75.0      0.0          0.0          60.0   34.7    $3.09    $8.90
+ claude-opus-4-6@thinking              0.0     87.5      44.4         0.0          70.0   36.7    $4.38   $11.92
+                                                 LLM ensemble†
+ Any strategy (direct or agentic)     46.7     100       44.4         0.0          80.0   55.1   $2.66    $38.51
+                                    Ours, trained from scratch, 7M parameters
+ PTRM, best-Q@K                      97.8      100       88.9        85.7          80.0 91.2 $0.001 $0.001
+Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt /
+per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100
+wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate,
+and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier.
+
+
+
+
+On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them.
+PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the
+highest Q value within each augmentation, then vote across augmentations as in the standard pipeline.
+With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to
+15.97% over our deterministic TRM baseline, while matching it at pass@2.
+
+
+                                                  Sudoku-Extreme Maze-Hard        ARC-AGI-2
+ Method                               # Params       Acc. (%)     Acc. (%) pass@1 pass@2 pass@100
+ HRM                                    27M             55.0                74.5           –       5.0       –
+ TRM                                  5M / 7M†          87.4                85.3           –       7.8       –
+ Ours
+ Standard TRM, our reproduction 5M / 7M†               87.28            83.80             7.36    9.72     14.31
+ PTRM                           5M / 7M†               98.75            86.73             8.47    9.72     15.97
+Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100,
+D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2.
+pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM
+shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M
+for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att).
+
+
+5.4   Q head selection as σ grows
+
+With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses.
+For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises
+pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset
+we tested, it unlocks correct solutions well beyond the deterministic model’s reach.
+TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and Sudoku-
+Extreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration
+translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces
+significantly more correct rollouts, but the existing Q head is not able to identify them, leaving
+performance on the table. The gap between best-Q@K and pass@K represents headroom for a
+stronger verifier which is left for future work. Appendix B reports the full σ sweep.
+
+
+                                                        8
+6   Related Work
+
+A long line of work explores recursive computation for iterative reasoning and representation re-
+finement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep
+Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation
+steps to improve reasoning performance. More recent work has introduced methods to substantially
+accelerate TRM training [15], while TRM-style recursive architectures have also been extended to
+language modeling tasks [16].
+Building on this broader perspective of recursive computation, a growing body of work studies
+latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous
+“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively
+generated and appended to the model context, enabling reasoning directly in latent space without
+producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition
+and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token
+sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate
+the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21].
+Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state
+z is sampled instead of deterministic. This improves performance on multiple tasks, but requires
+retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute
+method where they only apply noise in the initial hidden state z, while we apply noise at every
+supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme
+dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that
+Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no
+improvement in accuracy on two datasets).
+Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who
+identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates
+these attractors through a combination of task-specific training data augmentation, inference-time
+input perturbations, and model bootstrapping across training checkpoints, thereby effectively in-
+creasing test-time compute. However, these interventions are comparatively less general and less
+computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple
+puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism:
+injecting Gaussian noise into the latent state at each supervision step while using a single deterministic
+checkpoint.
+
+
+7   Conclusion
+
+In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for
+Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales
+test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth
+scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on
+task-specific data augmentations making it extremely easy to use and versatile.
+By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide
+selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM
+nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51
+cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4%
+to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1).
+Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test
+on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited
+computational resources. It is not guaranteed that the method works as well for all types of problems
+(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller).
+Future work. It would be interesting to understand why some puzzles benefit from test-time scaling
+more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less
+from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones.
+Developing stronger verifiers than the existing Q head is an interesting direction for future work.
+
+
+                                                    9
+References
+ [1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+     preprint arXiv:2510.04871, 2025.
+ [2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+     Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+ [3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+ [4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+     agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+     2025.
+ [5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic
+     analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026.
+ [6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti-
+     mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314,
+     2024.
+ [7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+     arXiv:1603.08983, 2016.
+ [8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv
+     preprint arXiv:2603.02119, 2026.
+ [9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026.
+     Accessed: 2026-05-01.
+[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+     versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+     Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+     recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum,
+     and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with
+     recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021.
+[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum,
+     and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation
+     without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242,
+     2022.
+[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning
+     of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems,
+     37:97027–97052, 2024.
+[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641,
+     2026.
+[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie,
+     Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation.
+     In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning,
+     2026.
+[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+     Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+     arXiv:2412.06769, 2024.
+[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+     Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+     Advances in neural information processing systems, 35:24824–24837, 2022.
+
+
+                                                 10
+[19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+     by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+     arXiv:2505.12514, 2025.
+[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny
+     Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring
+     faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
+[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul-
+     man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t
+     always say what they think. arXiv preprint arXiv:2505.05410, 2025.
+[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive
+     reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026.
+[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search:
+     Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Think-
+     ing – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id=
+     kKps9W1K7n.
+
+
+
+
+                                               11
+A     Implementation Details
+
+A.1   Compute
+
+We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no
+additional training cost over standard TRM since it operates entirely at inference time.
+
+A.2   Models
+
+All experiments use the standard TRM backbone [1] with the released architecture and training recipes.
+Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme
+and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench.
+Layout and hyperparameters are unchanged from TRM.
+
+A.3   PPBench dataset construction
+
+Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM.
+The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we
+detail how we built our training, validation, and golden splits.
+
+Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types.
+Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8].
+
+Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles
+from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training
+set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types
+(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku
+and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform
+sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline
+reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no
+headroom to compare against PTRM).
+
+Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1)
+trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s
+solution trajectory rather than always the empty initial grid, while the label is always the fully solved
+grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among
+the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the
+label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining
+9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not
+augmented.
+
+Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len =
+100. Per-type sample counts are reported in Table 4.
+
+                               puzzle type        train    val    golden
+                               sudoku           7,810       97        15
+                               lightup          9,504       65         8
+                               nurikabe        15,180       55         9
+                               heyawake        42,108       70         7
+                               tapa             3,663       26        10
+                               shakashaka∗     20,702       62        12
+                               total           98,967     375         61
+Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation.
+∗
+  Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deter-
+ministic TRM already solves all evaluated shakashaka puzzles.
+
+
+
+                                                   12
+  B                  Noise Ablation
+
+  We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and
+  D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the
+  test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ,
+  averaged over three random seeds.
+
+                                                           pass@K      best-Q@K         mode@K         K = 1 baseline
+
+                              Sudoku-Extreme                                       Maze-Hard                                        ARC-AGI-2 (within-aug)
+               100
+                                                               96                                                 5.5
+               90
+                                                               94
+               80                                                                                                 5.0
+                                                               92
+accuracy (%)
+
+
+
+
+               70                                              90                                                 4.5
+               60                                              88
+               50                                              86                                                 4.0
+               40                                              84
+                                                                                                                  3.5
+               30                                              82
+                 0.0    0.2    0.4    0.6      0.8   1.0         0.0    0.2       0.4   0.6      0.8     1.0            0.0   0.2       0.4     0.6    0.8   1.0
+
+
+  Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task,
+  increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to
+  a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely
+  on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region
+  represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has
+  the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation
+  to isolate the Q head’s verification abilities from the augmentation pipeline.
+
+  On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then
+  plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the
+  sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the
+  ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of
+  pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K
+  is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close).
+
+
+  C                  Q-guided Langevin sampling
+
+  We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration
+  mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic
+  search by additionally steering each rollout (using the Q head gradient) toward regions of high Q
+  value. We ultimately found that the gain from this approach was entirely attributable to the Langevin
+  noise term, with the gradient component contributing nothing measurable on top of the equivalent
+  recurrent noise of Sec. 4. We document the approach here as a negative result.
+
+  Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote
+  the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent
+  space. Empirical observations during early experiments suggested that regions of low E correspond
+  to good basins from which the decoded answer is likely correct. PCA visualizations of the latent
+  dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and
+  bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which
+  to push latents.
+
+  Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin
+  dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to
+  the latent,
+                                                 p
+                           z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I),
+  The number of Langevin steps N is the additional scaling axis under this scheme.
+
+
+                                                                                   13
+                    t=0               t=5                       t = 10                    t = 15
+    Correct (21)
+    Incorrect (4)
+      Q
+
+
+
+
+Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recur-
+sive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts
+in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both
+good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization
+motivated the Langevin sampling experiment described below.
+
+
+Tractable gradient computation. TRM’s original Q head is a linear projection on a single token,
+fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent
+of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion.
+This works but requires backpropagating through a full latent recursion at every Langevin step, which
+scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with
+an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear
+network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which
+is ∼8× faster per step and does not sacrifice accuracy.
+
+The gain came from the noise,√ not the gradient. Comparing Langevin sampling against a noise-
+only ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially
+identical accuracy at matched N . The gradient component contributed nothing measurable on
+top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in
+Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time
+backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective.
+
+D      Per-puzzle accuracy on the PPBench validation set
+The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct compara-
+bility with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement,
+Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for
+golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift,
+and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K
+from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles
+where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69%
+to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only
+marginally.
+
+  % accuracy                            # Params sudoku lightup nurikabe heyawake           tapa   agg.
+  Direct prediction                         27M      0.0     10.0         4.0     14.0      0.0  6.2
+  TRM (K=1, D=16)                           7M      68.7     83.3        76.0     96.7     39.7 76.4
+  TRM (K=1, D=48)                           7M      74.0     84.0        76.7     98.0     41.0 78.3
+  PTRM, best-Q@K (K=100, D=48)              7M      93.3     93.3        84.7     100      71.8 90.4
+Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the
+deterministic TRM. Results on the larger validation set follow the same trends as on the golden set.
+
+
+
+
+                                                   14
+
+\ No newline at end of file