diff options
Diffstat (limited to 'papers/ptrm_2605.19943.txt')
| -rw-r--r-- | papers/ptrm_2605.19943.txt | 1418 |
1 files changed, 1418 insertions, 0 deletions
diff --git a/papers/ptrm_2605.19943.txt b/papers/ptrm_2605.19943.txt new file mode 100644 index 0000000..49213a0 --- /dev/null +++ b/papers/ptrm_2605.19943.txt @@ -0,0 +1,1418 @@ +Probabilistic Tiny Recursive Model + +Ali Parviz +Mila – Quebec AI Institute + +Alexia Jolicoeur-Martineau +Independent + +{amin.sghaier, ali.parviz}@mila.quebec +alexia.jolicoeur-martineau@mail.mcgill.ca + +Abstract +Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of +the parameters of modern large language models (LLMs) by iteratively refining a +latent state and final answer. While powerful, their deterministic recursion can lead +to convergence at suboptimal solutions, without escape mechanism. A common +workaround relies on task-specific input perturbations at test time combined with +answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a taskagnostic framework for test-time compute scaling that addresses this limitation +through stochastic exploration. PTRM injects Gaussian noise at each deep recursion +step, enabling parallel trajectories to explore diverse solution basins, and selects +among them using the model’s existing Q head (used for early stopping in the +original TRM). Without requiring retraining or task-specific augmentations, PTRM +enables substantial accuracy gains across benchmarks, including Sudoku-Extreme +(87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to +91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs +(91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters. + +PPBench Puzzles + +sudoku, lightup, nurikabe, heyawake, and tapa + +91.2 + +80 +55.1 +34.7 + +Direct prediction +Deterministic recursive prediction + +0 + +0 + +PTRM (ours) + +0 + +TRM + +PTRM (ours) + +TRM + +LLM ensemble + +claude-opus-4-6 + +Chain-of-thought, pretrained +LLM ensemble + +0 + +HRM + +20 + +Direct pred + +40 +0 + +gemini-3.1-pro + +Direct pred + +0 + +24.5 + +98.75 + +55 + +60 + +2 +gpt-5.2@xhigh + +20 + +24.5 + +80 + +62.6 + +o3-mini-high + +40 + +87.4 + +Claude 3.7 8K + +60 + +Sudoku-Extreme +100 + +Deepseek R1 + +100 + +Accuracy (%) + +arXiv:2605.19943v1 [cs.AI] 19 May 2026 + +Amin Sghaier +Mila – Quebec AI Institute +ILLS & ETS Montreal + +Probabilistic recursive prediction (ours) + +Best of 7 strongest LLMs. Assumes access to a perfect verifier. + +Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM +performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs +by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36 +points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%. + +1 + +Introduction + +Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with +orders of magnitude fewer parameters than the large language models (LLMs) they outperform on +tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning +Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning +models. Rather than autoregressively generating chains of token-level reasoning, they recursively +refine a latent state. This approach produces a single deterministic answer per input, fitting well with +tasks where the answer is unique. +Despite their strong performance, their deterministic inference does not make full use of their +capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent +space basins (i.e., regions of the latent space which decode to incorrect answers and from which the +deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work +on related models [5], suggests that TRM has the capabilities to solve significantly more problems +but is limited by its standard inference procedure. +Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This +is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique +answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different +trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using +a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage +test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time +scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic +decoded answer, and selecting among the answers using the model’s own Q head. +TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is +conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable +information that the standard inference procedure discards. +We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a +new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian +noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different +latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head +is used to select the one most likely to be correct. PTRM requires no training changes and no +task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy +gains across diverse reasoning benchmarks. + +2 + +Background: Tiny Recursive Model + +Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a +question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion +consists of n updates to the latent state z followed by one update to the predicted answer y, all using +the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z). +fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T +latent recursions in sequence, with only the final one retaining gradients, allowing the model to +leverage a large effective depth while keeping training efficient. +Rather than doing one optimization step per sample, TRM is trained via deep supervision, which +consists in keeping the previous latent state z and answer y as initialization (after being detached from +the computational graph) for the next supervision step. This is done for up to Nsup supervision steps. +The loss at each step is calculated using cross entropy between the predicted answer logits fO (y) +(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively +refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more +steps than during training, providing a depth axis for test-time compute scaling (additional steps may +correct otherwise-incorrect answers). +Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision +steps rather than being replaced after each one. To avoid wasting compute on already-solved samples, +an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary +cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact +2 + +Correct answer + +Incorrect answer + +PC 1 (58% var) +1.0 + +5.0 +0.9 2.5 +0.0 +0.8 2.5 +5.0 +0.7 + +2.5 +0.0 +2.5 +5 + +10 + +Supervision step + +15 + +0.9 + +1.0 + +5 + +Cell accuracy + +Q value + +PC 1 (85% var) + +1.0 + +5.0 + +0 + +Failure + +End + +PC 2 (8% var) + +PC 2 (36% var) +PC 1 (84% var) + +5.0 + +Start + +Delayed success + +PC 2 (15% var) + +Quick success + +Cell accuracy + +0.8 + +0 + +0.8 +0.7 +0 + +5 + +10 + +Supervision step + +15 + +0.6 + +5 +0 + +5 + +10 + +Supervision step + +15 + +Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell +accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating +three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are +projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories +fade from light (early steps) to dark (later steps). Circle marks the start and square marks end. + +correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ = +ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5, +improving data efficiency. During inference, the Q head is not used, and the model performs Nsup +supervision steps to maximize answer correctness. +While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will +investigate such failures cases in order to determine a way to remedy them. + +3 + +Problem: When Does TRM Fail? + +3.1 + +Analysis of failures and successes + +We present observations about TRM that motivate our method. In this section, we train a TRM on +multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head +behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent +yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into +the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction +of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and +Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe: +Quick success: the trajectory transitions in a few steps from its starting location to a convergence +region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima +within the same few steps. +Delayed success: the trajectory initially oscillates around one region and remains there for multiple +supervision steps before sharply escaping to a different region where it converges. During the initial +3 + +phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell +accuracy spike together. +Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches +near 100%, and the Q value stays negative for all supervision steps. +We refer to latent space regions that trajectories remain in across multiple supervision steps and +exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are +good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave +similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories, +when delayed successes find an escape to a good basin while failures remain stuck. +3.2 + +The Q head tracks trajectory quality +6 +4 + +Cell accuracy + +Q value + +2 + +1.00 +0.95 +0.90 +0.85 +Incorrect (28) +Correct (69) +0.80 +Cell accuracy (right axis) +0.75 +0.70 +0.65 +0.60 +10 +12 +14 + +0 +2 +4 +6 +0 + +2 + +4 + +6 +8 +Supervision step + +Figure 3: Q value follows cell accuracy across reasoning. Mean +Q value (solid, left axis) and mean +cell accuracy (dashed, right axis) +over supervision steps, aggregated +over 100 PPBench validation puzzles, separated by final correctness +(green: correct, red: incorrect). + +Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s +value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3 +aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer +correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell +accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at +convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for +correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable +learned indicator of whether a trajectory has reached a good basin. +Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows: +can we leverage the Q head to identify better trajectories? The main challenge is that the standard +TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given +problem. In the next section, we will show that by simply adding Gaussian noise to the latent state, +we can sample different parallel trajectories and leverage the Q head to pick the best one. + +4 + +Method: Test-Time Compute Scaling via Stochastic Rollouts + +We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion +stochastic and selects the best of K resulting trajectories. PTRM requires no special training and +can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific +augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by +σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the +highest Q value is selected and then decoded using the model’s output head fO . The algorithm in +Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories +to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new +axis for test-time scaling. +4.1 + +Escaping bad basins + +In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in +latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough +for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows +K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1), +4 + +PTRM Inference + +(a) Standard TRM (deterministic) + +1: Input: puzzle x, rollouts K, +2: supervision steps D, noise scale σ +3: for k = 1, . . . , K in parallel do + +answer + +depth axis: D deep recursion steps + +(k) + +(b) PTRM (ours): K stochastic rollouts + Q-head selection ++ϵ + +width axis: K rollouts + +(k) + +Initialize z0 , y0 +for t = 1, . . . , D do +(k) +zt−1 += ϵ, ϵ ∼ N (0, σ 2 I) +(k) +(k) +(k) +(k) +7: +zt , yt ← rec(x, zt−1 , yt−1 ) +8: +end for +(k) +9: +ŷ (k) ← arg max fO (yD ) +(k) +(k) +10: +q̂ ← fQ (yD ) +11: end for ∗ +12: return ŷ (k ) , k ∗ = arg maxk q̂ (k) +4: +5: +6: + +··· + +puzzle + +k= + +puzzle + +1 + ++ϵ + ++ϵ + +··· ++ϵ + ++ϵ + ++ϵ + +··· +k=2 +k= +K + +· +· +· + ++ϵ + +· +· +· + +· +· +· + ++ϵ + ++ϵ + +··· +deep recursion step + ++ϵ + +arg maxk Qk + +final answer + +Gaussian noise injection + +Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right: +PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent +rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer. +projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while +a minority (8%) escape to a distinct region in latent space and produce correct answers. We also +observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape, +at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity +needed to occasionally find an escape trajectory. +4.2 + +Width scaling + +Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the +number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any +rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a +metric available at inference without a correctness oracle. The choice of Q as selector is motivated by +Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3). +Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench +validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at +K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between +pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set. +By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same +range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct +solutions even when they are rare. +Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which +consists of running more deep recursions (supervision steps) at inference than the Nsup the model +was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises +PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only +provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth +and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are +independent and parallelizable while extra depth is sequential, width is the more practical scaling +axis. +PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next +section evaluates the method across multiple benchmarks and against several baselines, including +frontier LLMs. + +5 + +Experiments + +This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against +the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs. +Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4], +PTRM substantially boosts the performance of each pretrained TRM using only inference compute. +5 + +Correct (8) +Incorrect (92) +Start +End + +10 +8 + +92.5 + +PPBench accuracy (%) + +PC 2 (34% var) + +6 +4 +2 +0 + +85.0 +82.5 +80.0 +77.5 +72.5 + +2.5 + +0.0 + +2.5 +5.0 +7.5 +PC 1 (53% var) + +10.0 + +12.5 + +Figure 5: Stochastic rollouts escape bad +basins. Principal plane projection of K = +100 independent rollouts of the same failed +puzzle as in Figure 2 (right). 92 rollouts +remain caught in the bad basin (red). 8 +escape to a good basin and produce correct +answers (green). + +5.1 + +87.5 + +75.0 + +2 +4 + +pass@K +best-Q@K +mode@K + +90.0 + +1 + +5 +10 +25 +Rollouts per puzzle K (log scale) + +100 + +Figure 6: Width scaling. pass@K, best-Q@K, +and mode@K as K grows, averaged over 3 +seeds on a held-out PPBench validation set. The +Q head is a strong verifier on the tested puzzles, +consistently outperforming selection of the most +frequent answer. + +Setup + +Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles +(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types) +selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a +fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size), +and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup, +nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest. +We use the validation set to track performance during training and select the final checkpoint. We +report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on +shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across +types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets. +Models and inference. For each benchmark we use a standard TRM checkpoint. For SudokuExtreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku), +and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running +D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step. +The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics +are averaged across three seeds. +Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone, +we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For +each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and +ARC2 we additionally report the published direct prediction and TRM baselines from [1]. +Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s +wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9]) +so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle. +5.2 +5.2.1 + +Pencil Puzzle Bench +Per-puzzle accuracy + +Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2 +raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48) +gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes +from scaling width (stochastic rollouts). The largest improvements are on puzzle types where +6 + +the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to +97.8% and tapa from 40.0% to 80.0%. +% accuracy +Direct prediction +TRM (K=1, D=16) +TRM (K=1, D=48) +PTRM, best-Q@K (K=100, D=16) +PTRM, best-Q@K (K=100, D=48) + +# Params sudoku lightup nurikabe heyawake +27M +7M +7M +7M +7M + +0.0 +46.7 +57.8 +93.3 +97.8 + +0.0 +87.5 +87.5 +100 +100 + +0.0 +74.1 +74.1 +88.9 +88.9 + +14.3 +85.7 +85.7 +85.7 +85.7 + +tapa + +agg. + +0.0 +2.0 +40.0 62.6 +40.0 66.0 +80.0 89.8 +80.0 91.2 + +Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as +the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points +over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises +accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer +trained on the same data. + +5.2.2 + +Comparison with frontier LLMs on golden set + +PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response +from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct +and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM +substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively +verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while +it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn +verifier (the LLM could still self-verify by writing verification code within the same response). We +additionally observe that the agentic strategy was applied selectively in the published PPBench data: +across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried +with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our +golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high, +claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in +each strategy block. +We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved +if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against +PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under +any strategy, the ensemble counts it as solved, even though in practice we would not have access +to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that +even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble +cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct +divides total cost by the number of puzzles the ensemble solved. +Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57 +points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%) +despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for +LLMs than PTRM. +5.3 + +Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 + +For each benchmark we use the standard TRM checkpoint trained as described in [1] without +modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2). +Table 3 summarizes results on all three. +On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to +99.06% pass@K and 98.75% best-Q@K, achieving state of the art. +On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain +over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73% +(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows +that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on +the previous benchmarks. +7 + +% accuracy + +tapa + +agg. + +$/att. + +$/corr. + +30.0 +50.0 +60.0 + +24.5 +24.5 +34.7 + +$0.40 +$1.79 +$2.91 + +$1.62 +$7.29 +$8.40 + +0.0 +0.0 +0.0 + +40.0 +60.0 +70.0 + +30.6 +34.7 +36.7 + +$10.38 +$3.09 +$4.38 + +$33.91 +$8.90 +$11.92 + +0.0 + +80.0 + +55.1 + +$2.66 + +$38.51 + +sudoku lightup nurikabe heyawake +Direct + +gemini-3.1-pro +gpt-5.2@xhigh +claude-opus-4-6@thinking + +6.7 +20.0 +0.0 + +75.0 +50.0 +87.5 + +22.2 +0.0 +44.4 + +0.0 +0.0 +0.0 + +Any strategy (direct or agentic)† +gemini-3.1-pro +gpt-5.2@xhigh +claude-opus-4-6@thinking + +6.7 +33.3 +0.0 + +87.5 +75.0 +87.5 + +33.3 +0.0 +44.4 + +LLM ensemble† +Any strategy (direct or agentic) + +46.7 + +100 + +44.4 + +Ours, trained from scratch, 7M parameters +PTRM, best-Q@K + +97.8 + +100 + +88.9 + +85.7 + +80.0 91.2 $0.001 $0.001 + +Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt / +per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100 +wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate, +and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier. + +On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them. +PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the +highest Q value within each augmentation, then vote across augmentations as in the standard pipeline. +With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to +15.97% over our deterministic TRM baseline, while matching it at pass@2. + +Sudoku-Extreme Maze-Hard +ARC-AGI-2 +Acc. (%) +Acc. (%) pass@1 pass@2 pass@100 + +Method + +# Params + +HRM +TRM + +27M +5M / 7M† + +55.0 +87.4 + +74.5 +85.3 + +– +– + +5.0 +7.8 + +– +– + +Ours +Standard TRM, our reproduction 5M / 7M† +PTRM +5M / 7M† + +87.28 +98.75 + +83.80 +86.73 + +7.36 +8.47 + +9.72 +9.72 + +14.31 +15.97 + +Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100, +D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2. +pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM +shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M +for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att). + +5.4 + +Q head selection as σ grows + +With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses. +For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises +pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset +we tested, it unlocks correct solutions well beyond the deterministic model’s reach. +TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and SudokuExtreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration +translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces +significantly more correct rollouts, but the existing Q head is not able to identify them, leaving +performance on the table. The gap between best-Q@K and pass@K represents headroom for a +stronger verifier which is left for future work. Appendix B reports the full σ sweep. +8 + +6 + +Related Work + +A long line of work explores recursive computation for iterative reasoning and representation refinement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep +Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation +steps to improve reasoning performance. More recent work has introduced methods to substantially +accelerate TRM training [15], while TRM-style recursive architectures have also been extended to +language modeling tasks [16]. +Building on this broader perspective of recursive computation, a growing body of work studies +latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous +“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively +generated and appended to the model context, enabling reasoning directly in latent space without +producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition +and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token +sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate +the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21]. +Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state +z is sampled instead of deterministic. This improves performance on multiple tasks, but requires +retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute +method where they only apply noise in the initial hidden state z, while we apply noise at every +supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme +dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that +Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no +improvement in accuracy on two datasets). +Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who +identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates +these attractors through a combination of task-specific training data augmentation, inference-time +input perturbations, and model bootstrapping across training checkpoints, thereby effectively increasing test-time compute. However, these interventions are comparatively less general and less +computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple +puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism: +injecting Gaussian noise into the latent state at each supervision step while using a single deterministic +checkpoint. + +7 + +Conclusion + +In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for +Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales +test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth +scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on +task-specific data augmentations making it extremely easy to use and versatile. +By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide +selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM +nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51 +cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4% +to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1). +Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test +on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited +computational resources. It is not guaranteed that the method works as well for all types of problems +(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller). +Future work. It would be interesting to understand why some puzzles benefit from test-time scaling +more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less +from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones. +Developing stronger verifiers than the existing Q head is an interesting direction for future work. +9 + +References +[1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv +preprint arXiv:2510.04871, 2025. +[2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and +Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025. +[3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. +[4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arcagi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, +2025. +[5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic +analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026. +[6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, +2024. +[7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint +arXiv:1603.08983, 2016. +[8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv +preprint arXiv:2603.02119, 2026. +[9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026. +Accessed: 2026-05-01. +[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018. +[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, +Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic +recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025. +[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, +and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with +recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021. +[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, +and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation +without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242, +2022. +[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning +of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems, +37:97027–97052, 2024. +[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641, +2026. +[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie, +Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation. +In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning, +2026. +[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong +Tian. Training large language models to reason in a continuous latent space. arXiv preprint +arXiv:2412.06769, 2024. +[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, +Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. +Advances in neural information processing systems, 35:24824–24837, 2022. +10 + +[19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning +by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint +arXiv:2505.12514, 2025. +[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny +Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring +faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023. +[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t +always say what they think. arXiv preprint arXiv:2505.05410, 2025. +[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive +reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026. +[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search: +Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Thinking – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id= +kKps9W1K7n. + +11 + +A + +Implementation Details + +A.1 + +Compute + +We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no +additional training cost over standard TRM since it operates entirely at inference time. +A.2 + +Models + +All experiments use the standard TRM backbone [1] with the released architecture and training recipes. +Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme +and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench. +Layout and hyperparameters are unchanged from TRM. +A.3 + +PPBench dataset construction + +Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM. +The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we +detail how we built our training, validation, and golden splits. +Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types. +Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8]. +Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles +from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training +set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types +(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku +and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform +sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline +reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no +headroom to compare against PTRM). +Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1) +trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s +solution trajectory rather than always the empty initial grid, while the label is always the fully solved +grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among +the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the +label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining +9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not +augmented. +Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len = +100. Per-type sample counts are reported in Table 4. +puzzle type + +train + +val + +golden + +sudoku +lightup +nurikabe +heyawake +tapa +shakashaka∗ + +7,810 +9,504 +15,180 +42,108 +3,663 +20,702 + +97 +65 +55 +70 +26 +62 + +15 +8 +9 +7 +10 +12 + +total + +98,967 + +375 + +61 + +Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation. +∗ +Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deterministic TRM already solves all evaluated shakashaka puzzles. + +12 + +B + +Noise Ablation + +We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and +D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the +test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ, +averaged over three random seeds. +pass@K + +Sudoku-Extreme + +100 + +mode@K + +K = 1 baseline + +Maze-Hard + +ARC-AGI-2 (within-aug) +5.5 + +96 + +90 + +accuracy (%) + +best-Q@K + +94 + +80 + +5.0 + +92 + +70 + +90 + +60 + +88 + +50 + +86 + +40 + +84 + +30 +0.0 + +0.2 + +0.4 + +0.6 + +0.8 + +1.0 + +82 +0.0 + +4.5 +4.0 +3.5 +0.2 + +0.4 + +0.6 + +0.8 + +1.0 + +0.0 + +0.2 + +0.4 + +0.6 + +0.8 + +1.0 + +Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task, +increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to +a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely +on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region +represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has +the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation +to isolate the Q head’s verification abilities from the augmentation pipeline. +On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then +plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the +sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the +ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of +pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K +is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close). + +C + +Q-guided Langevin sampling + +We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration +mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic +search by additionally steering each rollout (using the Q head gradient) toward regions of high Q +value. We ultimately found that the gain from this approach was entirely attributable to the Langevin +noise term, with the gradient component contributing nothing measurable on top of the equivalent +recurrent noise of Sec. 4. We document the approach here as a negative result. +Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote +the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent +space. Empirical observations during early experiments suggested that regions of low E correspond +to good basins from which the decoded answer is likely correct. PCA visualizations of the latent +dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and +bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which +to push latents. +Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin +dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to +the latent, +p +z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I), +The number of Langevin steps N is the additional scaling axis under this scheme. +13 + +t=0 + +t=5 + +t = 10 + +t = 15 + +Correct (21) +Incorrect (4) +Q + +Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recursive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts +in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both +good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization +motivated the Langevin sampling experiment described below. +Tractable gradient computation. TRM’s original Q head is a linear projection on a single token, +fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent +of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion. +This works but requires backpropagating through a full latent recursion at every Langevin step, which +scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with +an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear +network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which +is ∼8× faster per step and does not sacrifice accuracy. +The gain came from the noise, +√ not the gradient. Comparing Langevin sampling against a noiseonly ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially +identical accuracy at matched N . The gradient component contributed nothing measurable on +top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in +Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time +backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective. + +D + +Per-puzzle accuracy on the PPBench validation set + +The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct comparability with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement, +Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for +golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift, +and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K +from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles +where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69% +to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only +marginally. +% accuracy +Direct prediction +TRM (K=1, D=16) +TRM (K=1, D=48) +PTRM, best-Q@K (K=100, D=48) + +# Params sudoku lightup nurikabe heyawake +27M +7M +7M +7M + +0.0 +68.7 +74.0 +93.3 + +10.0 +83.3 +84.0 +93.3 + +4.0 +76.0 +76.7 +84.7 + +14.0 +96.7 +98.0 +100 + +tapa + +agg. + +0.0 +6.2 +39.7 76.4 +41.0 78.3 +71.8 90.4 + +Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the +deterministic TRM. Results on the larger validation set follow the same trends as on the golden set. + +14 + +
\ No newline at end of file |
