diff options
Diffstat (limited to 'papers/txt/ptrm2025_probabilistic_trm.txt')
| -rw-r--r-- | papers/txt/ptrm2025_probabilistic_trm.txt | 906 |
1 files changed, 906 insertions, 0 deletions
diff --git a/papers/txt/ptrm2025_probabilistic_trm.txt b/papers/txt/ptrm2025_probabilistic_trm.txt new file mode 100644 index 0000000..a00f31c --- /dev/null +++ b/papers/txt/ptrm2025_probabilistic_trm.txt @@ -0,0 +1,906 @@ + Probabilistic Tiny Recursive Model + + + Amin Sghaier Ali Parviz Alexia Jolicoeur-Martineau + Mila – Quebec AI Institute Mila – Quebec AI Institute Independent + ILLS & ETS Montreal + + {amin.sghaier, ali.parviz}@mila.quebec +arXiv:2605.19943v1 [cs.AI] 19 May 2026 + + + + + alexia.jolicoeur-martineau@mail.mcgill.ca + + + + Abstract + Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of + the parameters of modern large language models (LLMs) by iteratively refining a + latent state and final answer. While powerful, their deterministic recursion can lead + to convergence at suboptimal solutions, without escape mechanism. A common + workaround relies on task-specific input perturbations at test time combined with + answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task- + agnostic framework for test-time compute scaling that addresses this limitation + through stochastic exploration. PTRM injects Gaussian noise at each deep recursion + step, enabling parallel trajectories to explore diverse solution basins, and selects + among them using the model’s existing Q head (used for early stopping in the + original TRM). Without requiring retraining or task-specific augmentations, PTRM + enables substantial accuracy gains across benchmarks, including Sudoku-Extreme + (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to + 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs + (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters. + + + + PPBench Puzzles + sudoku, lightup, nurikabe, heyawake, and tapa Sudoku-Extreme + 100 100 98.75 + 91.2 87.4 + 80 80 + Accuracy (%) + + + + + 62.6 + 60 55.1 60 55 + 40 34.7 40 + 24.5 24.5 + 20 20 + 2 0 0 0 0 + 0 0 + Direct pred + + + + + Direct pred + gemini-3.1-pro + + claude-opus-4-6 + + + + TRM + + + + + Deepseek R1 + + + + o3-mini-high + + + + HRM + + TRM + gpt-5.2@xhigh + + + + + PTRM (ours) + + + + + Claude 3.7 8K + + + + + PTRM (ours) + LLM ensemble + + + + + Chain-of-thought, pretrained Direct prediction Probabilistic recursive prediction (ours) + LLM ensemble Deterministic recursive prediction + Best of 7 strongest LLMs. Assumes access to a perfect verifier. + + Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM + performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs + by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36 + points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%. +1 Introduction +Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with +orders of magnitude fewer parameters than the large language models (LLMs) they outperform on +tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning +Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning +models. Rather than autoregressively generating chains of token-level reasoning, they recursively +refine a latent state. This approach produces a single deterministic answer per input, fitting well with +tasks where the answer is unique. +Despite their strong performance, their deterministic inference does not make full use of their +capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent +space basins (i.e., regions of the latent space which decode to incorrect answers and from which the +deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work +on related models [5], suggests that TRM has the capabilities to solve significantly more problems +but is limited by its standard inference procedure. +Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This +is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique +answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different +trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using +a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage +test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time +scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic +decoded answer, and selecting among the answers using the model’s own Q head. +TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is +conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable +information that the standard inference procedure discards. +We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a +new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian +noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different +latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head +is used to select the one most likely to be correct. PTRM requires no training changes and no +task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy +gains across diverse reasoning benchmarks. + +2 Background: Tiny Recursive Model +Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a +question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion +consists of n updates to the latent state z followed by one update to the predicted answer y, all using +the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z). +fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T +latent recursions in sequence, with only the final one retaining gradients, allowing the model to +leverage a large effective depth while keeping training efficient. +Rather than doing one optimization step per sample, TRM is trained via deep supervision, which +consists in keeping the previous latent state z and answer y as initialization (after being detached from +the computational graph) for the next supervision step. This is done for up to Nsup supervision steps. +The loss at each step is calculated using cross entropy between the predicted answer logits fO (y) +(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively +refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more +steps than during training, providing a depth axis for test-time compute scaling (additional steps may +correct otherwise-incorrect answers). +Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision +steps rather than being replaced after each one. To avoid wasting compute on already-solved samples, +an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary +cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact + + + 2 + Correct answer Incorrect answer Cell accuracy Start End + Quick success Delayed success Failure + + PC 2 (15% var) + + + + + PC 2 (36% var) + + + + + PC 2 (8% var) + PC 1 (84% var) PC 1 (58% var) PC 1 (85% var) + + + 1.0 1.0 1.0 + 5.0 5.0 5 + + + + + Cell accuracy + 2.5 0.9 2.5 0.9 0.8 +Q value + + + + + 0.0 0 + 0.0 0.8 2.5 0.8 + 0.6 + 2.5 5 + 5.0 + 0.7 0.7 + 5.0 + 0 5 10 15 0 5 10 15 0 5 10 15 + Supervision step Supervision step Supervision step + Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell + accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating + three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are + projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories + fade from light (early steps) to dark (later steps). Circle marks the start and square marks end. + + + correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ = + ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5, + improving data efficiency. During inference, the Q head is not used, and the model performs Nsup + supervision steps to maximize answer correctness. + While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will + investigate such failures cases in order to determine a way to remedy them. + + + 3 Problem: When Does TRM Fail? + + 3.1 Analysis of failures and successes + + We present observations about TRM that motivate our method. In this section, we train a TRM on + multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head + behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent + yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into + the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction + of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and + Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe: + Quick success: the trajectory transitions in a few steps from its starting location to a convergence + region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima + within the same few steps. + Delayed success: the trajectory initially oscillates around one region and remains there for multiple + supervision steps before sharply escaping to a different region where it converges. During the initial + + + 3 +phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell +accuracy spike together. +Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches +near 100%, and the Q value stays negative for all supervision steps. +We refer to latent space regions that trajectories remain in across multiple supervision steps and +exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are +good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave +similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories, +when delayed successes find an escape to a good basin while failures remain stuck. + +3.2 The Q head tracks trajectory quality + + 6 1.00 + 0.95 Figure 3: Q value follows cell ac- + 4 + 0.90 curacy across reasoning. Mean + 2 + + + + + Cell accuracy + Incorrect (28) 0.85 Q value (solid, left axis) and mean +Q value + + + + + 0 Correct (69) 0.80 cell accuracy (dashed, right axis) + Cell accuracy (right axis) over supervision steps, aggregated + 2 0.75 + 0.70 + over 100 PPBench validation puz- + 4 zles, separated by final correctness + 0.65 + 6 (green: correct, red: incorrect). + 0.60 + 0 2 4 6 8 10 12 14 + Supervision step +Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s +value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3 +aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer +correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell +accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at +convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for +correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable +learned indicator of whether a trajectory has reached a good basin. +Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows: +can we leverage the Q head to identify better trajectories? The main challenge is that the standard +TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given +problem. In the next section, we will show that by simply adding Gaussian noise to the latent state, +we can sample different parallel trajectories and leverage the Q head to pick the best one. + +4 Method: Test-Time Compute Scaling via Stochastic Rollouts +We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion +stochastic and selects the best of K resulting trajectories. PTRM requires no special training and +can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific +augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by +σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the +highest Q value is selected and then decoded using the model’s output head fO . The algorithm in +Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories +to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new +axis for test-time scaling. + +4.1 Escaping bad basins + +In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in +latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough +for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows +K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1), + + + 4 +PTRM Inference (a) Standard TRM (deterministic) + + 1: Input: puzzle x, rollouts K, puzzle ··· answer + 2: supervision steps D, noise scale σ + 3: for k = 1, . . . , K in parallel do depth axis: D deep recursion steps + (k) (k) + 4: Initialize z0 , y0 (b) PTRM (ours): K stochastic rollouts + Q-head selection + 5: for t = 1, . . . , D do +ϵ +ϵ +ϵ + ··· + + + + + width axis: K rollouts + (k) + 6: zt−1 += ϵ, ϵ ∼ N (0, σ 2 I) 1 +ϵ +ϵ +ϵ + k= + (k) (k) (k) (k) + 7: zt , yt ← rec(x, zt−1 , yt−1 ) ··· + puzzle arg maxk Qk + 8: end for k=2 + · + · + · + · + · + · + (k) + 9: ŷ (k) ← arg max fO (yD ) k= + K + · + +ϵ + · + +ϵ + · + +ϵ + (k) (k) +10: q̂ ← fQ (yD ) ··· final answer + +11: end for ∗ +12: return ŷ (k ) , k ∗ = arg maxk q̂ (k) deep recursion step +ϵ Gaussian noise injection + + + + +Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right: +PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent +rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer. + + +projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while +a minority (8%) escape to a distinct region in latent space and produce correct answers. We also +observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape, +at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity +needed to occasionally find an escape trajectory. + +4.2 Width scaling + +Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the +number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any +rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a +metric available at inference without a correctness oracle. The choice of Q as selector is motivated by +Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3). +Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench +validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at +K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between +pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set. +By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same +range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct +solutions even when they are rare. +Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which +consists of running more deep recursions (supervision steps) at inference than the Nsup the model +was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises +PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only +provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth +and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are +independent and parallelizable while extra depth is sequential, width is the more practical scaling +axis. +PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next +section evaluates the method across multiple benchmarks and against several baselines, including +frontier LLMs. + +5 Experiments +This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against +the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs. +Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4], +PTRM substantially boosts the performance of each pretrained TRM using only inference compute. + + + 5 + Correct (8) + 10 Incorrect (92) + Start + 8 End 92.5 pass@K + 90.0 best-Q@K + 6 mode@K + 87.5 +PC 2 (34% var) + + + + + PPBench accuracy (%) + 4 85.0 + + 2 + 82.5 + 80.0 + 0 + 77.5 + 2 75.0 + 72.5 + 4 1 5 10 25 100 + 2.5 0.0 2.5 5.0 7.5 10.0 12.5 Rollouts per puzzle K (log scale) + PC 1 (53% var) + +Figure 5: Stochastic rollouts escape bad Figure 6: Width scaling. pass@K, best-Q@K, +basins. Principal plane projection of K = and mode@K as K grows, averaged over 3 +100 independent rollouts of the same failed seeds on a held-out PPBench validation set. The +puzzle as in Figure 2 (right). 92 rollouts Q head is a strong verifier on the tested puzzles, +remain caught in the bad basin (red). 8 consistently outperforming selection of the most +escape to a good basin and produce correct frequent answer. +answers (green). + + + 5.1 Setup + +Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles +(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types) +selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a +fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size), +and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup, +nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest. +We use the validation set to track performance during training and select the final checkpoint. We +report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on +shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across +types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets. + Models and inference. For each benchmark we use a standard TRM checkpoint. For Sudoku- + Extreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku), + and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running + D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step. + The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics + are averaged across three seeds. + Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone, + we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For + each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and + ARC2 we additionally report the published direct prediction and TRM baselines from [1]. + Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s + wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9]) + so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle. + + + 5.2 Pencil Puzzle Bench + + 5.2.1 Per-puzzle accuracy + + Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2 + raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48) + gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes + from scaling width (stochastic rollouts). The largest improvements are on puzzle types where + + + 6 +the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to +97.8% and tapa from 40.0% to 80.0%. + + % accuracy # Params sudoku lightup nurikabe heyawake tapa agg. + Direct prediction 27M 0.0 0.0 0.0 14.3 0.0 2.0 + TRM (K=1, D=16) 7M 46.7 87.5 74.1 85.7 40.0 62.6 + TRM (K=1, D=48) 7M 57.8 87.5 74.1 85.7 40.0 66.0 + PTRM, best-Q@K (K=100, D=16) 7M 93.3 100 88.9 85.7 80.0 89.8 + PTRM, best-Q@K (K=100, D=48) 7M 97.8 100 88.9 85.7 80.0 91.2 +Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as +the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points +over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises +accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer +trained on the same data. + + + +5.2.2 Comparison with frontier LLMs on golden set +PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response +from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct +and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM +substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively +verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while +it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn +verifier (the LLM could still self-verify by writing verification code within the same response). We +additionally observe that the agentic strategy was applied selectively in the published PPBench data: +across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried +with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our +golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high, +claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in +each strategy block. +We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved +if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against +PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under +any strategy, the ensemble counts it as solved, even though in practice we would not have access +to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that +even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble +cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct +divides total cost by the number of puzzles the ensemble solved. +Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57 +points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%) +despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for +LLMs than PTRM. + +5.3 Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 + +For each benchmark we use the standard TRM checkpoint trained as described in [1] without +modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2). +Table 3 summarizes results on all three. +On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to +99.06% pass@K and 98.75% best-Q@K, achieving state of the art. +On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain +over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73% +(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows +that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on +the previous benchmarks. + + + 7 + % accuracy sudoku lightup nurikabe heyawake tapa agg. $/att. $/corr. + Direct + gemini-3.1-pro 6.7 75.0 22.2 0.0 30.0 24.5 $0.40 $1.62 + gpt-5.2@xhigh 20.0 50.0 0.0 0.0 50.0 24.5 $1.79 $7.29 + claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 60.0 34.7 $2.91 $8.40 + Any strategy (direct or agentic)† + gemini-3.1-pro 6.7 87.5 33.3 0.0 40.0 30.6 $10.38 $33.91 + gpt-5.2@xhigh 33.3 75.0 0.0 0.0 60.0 34.7 $3.09 $8.90 + claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 70.0 36.7 $4.38 $11.92 + LLM ensemble† + Any strategy (direct or agentic) 46.7 100 44.4 0.0 80.0 55.1 $2.66 $38.51 + Ours, trained from scratch, 7M parameters + PTRM, best-Q@K 97.8 100 88.9 85.7 80.0 91.2 $0.001 $0.001 +Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt / +per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100 +wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate, +and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier. + + + + +On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them. +PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the +highest Q value within each augmentation, then vote across augmentations as in the standard pipeline. +With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to +15.97% over our deterministic TRM baseline, while matching it at pass@2. + + + Sudoku-Extreme Maze-Hard ARC-AGI-2 + Method # Params Acc. (%) Acc. (%) pass@1 pass@2 pass@100 + HRM 27M 55.0 74.5 – 5.0 – + TRM 5M / 7M† 87.4 85.3 – 7.8 – + Ours + Standard TRM, our reproduction 5M / 7M† 87.28 83.80 7.36 9.72 14.31 + PTRM 5M / 7M† 98.75 86.73 8.47 9.72 15.97 +Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100, +D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2. +pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM +shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M +for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att). + + +5.4 Q head selection as σ grows + +With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses. +For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises +pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset +we tested, it unlocks correct solutions well beyond the deterministic model’s reach. +TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and Sudoku- +Extreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration +translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces +significantly more correct rollouts, but the existing Q head is not able to identify them, leaving +performance on the table. The gap between best-Q@K and pass@K represents headroom for a +stronger verifier which is left for future work. Appendix B reports the full σ sweep. + + + 8 +6 Related Work + +A long line of work explores recursive computation for iterative reasoning and representation re- +finement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep +Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation +steps to improve reasoning performance. More recent work has introduced methods to substantially +accelerate TRM training [15], while TRM-style recursive architectures have also been extended to +language modeling tasks [16]. +Building on this broader perspective of recursive computation, a growing body of work studies +latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous +“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively +generated and appended to the model context, enabling reasoning directly in latent space without +producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition +and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token +sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate +the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21]. +Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state +z is sampled instead of deterministic. This improves performance on multiple tasks, but requires +retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute +method where they only apply noise in the initial hidden state z, while we apply noise at every +supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme +dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that +Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no +improvement in accuracy on two datasets). +Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who +identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates +these attractors through a combination of task-specific training data augmentation, inference-time +input perturbations, and model bootstrapping across training checkpoints, thereby effectively in- +creasing test-time compute. However, these interventions are comparatively less general and less +computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple +puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism: +injecting Gaussian noise into the latent state at each supervision step while using a single deterministic +checkpoint. + + +7 Conclusion + +In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for +Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales +test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth +scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on +task-specific data augmentations making it extremely easy to use and versatile. +By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide +selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM +nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51 +cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4% +to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1). +Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test +on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited +computational resources. It is not guaranteed that the method works as well for all types of problems +(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller). +Future work. It would be interesting to understand why some puzzles benefit from test-time scaling +more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less +from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones. +Developing stronger verifiers than the existing Q head is an interesting direction for future work. + + + 9 +References + [1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv + preprint arXiv:2510.04871, 2025. + [2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and + Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025. + [3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. + [4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- + agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, + 2025. + [5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic + analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026. + [6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- + mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, + 2024. + [7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint + arXiv:1603.08983, 2016. + [8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv + preprint arXiv:2603.02119, 2026. + [9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026. + Accessed: 2026-05-01. +[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- + versal transformers. arXiv preprint arXiv:1807.03819, 2018. +[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, + Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic + recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025. +[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, + and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with + recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021. +[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, + and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation + without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242, + 2022. +[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning + of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems, + 37:97027–97052, 2024. +[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641, + 2026. +[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie, + Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation. + In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning, + 2026. +[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong + Tian. Training large language models to reason in a continuous latent space. arXiv preprint + arXiv:2412.06769, 2024. +[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, + Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. + Advances in neural information processing systems, 35:24824–24837, 2022. + + + 10 +[19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning + by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint + arXiv:2505.12514, 2025. +[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny + Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring + faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023. +[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- + man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t + always say what they think. arXiv preprint arXiv:2505.05410, 2025. +[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive + reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026. +[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search: + Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Think- + ing – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id= + kKps9W1K7n. + + + + + 11 +A Implementation Details + +A.1 Compute + +We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no +additional training cost over standard TRM since it operates entirely at inference time. + +A.2 Models + +All experiments use the standard TRM backbone [1] with the released architecture and training recipes. +Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme +and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench. +Layout and hyperparameters are unchanged from TRM. + +A.3 PPBench dataset construction + +Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM. +The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we +detail how we built our training, validation, and golden splits. + +Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types. +Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8]. + +Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles +from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training +set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types +(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku +and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform +sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline +reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no +headroom to compare against PTRM). + +Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1) +trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s +solution trajectory rather than always the empty initial grid, while the label is always the fully solved +grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among +the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the +label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining +9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not +augmented. + +Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len = +100. Per-type sample counts are reported in Table 4. + + puzzle type train val golden + sudoku 7,810 97 15 + lightup 9,504 65 8 + nurikabe 15,180 55 9 + heyawake 42,108 70 7 + tapa 3,663 26 10 + shakashaka∗ 20,702 62 12 + total 98,967 375 61 +Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation. +∗ + Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deter- +ministic TRM already solves all evaluated shakashaka puzzles. + + + + 12 + B Noise Ablation + + We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and + D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the + test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ, + averaged over three random seeds. + + pass@K best-Q@K mode@K K = 1 baseline + + Sudoku-Extreme Maze-Hard ARC-AGI-2 (within-aug) + 100 + 96 5.5 + 90 + 94 + 80 5.0 + 92 +accuracy (%) + + + + + 70 90 4.5 + 60 88 + 50 86 4.0 + 40 84 + 3.5 + 30 82 + 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 + + + Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task, + increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to + a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely + on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region + represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has + the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation + to isolate the Q head’s verification abilities from the augmentation pipeline. + + On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then + plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the + sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the + ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of + pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K + is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close). + + + C Q-guided Langevin sampling + + We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration + mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic + search by additionally steering each rollout (using the Q head gradient) toward regions of high Q + value. We ultimately found that the gain from this approach was entirely attributable to the Langevin + noise term, with the gradient component contributing nothing measurable on top of the equivalent + recurrent noise of Sec. 4. We document the approach here as a negative result. + + Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote + the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent + space. Empirical observations during early experiments suggested that regions of low E correspond + to good basins from which the decoded answer is likely correct. PCA visualizations of the latent + dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and + bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which + to push latents. + + Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin + dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to + the latent, + p + z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I), + The number of Langevin steps N is the additional scaling axis under this scheme. + + + 13 + t=0 t=5 t = 10 t = 15 + Correct (21) + Incorrect (4) + Q + + + + +Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recur- +sive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts +in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both +good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization +motivated the Langevin sampling experiment described below. + + +Tractable gradient computation. TRM’s original Q head is a linear projection on a single token, +fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent +of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion. +This works but requires backpropagating through a full latent recursion at every Langevin step, which +scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with +an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear +network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which +is ∼8× faster per step and does not sacrifice accuracy. + +The gain came from the noise,√ not the gradient. Comparing Langevin sampling against a noise- +only ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially +identical accuracy at matched N . The gradient component contributed nothing measurable on +top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in +Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time +backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective. + +D Per-puzzle accuracy on the PPBench validation set +The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct compara- +bility with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement, +Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for +golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift, +and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K +from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles +where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69% +to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only +marginally. + + % accuracy # Params sudoku lightup nurikabe heyawake tapa agg. + Direct prediction 27M 0.0 10.0 4.0 14.0 0.0 6.2 + TRM (K=1, D=16) 7M 68.7 83.3 76.0 96.7 39.7 76.4 + TRM (K=1, D=48) 7M 74.0 84.0 76.7 98.0 41.0 78.3 + PTRM, best-Q@K (K=100, D=48) 7M 93.3 93.3 84.7 100 71.8 90.4 +Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the +deterministic TRM. Results on the larger validation set follow the same trends as on the golden set. + + + + + 14 +
\ No newline at end of file |
