Generative Recursive Reasoning


                                                                      Junyeob Baek1†∗ Mingyu Jo1†∗          Minsu Kim1,2

                                                                     Mengye Ren3       Yoshua Bengio2,4     Sungjin Ahn1,3†
arXiv:2605.19376v2 [cs.AI] 20 May 2026


                                                                              1
                                                                               KAIST 2 Mila – Québec AI Institute
                                                                       3
                                                                           New York University 4 Université de Montréal


                                                                                          Abstract
                                                     How should future neural reasoning systems implement extended computation?
                                                     Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
                                                     sive sequence extension by performing iterative latent-state refinement with shared
                                                     transition functions. Yet existing RRMs are largely deterministic, following a
                                                     single latent trajectory and converging to a single prediction. We introduce Gen-
                                                     erative Recursive reAsoning Models (GRAM), a framework that turns recursive
                                                     latent reasoning into probabilistic multi-trajectory computation. GRAM models
                                                     reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
                                                     tive solution strategies, and inference-time scaling through both recursive depth
                                                     and parallel trajectory sampling. This yields a latent-variable generative model
                                                     supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
                                                     unconditional generation via pθ (x). Trained with amortized variational inference,
                                                     GRAM improves over deterministic recurrent and recursive baselines on structured
                                                     reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
                                                     unconditional generation capability. https://ahn-ml.github.io/gram-website


                                         1     Introduction
                                         A central question for future neural reasoning systems is how extended computation should be imple-
                                         mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
                                         process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
                                         plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
                                         Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
                                         than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
                                         because it decouples reasoning depth from both parameter scale and output length: a compact model
                                         can perform many steps of internal computation by repeatedly applying shared transition functions
                                         over time.
                                         Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
                                         potential of this approach in structured reasoning. Rather than producing a solution in a single
                                         feedforward pass, they perform extended computation through iterative latent-state refinement, deep
                                         supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
                                         latent dynamics. These features make them well suited to problems requiring constraint propagation,
                                         state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
                                             ∗ Equal contribution
                                            † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn

                                         (sungjin.ahn@kaist.ac.kr)


                                         Preprint.
                      Solution 1,


     Input Task,


                      Solution 2,


                                              (a) Deterministic RRMs                (b) GRAM (Ours)

Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.

also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
depth without increasing parameter count. Together, these models suggest that reasoning capability
can emerge not only from scaling model size or generating longer traces, but also from the organization
of computation itself.
While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
exploring multiple latent trajectories in parallel.
Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
motivates the central question of our work: can recursive latent computation support probabilistic,
generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
conditioned on the input and the current reasoning state, rather than deterministically updating to a
single next state. Repeating this process defines a distribution over possible reasoning trajectories,
allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
marginalizing over latent reasoning trajectories, while the same recursive process can also define an
unconditional generative model pθ (x) when the input is fixed or absent.
We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
MNIST [15] probes the unconditional generative interpretation.
Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.


                                                        2
First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
inference-time scaling, enabling inference to scale not only with recursive depth but also with the
number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
yields the intended architectural advantages over deterministic recurrent and recursive baselines,
improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.

2     Generative Recursive Reasoning Models
In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
procedure in Section 2.2, with an architecture schematic shown in Figure 2.

2.1   Architecture

Overview. GRAM models the conditional distri-                                                                      𝑦        (A) CE loss
bution pθ (y | x) by marginalizing over stochas-
                                                                            Prior          (B) KL Div.         Posterior                  Decoder
tic latent reasoning trajectories. Given an input                          𝑝𝜃 (⋅ |𝑢𝑡 )                        𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡          𝑓dec
x, GRAM first computes an embedding
                                                              ℎ𝑡−1                                       𝑓𝐻        𝑢𝑡                       ℎ𝑡
                ex = fenc (x; θ),                       (1)
                                                                                         𝐾 times
which is reused throughout the entire recursive 𝑙𝑡−1              𝑓𝐿         𝑓𝐿                        𝑙𝑡
computation. Starting from a fixed initial la-
                                                          Encoder
tent state z0 , the model evolves the latent state   𝑥
                                                            𝑓enc
through learned stochastic transitions. The re-
cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
levels: inner and outer loops.                     latent transition in the hierarchical instantiation z =
At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
                                                level update fH produces a deterministic proposal ut , to
a new latent state conditioned on the previous which stochastic  guidance ϵt is added: ht = ut + ϵt .
latent state and the input embedding,
                                  zt ∼ pθ (zt | zt−1 , ex ),              t = 1, . . . , T.                                                   (2)
At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
gradients computed as described in Section 2.2.
At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
step serving as the initial state of the next, thereby forming the full recursive computation:
                 (1)   T transitions    (1)       (2)     T transitions                  T transitions          (N      )
               z0      −−−−−−−→ zT = z0                  −−−−−−−→ · · · −−−−−−−→ zT sup ,                                                     (3)
        (n)                                                                                                                         (1)
where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
   (n+1)       (n)
(z0      := zT ). This abstract formulation can be instantiated with various recurrent Transformer
backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
as well as hierarchical designs such as HRM and TRM [8, 9].
Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
perturbation around a deterministic update: at each transition, the model first computes a deterministic
update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
and adds it to ut :
                                ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
                                                                            
                                                                                                     (4)
                                 zt = ut + ϵt .                                                                                               (5)


                                                              3
We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
and support robust exploration of the solution space without discarding the deterministic refinement
performed by ut .
Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
state, while the low-level component l is updated K times within a single transition and carries
fine-grained intermediate computation. This decomposition separates the two roles across time scales,
with h accumulating slowly across transitions and l refined rapidly within each one.
With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
The low-level component is first refined for K updates, with the high-level component held fixed:
                              lt,k = fL (ht−1 , lt,k−1 , ex ; θ),            k = 1, . . . , K,                (6)
where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
component is then updated as a stochastic transition conditioned on the refined lt ,
                                    ut = fH (ht−1 , lt ; θ),                                                  (7)
                                    ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
                                                                                        
                                                                                            ,                 (8)
                                    ht = ut + ϵt ,                                                            (9)
and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
more abstract component of the latent state, where it can steer the overall reasoning trajectory
across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
Modeling Unconditional Distribution. While the description so far focuses on the conditional
setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
generation tasks in Section 4.3.

2.2   Training

GRAM is trained to model the conditional distribution pθ (y | x), where each training example
consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
respect to the generative parameters θ and variational parameters ϕ.
Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
TTotal = T × Nsup . The conditional likelihood is defined as
                                            Z
                               pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ,                           (10)

where x denotes the input problem and y denotes the corresponding ground-truth output.
Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
              log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) .           (11)

During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
access to both the input problem x and the target output y. At inference time, where y is unavailable,
trajectories are instead generated from the learned prior pθ (· | x).
   3We also tried injecting noise into the low-level state, but found that it did not improve performance.


                                                              4
Both the prior and the posterior are modeled as conditional Markov processes over latent states:
                     TY
                      Total                                            TY
                                                                        Total

 pθ (τ | x) = p(z0 )        pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 )        qϕ (zt | zt−1 , x, y). (12)
                        t=1                                                  t=1
Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
whereas the prior uses pθ (ϵt | ut ).
Since the two processes share the same Markov structure and all stochasticity is introduced through
ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
                                          TX Total               h                                    i
 LELBO = Eqϕ log pθ (y | zTTotal , x) −              Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
                                               t=1
Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
step is reused as the initial state of the next step.
Following standard practice for recurrent models with long computation chains, we apply truncated
gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
implementation, gradients are propagated only through the final transition of each supervision step,
 (n)      (n)
zT −1 → zT . This gives the following surrogate objective for each supervision step:
     (n)                                 (n)               (n)   (n)           (n)   (n) 
   LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
        (n)
where zT is the terminal state of the current supervision step n, and gradients are stopped through
preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
are listed in Appendix B.2.

2.3   Inference-Time Scaling

GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
Appendix A.1). For width — the focus of this section — we draw {τ (i) }N  i=1 ∼ pθ (τ(i)| x) from the
learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
chooses the candidate with the highest predicted terminal value. Details of LPRM training are
provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
parallel exploration, without increasing the sequential recursion length.

3     Related Work
Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous


                                                      5
                              Sudoku                                  ARC-AGI-1                                              ARC-AGI-2
                                                                                                          20
               100                           97.0%                                                66.7%
                                     87.4%                                                                                                         16.0%
                80                                   60                   52.0%           55.7%           15
Accuracy (%)         61.3%                                        44.6%                                                      11.1%
                60           55.0%                        40.3%                                                                             9.7%
                                                     40                           34.5%                   10          7.8%
                40                                                                                             5.0%
                                                     20                                                    5                         3.0%
                20
                 0                                   0                                                     0
                     Looped HRM TRM GRAM                  HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-                 HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
                       TF           (Ours)                        (Ours) high (low) thinking                           (Ours) high (low) thinking
                                                           Recursive Models                GRAM (Ours)          LLMs

Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
only as external reference points for benchmark difficulty.

representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
However, many remain organized around autoregressive sequence generation, where additional
computation is tied to generating more tokens, latent positions, or sequential reasoning states.
Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
recursive reasoning models show that increasing inference-time depth can outperform larger static
models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
enabling multi-path exploration and generative sampling.
Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
stochastic dynamics as computation rather than temporal observation modeling: latent transitions
define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
pθ (y | x) and unconditional pθ (x) generation.


4                Experiments

GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
purpose large language reasoning model whose training data, inference budgets, prompting strategies,
tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
choices (Section 4.4).

4.1                  Challenging Puzzle Tasks

Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-


                                                                                     6
           1.0                     N=                                    1.0
                                         1 5 10 20 50
                                                                         0.8
           0.9
Accuracy                                                                                                  Looped TF


                                                              Accuracy
                                                Looped TF                0.6                              HRM
                                                HRM
           0.8                                  TRM                                                       TRM
                                                GRAM (Ours)              0.4                              GRAM (N=1)
                                                                         0.2
           0.6
           0.5                                                           0.0
                 8   16    32             128          320                     2   4    6   8   10   12   14   16   18
                            Iterations                                                 Number of Solutions
 Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
 longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
 iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
 Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
 a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
 performance.


 former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
 model results [40] are included as external reference points for benchmark difficulty, rather than
 as controlled baselines, since their training and inference settings are not directly comparable to
 task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
 reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
 Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
 GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
 improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
 TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
 stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
 of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
 problem spaces more effectively than models constrained to a single sequential refinement process.
 Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
 Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
 number of parallel samples consistently improves performance across all iteration counts. Notably,
 GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
 including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
 recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
 explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
 Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
 the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
 analysis on the ARC-AGI Challenge is provided in Appendix D.2.

 4.2        Multi-solution Puzzle Tasks

Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
Appendix C.2.
 Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
 models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
 Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
 deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
 consistent performance regardless of solution count. This confirms that deterministic latent updates


                                                              7
Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
solutions discovered with 20 samples (%).
                                                           N-Queens                                 Graph Coloring
                                                    8×8              10 × 10                 8-vertex            10-vertex
 Method                  Rec. Gen. # Params Accuracy Coverage Accuracy Coverage       Conflict↓ Coverage Conflict↓ Coverage
 Direct Pred (8 layers)   ✗    ✗     27M     40.4±1.1  13.7±1.1 13.6±0.5    1.6±0.2   179.3±4.0 19.9±0.2 198.7±5.0      6.7±0.1
 Direct Pred (32 layers) ✗     ✗     100M    40.2±1.3  13.6±1.1 13.1±0.4    1.6±0.2   174.0±18.0 19.1±1.7 227.7±34.5    6.5±1.9
 Looped TF                ✓    ✗       7M    68.4±3.7  23.6±1.9 50.0±7.6    6.2±3.2   136.0±16.1 20.5±1.5 157.3±9.0     7.2±0.7
 HRM                      ✓    ✗      27M    78.7±2.9  26.7±1.3 37.4±0.3    4.7±0.1   109.7±1.5 21.8±0.3 164.3±21.6     8.9±1.7
 TRM                      ✓    ✗       7M    66.8±5.7 36.1±22.5 17.5±11.2  2.0±1.3    109.3±3.1 22.3±0.6 170.7±17.9     6.8±0.3
 AR                       ✗    ✓    10.6M    96.3±1.0  84.8±0.8 90.0±2.2   53.2±0.8    19.0±11.3   83.0±0.7 61.3±8.3   40.0±0.3
 MDLM                     ✗    ✓    12.6M    96.1±1.5  87.2±0.6 74.3±6.6   47.4±2.2     2.7±0.6    84.5±4.0 12.0±7.0   48.2±1.4
 GRAM (Ours)              ✓    ✓     10M     99.7±0.3  90.3±1.9 89.7±2.7   57.5±3.4     2.7±2.1    85.8±0.5  3.3±1.5   51.3±2.8


cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
analysis is provided in Appendix D.3.
Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
enables stricter constraint satisfaction than generative sampling alone.

4.3   Exploring GRAM as an Unconditional Generator

Setup. To investigate GRAM’s unconditional gen-                    Table 2: Unconditional generation results on
erative capability beyond conditional reasoning, we                binarized MNIST. We report IS (↑) and FID (↓).
evaluate generation in two domains: structured con-                For iterative models, a step corresponds to a super-
straint generation on Sudoku (from empty boards,                   vision step for TRM and GRAM, and a denoising
                                                                   step for D3PM. FID is calculated using real sam-
evaluated by the fraction of generated boards satis-               ples with original pixel values (0–255).
fying Sudoku constraints) and image generation on
binarized MNIST [15], where pixel values are thresh-                       Method                    IS (↑)    FID (↓)
olded to 0 or 1 (evaluated by Inception Score (IS) [42]
and FID [43]). In both cases, the input is replaced by                     VAE                        1.70       86.28
                                                                           D3PM (1000 steps)          1.86       74.03
an empty conditioning signal and the model samples                         TRM (16 steps)             1.00      303.29
an output from its learned prior. Baselines include
D3PM [44], a discrete diffusion model, on both tasks,                      GRAM (Ours)
and additionally a VAE [45] trained with binary re-                         8 steps                   1.85       84.08
                                                                            16 steps                  1.89       77.79
construction loss on MNIST. To ensure a fair compar-                        32 steps                  1.91       76.65
ison with existing literature, FID is calculated using                      64 steps                  1.95       75.39
real samples from the original standard MNIST.                              128 steps                 1.99       74.30
                                                                            256 steps                 2.04       73.34
Generative Behavior Beyond Reasoning. GRAM
extends from conditional reasoning to unconditional
generation in two different domains. On Sudoku
generation (Figure 5), GRAM produces valid boards
with 99.05% validity using 10.9M parameters and
16 supervision steps, surpassing D3PM baselines
that use up to 55.1M parameters and 1000 denois-
ing steps. Figure 7 shows qualitative examples, illus-
trating that the model produces diverse, fully valid
boards from empty inputs without any explicit con-
straint checker. On MNIST (Table 2), the deter-
ministic baseline TRM exhibits mode collapse (FID
303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
                                                                    tially fewer parameters and steps.

                                                            8
D3PM
           t=0   t=100     t=200   t=300     t=400   t=500      t=600   t=700   t=800   t=900    t=1000
GRAM TRM
           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16


           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16    Sample 1 Sample 2 Sample 3 Sample 4
                                             (a) Generation Process                                                 (b) Samples
  Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
  steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
  through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.

  Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
  Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
  recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
  either significantly degrades performance.
                        (a) Architecture Ablation.                                              (b) Mechanism Ablation.

  Model variant                             Sudoku            N-Queens          Model variant                   Sudoku       N-Queens
  base (Looped TF)                            61.25              71.30          GRAM (ours)                      93.96          99.69
  + DS + HR (=HRM, TRM)                   55.00 / 87.40      80.70 / 72.90
                                                                                w/o stochastic guidance          82.87          72.91
  + SG                                        65.64              86.30
                                                                                stochasticity only               94.88          50.27
  + DS + SG                                   73.90             100.00
                                                                                guide only                        0.00           0.00
  + DS + HR + SG (=GRAM)                     93.96              99.69           w/ direct prediction             63.43          61.44
                                                                                TRM w/ stochastic decoder        82.87          71.66
                                                                                TRM w/ random init.              78.53          71.82


  transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
  emerging as a natural byproduct of the recursive generative process.
  Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
  at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
  to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
  advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
  additional samples are in Section D.4.

  4.4        Ablation Study

  We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
  Table 3 summarizes the results.
  Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
  stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
  lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
  on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
  (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
  gains in every configuration, supporting our design of stochastic guidance as the core extension
  introduced by GRAM.
  Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
  learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
  Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
  However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
  navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
  tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
  Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
  TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and


                                                                         9
                                     8    9       648 59 1        648 59 1     648259317   648259317   648259317   648259317
                 7                 79         5   79 68       5   79 68 415    79 683415   79 683415   792683415   792683415
                               9     6      289   3 6     1289    3 6   1289   3 6 71289   356 71289   356471289   356471289


D3PM
                                         657 4      8    657 4      831657 4    83165734   983165734   983165734   983165734
                     3 9           4 3 9 5        4 3 9 5         4 389 56     42389 561   423897561   423897561   423897561
                     7     8         7      8 2       7     8 2      7 648 2   5 73648 2   5 73648 2   5173648 2   517364892
                       5               5          13 548          139548 2     139548626   139548626   139548626   139548626
                     59              593              593 1 8        593 1 8   27593 148   275936148   275936148   275936148
                         2         8      2   3   8     7 2   3   8   7 2 53   8 47 2953   8 4712953   864712953   864712953
         t=0         t=125              t=250        t=375          t=500        t=625       t=750       t=875      t=1000
                 717348652         716348952      716348952       716348952    716348952   716348952   716348952   716348952
                 483927379         493527861      493527861       493527861    493527861   493527861   493527861   493527861
                 523971734         528996734      528169734       528169734    528169734   528169734   528169734   528169734
GRAM


                 864235917         864251379      864215379       864215379    864215379   864215379   864215379   864215379
                 359841126         359784126      359874126       359874126    359874126   359874126   359874126   359874126
                 172496538         172936548      172936548       172936548    172936548   172936548   172936548   172936548
                 937812445         937812615      937482615       937482615    937482615   937482615   937482615   937482615
                 645779283         645179283      645791283       645791283    645791283   645791283   645791283   645791283
                 281623497         281663497      281653497       281653497    281653497   281653497   281653497   281653497
        iter=0       iter=1             iter=2       iter=4         iter=6       iter=8     iter=10     iter=13     iter=16
Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
highlighted in red.


(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
than mere randomness.

5      Conclusions and Limitations
We introduced GRAM, a generative framework that transforms deterministic recursive architectures
into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
enables diverse solution discovery and improved exploration compared to deterministic counterparts.
Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
sion that consistently improves any recursive architecture, and that the gains stem specifically from
the variational framework — not from mere randomness, as naive stochastic alternatives applied to
existing models yield no improvement.
Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
through recursion-based generation over inputs, with generation quality improving monotonically
with recursive depth even beyond training-time steps. This suggests new directions for generative
modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
toward larger foundation models.


                                                                    10
Acknowledgment
This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
ICT (MSIT). This work was also supported by the Institute of Information & Communications
Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
members of the MLML for valuable discussions and feedback throughout this project.

Broader Impacts
GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
to improve reasoning through inference-time computation rather than parameter scaling alone. Its
generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
in downstream decision-making pipelines, and multi-sample inference may increase computational
and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
domain-specific safeguards.

References
 [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
     Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
     Advances in neural information processing systems, 35:24824–24837, 2022.

 [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
     Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
     vances in neural information processing systems, 36:11809–11822, 2023.

 [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
     Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
     of thoughts: Solving elaborate problems with large language models. In Proceedings of the
     AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.

 [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
     Tian. Training large language models to reason in a continuous latent space. arXiv preprint
     arXiv:2412.06769, 2024.

 [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
     by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
     arXiv:2505.12514, 2025.

 [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
     Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
     reasoning. arXiv preprint arXiv:2505.23648, 2025.

 [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
     are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.

 [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
     Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.

 [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
     preprint arXiv:2510.04871, 2025.


                                                   11
[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
     versal transformers. arXiv preprint arXiv:1807.03819, 2018.
[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
     agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
     2025.
[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
     recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
     recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
     preprint arXiv:1705.08209, 2017.
[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
     son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
     with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
     beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
     Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
     concept space. arXiv preprint arXiv:2505.15778, 2025.
[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
     hard truths. arXiv preprint arXiv:2509.19170, 2025.
[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
     Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
     arXiv:2502.21074, 2025.
[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
     Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
     preprint arXiv:2505.18454, 2025.
[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
     Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
     recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
     thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
     arXiv:2310.10845, 2023.
[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
     Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
     arXiv:2410.20672, 2024.
[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
     1735–1780, 1997.
[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
     Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
     decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.


                                                 12
[30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
     Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
     preprint arXiv:1909.11942, 2019.
[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
     preprint arXiv:1910.10073, 2019.
[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
     arXiv:1603.08983, 2016.
[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
     Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
     org/abs/1506.02216.
[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
     models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
     //arxiv.org/abs/1511.05121.
[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
     James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
     //arxiv.org/abs/1811.04551.
[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
     discrete world models. arXiv preprint arXiv:2010.02193, 2020.
[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
     through world models. arXiv preprint arXiv:2301.04104, 2023.
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
     processing systems, 30, 2017.
[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
     benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
     Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
     models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
     2018.
[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
     Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
     neural information processing systems, 30, 2017.
[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
     Structured denoising diffusion models in discrete state-spaces. Advances in neural information
     processing systems, 34:17981–17993, 2021.
[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
     arXiv:1312.6114, 2013.
[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
     Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
     discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
     of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.


                                                13
[50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
     applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
     convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
     function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
     on computer vision (ECCV), pages 3–19, 2018.
[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
     arXiv:1711.05101, 2017.
[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
     natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
     Advances in neural information processing systems, 33:12438–12448, 2020.
[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
     Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
     BF02066689. URL https://doi.org/10.1007/BF02066689.
[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
     learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
     31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
     IEEE, 2019.
[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
     Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
     mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
     13091446.


                                               14
A     Additional Method Details
A.1   Adaptive Computation Time

GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
binary action of halting or continuing computation at the current supervision step.
Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
                                  (n)
After computing the latent state zT at the end of supervision step n, we form Q-learning targets:

       • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
         prediction.
                                                
       • q̂ncontinue = max qn+1halt    continue
                                    , qn+1        , the bootstrapped value of running one more supervision
         step.

The halt head is trained by regression to these targets:
                               Nsup h
                               X                           2                                2 i
                     LACT =             qnhalt − q̂nhalt        + qncontinue − q̂ncontinue       .     (15)
                               n=1

This auxiliary loss is added to the main training objective and contributes only through the halt head;
it does not propagate gradients into the recursive core.
Inference. At inference, computation proceeds one supervision step at a time. After each step
n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
                                                                                       max
computation continues to the next supervision step, up to a maximum budget of Nsup          steps. Different
trajectories sampled in parallel may therefore terminate at different depths, complementing the
parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
implementation; our released code uses this variant.

A.2   Latent Process Reward Model (LPRM).

To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
with the main objective via a regression loss:
                                                    T
                                                    X
                                        LLPRM =       (vψ (zt ) − r)2 ,                                (16)
                                                    t=1

where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.

A.3   Empirical Validation of the Surrogate Objective

We further analyze the approximation introduced by the surrogate training objective LGRAM used in
Section 2.2, both qualitatively and empirically.
Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
the entire trajectory. To enable training with constant memory, we propagate gradients only through
the final transition of each supervision step. This is a standard practice in recurrent latent variable
models with long computation chains: ELBOs over truncated sequences are used, for example, in
VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore


                                                       15
well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
reasoning trajectories rather than temporal sequences.
Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
variational bound, we compute both quantities on the validation set throughout training. The full
ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
                                                                                       (n)
across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.

                              Sudoku-Extreme                                                  N-Queens 8 × 8
                                        GRAM (training objective)                                      GRAM (training objective)
             0.6                        ELBO (full)                         103                        ELBO (full)


             0.5                                                            102
      ELBO


                                                                    ELBO
             0.4                                                            101

                                                                            100
             0.3
                                                                           10 1
             0.2
                   10000 20000 30000 40000 50000 60000                            0   10000    20000    30000      40000
                               Training Step                                                  Training Step
Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
the large dynamic range.

Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
even though training relies on the surrogate.
The gap between the two curves in Figure 8 reflects the structural difference between the two
quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
with LGRAM being a biased but useful surrogate for LELBO .

B     Training and Architecture Details
B.1    Architecture Details

GRAM consists of three components: Encoder, Recursive Core, and Decoder.
Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
meaning the initialized z0 has a fixed value.
Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
correctness values from the first token of h.


                                                                    16
                                       Table 4: Architecture components.
                  Component         Module                   Description
                    Encoder
                                    Token Embedding          vocab → D
                                    Puzzle Embedding         16 tokens (optional, for ARC)
                                    Position Encoding        RoPE or learned
                    Recursive Core
                                 f L , fH                    [Attention + SwiGLU] × 2 layers
                                 Iterations                  K low-level, T high-level steps
                                 µθ , σ θ , µ ϕ , σ ϕ        SwiGLU MLP for each parameter
                    Decoder
                                    LM Head                  Linear(D → vocab)
                                    Q Head                   Linear(D → 2) for halt
                                    V Head                   Linear(D → 1) for value


Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
convolutional patch encoder, following [48, 49].
The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
the normalized image passes through a shallow convolutional encoder. This encoder consists of
two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
feature map is divided into non-overlapping patches of size P × P and linearly projected to match
the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
are summarized in Table 5.

Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
channels, P patch size, Np the number of patches, and D the hidden dimension.
                        Stage           Layer / Operation              Output Dim.
                                        Input Tokens                   (B, C, H, W )
                        1. Norm.
                                        Linear Scaling [−1, 1]         (B, C, H, W )
                                        Conv2d 5 × 5 (p = 2)
                                                                      (B, D/2, H, W )
                                        SiLU → GN(32)
                        2. Conv
                                        Conv2d 5 × 5 (p = 2)
                                                                      (B, D/2, H, W )
                                        SiLU → GN(32)
                                        Flatten Patches               (B, Np , P 2 · D
                                                                                     2)
                        3. Patch
                                        Linear Projection               (B, Np , D)


Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.

B.2   Training Details

Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
in Table 6).


                                                        17
        • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
          (0=pad, 1=blank, 2–10=digits).
        • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
          markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
          specific puzzle embeddings are prepended to distinguish different ARC tasks.
        • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
          (0=pad, 1=empty, 2=queen).
        • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
          using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
        • For image generation on MNIST [15], images are quantized and processed via CNN-based
          patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
          patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
          1=black, 2=white).


                                  Table 6: Task-specific configurations.
           Task             Seq. Len    Vocab     Puzzle Emb                 Encoding
           Sudoku                81       11              ✗             9×9 grid, row-major
           ARC-AGI              900       12              ✓             30×30 padded canvas
           N-Queens             N2        3               ✗                 N × N board
                             n(n−1)
           Graph Coloring       2
                                          6               ✗        Strict adjacency upper triangle
           MNIST                196        3              ✗                14×14 patches

Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.

                      Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
                         Task                             Epochs   GPUs     Time
                         Sudoku                            50K      8        2h
                         ARC-AGI                          200K      8      5 days
                         N-Queens (8×8)                     3K      8        1h
                         N-Queens (10×10)                   1K      8        3h
                         Graph Coloring (8 nodes)           5K      8       1.5h
                         Graph Coloring (10 nodes)          5K      8        6h
                         MNIST                            1.8K      8       16h


C     Additional Details of Experiment Setup
C.1     Challenging Puzzle Tasks

C.1.1    Looped TF on ARC-AGI
We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped


                                                     18
TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
ARC-AGI scale, making a full Looped TF training run impractical.

C.2     Multi-solution Puzzle Tasks

C.2.1    N-Queens Problem

                          Input               Solution 1                Solution 2               Solution 3


Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
admits exactly 3 valid solutions.

Data Generation Details. The N-Queens problem requires placing N queens on an N × N
chessboard such that no two queens attack each other—meaning no queens share the same row,
column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
solution, resulting in a puzzle with 3 distinct valid completions.
To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
N = 10. We then created puzzle instances by removing a specific number of queens, treating the
remaining partial configuration as the input and the original complete board as the target label. To
generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
generated dataset is shown in Figure 10.
For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
the model learns to reason rather than memorize, the split was performed based on unique input
configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).

                                     8x8 N-Queens                                     10x10 N-Queens
                   3000                                                20000
                                                                       15000
          Counts


                                                              Counts


                   2000
                                                                       10000
                   1000
                                                                        5000
                      0                                                    0
                             3      6    9   12     15   18                    0      20    40       60       80
                                  Number of solutions                                Number of solutions
Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.


C.2.2    Graph Coloring Problem
Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
node in a graph such that no two adjacent nodes share the same color. We consider graphs with
N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
nodes and k = 3 colors.
Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently


                                                              19
with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
3-colorable.
For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
192 test instances for N = 10.
Input and Output Representation. The input graph is represented by extracting the upper triangular
portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
output is a sequence of length N , where each position encodes the assigned color for the corresponding
node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
blue, and green respectively.

                      Input             Solution 1             Solution 2              Solution 3        Solution 4


                                             Figure 11: Graph Coloring Example


                              Vertex 8 Graph Coloring                                  Vertex 10 Graph Coloring
                                                                            2000
                   1500
                                                                            1500
          Counts


                                                                   Counts


                   1000
                                                                            1000
                    500                                                      500
                      0                                                        0
                              3     6    9     12    15   18                       0      20        40   60    80
                                  Number of solutions                                    Number of solutions
Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.


D     Additional Experiment Results
D.1   Additional Results on Challenging Puzzle Benchmarks

Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
observations complementing the main text.

GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
in both cases) further confirms that recursive computation is essential for these tasks — single-pass
models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
increased model capacity.

Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially


                                                                 20
Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
recursive models. All recursive model scores were obtained at 16 supervision steps.
                         Method                   #Params Sudoku ARC-1 ARC-2
                         Large Reasoning Models
                           Deepseek-R1             671B    0.0    15.8    1.3
                           Claude 3.7 16k          N/A     0.0    28.6    0.7
                           o3-mini-high            N/A     0.0    34.5    3.0
                           GPT 5.2 (low)           N/A      –     55.7    9.7
                           Grok-4-thinking         1.7T     –     66.7   16.0
                           Gemini 3 Pro            N/A      –     75.0   31.1
                         Recursive Models
                           Direct Pred             27M      0.0   21.0   0.0
                           Looped TF                7M     61.3    -      -
                           HRM                     27M     55.0   40.3   5.0
                           TRM                      7M     87.4   44.6   7.8
                           GRAM (Ours)             10M     97.0   52.0   11.1
                         Human Results
                           Avg. Human                –      –     60.2     –
                           Best Human                –      –     98.0   100.0


ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
we view these numbers as benchmark-difficulty reference points rather than controlled baselines.

D.2   Scales with Parallel Sampling on ARC-AGI Challenge

To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
by generating 1,000 augmentations for a single problem and performing majority voting over the
results. Because this augmentation process itself creates a wide variety of samples, we isolated
the specific effect of generative sampling by performing inference solely on the original problem
instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
rate, and the number of layers.
As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
observations in Section 4.2, suggesting that increased inference-time compute through width scaling
allows the model to explore more plausible reasoning trajectories and recover from initial errors,
eventually leading to more robust solution discovery.

Interaction between Augmentation and Sampling. A natural question arises: why not combine
higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
ablation study examining the interaction between data augmentation and inference-time sampling.
Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
as the level of augmentation increases, the marginal benefit of additional sampling diminishes
substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
and sampling serve complementary rather than additive roles: both mechanisms enable the model
to capture solution diversity, but through different means. When training data is limited, parallel
sampling compensates by exploring varied reasoning paths at inference time. When training data
is abundant through augmentation, the model has already internalized sufficient diversity during
training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
beyond augmentation provides diminishing returns, justifying our experimental design choice to
evaluate these two scaling axes separately.


                                                    21
                                       0.450
                                       0.425
                                       0.400
                                       0.375


                            Accuracy
                                       0.350
                                       0.325
                                       0.300                                                       TRM
                                                                                                   GRAM ( =0.1)
                                       0.275                                                       GRAM ( =0.05)
                                                                                                   GRAM ( =0.04)
                                       0.250   1       2       5      10        20   50    100         250     500
                                                                   Number of Samples (N)
Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
as the number of parallel samples N increases, outperforming the TRM baseline.


                          0.500
                          0.475
                          0.450
               Accuracy


                          0.425
                          0.400
                          0.375                                                                                      Aug=0
                                                                                                                     Aug=5
                          0.350                                                                                      Aug=10
                                                                                                                     Aug=50
                          0.325
                                         1         2       5         10         20    50         100         250       500
                                                                   Number of Samples (N)
Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
roles in capturing solution diversity.


D.3   Solution Coverage Analysis

We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
respect to the total number of valid ground-truth solutions.
As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
coverage as the number of possible solutions increases. Since these models are constrained to a single
fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
collapse in multi-solution landscapes.
In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
where multiple valid reasoning paths exist.


D.4   Additional Generated Image Samples

In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.


                                                                           22
           1.0                                                 HRM                       1.0                                            HRM
                                                               TRM                                                                      TRM
                                                               GRAM (N=1)                                                               GRAM (N=1)
                                                               GRAM (N=5)                                                               GRAM (N=5)
           0.8                                                 GRAM (N=10)               0.8                                            GRAM (N=10)
                                                               GRAM (N=20)                                                              GRAM (N=20)


           0.6                                                                           0.6
Coverage


                                                                              Coverage
           0.4                                                                           0.4


           0.2                                                                           0.2


           0.0                                                                           0.0
                  2    4     6       8         10   12    14   16     18                       0     15      30      45     60     75        90
                                 Number of Solutions                                                         Number of Solutions
                           (a) N-Queens 8 × 8                                                             (b) N-Queens 10 × 10

 Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
 ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
 space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
 increases.


 As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
 from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
 A particularly compelling property of this process is the model’s ability to recover from initially
 ambiguous or incorrect formations.
 For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
 the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
 as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
 resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.


                 t=0       t=1           t=2        t=4        t=6           t=7               t=9    t=11        t=12    t=14     t=16
 Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
 tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
 recursive refinement process.


                                                                             23
D.5   Additional Experiment Results on Unconditional Sudoku Generation

In this section, we provide additional details on unconditional Sudoku generation. Unlike the
conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
boards. To check whether high validity comes from repeatedly producing the same board, we also
compute the fraction of unique boards among valid samples.
For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
split, discard the clue patterns, and use an all-blank board as input with the complete solution as
the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
contains 10.9M parameters and uses 16 inference steps.
For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
generate boards with 1000 denoising steps.
As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
In all cases, the valid samples are unique under exact board matching, indicating that the reported
validity is not due to simple repetition of a small set of boards. These results show that GRAM can
generate highly constrained symbolic structures from an empty input, supporting its potential as a
generator beyond conditional puzzle solving.
Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
the task is to generate complete boards, and validity is determined by whether the generated board
satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.

Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
                         Method                     #Params            Steps       Validity(%)
                         D3PM-Uniform (Big)         55.1M              1000           91.33
                         D3PM-Uniform (Small)       15.9M              1000           29.24
                         D3PM-Absorb (Big)          55.1M              1000           79.18
                         D3PM-Absorb (Small)        15.9M              1000           21.88
                         GRAM (Ours)                10.9M               16            99.05


           Empty input                                  Valid sample                              Invalid sample
                                            3   6   5    4   7   8     9   2   1        4     3   8   2   9   1   7   6   5
                                            9   4   1    2   5   6     8   7   3        5     2   1   7   3   6   4   9   8
                                            2   7   8    9   1   3     6   5   4        7     6   9   4   5   8   1   2   3
                                            5   2   9    8   4   7     3   1   6        9     1   3   4   4   7   5   8   6
                                            7   3   6    1   9   2     5   4   8        2     5   4   6   8   9   3   1   7
                                            1   8   4    3   6   5     2   9   7        6     8   7   5   1   3   2   4   9
                                            8   9   7    5   3   4     1   6   2        3     7   2   9   6   4   8   5   1
                                            4   1   3    6   2   9     7   8   5        1     9   5   3   2   8   6   7   4
                                            6   5   2    7   8   1     4   3   9        8     4   6   1   7   5   9   3   2


Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
sample indicate cells involved in constraint violations.


                                                        24
D.6   Visualizing Latent Recursion Process

To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
recursive computation. Specifically, we track the high-level state h at each supervision step throughout
the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
regions of latent space before converging. While some trajectories become trapped in local minima
(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.


Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
no ability to escape suboptimal trajectories.


                                                      25
Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.


                                                      26
Licenses
Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
are generated by the authors and are therefore not external assets.

 Asset                 Use in this paper           License / terms     Source link
 MNIST                 Binarized MNIST genera-     Creative Com-       https://keras.io/api/
                       tion experiments            mons Attribution-   datasets/mnist/
                                                   Share Alike 3.0
 ARC-AGI-1 / origi-  ARC-AGI         reasoning     Apache License      https://github.com/fchollet/
 nal ARC             benchmark                     2.0                 ARC-AGI
 ARC-AGI-2           ARC-AGI-2 reasoning           Apache License      https://github.com/arcprize/
                     benchmark / reference         2.0                 ARC-AGI-2
                     results
 HRM repository      HRM       baseline    and     Apache License      https://github.com/
                     Sudoku-Extreme-related        2.0                 sapientinc/HRM
                     reference implementation
 TinyRecursiveModels TRM baseline and recur-       MIT License         https://github.com/
 / TRM repository    sive reasoning reference                          SamsungSAILMontreal/
                     implementation                                    TinyRecursiveModels
 MDLM repository     Masked diffusion base-        Apache License      https://github.com/
                     line reference implemen-      2.0                 kuleshov-group/mdlm
                     tation, if public code is
                     used
 Google Research D3PM image-generation             Apache License      https://github.com/
 D3PM implementa- baseline reference imple-        2.0                 google-research/
 tion                mentation, if public code                         google-research/blob/master/
                     is used                                           d3pm/images/diffusion_
                                                                       categorical.py
 Looped       Trans-   Looped Transformer base-    MIT License         https://github.com/Leiay/
 former repository     line reference implemen-                        looped_transformer
                       tation, if public code is
                       used
 N-Queens              Synthetic multi-solution    Not an external     N/A
                       constraint satisfaction     asset
                       task generated by the
                       authors
 Graph Coloring        Synthetic multi-solution    Not an external     N/A
                       constraint satisfaction     asset
                       task generated by the
                       authors


                                                    27