summaryrefslogtreecommitdiff
path: root/papers/txt/gram2025_generative_recursive.txt
diff options
context:
space:
mode:
Diffstat (limited to 'papers/txt/gram2025_generative_recursive.txt')
-rw-r--r--papers/txt/gram2025_generative_recursive.txt1532
1 files changed, 1532 insertions, 0 deletions
diff --git a/papers/txt/gram2025_generative_recursive.txt b/papers/txt/gram2025_generative_recursive.txt
new file mode 100644
index 0000000..d5f299d
--- /dev/null
+++ b/papers/txt/gram2025_generative_recursive.txt
@@ -0,0 +1,1532 @@
+ Generative Recursive Reasoning
+
+
+ Junyeob Baek1†∗ Mingyu Jo1†∗ Minsu Kim1,2
+
+ Mengye Ren3 Yoshua Bengio2,4 Sungjin Ahn1,3†
+arXiv:2605.19376v2 [cs.AI] 20 May 2026
+
+
+
+
+ 1
+ KAIST 2 Mila – Québec AI Institute
+ 3
+ New York University 4 Université de Montréal
+
+
+
+ Abstract
+ How should future neural reasoning systems implement extended computation?
+ Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
+ sive sequence extension by performing iterative latent-state refinement with shared
+ transition functions. Yet existing RRMs are largely deterministic, following a
+ single latent trajectory and converging to a single prediction. We introduce Gen-
+ erative Recursive reAsoning Models (GRAM), a framework that turns recursive
+ latent reasoning into probabilistic multi-trajectory computation. GRAM models
+ reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
+ tive solution strategies, and inference-time scaling through both recursive depth
+ and parallel trajectory sampling. This yields a latent-variable generative model
+ supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
+ unconditional generation via pθ (x). Trained with amortized variational inference,
+ GRAM improves over deterministic recurrent and recursive baselines on structured
+ reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
+ unconditional generation capability. https://ahn-ml.github.io/gram-website
+
+
+ 1 Introduction
+ A central question for future neural reasoning systems is how extended computation should be imple-
+ mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
+ process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
+ plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
+ Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
+ than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
+ because it decouples reasoning depth from both parameter scale and output length: a compact model
+ can perform many steps of internal computation by repeatedly applying shared transition functions
+ over time.
+ Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
+ potential of this approach in structured reasoning. Rather than producing a solution in a single
+ feedforward pass, they perform extended computation through iterative latent-state refinement, deep
+ supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
+ latent dynamics. These features make them well suited to problems requiring constraint propagation,
+ state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
+ ∗ Equal contribution
+ † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn
+
+ (sungjin.ahn@kaist.ac.kr)
+
+
+ Preprint.
+ Solution 1,
+
+
+ Input Task,
+
+
+
+ Solution 2,
+
+
+
+
+ (a) Deterministic RRMs (b) GRAM (Ours)
+
+Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
+Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
+deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
+alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
+valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.
+
+also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
+Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
+depth without increasing parameter count. Together, these models suggest that reasoning capability
+can emerge not only from scaling model size or generating longer traces, but also from the organization
+of computation itself.
+While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
+reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
+system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
+multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
+or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
+path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
+not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
+exploring multiple latent trajectories in parallel.
+Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
+initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
+ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
+probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
+motivates the central question of our work: can recursive latent computation support probabilistic,
+generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
+In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
+recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
+process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
+conditioned on the input and the current reasoning state, rather than deterministically updating to a
+single next state. Repeating this process defines a distribution over possible reasoning trajectories,
+allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
+inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
+a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
+marginalizing over latent reasoning trajectories, while the same recursive process can also define an
+unconditional generative model pθ (x) when the input is fixed or absent.
+We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
+chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
+multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
+comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
+including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
+whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
+Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
+transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
+MNIST [15] probes the unconditional generative interpretation.
+Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
+for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.
+
+
+ 2
+ First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
+obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
+inference-time scaling, enabling inference to scale not only with recursive depth but also with the
+number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
+yields the intended architectural advantages over deterministic recurrent and recursive baselines,
+improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.
+
+2 Generative Recursive Reasoning Models
+In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
+of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
+procedure in Section 2.2, with an architecture schematic shown in Figure 2.
+
+2.1 Architecture
+
+Overview. GRAM models the conditional distri- 𝑦 (A) CE loss
+bution pθ (y | x) by marginalizing over stochas-
+ Prior (B) KL Div. Posterior Decoder
+tic latent reasoning trajectories. Given an input 𝑝𝜃 (⋅ |𝑢𝑡 ) 𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡 𝑓dec
+x, GRAM first computes an embedding
+ ℎ𝑡−1 𝑓𝐻 𝑢𝑡 ℎ𝑡
+ ex = fenc (x; θ), (1)
+ 𝐾 times
+which is reused throughout the entire recursive 𝑙𝑡−1 𝑓𝐿 𝑓𝐿 𝑙𝑡
+computation. Starting from a fixed initial la-
+ Encoder
+tent state z0 , the model evolves the latent state 𝑥
+ 𝑓enc
+through learned stochastic transitions. The re-
+cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
+levels: inner and outer loops. latent transition in the hierarchical instantiation z =
+At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
+ level update fH produces a deterministic proposal ut , to
+a new latent state conditioned on the previous which stochastic guidance ϵt is added: ht = ut + ϵt .
+latent state and the input embedding,
+ zt ∼ pθ (zt | zt−1 , ex ), t = 1, . . . , T. (2)
+At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
+to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
+supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
+gradients computed as described in Section 2.2.
+At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
+step serving as the initial state of the next, thereby forming the full recursive computation:
+ (1) T transitions (1) (2) T transitions T transitions (N )
+ z0 −−−−−−−→ zT = z0 −−−−−−−→ · · · −−−−−−−→ zT sup , (3)
+ (n) (1)
+where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
+fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
+ (n+1) (n)
+(z0 := zT ). This abstract formulation can be instantiated with various recurrent Transformer
+backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
+as well as hierarchical designs such as HRM and TRM [8, 9].
+Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
+latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
+zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
+reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
+perturbation around a deterministic update: at each transition, the model first computes a deterministic
+update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
+and adds it to ut :
+ ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
+ 
+ (4)
+ zt = ut + ϵt . (5)
+
+
+ 3
+ We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
+direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
+ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
+and support robust exploration of the solution space without discarding the deterministic refinement
+performed by ut .
+Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
+(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
+state, while the low-level component l is updated K times within a single transition and carries
+fine-grained intermediate computation. This decomposition separates the two roles across time scales,
+with h accumulating slowly across transitions and l refined rapidly within each one.
+With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
+The low-level component is first refined for K updates, with the high-level component held fixed:
+ lt,k = fL (ht−1 , lt,k−1 , ex ; θ), k = 1, . . . , K, (6)
+where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
+component is then updated as a stochastic transition conditioned on the refined lt ,
+ ut = fH (ht−1 , lt ; θ), (7)
+ ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
+ 
+ , (8)
+ ht = ut + ϵt , (9)
+and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
+level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
+more abstract component of the latent state, where it can steer the overall reasoning trajectory
+across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
+fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
+Modeling Unconditional Distribution. While the description so far focuses on the conditional
+setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
+pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
+generation tasks in Section 4.3.
+
+2.2 Training
+
+GRAM is trained to model the conditional distribution pθ (y | x), where each training example
+consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
+latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
+respect to the generative parameters θ and variational parameters ϕ.
+Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
+the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
+TTotal = T × Nsup . The conditional likelihood is defined as
+ Z
+ pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ, (10)
+
+where x denotes the input problem and y denotes the corresponding ground-truth output.
+Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
+over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
+evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
+ log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) . (11)
+
+During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
+access to both the input problem x and the target output y. At inference time, where y is unavailable,
+trajectories are instead generated from the learned prior pθ (· | x).
+ 3We also tried injecting noise into the low-level state, but found that it did not improve performance.
+
+
+
+
+ 4
+ Both the prior and the posterior are modeled as conditional Markov processes over latent states:
+ TY
+ Total TY
+ Total
+
+ pθ (τ | x) = p(z0 ) pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 ) qϕ (zt | zt−1 , x, y). (12)
+ t=1 t=1
+Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
+adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
+transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
+whereas the prior uses pθ (ϵt | ut ).
+Since the two processes share the same Markov structure and all stochasticity is introduced through
+ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
+since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
+pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
+   TX Total h i
+ LELBO = Eqϕ log pθ (y | zTTotal , x) − Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
+ t=1
+Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
+in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
+variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
+Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
+train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
+recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
+than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
+step is reused as the initial state of the next step.
+Following standard practice for recurrent models with long computation chains, we apply truncated
+gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
+implementation, gradients are propagated only through the final transition of each supervision step,
+ (n) (n)
+zT −1 → zT . This gives the following surrogate objective for each supervision step:
+ (n)  (n)  (n) (n) (n) (n) 
+ LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
+ (n)
+where zT is the terminal state of the current supervision step n, and gradients are stopped through
+preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
+exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
+analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
+are listed in Appendix B.2.
+
+2.3 Inference-Time Scaling
+
+GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
+of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
+For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
+time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
+Appendix A.1). For width — the focus of this section — we draw {τ (i) }N i=1 ∼ pθ (τ(i)| x) from the
+learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
+multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
+To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
+Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
+from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
+inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
+chooses the candidate with the highest predicted terminal value. Details of LPRM training are
+provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
+parallel exploration, without increasing the sequential recursion length.
+
+3 Related Work
+Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
+Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous
+
+
+ 5
+ Sudoku ARC-AGI-1 ARC-AGI-2
+ 20
+ 100 97.0% 66.7%
+ 87.4% 16.0%
+ 80 60 52.0% 55.7% 15
+Accuracy (%) 61.3% 44.6% 11.1%
+ 60 55.0% 40.3% 9.7%
+ 40 34.5% 10 7.8%
+ 40 5.0%
+ 20 5 3.0%
+ 20
+ 0 0 0
+ Looped HRM TRM GRAM HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
+ TF (Ours) (Ours) high (low) thinking (Ours) high (low) thinking
+ Recursive Models GRAM (Ours) LLMs
+
+Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
+outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
+transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
+omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
+only as external reference points for benchmark difficulty.
+
+representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
+tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
+instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
+soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
+However, many remain organized around autoregressive sequence generation, where additional
+computation is tied to generating more tokens, latent positions, or sequential reasoning states.
+Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
+from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
+recursive reasoning models show that increasing inference-time depth can outperform larger static
+models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
+instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
+enabling multi-path exploration and generative sampling.
+Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
+transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
+ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
+and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
+stochastic dynamics as computation rather than temporal observation modeling: latent transitions
+define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
+pθ (y | x) and unconditional pθ (x) generation.
+
+
+4 Experiments
+
+GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
+purpose large language reasoning model whose training data, inference budgets, prompting strategies,
+tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
+recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
+tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
+stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
+In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
+AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
+(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
+MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
+choices (Section 4.4).
+
+4.1 Challenging Puzzle Tasks
+
+Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
+quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
+visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-
+
+
+ 6
+ 1.0 N= 1.0
+ 1 5 10 20 50
+ 0.8
+ 0.9
+Accuracy Looped TF
+
+
+
+
+ Accuracy
+ Looped TF 0.6 HRM
+ HRM
+ 0.8 TRM TRM
+ GRAM (Ours) 0.4 GRAM (N=1)
+ 0.2
+ 0.6
+ 0.5 0.0
+ 8 16 32 128 320 2 4 6 8 10 12 14 16 18
+ Iterations Number of Solutions
+ Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
+ longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
+ iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
+ Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
+ a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
+ performance.
+
+
+ former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
+ model results [40] are included as external reference points for benchmark difficulty, rather than
+ as controlled baselines, since their training and inference settings are not directly comparable to
+ task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
+ reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
+ Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
+ GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
+ improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
+ TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
+ stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
+ of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
+ problem spaces more effectively than models constrained to a single sequential refinement process.
+ Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
+ Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
+ number of parallel samples consistently improves performance across all iteration counts. Notably,
+ GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
+ including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
+ recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
+ explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
+ Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
+ the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
+ analysis on the ARC-AGI Challenge is provided in Appendix D.2.
+
+ 4.2 Multi-solution Puzzle Tasks
+
+Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
+10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
+input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
+HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
+N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
+total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
+constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
+Appendix C.2.
+ Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
+ models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
+ Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
+ deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
+ consistent performance regardless of solution count. This confirms that deterministic latent updates
+
+
+ 7
+ Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
+model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
+over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
+solutions discovered with 20 samples (%).
+ N-Queens Graph Coloring
+ 8×8 10 × 10 8-vertex 10-vertex
+ Method Rec. Gen. # Params Accuracy Coverage Accuracy Coverage Conflict↓ Coverage Conflict↓ Coverage
+ Direct Pred (8 layers) ✗ ✗ 27M 40.4±1.1 13.7±1.1 13.6±0.5 1.6±0.2 179.3±4.0 19.9±0.2 198.7±5.0 6.7±0.1
+ Direct Pred (32 layers) ✗ ✗ 100M 40.2±1.3 13.6±1.1 13.1±0.4 1.6±0.2 174.0±18.0 19.1±1.7 227.7±34.5 6.5±1.9
+ Looped TF ✓ ✗ 7M 68.4±3.7 23.6±1.9 50.0±7.6 6.2±3.2 136.0±16.1 20.5±1.5 157.3±9.0 7.2±0.7
+ HRM ✓ ✗ 27M 78.7±2.9 26.7±1.3 37.4±0.3 4.7±0.1 109.7±1.5 21.8±0.3 164.3±21.6 8.9±1.7
+ TRM ✓ ✗ 7M 66.8±5.7 36.1±22.5 17.5±11.2 2.0±1.3 109.3±3.1 22.3±0.6 170.7±17.9 6.8±0.3
+ AR ✗ ✓ 10.6M 96.3±1.0 84.8±0.8 90.0±2.2 53.2±0.8 19.0±11.3 83.0±0.7 61.3±8.3 40.0±0.3
+ MDLM ✗ ✓ 12.6M 96.1±1.5 87.2±0.6 74.3±6.6 47.4±2.2 2.7±0.6 84.5±4.0 12.0±7.0 48.2±1.4
+ GRAM (Ours) ✓ ✓ 10M 99.7±0.3 90.3±1.9 89.7±2.7 57.5±3.4 2.7±2.1 85.8±0.5 3.3±1.5 51.3±2.8
+
+
+
+
+cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
+analysis is provided in Appendix D.3.
+Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
+MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
+sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
+gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
+and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
+enables stricter constraint satisfaction than generative sampling alone.
+
+4.3 Exploring GRAM as an Unconditional Generator
+
+Setup. To investigate GRAM’s unconditional gen- Table 2: Unconditional generation results on
+erative capability beyond conditional reasoning, we binarized MNIST. We report IS (↑) and FID (↓).
+evaluate generation in two domains: structured con- For iterative models, a step corresponds to a super-
+straint generation on Sudoku (from empty boards, vision step for TRM and GRAM, and a denoising
+ step for D3PM. FID is calculated using real sam-
+evaluated by the fraction of generated boards satis- ples with original pixel values (0–255).
+fying Sudoku constraints) and image generation on
+binarized MNIST [15], where pixel values are thresh- Method IS (↑) FID (↓)
+olded to 0 or 1 (evaluated by Inception Score (IS) [42]
+and FID [43]). In both cases, the input is replaced by VAE 1.70 86.28
+ D3PM (1000 steps) 1.86 74.03
+an empty conditioning signal and the model samples TRM (16 steps) 1.00 303.29
+an output from its learned prior. Baselines include
+D3PM [44], a discrete diffusion model, on both tasks, GRAM (Ours)
+and additionally a VAE [45] trained with binary re- 8 steps 1.85 84.08
+ 16 steps 1.89 77.79
+construction loss on MNIST. To ensure a fair compar- 32 steps 1.91 76.65
+ison with existing literature, FID is calculated using 64 steps 1.95 75.39
+real samples from the original standard MNIST. 128 steps 1.99 74.30
+ 256 steps 2.04 73.34
+Generative Behavior Beyond Reasoning. GRAM
+extends from conditional reasoning to unconditional
+generation in two different domains. On Sudoku
+generation (Figure 5), GRAM produces valid boards
+with 99.05% validity using 10.9M parameters and
+16 supervision steps, surpassing D3PM baselines
+that use up to 55.1M parameters and 1000 denois-
+ing steps. Figure 7 shows qualitative examples, illus-
+trating that the model produces diverse, fully valid
+boards from empty inputs without any explicit con-
+straint checker. On MNIST (Table 2), the deter-
+ministic baseline TRM exhibits mode collapse (FID
+303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
+its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
+these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
+ tially fewer parameters and steps.
+
+ 8
+ D3PM
+ t=0 t=100 t=200 t=300 t=400 t=500 t=600 t=700 t=800 t=900 t=1000
+GRAM TRM
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16
+
+
+
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 Sample 1 Sample 2 Sample 3 Sample 4
+ (a) Generation Process (b) Samples
+ Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
+ steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
+ through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.
+
+ Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
+ Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
+ recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
+ either significantly degrades performance.
+ (a) Architecture Ablation. (b) Mechanism Ablation.
+
+ Model variant Sudoku N-Queens Model variant Sudoku N-Queens
+ base (Looped TF) 61.25 71.30 GRAM (ours) 93.96 99.69
+ + DS + HR (=HRM, TRM) 55.00 / 87.40 80.70 / 72.90
+ w/o stochastic guidance 82.87 72.91
+ + SG 65.64 86.30
+ stochasticity only 94.88 50.27
+ + DS + SG 73.90 100.00
+ guide only 0.00 0.00
+ + DS + HR + SG (=GRAM) 93.96 99.69 w/ direct prediction 63.43 61.44
+ TRM w/ stochastic decoder 82.87 71.66
+ TRM w/ random init. 78.53 71.82
+
+
+ transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
+ emerging as a natural byproduct of the recursive generative process.
+ Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
+ at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
+ to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
+ advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
+ additional samples are in Section D.4.
+
+ 4.4 Ablation Study
+
+ We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
+ Table 3 summarizes the results.
+ Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
+ stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
+ lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
+ on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
+ (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
+ gains in every configuration, supporting our design of stochastic guidance as the core extension
+ introduced by GRAM.
+ Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
+ learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
+ Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
+ However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
+ navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
+ tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
+ Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
+ TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and
+
+
+ 9
+ 8 9 648 59 1 648 59 1 648259317 648259317 648259317 648259317
+ 7 79 5 79 68 5 79 68 415 79 683415 79 683415 792683415 792683415
+ 9 6 289 3 6 1289 3 6 1289 3 6 71289 356 71289 356471289 356471289
+
+
+
+D3PM
+ 657 4 8 657 4 831657 4 83165734 983165734 983165734 983165734
+ 3 9 4 3 9 5 4 3 9 5 4 389 56 42389 561 423897561 423897561 423897561
+ 7 8 7 8 2 7 8 2 7 648 2 5 73648 2 5 73648 2 5173648 2 517364892
+ 5 5 13 548 139548 2 139548626 139548626 139548626 139548626
+ 59 593 593 1 8 593 1 8 27593 148 275936148 275936148 275936148
+ 2 8 2 3 8 7 2 3 8 7 2 53 8 47 2953 8 4712953 864712953 864712953
+ t=0 t=125 t=250 t=375 t=500 t=625 t=750 t=875 t=1000
+ 717348652 716348952 716348952 716348952 716348952 716348952 716348952 716348952
+ 483927379 493527861 493527861 493527861 493527861 493527861 493527861 493527861
+ 523971734 528996734 528169734 528169734 528169734 528169734 528169734 528169734
+GRAM
+
+
+ 864235917 864251379 864215379 864215379 864215379 864215379 864215379 864215379
+ 359841126 359784126 359874126 359874126 359874126 359874126 359874126 359874126
+ 172496538 172936548 172936548 172936548 172936548 172936548 172936548 172936548
+ 937812445 937812615 937482615 937482615 937482615 937482615 937482615 937482615
+ 645779283 645179283 645791283 645791283 645791283 645791283 645791283 645791283
+ 281623497 281663497 281653497 281653497 281653497 281653497 281653497 281653497
+ iter=0 iter=1 iter=2 iter=4 iter=6 iter=8 iter=10 iter=13 iter=16
+Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
+sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
+row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
+highlighted in red.
+
+
+(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
+improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
+than mere randomness.
+
+5 Conclusions and Limitations
+We introduced GRAM, a generative framework that transforms deterministic recursive architectures
+into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
+tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
+enables diverse solution discovery and improved exploration compared to deterministic counterparts.
+Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
+to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
+depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
+sion that consistently improves any recursive architecture, and that the gains stem specifically from
+the variational framework — not from mere randomness, as naive stochastic alternatives applied to
+existing models yield no improvement.
+Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
+through recursion-based generation over inputs, with generation quality improving monotonically
+with recursive depth even beyond training-time steps. This suggests new directions for generative
+modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
+limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
+toward larger foundation models.
+
+
+
+
+ 10
+ Acknowledgment
+This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
+the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
+through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
+ICT (MSIT). This work was also supported by the Institute of Information & Communications
+Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
+2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
+Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
+are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
+members of the MLML for valuable discussions and feedback throughout this project.
+
+Broader Impacts
+GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
+taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
+problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
+to improve reasoning through inference-time computation rather than parameter scaling alone. Its
+generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
+in downstream decision-making pipelines, and multi-sample inference may increase computational
+and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
+real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
+domain-specific safeguards.
+
+References
+ [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+ Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+ Advances in neural information processing systems, 35:24824–24837, 2022.
+
+ [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
+ Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
+ vances in neural information processing systems, 36:11809–11822, 2023.
+
+ [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
+ Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
+ of thoughts: Solving elaborate problems with large language models. In Proceedings of the
+ AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.
+
+ [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+ Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+ arXiv:2412.06769, 2024.
+
+ [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+ by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+ arXiv:2505.12514, 2025.
+
+ [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
+ Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
+ reasoning. arXiv preprint arXiv:2505.23648, 2025.
+
+ [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
+ are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
+
+ [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+ Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+
+ [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+ preprint arXiv:2510.04871, 2025.
+
+
+ 11
+ [10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+ versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
+[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
+[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+ agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+ 2025.
+[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
+ recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
+[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
+ recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
+[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
+ preprint arXiv:1705.08209, 2017.
+[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
+ son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
+ with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
+[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
+ beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
+[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
+ Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
+ concept space. arXiv preprint arXiv:2505.15778, 2025.
+[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
+ hard truths. arXiv preprint arXiv:2509.19170, 2025.
+[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
+ Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
+ arXiv:2502.21074, 2025.
+[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
+ Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
+ preprint arXiv:2505.18454, 2025.
+[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+ Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+ recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
+ thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
+ arXiv:2310.10845, 2023.
+[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
+ Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
+ arXiv:2410.20672, 2024.
+[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
+[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
+ 1735–1780, 1997.
+[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
+ Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
+ decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+
+
+ 12
+ [30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
+ Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
+ preprint arXiv:1909.11942, 2019.
+[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
+ preprint arXiv:1910.10073, 2019.
+[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+ arXiv:1603.08983, 2016.
+[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
+ Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
+ org/abs/1506.02216.
+[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
+ models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
+[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
+ //arxiv.org/abs/1511.05121.
+[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
+ James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
+ //arxiv.org/abs/1811.04551.
+[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
+ discrete world models. arXiv preprint arXiv:2010.02193, 2020.
+[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
+ through world models. arXiv preprint arXiv:2301.04104, 2023.
+[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+ Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
+ processing systems, 30, 2017.
+[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
+ benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
+[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
+ Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
+ models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
+[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
+ 2018.
+[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
+ Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
+ neural information processing systems, 30, 2017.
+[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
+ Structured denoising diffusion models in discrete state-spaces. Advances in neural information
+ processing systems, 34:17981–17993, 2021.
+[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
+ arXiv:1312.6114, 2013.
+[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+ Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
+[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
+ discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
+[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
+ of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
+
+
+ 13
+ [50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
+ applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
+[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
+ convolutional neural networks. Advances in neural information processing systems, 25, 2012.
+[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
+ function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
+[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
+ on computer vision (ECCV), pages 3–19, 2018.
+[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
+ arXiv:1711.05101, 2017.
+[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
+ natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
+[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
+ Advances in neural information processing systems, 33:12438–12448, 2020.
+[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
+ Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
+ BF02066689. URL https://doi.org/10.1007/BF02066689.
+[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
+ learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
+ 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
+ IEEE, 2019.
+[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
+ Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
+[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
+ mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
+ 13091446.
+
+
+
+
+ 14
+ A Additional Method Details
+A.1 Adaptive Computation Time
+
+GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
+trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
+steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
+Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
+to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
+binary action of halting or continuing computation at the current supervision step.
+Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
+ (n)
+After computing the latent state zT at the end of supervision step n, we form Q-learning targets:
+
+ • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
+ prediction.
+ 
+ • q̂ncontinue = max qn+1halt continue
+ , qn+1 , the bootstrapped value of running one more supervision
+ step.
+
+The halt head is trained by regression to these targets:
+ Nsup h
+ X 2 2 i
+ LACT = qnhalt − q̂nhalt + qncontinue − q̂ncontinue . (15)
+ n=1
+
+This auxiliary loss is added to the main training objective and contributes only through the halt head;
+it does not propagate gradients into the recursive core.
+Inference. At inference, computation proceeds one supervision step at a time. After each step
+n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
+ max
+computation continues to the next supervision step, up to a maximum budget of Nsup steps. Different
+trajectories sampled in parallel may therefore terminate at different depths, complementing the
+parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
+(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
+implementation; our released code uses this variant.
+
+A.2 Latent Process Reward Model (LPRM).
+
+To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
+accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
+with the main objective via a regression loss:
+ T
+ X
+ LLPRM = (vψ (zt ) − r)2 , (16)
+ t=1
+
+where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.
+
+A.3 Empirical Validation of the Surrogate Objective
+
+We further analyze the approximation introduced by the surrogate training objective LGRAM used in
+Section 2.2, both qualitatively and empirically.
+Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
+a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
+terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
+the entire trajectory. To enable training with constant memory, we propagate gradients only through
+the final transition of each supervision step. This is a standard practice in recurrent latent variable
+models with long computation chains: ELBOs over truncated sequences are used, for example, in
+VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
+models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore
+
+
+ 15
+ well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
+reasoning trajectories rather than temporal sequences.
+Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
+variational bound, we compute both quantities on the validation set throughout training. The full
+ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
+ (n)
+across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
+Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.
+
+ Sudoku-Extreme N-Queens 8 × 8
+ GRAM (training objective) GRAM (training objective)
+ 0.6 ELBO (full) 103 ELBO (full)
+
+
+ 0.5 102
+ ELBO
+
+
+
+
+ ELBO
+ 0.4 101
+
+ 100
+ 0.3
+ 10 1
+ 0.2
+ 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000
+ Training Step Training Step
+Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
+smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
+monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
+bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
+while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
+across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
+the large dynamic range.
+
+Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
+that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
+variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
+its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
+even though training relies on the surrogate.
+The gap between the two curves in Figure 8 reflects the structural difference between the two
+quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
+final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
+with LGRAM being a biased but useful surrogate for LELBO .
+
+B Training and Architecture Details
+B.1 Architecture Details
+
+GRAM consists of three components: Encoder, Recursive Core, and Decoder.
+Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
+catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
+(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
+Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
+step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
+each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
+architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
+[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
+Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
+Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
+meaning the initialized z0 has a fixed value.
+Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
+and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
+correctness values from the first token of h.
+
+
+ 16
+ Table 4: Architecture components.
+ Component Module Description
+ Encoder
+ Token Embedding vocab → D
+ Puzzle Embedding 16 tokens (optional, for ARC)
+ Position Encoding RoPE or learned
+ Recursive Core
+ f L , fH [Attention + SwiGLU] × 2 layers
+ Iterations K low-level, T high-level steps
+ µθ , σ θ , µ ϕ , σ ϕ SwiGLU MLP for each parameter
+ Decoder
+ LM Head Linear(D → vocab)
+ Q Head Linear(D → 2) for halt
+ V Head Linear(D → 1) for value
+
+
+
+Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
+construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
+continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
+convolutional patch encoder, following [48, 49].
+The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
+normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
+the normalized image passes through a shallow convolutional encoder. This encoder consists of
+two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
+padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
+feature map is divided into non-overlapping patches of size P × P and linearly projected to match
+the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
+are summarized in Table 5.
+
+Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
+channels, P patch size, Np the number of patches, and D the hidden dimension.
+ Stage Layer / Operation Output Dim.
+ Input Tokens (B, C, H, W )
+ 1. Norm.
+ Linear Scaling [−1, 1] (B, C, H, W )
+ Conv2d 5 × 5 (p = 2)
+ (B, D/2, H, W )
+ SiLU → GN(32)
+ 2. Conv
+ Conv2d 5 × 5 (p = 2)
+ (B, D/2, H, W )
+ SiLU → GN(32)
+ Flatten Patches (B, Np , P 2 · D
+ 2)
+ 3. Patch
+ Linear Projection (B, Np , D)
+
+
+Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
+represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
+Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
+as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
+backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
+Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
+steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.
+
+B.2 Training Details
+
+Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
+in Table 6).
+
+
+ 17
+ • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
+ (0=pad, 1=blank, 2–10=digits).
+ • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
+ markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
+ specific puzzle embeddings are prepended to distinguish different ARC tasks.
+ • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
+ (0=pad, 1=empty, 2=queen).
+ • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
+ using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
+ • For image generation on MNIST [15], images are quantized and processed via CNN-based
+ patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
+ patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
+ 1=black, 2=white).
+
+
+ Table 6: Task-specific configurations.
+ Task Seq. Len Vocab Puzzle Emb Encoding
+ Sudoku 81 11 ✗ 9×9 grid, row-major
+ ARC-AGI 900 12 ✓ 30×30 padded canvas
+ N-Queens N2 3 ✗ N × N board
+ n(n−1)
+ Graph Coloring 2
+ 6 ✗ Strict adjacency upper triangle
+ MNIST 196 3 ✗ 14×14 patches
+
+Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
+1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
+moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
+To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
+supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
+(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
+and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.
+
+ Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
+ Task Epochs GPUs Time
+ Sudoku 50K 8 2h
+ ARC-AGI 200K 8 5 days
+ N-Queens (8×8) 3K 8 1h
+ N-Queens (10×10) 1K 8 3h
+ Graph Coloring (8 nodes) 5K 8 1.5h
+ Graph Coloring (10 nodes) 5K 8 6h
+ MNIST 1.8K 8 16h
+
+
+
+C Additional Details of Experiment Setup
+C.1 Challenging Puzzle Tasks
+
+C.1.1 Looped TF on ARC-AGI
+We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
+prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
+batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
+already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
+and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
+This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
+TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
+TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped
+
+
+ 18
+ TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
+recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
+ARC-AGI scale, making a full Looped TF training run impractical.
+
+C.2 Multi-solution Puzzle Tasks
+
+C.2.1 N-Queens Problem
+
+ Input Solution 1 Solution 2 Solution 3
+
+
+
+
+Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
+full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
+admits exactly 3 valid solutions.
+
+Data Generation Details. The N-Queens problem requires placing N queens on an N × N
+chessboard such that no two queens attack each other—meaning no queens share the same row,
+column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
+solution, resulting in a puzzle with 3 distinct valid completions.
+To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
+N = 10. We then created puzzle instances by removing a specific number of queens, treating the
+remaining partial configuration as the input and the original complete board as the target label. To
+generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
+setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
+generated dataset is shown in Figure 10.
+For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
+the model learns to reason rather than memorize, the split was performed based on unique input
+configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
+are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
+puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).
+
+ 8x8 N-Queens 10x10 N-Queens
+ 3000 20000
+ 15000
+ Counts
+
+
+
+
+ Counts
+
+
+
+
+ 2000
+ 10000
+ 1000
+ 5000
+ 0 0
+ 3 6 9 12 15 18 0 20 40 60 80
+ Number of solutions Number of solutions
+Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
+covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+C.2.2 Graph Coloring Problem
+Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
+node in a graph such that no two adjacent nodes share the same color. We consider graphs with
+N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
+nodes and k = 3 colors.
+Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
+pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently
+
+
+ 19
+ with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
+3-colorable.
+For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
+redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
+structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
+The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
+192 test instances for N = 10.
+Input and Output Representation. The input graph is represented by extracting the upper triangular
+portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
+output is a sequence of length N , where each position encodes the assigned color for the corresponding
+node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
+blue, and green respectively.
+
+ Input Solution 1 Solution 2 Solution 3 Solution 4
+
+
+
+
+ Figure 11: Graph Coloring Example
+
+
+ Vertex 8 Graph Coloring Vertex 10 Graph Coloring
+ 2000
+ 1500
+ 1500
+ Counts
+
+
+
+
+ Counts
+
+
+
+
+ 1000
+ 1000
+ 500 500
+ 0 0
+ 3 6 9 12 15 18 0 20 40 60 80
+ Number of solutions Number of solutions
+Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
+dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+D Additional Experiment Results
+D.1 Additional Results on Challenging Puzzle Benchmarks
+
+Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
+observations complementing the main text.
+
+GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
+tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
+than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
+in both cases) further confirms that recursive computation is essential for these tasks — single-pass
+models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
+arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
+increased model capacity.
+
+Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
+Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
+does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
+probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
+more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially
+
+
+ 20
+ Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
+recursive models. All recursive model scores were obtained at 16 supervision steps.
+ Method #Params Sudoku ARC-1 ARC-2
+ Large Reasoning Models
+ Deepseek-R1 671B 0.0 15.8 1.3
+ Claude 3.7 16k N/A 0.0 28.6 0.7
+ o3-mini-high N/A 0.0 34.5 3.0
+ GPT 5.2 (low) N/A – 55.7 9.7
+ Grok-4-thinking 1.7T – 66.7 16.0
+ Gemini 3 Pro N/A – 75.0 31.1
+ Recursive Models
+ Direct Pred 27M 0.0 21.0 0.0
+ Looped TF 7M 61.3 - -
+ HRM 27M 55.0 40.3 5.0
+ TRM 7M 87.4 44.6 7.8
+ GRAM (Ours) 10M 97.0 52.0 11.1
+ Human Results
+ Avg. Human – – 60.2 –
+ Best Human – – 98.0 100.0
+
+
+
+ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
+we view these numbers as benchmark-difficulty reference points rather than controlled baselines.
+
+D.2 Scales with Parallel Sampling on ARC-AGI Challenge
+
+To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
+mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
+by generating 1,000 augmentations for a single problem and performing majority voting over the
+results. Because this augmentation process itself creates a wide variety of samples, we isolated
+the specific effect of generative sampling by performing inference solely on the original problem
+instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
+was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
+rate, and the number of layers.
+As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
+and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
+that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
+observations in Section 4.2, suggesting that increased inference-time compute through width scaling
+allows the model to explore more plausible reasoning trajectories and recover from initial errors,
+eventually leading to more robust solution discovery.
+
+Interaction between Augmentation and Sampling. A natural question arises: why not combine
+higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
+ablation study examining the interaction between data augmentation and inference-time sampling.
+Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
+augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
+demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
+as the level of augmentation increases, the marginal benefit of additional sampling diminishes
+substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
+nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
+and sampling serve complementary rather than additive roles: both mechanisms enable the model
+to capture solution diversity, but through different means. When training data is limited, parallel
+sampling compensates by exploring varied reasoning paths at inference time. When training data
+is abundant through augmentation, the model has already internalized sufficient diversity during
+training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
+beyond augmentation provides diminishing returns, justifying our experimental design choice to
+evaluate these two scaling axes separately.
+
+
+ 21
+ 0.450
+ 0.425
+ 0.400
+ 0.375
+
+
+
+
+ Accuracy
+ 0.350
+ 0.325
+ 0.300 TRM
+ GRAM ( =0.1)
+ 0.275 GRAM ( =0.05)
+ GRAM ( =0.04)
+ 0.250 1 2 5 10 20 50 100 250 500
+ Number of Samples (N)
+Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
+effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
+augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
+as the number of parallel samples N increases, outperforming the TRM baseline.
+
+
+ 0.500
+ 0.475
+ 0.450
+ Accuracy
+
+
+
+
+ 0.425
+ 0.400
+ 0.375 Aug=0
+ Aug=5
+ 0.350 Aug=10
+ Aug=50
+ 0.325
+ 1 2 5 10 20 50 100 250 500
+ Number of Samples (N)
+Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
+sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
+performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
+roles in capturing solution diversity.
+
+
+
+D.3 Solution Coverage Analysis
+
+We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
+istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
+respect to the total number of valid ground-truth solutions.
+As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
+coverage as the number of possible solutions increases. Since these models are constrained to a single
+fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
+collapse in multi-solution landscapes.
+In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
+of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
+improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
+that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
+where multiple valid reasoning paths exist.
+
+
+D.4 Additional Generated Image Samples
+
+In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
+ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
+MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.
+
+
+ 22
+ 1.0 HRM 1.0 HRM
+ TRM TRM
+ GRAM (N=1) GRAM (N=1)
+ GRAM (N=5) GRAM (N=5)
+ 0.8 GRAM (N=10) 0.8 GRAM (N=10)
+ GRAM (N=20) GRAM (N=20)
+
+
+ 0.6 0.6
+Coverage
+
+
+
+
+ Coverage
+ 0.4 0.4
+
+
+ 0.2 0.2
+
+
+ 0.0 0.0
+ 2 4 6 8 10 12 14 16 18 0 15 30 45 60 75 90
+ Number of Solutions Number of Solutions
+ (a) N-Queens 8 × 8 (b) N-Queens 10 × 10
+
+ Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
+ ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
+ space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
+ increases.
+
+
+
+ As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
+ from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
+ A particularly compelling property of this process is the model’s ability to recover from initially
+ ambiguous or incorrect formations.
+ For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
+ the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
+ as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
+ resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.
+
+
+
+
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16
+ Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
+ tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
+ recursive refinement process.
+
+
+
+
+ 23
+ D.5 Additional Experiment Results on Unconditional Sudoku Generation
+
+In this section, we provide additional details on unconditional Sudoku generation. Unlike the
+conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
+entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
+each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
+must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
+boards. To check whether high validity comes from repeatedly producing the same board, we also
+compute the fraction of unique boards among valid samples.
+For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
+benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
+split, discard the clue patterns, and use an all-blank board as input with the complete solution as
+the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
+epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
+contains 10.9M parameters and uses 16 inference steps.
+For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
+Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
+parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
+heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
+generate boards with 1000 denoising steps.
+As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
+strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
+ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
+In all cases, the valid samples are unique under exact board matching, indicating that the reported
+validity is not due to simple repetition of a small set of boards. These results show that GRAM can
+generate highly constrained symbolic structures from an empty input, supporting its potential as a
+generator beyond conditional puzzle solving.
+Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
+the task is to generate complete boards, and validity is determined by whether the generated board
+satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.
+
+Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
+constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
+ Method #Params Steps Validity(%)
+ D3PM-Uniform (Big) 55.1M 1000 91.33
+ D3PM-Uniform (Small) 15.9M 1000 29.24
+ D3PM-Absorb (Big) 55.1M 1000 79.18
+ D3PM-Absorb (Small) 15.9M 1000 21.88
+ GRAM (Ours) 10.9M 16 99.05
+
+
+ Empty input Valid sample Invalid sample
+ 3 6 5 4 7 8 9 2 1 4 3 8 2 9 1 7 6 5
+ 9 4 1 2 5 6 8 7 3 5 2 1 7 3 6 4 9 8
+ 2 7 8 9 1 3 6 5 4 7 6 9 4 5 8 1 2 3
+ 5 2 9 8 4 7 3 1 6 9 1 3 4 4 7 5 8 6
+ 7 3 6 1 9 2 5 4 8 2 5 4 6 8 9 3 1 7
+ 1 8 4 3 6 5 2 9 7 6 8 7 5 1 3 2 4 9
+ 8 9 7 5 3 4 1 6 2 3 7 2 9 6 4 8 5 1
+ 4 1 3 6 2 9 7 8 5 1 9 5 3 2 8 6 7 4
+ 6 5 2 7 8 1 4 3 9 8 4 6 1 7 5 9 3 2
+
+
+Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
+complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
+sample indicate cells involved in constraint violations.
+
+
+
+ 24
+ D.6 Visualizing Latent Recursion Process
+
+To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
+recursive computation. Specifically, we track the high-level state h at each supervision step throughout
+the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
+interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
+Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
+deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
+enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
+regions of latent space before converging. While some trajectories become trapped in local minima
+(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
+This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.
+
+
+
+
+Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
+indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
+loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
+no ability to escape suboptimal trajectories.
+
+
+
+
+ 25
+ Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
+Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
+of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
+the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.
+
+
+
+
+ 26
+ Licenses
+Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
+reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
+are generated by the authors and are therefore not external assets.
+
+ Asset Use in this paper License / terms Source link
+ MNIST Binarized MNIST genera- Creative Com- https://keras.io/api/
+ tion experiments mons Attribution- datasets/mnist/
+ Share Alike 3.0
+ ARC-AGI-1 / origi- ARC-AGI reasoning Apache License https://github.com/fchollet/
+ nal ARC benchmark 2.0 ARC-AGI
+ ARC-AGI-2 ARC-AGI-2 reasoning Apache License https://github.com/arcprize/
+ benchmark / reference 2.0 ARC-AGI-2
+ results
+ HRM repository HRM baseline and Apache License https://github.com/
+ Sudoku-Extreme-related 2.0 sapientinc/HRM
+ reference implementation
+ TinyRecursiveModels TRM baseline and recur- MIT License https://github.com/
+ / TRM repository sive reasoning reference SamsungSAILMontreal/
+ implementation TinyRecursiveModels
+ MDLM repository Masked diffusion base- Apache License https://github.com/
+ line reference implemen- 2.0 kuleshov-group/mdlm
+ tation, if public code is
+ used
+ Google Research D3PM image-generation Apache License https://github.com/
+ D3PM implementa- baseline reference imple- 2.0 google-research/
+ tion mentation, if public code google-research/blob/master/
+ is used d3pm/images/diffusion_
+ categorical.py
+ Looped Trans- Looped Transformer base- MIT License https://github.com/Leiay/
+ former repository line reference implemen- looped_transformer
+ tation, if public code is
+ used
+ N-Queens Synthetic multi-solution Not an external N/A
+ constraint satisfaction asset
+ task generated by the
+ authors
+ Graph Coloring Synthetic multi-solution Not an external N/A
+ constraint satisfaction asset
+ task generated by the
+ authors
+
+
+
+
+ 27
+ \ No newline at end of file