1 files changed, 1532 insertions, 0 deletions
diff --git a/papers/txt/gram2025_generative_recursive.txt b/papers/txt/gram2025_generative_recursive.txt
new file mode 100644
index 0000000..d5f299d
--- /dev/null
+++ b/papers/txt/gram2025_generative_recursive.txt
@@ -0,0 +1,1532 @@
+                                                                    Generative Recursive Reasoning
+
+
+                                                                      Junyeob Baek1†∗ Mingyu Jo1†∗          Minsu Kim1,2
+
+                                                                     Mengye Ren3       Yoshua Bengio2,4     Sungjin Ahn1,3†
+arXiv:2605.19376v2 [cs.AI] 20 May 2026
+
+
+
+
+                                                                              1
+                                                                               KAIST 2 Mila – Québec AI Institute
+                                                                       3
+                                                                           New York University 4 Université de Montréal
+
+
+
+                                                                                          Abstract
+                                                     How should future neural reasoning systems implement extended computation?
+                                                     Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
+                                                     sive sequence extension by performing iterative latent-state refinement with shared
+                                                     transition functions. Yet existing RRMs are largely deterministic, following a
+                                                     single latent trajectory and converging to a single prediction. We introduce Gen-
+                                                     erative Recursive reAsoning Models (GRAM), a framework that turns recursive
+                                                     latent reasoning into probabilistic multi-trajectory computation. GRAM models
+                                                     reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
+                                                     tive solution strategies, and inference-time scaling through both recursive depth
+                                                     and parallel trajectory sampling. This yields a latent-variable generative model
+                                                     supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
+                                                     unconditional generation via pθ (x). Trained with amortized variational inference,
+                                                     GRAM improves over deterministic recurrent and recursive baselines on structured
+                                                     reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
+                                                     unconditional generation capability. https://ahn-ml.github.io/gram-website
+
+
+                                         1     Introduction
+                                         A central question for future neural reasoning systems is how extended computation should be imple-
+                                         mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
+                                         process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
+                                         plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
+                                         Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
+                                         than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
+                                         because it decouples reasoning depth from both parameter scale and output length: a compact model
+                                         can perform many steps of internal computation by repeatedly applying shared transition functions
+                                         over time.
+                                         Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
+                                         potential of this approach in structured reasoning. Rather than producing a solution in a single
+                                         feedforward pass, they perform extended computation through iterative latent-state refinement, deep
+                                         supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
+                                         latent dynamics. These features make them well suited to problems requiring constraint propagation,
+                                         state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
+                                             ∗ Equal contribution
+                                            † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn
+
+                                         (sungjin.ahn@kaist.ac.kr)
+
+
+                                         Preprint.
+                      Solution 1,
+
+
+     Input Task,
+
+
+
+                      Solution 2,
+
+
+
+
+                                              (a) Deterministic RRMs                (b) GRAM (Ours)
+
+Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
+Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
+deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
+alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
+valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.
+
+also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
+Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
+depth without increasing parameter count. Together, these models suggest that reasoning capability
+can emerge not only from scaling model size or generating longer traces, but also from the organization
+of computation itself.
+While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
+reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
+system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
+multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
+or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
+path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
+not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
+exploring multiple latent trajectories in parallel.
+Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
+initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
+ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
+probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
+motivates the central question of our work: can recursive latent computation support probabilistic,
+generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
+In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
+recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
+process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
+conditioned on the input and the current reasoning state, rather than deterministically updating to a
+single next state. Repeating this process defines a distribution over possible reasoning trajectories,
+allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
+inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
+a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
+marginalizing over latent reasoning trajectories, while the same recursive process can also define an
+unconditional generative model pθ (x) when the input is fixed or absent.
+We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
+chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
+multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
+comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
+including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
+whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
+Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
+transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
+MNIST [15] probes the unconditional generative interpretation.
+Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
+for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.
+
+
+                                                        2
+First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
+obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
+inference-time scaling, enabling inference to scale not only with recursive depth but also with the
+number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
+yields the intended architectural advantages over deterministic recurrent and recursive baselines,
+improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.
+
+2     Generative Recursive Reasoning Models
+In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
+of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
+procedure in Section 2.2, with an architecture schematic shown in Figure 2.
+
+2.1   Architecture
+
+Overview. GRAM models the conditional distri-                                                                      𝑦        (A) CE loss
+bution pθ (y | x) by marginalizing over stochas-
+                                                                            Prior          (B) KL Div.         Posterior                  Decoder
+tic latent reasoning trajectories. Given an input                          𝑝𝜃 (⋅ |𝑢𝑡 )                        𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡          𝑓dec
+x, GRAM first computes an embedding
+                                                              ℎ𝑡−1                                       𝑓𝐻        𝑢𝑡                       ℎ𝑡
+                ex = fenc (x; θ),                       (1)
+                                                                                         𝐾 times
+which is reused throughout the entire recursive 𝑙𝑡−1              𝑓𝐿         𝑓𝐿                        𝑙𝑡
+computation. Starting from a fixed initial la-
+                                                          Encoder
+tent state z0 , the model evolves the latent state   𝑥
+                                                            𝑓enc
+through learned stochastic transitions. The re-
+cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
+levels: inner and outer loops.                     latent transition in the hierarchical instantiation z =
+At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
+                                                level update fH produces a deterministic proposal ut , to
+a new latent state conditioned on the previous which stochastic  guidance ϵt is added: ht = ut + ϵt .
+latent state and the input embedding,
+                                  zt ∼ pθ (zt | zt−1 , ex ),              t = 1, . . . , T.                                                   (2)
+At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
+to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
+supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
+gradients computed as described in Section 2.2.
+At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
+step serving as the initial state of the next, thereby forming the full recursive computation:
+                 (1)   T transitions    (1)       (2)     T transitions                  T transitions          (N      )
+               z0      −−−−−−−→ zT = z0                  −−−−−−−→ · · · −−−−−−−→ zT sup ,                                                     (3)
+        (n)                                                                                                                         (1)
+where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
+fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
+   (n+1)       (n)
+(z0      := zT ). This abstract formulation can be instantiated with various recurrent Transformer
+backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
+as well as hierarchical designs such as HRM and TRM [8, 9].
+Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
+latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
+zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
+reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
+perturbation around a deterministic update: at each transition, the model first computes a deterministic
+update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
+and adds it to ut :
+                                ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
+                                                                            
+                                                                                                     (4)
+                                 zt = ut + ϵt .                                                                                               (5)
+
+
+                                                              3
+We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
+direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
+ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
+and support robust exploration of the solution space without discarding the deterministic refinement
+performed by ut .
+Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
+(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
+state, while the low-level component l is updated K times within a single transition and carries
+fine-grained intermediate computation. This decomposition separates the two roles across time scales,
+with h accumulating slowly across transitions and l refined rapidly within each one.
+With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
+The low-level component is first refined for K updates, with the high-level component held fixed:
+                              lt,k = fL (ht−1 , lt,k−1 , ex ; θ),            k = 1, . . . , K,                (6)
+where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
+component is then updated as a stochastic transition conditioned on the refined lt ,
+                                    ut = fH (ht−1 , lt ; θ),                                                  (7)
+                                    ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
+                                                                                        
+                                                                                            ,                 (8)
+                                    ht = ut + ϵt ,                                                            (9)
+and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
+level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
+more abstract component of the latent state, where it can steer the overall reasoning trajectory
+across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
+fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
+Modeling Unconditional Distribution. While the description so far focuses on the conditional
+setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
+pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
+generation tasks in Section 4.3.
+
+2.2   Training
+
+GRAM is trained to model the conditional distribution pθ (y | x), where each training example
+consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
+latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
+respect to the generative parameters θ and variational parameters ϕ.
+Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
+the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
+TTotal = T × Nsup . The conditional likelihood is defined as
+                                            Z
+                               pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ,                           (10)
+
+where x denotes the input problem and y denotes the corresponding ground-truth output.
+Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
+over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
+evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
+              log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) .           (11)
+
+During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
+access to both the input problem x and the target output y. At inference time, where y is unavailable,
+trajectories are instead generated from the learned prior pθ (· | x).
+   3We also tried injecting noise into the low-level state, but found that it did not improve performance.
+
+
+
+
+                                                              4
+Both the prior and the posterior are modeled as conditional Markov processes over latent states:
+                     TY
+                      Total                                            TY
+                                                                        Total
+
+ pθ (τ | x) = p(z0 )        pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 )        qϕ (zt | zt−1 , x, y). (12)
+                        t=1                                                  t=1
+Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
+adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
+transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
+whereas the prior uses pθ (ϵt | ut ).
+Since the two processes share the same Markov structure and all stochasticity is introduced through
+ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
+since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
+pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
+                                          TX Total               h                                    i
+ LELBO = Eqϕ log pθ (y | zTTotal , x) −              Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
+                                               t=1
+Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
+in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
+variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
+Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
+train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
+recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
+than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
+step is reused as the initial state of the next step.
+Following standard practice for recurrent models with long computation chains, we apply truncated
+gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
+implementation, gradients are propagated only through the final transition of each supervision step,
+ (n)      (n)
+zT −1 → zT . This gives the following surrogate objective for each supervision step:
+     (n)                                 (n)               (n)   (n)           (n)   (n) 
+   LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
+        (n)
+where zT is the terminal state of the current supervision step n, and gradients are stopped through
+preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
+exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
+analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
+are listed in Appendix B.2.
+
+2.3   Inference-Time Scaling
+
+GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
+of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
+For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
+time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
+Appendix A.1). For width — the focus of this section — we draw {τ (i) }N  i=1 ∼ pθ (τ(i)| x) from the
+learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
+multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
+To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
+Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
+from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
+inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
+chooses the candidate with the highest predicted terminal value. Details of LPRM training are
+provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
+parallel exploration, without increasing the sequential recursion length.
+
+3     Related Work
+Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
+Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous
+
+
+                                                      5
+                              Sudoku                                  ARC-AGI-1                                              ARC-AGI-2
+                                                                                                          20
+               100                           97.0%                                                66.7%
+                                     87.4%                                                                                                         16.0%
+                80                                   60                   52.0%           55.7%           15
+Accuracy (%)         61.3%                                        44.6%                                                      11.1%
+                60           55.0%                        40.3%                                                                             9.7%
+                                                     40                           34.5%                   10          7.8%
+                40                                                                                             5.0%
+                                                     20                                                    5                         3.0%
+                20
+                 0                                   0                                                     0
+                     Looped HRM TRM GRAM                  HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-                 HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
+                       TF           (Ours)                        (Ours) high (low) thinking                           (Ours) high (low) thinking
+                                                           Recursive Models                GRAM (Ours)          LLMs
+
+Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
+outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
+transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
+omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
+only as external reference points for benchmark difficulty.
+
+representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
+tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
+instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
+soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
+However, many remain organized around autoregressive sequence generation, where additional
+computation is tied to generating more tokens, latent positions, or sequential reasoning states.
+Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
+from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
+recursive reasoning models show that increasing inference-time depth can outperform larger static
+models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
+instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
+enabling multi-path exploration and generative sampling.
+Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
+transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
+ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
+and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
+stochastic dynamics as computation rather than temporal observation modeling: latent transitions
+define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
+pθ (y | x) and unconditional pθ (x) generation.
+
+
+4                Experiments
+
+GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
+purpose large language reasoning model whose training data, inference budgets, prompting strategies,
+tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
+recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
+tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
+stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
+In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
+AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
+(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
+MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
+choices (Section 4.4).
+
+4.1                  Challenging Puzzle Tasks
+
+Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
+quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
+visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-
+
+
+                                                                                     6
+           1.0                     N=                                    1.0
+                                         1 5 10 20 50
+                                                                         0.8
+           0.9
+Accuracy                                                                                                  Looped TF
+
+
+
+
+                                                              Accuracy
+                                                Looped TF                0.6                              HRM
+                                                HRM
+           0.8                                  TRM                                                       TRM
+                                                GRAM (Ours)              0.4                              GRAM (N=1)
+                                                                         0.2
+           0.6
+           0.5                                                           0.0
+                 8   16    32             128          320                     2   4    6   8   10   12   14   16   18
+                            Iterations                                                 Number of Solutions
+ Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
+ longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
+ iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
+ Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
+ a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
+ performance.
+
+
+ former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
+ model results [40] are included as external reference points for benchmark difficulty, rather than
+ as controlled baselines, since their training and inference settings are not directly comparable to
+ task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
+ reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
+ Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
+ GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
+ improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
+ TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
+ stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
+ of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
+ problem spaces more effectively than models constrained to a single sequential refinement process.
+ Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
+ Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
+ number of parallel samples consistently improves performance across all iteration counts. Notably,
+ GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
+ including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
+ recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
+ explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
+ Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
+ the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
+ analysis on the ARC-AGI Challenge is provided in Appendix D.2.
+
+ 4.2        Multi-solution Puzzle Tasks
+
+Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
+10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
+input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
+HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
+N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
+total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
+constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
+Appendix C.2.
+ Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
+ models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
+ Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
+ deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
+ consistent performance regardless of solution count. This confirms that deterministic latent updates
+
+
+                                                              7
+Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
+model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
+over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
+solutions discovered with 20 samples (%).
+                                                           N-Queens                                 Graph Coloring
+                                                    8×8              10 × 10                 8-vertex            10-vertex
+ Method                  Rec. Gen. # Params Accuracy Coverage Accuracy Coverage       Conflict↓ Coverage Conflict↓ Coverage
+ Direct Pred (8 layers)   ✗    ✗     27M     40.4±1.1  13.7±1.1 13.6±0.5    1.6±0.2   179.3±4.0 19.9±0.2 198.7±5.0      6.7±0.1
+ Direct Pred (32 layers) ✗     ✗     100M    40.2±1.3  13.6±1.1 13.1±0.4    1.6±0.2   174.0±18.0 19.1±1.7 227.7±34.5    6.5±1.9
+ Looped TF                ✓    ✗       7M    68.4±3.7  23.6±1.9 50.0±7.6    6.2±3.2   136.0±16.1 20.5±1.5 157.3±9.0     7.2±0.7
+ HRM                      ✓    ✗      27M    78.7±2.9  26.7±1.3 37.4±0.3    4.7±0.1   109.7±1.5 21.8±0.3 164.3±21.6     8.9±1.7
+ TRM                      ✓    ✗       7M    66.8±5.7 36.1±22.5 17.5±11.2  2.0±1.3    109.3±3.1 22.3±0.6 170.7±17.9     6.8±0.3
+ AR                       ✗    ✓    10.6M    96.3±1.0  84.8±0.8 90.0±2.2   53.2±0.8    19.0±11.3   83.0±0.7 61.3±8.3   40.0±0.3
+ MDLM                     ✗    ✓    12.6M    96.1±1.5  87.2±0.6 74.3±6.6   47.4±2.2     2.7±0.6    84.5±4.0 12.0±7.0   48.2±1.4
+ GRAM (Ours)              ✓    ✓     10M     99.7±0.3  90.3±1.9 89.7±2.7   57.5±3.4     2.7±2.1    85.8±0.5  3.3±1.5   51.3±2.8
+
+
+
+
+cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
+analysis is provided in Appendix D.3.
+Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
+MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
+sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
+gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
+and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
+enables stricter constraint satisfaction than generative sampling alone.
+
+4.3   Exploring GRAM as an Unconditional Generator
+
+Setup. To investigate GRAM’s unconditional gen-                    Table 2: Unconditional generation results on
+erative capability beyond conditional reasoning, we                binarized MNIST. We report IS (↑) and FID (↓).
+evaluate generation in two domains: structured con-                For iterative models, a step corresponds to a super-
+straint generation on Sudoku (from empty boards,                   vision step for TRM and GRAM, and a denoising
+                                                                   step for D3PM. FID is calculated using real sam-
+evaluated by the fraction of generated boards satis-               ples with original pixel values (0–255).
+fying Sudoku constraints) and image generation on
+binarized MNIST [15], where pixel values are thresh-                       Method                    IS (↑)    FID (↓)
+olded to 0 or 1 (evaluated by Inception Score (IS) [42]
+and FID [43]). In both cases, the input is replaced by                     VAE                        1.70       86.28
+                                                                           D3PM (1000 steps)          1.86       74.03
+an empty conditioning signal and the model samples                         TRM (16 steps)             1.00      303.29
+an output from its learned prior. Baselines include
+D3PM [44], a discrete diffusion model, on both tasks,                      GRAM (Ours)
+and additionally a VAE [45] trained with binary re-                         8 steps                   1.85       84.08
+                                                                            16 steps                  1.89       77.79
+construction loss on MNIST. To ensure a fair compar-                        32 steps                  1.91       76.65
+ison with existing literature, FID is calculated using                      64 steps                  1.95       75.39
+real samples from the original standard MNIST.                              128 steps                 1.99       74.30
+                                                                            256 steps                 2.04       73.34
+Generative Behavior Beyond Reasoning. GRAM
+extends from conditional reasoning to unconditional
+generation in two different domains. On Sudoku
+generation (Figure 5), GRAM produces valid boards
+with 99.05% validity using 10.9M parameters and
+16 supervision steps, surpassing D3PM baselines
+that use up to 55.1M parameters and 1000 denois-
+ing steps. Figure 7 shows qualitative examples, illus-
+trating that the model produces diverse, fully valid
+boards from empty inputs without any explicit con-
+straint checker. On MNIST (Table 2), the deter-
+ministic baseline TRM exhibits mode collapse (FID
+303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
+its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
+these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
+                                                                    tially fewer parameters and steps.
+
+                                                            8
+D3PM
+           t=0   t=100     t=200   t=300     t=400   t=500      t=600   t=700   t=800   t=900    t=1000
+GRAM TRM
+           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16
+
+
+
+           t=0    t=1       t=2     t=4       t=6     t=7        t=9    t=11    t=12    t=14      t=16    Sample 1 Sample 2 Sample 3 Sample 4
+                                             (a) Generation Process                                                 (b) Samples
+  Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
+  steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
+  through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.
+
+  Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
+  Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
+  recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
+  either significantly degrades performance.
+                        (a) Architecture Ablation.                                              (b) Mechanism Ablation.
+
+  Model variant                             Sudoku            N-Queens          Model variant                   Sudoku       N-Queens
+  base (Looped TF)                            61.25              71.30          GRAM (ours)                      93.96          99.69
+  + DS + HR (=HRM, TRM)                   55.00 / 87.40      80.70 / 72.90
+                                                                                w/o stochastic guidance          82.87          72.91
+  + SG                                        65.64              86.30
+                                                                                stochasticity only               94.88          50.27
+  + DS + SG                                   73.90             100.00
+                                                                                guide only                        0.00           0.00
+  + DS + HR + SG (=GRAM)                     93.96              99.69           w/ direct prediction             63.43          61.44
+                                                                                TRM w/ stochastic decoder        82.87          71.66
+                                                                                TRM w/ random init.              78.53          71.82
+
+
+  transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
+  emerging as a natural byproduct of the recursive generative process.
+  Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
+  at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
+  to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
+  advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
+  additional samples are in Section D.4.
+
+  4.4        Ablation Study
+
+  We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
+  Table 3 summarizes the results.
+  Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
+  stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
+  lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
+  on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
+  (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
+  gains in every configuration, supporting our design of stochastic guidance as the core extension
+  introduced by GRAM.
+  Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
+  learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
+  Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
+  However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
+  navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
+  tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
+  Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
+  TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and
+
+
+                                                                         9
+                                     8    9       648 59 1        648 59 1     648259317   648259317   648259317   648259317
+                 7                 79         5   79 68       5   79 68 415    79 683415   79 683415   792683415   792683415
+                               9     6      289   3 6     1289    3 6   1289   3 6 71289   356 71289   356471289   356471289
+
+
+
+D3PM
+                                         657 4      8    657 4      831657 4    83165734   983165734   983165734   983165734
+                     3 9           4 3 9 5        4 3 9 5         4 389 56     42389 561   423897561   423897561   423897561
+                     7     8         7      8 2       7     8 2      7 648 2   5 73648 2   5 73648 2   5173648 2   517364892
+                       5               5          13 548          139548 2     139548626   139548626   139548626   139548626
+                     59              593              593 1 8        593 1 8   27593 148   275936148   275936148   275936148
+                         2         8      2   3   8     7 2   3   8   7 2 53   8 47 2953   8 4712953   864712953   864712953
+         t=0         t=125              t=250        t=375          t=500        t=625       t=750       t=875      t=1000
+                 717348652         716348952      716348952       716348952    716348952   716348952   716348952   716348952
+                 483927379         493527861      493527861       493527861    493527861   493527861   493527861   493527861
+                 523971734         528996734      528169734       528169734    528169734   528169734   528169734   528169734
+GRAM
+
+
+                 864235917         864251379      864215379       864215379    864215379   864215379   864215379   864215379
+                 359841126         359784126      359874126       359874126    359874126   359874126   359874126   359874126
+                 172496538         172936548      172936548       172936548    172936548   172936548   172936548   172936548
+                 937812445         937812615      937482615       937482615    937482615   937482615   937482615   937482615
+                 645779283         645179283      645791283       645791283    645791283   645791283   645791283   645791283
+                 281623497         281663497      281653497       281653497    281653497   281653497   281653497   281653497
+        iter=0       iter=1             iter=2       iter=4         iter=6       iter=8     iter=10     iter=13     iter=16
+Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
+sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
+row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
+highlighted in red.
+
+
+(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
+improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
+than mere randomness.
+
+5      Conclusions and Limitations
+We introduced GRAM, a generative framework that transforms deterministic recursive architectures
+into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
+tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
+enables diverse solution discovery and improved exploration compared to deterministic counterparts.
+Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
+to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
+depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
+sion that consistently improves any recursive architecture, and that the gains stem specifically from
+the variational framework — not from mere randomness, as naive stochastic alternatives applied to
+existing models yield no improvement.
+Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
+through recursion-based generation over inputs, with generation quality improving monotonically
+with recursive depth even beyond training-time steps. This suggests new directions for generative
+modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
+limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
+toward larger foundation models.
+
+
+
+
+                                                                    10
+Acknowledgment
+This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
+the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
+through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
+ICT (MSIT). This work was also supported by the Institute of Information & Communications
+Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
+2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
+Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
+are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
+members of the MLML for valuable discussions and feedback throughout this project.
+
+Broader Impacts
+GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
+taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
+problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
+to improve reasoning through inference-time computation rather than parameter scaling alone. Its
+generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
+in downstream decision-making pipelines, and multi-sample inference may increase computational
+and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
+real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
+domain-specific safeguards.
+
+References
+ [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+     Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+     Advances in neural information processing systems, 35:24824–24837, 2022.
+
+ [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
+     Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
+     vances in neural information processing systems, 36:11809–11822, 2023.
+
+ [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
+     Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
+     of thoughts: Solving elaborate problems with large language models. In Proceedings of the
+     AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.
+
+ [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+     Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+     arXiv:2412.06769, 2024.
+
+ [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+     by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+     arXiv:2505.12514, 2025.
+
+ [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
+     Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
+     reasoning. arXiv preprint arXiv:2505.23648, 2025.
+
+ [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
+     are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
+
+ [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+     Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+
+ [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+     preprint arXiv:2510.04871, 2025.
+
+
+                                                   11
+[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+     versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
+[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
+[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+     agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+     2025.
+[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
+     recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
+[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
+     recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
+[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
+     preprint arXiv:1705.08209, 2017.
+[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
+     son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
+     with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
+[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
+     beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
+[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
+     Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
+     concept space. arXiv preprint arXiv:2505.15778, 2025.
+[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
+     hard truths. arXiv preprint arXiv:2509.19170, 2025.
+[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
+     Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
+     arXiv:2502.21074, 2025.
+[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
+     Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
+     preprint arXiv:2505.18454, 2025.
+[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+     Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+     recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
+     thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
+     arXiv:2310.10845, 2023.
+[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
+     Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
+     arXiv:2410.20672, 2024.
+[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
+[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
+     1735–1780, 1997.
+[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
+     Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
+     decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+
+
+                                                 12
+[30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
+     Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
+     preprint arXiv:1909.11942, 2019.
+[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
+     preprint arXiv:1910.10073, 2019.
+[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+     arXiv:1603.08983, 2016.
+[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
+     Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
+     org/abs/1506.02216.
+[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
+     models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
+[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
+     //arxiv.org/abs/1511.05121.
+[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
+     James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
+     //arxiv.org/abs/1811.04551.
+[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
+     discrete world models. arXiv preprint arXiv:2010.02193, 2020.
+[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
+     through world models. arXiv preprint arXiv:2301.04104, 2023.
+[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
+     processing systems, 30, 2017.
+[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
+     benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
+[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
+     Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
+     models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
+[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
+     2018.
+[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
+     Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
+     neural information processing systems, 30, 2017.
+[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
+     Structured denoising diffusion models in discrete state-spaces. Advances in neural information
+     processing systems, 34:17981–17993, 2021.
+[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
+     arXiv:1312.6114, 2013.
+[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+     Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
+[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
+     discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
+[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
+     of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
+
+
+                                                13
+[50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
+     applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
+[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
+     convolutional neural networks. Advances in neural information processing systems, 25, 2012.
+[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
+     function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
+[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
+     on computer vision (ECCV), pages 3–19, 2018.
+[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
+     arXiv:1711.05101, 2017.
+[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
+     natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
+[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
+     Advances in neural information processing systems, 33:12438–12448, 2020.
+[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
+     Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
+     BF02066689. URL https://doi.org/10.1007/BF02066689.
+[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
+     learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
+     31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
+     IEEE, 2019.
+[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
+     Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
+[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
+     mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
+     13091446.
+
+
+
+
+                                               14
+A     Additional Method Details
+A.1   Adaptive Computation Time
+
+GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
+trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
+steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
+Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
+to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
+binary action of halting or continuing computation at the current supervision step.
+Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
+                                  (n)
+After computing the latent state zT at the end of supervision step n, we form Q-learning targets:
+
+       • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
+         prediction.
+                                                
+       • q̂ncontinue = max qn+1halt    continue
+                                    , qn+1        , the bootstrapped value of running one more supervision
+         step.
+
+The halt head is trained by regression to these targets:
+                               Nsup h
+                               X                           2                                2 i
+                     LACT =             qnhalt − q̂nhalt        + qncontinue − q̂ncontinue       .     (15)
+                               n=1
+
+This auxiliary loss is added to the main training objective and contributes only through the halt head;
+it does not propagate gradients into the recursive core.
+Inference. At inference, computation proceeds one supervision step at a time. After each step
+n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
+                                                                                       max
+computation continues to the next supervision step, up to a maximum budget of Nsup          steps. Different
+trajectories sampled in parallel may therefore terminate at different depths, complementing the
+parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
+(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
+implementation; our released code uses this variant.
+
+A.2   Latent Process Reward Model (LPRM).
+
+To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
+accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
+with the main objective via a regression loss:
+                                                    T
+                                                    X
+                                        LLPRM =       (vψ (zt ) − r)2 ,                                (16)
+                                                    t=1
+
+where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.
+
+A.3   Empirical Validation of the Surrogate Objective
+
+We further analyze the approximation introduced by the surrogate training objective LGRAM used in
+Section 2.2, both qualitatively and empirically.
+Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
+a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
+terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
+the entire trajectory. To enable training with constant memory, we propagate gradients only through
+the final transition of each supervision step. This is a standard practice in recurrent latent variable
+models with long computation chains: ELBOs over truncated sequences are used, for example, in
+VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
+models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore
+
+
+                                                       15
+well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
+reasoning trajectories rather than temporal sequences.
+Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
+variational bound, we compute both quantities on the validation set throughout training. The full
+ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
+                                                                                       (n)
+across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
+Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.
+
+                              Sudoku-Extreme                                                  N-Queens 8 × 8
+                                        GRAM (training objective)                                      GRAM (training objective)
+             0.6                        ELBO (full)                         103                        ELBO (full)
+
+
+             0.5                                                            102
+      ELBO
+
+
+
+
+                                                                    ELBO
+             0.4                                                            101
+
+                                                                            100
+             0.3
+                                                                           10 1
+             0.2
+                   10000 20000 30000 40000 50000 60000                            0   10000    20000    30000      40000
+                               Training Step                                                  Training Step
+Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
+smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
+monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
+bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
+while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
+across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
+the large dynamic range.
+
+Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
+that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
+variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
+its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
+even though training relies on the surrogate.
+The gap between the two curves in Figure 8 reflects the structural difference between the two
+quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
+final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
+with LGRAM being a biased but useful surrogate for LELBO .
+
+B     Training and Architecture Details
+B.1    Architecture Details
+
+GRAM consists of three components: Encoder, Recursive Core, and Decoder.
+Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
+catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
+(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
+Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
+step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
+each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
+architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
+[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
+Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
+Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
+meaning the initialized z0 has a fixed value.
+Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
+and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
+correctness values from the first token of h.
+
+
+                                                                    16
+                                       Table 4: Architecture components.
+                  Component         Module                   Description
+                    Encoder
+                                    Token Embedding          vocab → D
+                                    Puzzle Embedding         16 tokens (optional, for ARC)
+                                    Position Encoding        RoPE or learned
+                    Recursive Core
+                                 f L , fH                    [Attention + SwiGLU] × 2 layers
+                                 Iterations                  K low-level, T high-level steps
+                                 µθ , σ θ , µ ϕ , σ ϕ        SwiGLU MLP for each parameter
+                    Decoder
+                                    LM Head                  Linear(D → vocab)
+                                    Q Head                   Linear(D → 2) for halt
+                                    V Head                   Linear(D → 1) for value
+
+
+
+Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
+construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
+continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
+convolutional patch encoder, following [48, 49].
+The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
+normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
+the normalized image passes through a shallow convolutional encoder. This encoder consists of
+two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
+padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
+feature map is divided into non-overlapping patches of size P × P and linearly projected to match
+the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
+are summarized in Table 5.
+
+Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
+channels, P patch size, Np the number of patches, and D the hidden dimension.
+                        Stage           Layer / Operation              Output Dim.
+                                        Input Tokens                   (B, C, H, W )
+                        1. Norm.
+                                        Linear Scaling [−1, 1]         (B, C, H, W )
+                                        Conv2d 5 × 5 (p = 2)
+                                                                      (B, D/2, H, W )
+                                        SiLU → GN(32)
+                        2. Conv
+                                        Conv2d 5 × 5 (p = 2)
+                                                                      (B, D/2, H, W )
+                                        SiLU → GN(32)
+                                        Flatten Patches               (B, Np , P 2 · D
+                                                                                     2)
+                        3. Patch
+                                        Linear Projection               (B, Np , D)
+
+
+Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
+represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
+Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
+as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
+backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
+Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
+steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.
+
+B.2   Training Details
+
+Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
+in Table 6).
+
+
+                                                        17
+        • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
+          (0=pad, 1=blank, 2–10=digits).
+        • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
+          markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
+          specific puzzle embeddings are prepended to distinguish different ARC tasks.
+        • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
+          (0=pad, 1=empty, 2=queen).
+        • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
+          using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
+        • For image generation on MNIST [15], images are quantized and processed via CNN-based
+          patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
+          patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
+          1=black, 2=white).
+
+
+                                  Table 6: Task-specific configurations.
+           Task             Seq. Len    Vocab     Puzzle Emb                 Encoding
+           Sudoku                81       11              ✗             9×9 grid, row-major
+           ARC-AGI              900       12              ✓             30×30 padded canvas
+           N-Queens             N2        3               ✗                 N × N board
+                             n(n−1)
+           Graph Coloring       2
+                                          6               ✗        Strict adjacency upper triangle
+           MNIST                196        3              ✗                14×14 patches
+
+Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
+1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
+moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
+To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
+supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
+(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
+and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.
+
+                      Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
+                         Task                             Epochs   GPUs     Time
+                         Sudoku                            50K      8        2h
+                         ARC-AGI                          200K      8      5 days
+                         N-Queens (8×8)                     3K      8        1h
+                         N-Queens (10×10)                   1K      8        3h
+                         Graph Coloring (8 nodes)           5K      8       1.5h
+                         Graph Coloring (10 nodes)          5K      8        6h
+                         MNIST                            1.8K      8       16h
+
+
+
+C     Additional Details of Experiment Setup
+C.1     Challenging Puzzle Tasks
+
+C.1.1    Looped TF on ARC-AGI
+We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
+prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
+batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
+already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
+and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
+This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
+TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
+TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped
+
+
+                                                     18
+TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
+recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
+ARC-AGI scale, making a full Looped TF training run impractical.
+
+C.2     Multi-solution Puzzle Tasks
+
+C.2.1    N-Queens Problem
+
+                          Input               Solution 1                Solution 2               Solution 3
+
+
+
+
+Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
+full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
+admits exactly 3 valid solutions.
+
+Data Generation Details. The N-Queens problem requires placing N queens on an N × N
+chessboard such that no two queens attack each other—meaning no queens share the same row,
+column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
+solution, resulting in a puzzle with 3 distinct valid completions.
+To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
+N = 10. We then created puzzle instances by removing a specific number of queens, treating the
+remaining partial configuration as the input and the original complete board as the target label. To
+generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
+setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
+generated dataset is shown in Figure 10.
+For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
+the model learns to reason rather than memorize, the split was performed based on unique input
+configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
+are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
+puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).
+
+                                     8x8 N-Queens                                     10x10 N-Queens
+                   3000                                                20000
+                                                                       15000
+          Counts
+
+
+
+
+                                                              Counts
+
+
+
+
+                   2000
+                                                                       10000
+                   1000
+                                                                        5000
+                      0                                                    0
+                             3      6    9   12     15   18                    0      20    40       60       80
+                                  Number of solutions                                Number of solutions
+Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
+covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+C.2.2    Graph Coloring Problem
+Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
+node in a graph such that no two adjacent nodes share the same color. We consider graphs with
+N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
+nodes and k = 3 colors.
+Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
+pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently
+
+
+                                                              19
+with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
+3-colorable.
+For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
+redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
+structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
+The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
+192 test instances for N = 10.
+Input and Output Representation. The input graph is represented by extracting the upper triangular
+portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
+output is a sequence of length N , where each position encodes the assigned color for the corresponding
+node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
+blue, and green respectively.
+
+                      Input             Solution 1             Solution 2              Solution 3        Solution 4
+
+
+
+
+                                             Figure 11: Graph Coloring Example
+
+
+                              Vertex 8 Graph Coloring                                  Vertex 10 Graph Coloring
+                                                                            2000
+                   1500
+                                                                            1500
+          Counts
+
+
+
+
+                                                                   Counts
+
+
+
+
+                   1000
+                                                                            1000
+                    500                                                      500
+                      0                                                        0
+                              3     6    9     12    15   18                       0      20        40   60    80
+                                  Number of solutions                                    Number of solutions
+Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
+dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+D     Additional Experiment Results
+D.1   Additional Results on Challenging Puzzle Benchmarks
+
+Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
+observations complementing the main text.
+
+GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
+tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
+than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
+in both cases) further confirms that recursive computation is essential for these tasks — single-pass
+models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
+arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
+increased model capacity.
+
+Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
+Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
+does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
+probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
+more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially
+
+
+                                                                 20
+Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
+recursive models. All recursive model scores were obtained at 16 supervision steps.
+                         Method                   #Params Sudoku ARC-1 ARC-2
+                         Large Reasoning Models
+                           Deepseek-R1             671B    0.0    15.8    1.3
+                           Claude 3.7 16k          N/A     0.0    28.6    0.7
+                           o3-mini-high            N/A     0.0    34.5    3.0
+                           GPT 5.2 (low)           N/A      –     55.7    9.7
+                           Grok-4-thinking         1.7T     –     66.7   16.0
+                           Gemini 3 Pro            N/A      –     75.0   31.1
+                         Recursive Models
+                           Direct Pred             27M      0.0   21.0   0.0
+                           Looped TF                7M     61.3    -      -
+                           HRM                     27M     55.0   40.3   5.0
+                           TRM                      7M     87.4   44.6   7.8
+                           GRAM (Ours)             10M     97.0   52.0   11.1
+                         Human Results
+                           Avg. Human                –      –     60.2     –
+                           Best Human                –      –     98.0   100.0
+
+
+
+ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
+we view these numbers as benchmark-difficulty reference points rather than controlled baselines.
+
+D.2   Scales with Parallel Sampling on ARC-AGI Challenge
+
+To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
+mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
+by generating 1,000 augmentations for a single problem and performing majority voting over the
+results. Because this augmentation process itself creates a wide variety of samples, we isolated
+the specific effect of generative sampling by performing inference solely on the original problem
+instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
+was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
+rate, and the number of layers.
+As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
+and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
+that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
+observations in Section 4.2, suggesting that increased inference-time compute through width scaling
+allows the model to explore more plausible reasoning trajectories and recover from initial errors,
+eventually leading to more robust solution discovery.
+
+Interaction between Augmentation and Sampling. A natural question arises: why not combine
+higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
+ablation study examining the interaction between data augmentation and inference-time sampling.
+Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
+augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
+demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
+as the level of augmentation increases, the marginal benefit of additional sampling diminishes
+substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
+nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
+and sampling serve complementary rather than additive roles: both mechanisms enable the model
+to capture solution diversity, but through different means. When training data is limited, parallel
+sampling compensates by exploring varied reasoning paths at inference time. When training data
+is abundant through augmentation, the model has already internalized sufficient diversity during
+training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
+beyond augmentation provides diminishing returns, justifying our experimental design choice to
+evaluate these two scaling axes separately.
+
+
+                                                    21
+                                       0.450
+                                       0.425
+                                       0.400
+                                       0.375
+
+
+
+
+                            Accuracy
+                                       0.350
+                                       0.325
+                                       0.300                                                       TRM
+                                                                                                   GRAM ( =0.1)
+                                       0.275                                                       GRAM ( =0.05)
+                                                                                                   GRAM ( =0.04)
+                                       0.250   1       2       5      10        20   50    100         250     500
+                                                                   Number of Samples (N)
+Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
+effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
+augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
+as the number of parallel samples N increases, outperforming the TRM baseline.
+
+
+                          0.500
+                          0.475
+                          0.450
+               Accuracy
+
+
+
+
+                          0.425
+                          0.400
+                          0.375                                                                                      Aug=0
+                                                                                                                     Aug=5
+                          0.350                                                                                      Aug=10
+                                                                                                                     Aug=50
+                          0.325
+                                         1         2       5         10         20    50         100         250       500
+                                                                   Number of Samples (N)
+Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
+sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
+performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
+roles in capturing solution diversity.
+
+
+
+D.3   Solution Coverage Analysis
+
+We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
+istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
+respect to the total number of valid ground-truth solutions.
+As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
+coverage as the number of possible solutions increases. Since these models are constrained to a single
+fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
+collapse in multi-solution landscapes.
+In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
+of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
+improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
+that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
+where multiple valid reasoning paths exist.
+
+
+D.4   Additional Generated Image Samples
+
+In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
+ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
+MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.
+
+
+                                                                           22
+           1.0                                                 HRM                       1.0                                            HRM
+                                                               TRM                                                                      TRM
+                                                               GRAM (N=1)                                                               GRAM (N=1)
+                                                               GRAM (N=5)                                                               GRAM (N=5)
+           0.8                                                 GRAM (N=10)               0.8                                            GRAM (N=10)
+                                                               GRAM (N=20)                                                              GRAM (N=20)
+
+
+           0.6                                                                           0.6
+Coverage
+
+
+
+
+                                                                              Coverage
+           0.4                                                                           0.4
+
+
+           0.2                                                                           0.2
+
+
+           0.0                                                                           0.0
+                  2    4     6       8         10   12    14   16     18                       0     15      30      45     60     75        90
+                                 Number of Solutions                                                         Number of Solutions
+                           (a) N-Queens 8 × 8                                                             (b) N-Queens 10 × 10
+
+ Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
+ ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
+ space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
+ increases.
+
+
+
+ As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
+ from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
+ A particularly compelling property of this process is the model’s ability to recover from initially
+ ambiguous or incorrect formations.
+ For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
+ the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
+ as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
+ resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.
+
+
+
+
+                 t=0       t=1           t=2        t=4        t=6           t=7               t=9    t=11        t=12    t=14     t=16
+ Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
+ tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
+ recursive refinement process.
+
+
+
+
+                                                                             23
+D.5   Additional Experiment Results on Unconditional Sudoku Generation
+
+In this section, we provide additional details on unconditional Sudoku generation. Unlike the
+conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
+entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
+each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
+must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
+boards. To check whether high validity comes from repeatedly producing the same board, we also
+compute the fraction of unique boards among valid samples.
+For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
+benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
+split, discard the clue patterns, and use an all-blank board as input with the complete solution as
+the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
+epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
+contains 10.9M parameters and uses 16 inference steps.
+For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
+Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
+parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
+heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
+generate boards with 1000 denoising steps.
+As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
+strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
+ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
+In all cases, the valid samples are unique under exact board matching, indicating that the reported
+validity is not due to simple repetition of a small set of boards. These results show that GRAM can
+generate highly constrained symbolic structures from an empty input, supporting its potential as a
+generator beyond conditional puzzle solving.
+Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
+the task is to generate complete boards, and validity is determined by whether the generated board
+satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.
+
+Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
+constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
+                         Method                     #Params            Steps       Validity(%)
+                         D3PM-Uniform (Big)         55.1M              1000           91.33
+                         D3PM-Uniform (Small)       15.9M              1000           29.24
+                         D3PM-Absorb (Big)          55.1M              1000           79.18
+                         D3PM-Absorb (Small)        15.9M              1000           21.88
+                         GRAM (Ours)                10.9M               16            99.05
+
+
+           Empty input                                  Valid sample                              Invalid sample
+                                            3   6   5    4   7   8     9   2   1        4     3   8   2   9   1   7   6   5
+                                            9   4   1    2   5   6     8   7   3        5     2   1   7   3   6   4   9   8
+                                            2   7   8    9   1   3     6   5   4        7     6   9   4   5   8   1   2   3
+                                            5   2   9    8   4   7     3   1   6        9     1   3   4   4   7   5   8   6
+                                            7   3   6    1   9   2     5   4   8        2     5   4   6   8   9   3   1   7
+                                            1   8   4    3   6   5     2   9   7        6     8   7   5   1   3   2   4   9
+                                            8   9   7    5   3   4     1   6   2        3     7   2   9   6   4   8   5   1
+                                            4   1   3    6   2   9     7   8   5        1     9   5   3   2   8   6   7   4
+                                            6   5   2    7   8   1     4   3   9        8     4   6   1   7   5   9   3   2
+
+
+Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
+complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
+sample indicate cells involved in constraint violations.
+
+
+
+                                                        24
+D.6   Visualizing Latent Recursion Process
+
+To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
+recursive computation. Specifically, we track the high-level state h at each supervision step throughout
+the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
+interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
+Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
+deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
+enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
+regions of latent space before converging. While some trajectories become trapped in local minima
+(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
+This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.
+
+
+
+
+Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
+indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
+loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
+no ability to escape suboptimal trajectories.
+
+
+
+
+                                                      25
+Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
+Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
+of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
+the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.
+
+
+
+
+                                                      26
+Licenses
+Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
+reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
+are generated by the authors and are therefore not external assets.
+
+ Asset                 Use in this paper           License / terms     Source link
+ MNIST                 Binarized MNIST genera-     Creative Com-       https://keras.io/api/
+                       tion experiments            mons Attribution-   datasets/mnist/
+                                                   Share Alike 3.0
+ ARC-AGI-1 / origi-  ARC-AGI         reasoning     Apache License      https://github.com/fchollet/
+ nal ARC             benchmark                     2.0                 ARC-AGI
+ ARC-AGI-2           ARC-AGI-2 reasoning           Apache License      https://github.com/arcprize/
+                     benchmark / reference         2.0                 ARC-AGI-2
+                     results
+ HRM repository      HRM       baseline    and     Apache License      https://github.com/
+                     Sudoku-Extreme-related        2.0                 sapientinc/HRM
+                     reference implementation
+ TinyRecursiveModels TRM baseline and recur-       MIT License         https://github.com/
+ / TRM repository    sive reasoning reference                          SamsungSAILMontreal/
+                     implementation                                    TinyRecursiveModels
+ MDLM repository     Masked diffusion base-        Apache License      https://github.com/
+                     line reference implemen-      2.0                 kuleshov-group/mdlm
+                     tation, if public code is
+                     used
+ Google Research D3PM image-generation             Apache License      https://github.com/
+ D3PM implementa- baseline reference imple-        2.0                 google-research/
+ tion                mentation, if public code                         google-research/blob/master/
+                     is used                                           d3pm/images/diffusion_
+                                                                       categorical.py
+ Looped       Trans-   Looped Transformer base-    MIT License         https://github.com/Leiay/
+ former repository     line reference implemen-                        looped_transformer
+                       tation, if public code is
+                       used
+ N-Queens              Synthetic multi-solution    Not an external     N/A
+                       constraint satisfaction     asset
+                       task generated by the
+                       authors
+ Graph Coloring        Synthetic multi-solution    Not an external     N/A
+                       constraint satisfaction     asset
+                       task generated by the
+                       authors
+
+
+
+
+                                                    27
+
+\ No newline at end of file