Generative Recursive Reasoning Junyeob Baek1†∗ Mingyu Jo1†∗ Minsu Kim1,2 Mengye Ren3 Yoshua Bengio2,4 Sungjin Ahn1,3† arXiv:2605.19376v2 [cs.AI] 20 May 2026 1 KAIST 2 Mila – Québec AI Institute 3 New York University 4 Université de Montréal Abstract How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres- sive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Gen- erative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna- tive solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs, unconditional generation via pθ (x). Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website 1 Introduction A central question for future neural reasoning systems is how extended computation should be imple- mented. Large autoregressive models typically scale reasoning by extending a sequence-generation process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im- plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing because it decouples reasoning depth from both parameter scale and output length: a compact model can perform many steps of internal computation by repeatedly applying shared transition functions over time. Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the potential of this approach in structured reasoning. Rather than producing a solution in a single feedforward pass, they perform extended computation through iterative latent-state refinement, deep supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical latent dynamics. These features make them well suited to problems requiring constraint propagation, state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle ∗ Equal contribution † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn (sungjin.ahn@kaist.ac.kr) Preprint. Solution 1, Input Task, Solution 2, (a) Deterministic RRMs (b) GRAM (Ours) Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions. Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling. also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational depth without increasing parameter count. Together, these models suggest that reasoning capability can emerge not only from scaling model size or generating longer traces, but also from the organization of computation itself. While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning system should also be able to maintain uncertainty, consider alternative hypotheses, and explore multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity or multiple valid solutions are intrinsic, and more generally in problems where a single refinement path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and exploring multiple latent trajectories in parallel. Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and initialization, they follow a single latent trajectory and converge to a single prediction. This deter- ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This motivates the central question of our work: can recursive latent computation support probabilistic, generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models? In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition conditioned on the input and the current reasoning state, rather than deterministically updating to a single next state. Repeating this process defines a distribution over possible reasoning trajectories, allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale inference not only by increasing recursive depth but also by sampling trajectories in parallel. From a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by marginalizing over latent reasoning trajectories, while the same recursive process can also define an unconditional generative model pθ (x) when the input is fixed or absent. We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar- chitectural properties targeted by our formulation: recursive refinement, stochastic exploration, multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines, including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku- Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized MNIST [15] probes the unconditional generative interpretation. Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle for future recurrent and recursive reasoning architectures. Concretely, we make three contributions. 2 First, we formulate recursive reasoning as a latent-variable generative process, where solutions are obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based inference-time scaling, enabling inference to scale not only with recursive depth but also with the number of sampled latent trajectories. Third, we provide empirical evidence that this formulation yields the intended architectural advantages over deterministic recurrent and recursive baselines, improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation. 2 Generative Recursive Reasoning Models In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training procedure in Section 2.2, with an architecture schematic shown in Figure 2. 2.1 Architecture Overview. GRAM models the conditional distri- 𝑦 (A) CE loss bution pθ (y | x) by marginalizing over stochas- Prior (B) KL Div. Posterior Decoder tic latent reasoning trajectories. Given an input 𝑝𝜃 (⋅ |𝑢𝑡 ) 𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡 𝑓dec x, GRAM first computes an embedding ℎ𝑡−1 𝑓𝐻 𝑢𝑡 ℎ𝑡 ex = fenc (x; θ), (1) 𝐾 times which is reused throughout the entire recursive 𝑙𝑡−1 𝑓𝐿 𝑓𝐿 𝑙𝑡 computation. Starting from a fixed initial la- Encoder tent state z0 , the model evolves the latent state 𝑥 𝑓enc through learned stochastic transitions. The re- cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic levels: inner and outer loops. latent transition in the hierarchical instantiation z = At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high- level update fH produces a deterministic proposal ut , to a new latent state conditioned on the previous which stochastic guidance ϵt is added: ht = ut + ϵt . latent state and the input embedding, zt ∼ pθ (zt | zt−1 , ex ), t = 1, . . . , T. (2) At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A supervision step is the unit at which the decoder is invoked, and the training objective is applied, with gradients computed as described in Section 2.2. At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision step serving as the initial state of the next, thereby forming the full recursive computation: (1) T transitions (1) (2) T transitions T transitions (N ) z0 −−−−−−−→ zT = z0 −−−−−−−→ · · · −−−−−−−→ zT sup , (3) (n) (1) where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the fixed initial state, and the terminal state of one supervision step serves as the initial state of the next (n+1) (n) (z0 := zT ). This abstract formulation can be instantiated with various recurrent Transformer backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7], as well as hierarchical designs such as HRM and TRM [8, 9]. Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt | zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual perturbation around a deterministic update: at each transition, the model first computes a deterministic update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian, and adds it to ut : ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,  (4) zt = ut + ϵt . (5) 3 We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex- ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima, and support robust exploration of the solution space without discarding the deterministic refinement performed by ut . Hierarchical Instantiation. We instantiate the latent state with two interacting components, z = (h, l). The high-level component h is updated once per latent transition and carries abstract reasoning state, while the low-level component l is updated K times within a single transition and carries fine-grained intermediate computation. This decomposition separates the two roles across time scales, with h accumulating slowly across transitions and l refined rapidly within each one. With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows. The low-level component is first refined for K updates, with the high-level component held fixed: lt,k = fL (ht−1 , lt,k−1 , ex ; θ), k = 1, . . . , K, (6) where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level component is then updated as a stochastic transition conditioned on the refined lt , ut = fH (ht−1 , lt ; θ), (7) ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I  , (8) ht = ut + ϵt , (9) and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low- level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower, more abstract component of the latent state, where it can steer the overall reasoning trajectory across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e., fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1. Modeling Unconditional Distribution. While the description so far focuses on the conditional setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for generation tasks in Section 4.3. 2.2 Training GRAM is trained to model the conditional distribution pθ (y | x), where each training example consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with respect to the generative parameters θ and variational parameters ϕ. Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with TTotal = T × Nsup . The conditional likelihood is defined as Z pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ, (10) where x denotes the input problem and y denotes the corresponding ground-truth output. Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the evidence lower bound (ELBO), jointly training θ and ϕ via variational inference: log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) . (11) During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has access to both the input problem x and the target output y. At inference time, where y is unavailable, trajectories are instead generated from the learned prior pθ (· | x). 3We also tried injecting noise into the low-level state, but found that it did not improve performance. 4 Both the prior and the posterior are modeled as conditional Markov processes over latent states: TY Total TY Total pθ (τ | x) = p(z0 ) pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 ) qϕ (zt | zt−1 , x, y). (12) t=1 t=1 Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y), whereas the prior uses pθ (ϵt | ut ). Since the two processes share the same Markov structure and all stochasticity is introduced through ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover, since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as   TX Total h i LELBO = Eqϕ log pθ (y | zTTotal , x) − Eqϕ (ϵ