diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-29 12:15:51 -0500 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-06-29 12:15:51 -0500 |
| commit | a6ec4288a2232988b130b2f00bb2565f81706966 (patch) | |
| tree | 1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /papers | |
Recursive reasoning dynamics: analysis pipeline, paper drafts, toy models
Failure=more-chaotic (task-general under validity labeling) reduces to convergence/completeness
detection; mechanism (transient chaos vs multistability vs input-induced) under investigation.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'papers')
| -rw-r--r-- | papers/txt/engelken2023_gradient_flossing.txt | 1879 | ||||
| -rw-r--r-- | papers/txt/gram2025_generative_recursive.txt | 1532 | ||||
| -rw-r--r-- | papers/txt/gram2026_generative_recursive.txt | 1532 | ||||
| -rw-r--r-- | papers/txt/hrm2025_hierarchical_reasoning.txt | 1302 | ||||
| -rw-r--r-- | papers/txt/ptrm2025_probabilistic_trm.txt | 906 | ||||
| -rw-r--r-- | papers/txt/trm2025_tiny_recursive.txt | 796 |
6 files changed, 7947 insertions, 0 deletions
diff --git a/papers/txt/engelken2023_gradient_flossing.txt b/papers/txt/engelken2023_gradient_flossing.txt new file mode 100644 index 0000000..6d70608 --- /dev/null +++ b/papers/txt/engelken2023_gradient_flossing.txt @@ -0,0 +1,1879 @@ + Gradient Flossing: Improving Gradient Descent + through Dynamic Control of Jacobians + + + Rainer Engelken + Zuckerman Mind Brain Behavior Institute + Columbia University +arXiv:2312.17306v1 [cs.LG] 28 Dec 2023 + + + + + New York, USA + re2365@columbia.edu + + + + Abstract + Training recurrent neural networks (RNNs) remains a challenge due to the insta- + bility of gradients across long time horizons, which can lead to exploding and + vanishing gradients. Recent research has linked these problems to the values + of Lyapunov exponents for the forward-dynamics, which describe the growth or + shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a + novel approach to tackling gradient instability by pushing Lyapunov exponents + of the forward dynamics toward zero during learning. We achieve this by regu- + larizing Lyapunov exponents through backpropagation using differentiable linear + algebra. This enables us to "floss" the gradients, stabilizing them and thus improv- + ing network training. We demonstrate that gradient flossing controls not only the + gradient norm but also the condition number of the long-term Jacobian, facilitating + multidimensional error feedback propagation. We find that applying gradient + flossing prior to training enhances both the success rate and convergence speed for + tasks involving long time horizons. For challenging tasks, we show that gradient + flossing during training can further increase the time horizon that can be bridged + by backpropagation through time. Moreover, we demonstrate the effectiveness of + our approach on various RNN architectures and tasks of variable temporal com- + plexity. Additionally, we provide a simple implementation of our gradient flossing + algorithm that can be used in practice. Our results indicate that gradient flossing + via regularizing Lyapunov exponents can significantly enhance the effectiveness of + RNN training and mitigate the exploding and vanishing gradients problem. + + + 1 Introduction + Recurrent neural networks are commonly used both in machine learning and computational neu- + roscience for tasks that involve input-to-output mappings over sequences and dynamic trajectories. + Training is often achieved through gradient descent by the backpropagation of error information + across time steps [1, 2, 3, 4]. This amounts to unrolling the network dynamics in time and recursively + applying the chain rule to calculate the gradient of the loss with respect to the network parameters. + Mathematically, evaluating the product of Jacobians of the recurrent state update describes how error + signals travel across time steps. When trained on tasks that have long-range temporal dependencies, + recurrent neural networks are prone to exploding and vanishing gradients [5, 6, 7, 8]. These arise from + the exponential amplification or attenuation of recursive derivatives of recurrent network states over + many time steps. Intuitively, to evaluate how an output error depends on a small parameter change at + a much earlier point in time, the error information has to be propagated through the recurrent network + states iteratively. Mathematically, this corresponds to a product of Jacobians that describe how + changes in one recurrent network state depend on changes in the previous network state. Together, + this product forms the long-term Jacobian. The singular value spectrum of the long-term Jacobian reg- + + 37th Conference on Neural Information Processing Systems (NeurIPS 2023). +ulates how well error signals can propagate backwards along multiple time steps, allowing temporal +credit assignment. A close mathematical correspondence of these singular values and the Lyapunov +exponents of the forward dynamics was established recently [9, 10, 11, 12]. Lyapunov exponents +characterize the asymptotic average rate of exponential divergence or convergence of nearby initial +conditions and are a cornerstone of dynamical systems theory [13, 14]. We will use this link to +improve the trainability of RNNs. +Previous approaches that tackled the problem of exploding or vanishing gradients have suggested +solutions at different levels. First, specialized units such as LSTM and GRU were introduced, +which have additional latent variables that can be decoupled from the recurrent network states via +multiplicative (gating) interactions. The gating interactions shield the latent memory state, which can +therefore transport information across multiple time steps [5, 6, 15]. Second, exploding gradients can +be avoided by gradient clipping, which re-scales the gradient norm [16] or their individual elements +[17] if they become too large [18]. Third, normalization schemes like batch normalization prevent +saturated nonlinearities that contribute to vanishing gradients [19]. Third, it was suggested that the +problem of exploding/vanishing gradients can be ameliorated by specialized network architectures, +for example, antisymmetric networks [20], orthogonal/unitary initializations [21, 22, 23], coupled +oscillatory RNNs [24], Lipschitz RNNs [25], linear recurrent units [26], echo state networks [27, 28], +(recurrent) highway networks [29, 30], and stable limit cycle neural networks [11, 31, 32]. Fourth, for +large networks, a suitable choice of weights can guarantee a well-conditioned Jacobian at initialization +[21, 33, 34, 35, 36, 37, 38, 39, 40, 41]. These initializations are based on mean-field methods, which +become exact only in the large-network limit. Such initialization schemes have also been suggested +for gated networks [40]. However, even when initializing the network with well-behaved gradients, +gradients will typically not retain their stability during training once the network parameters have +changed. +Here, we propose a novel approach to tackling this challenge by introducing gradient flossing, a +technique that keeps gradients well-behaved throughout training. Gradient flossing is based on +a recently described link between the gradients of backpropagation through time and Lyapunov +exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian +[9, 11, 12, 32]. Gradient flossing regularizes one or several Lyapunov exponents to keep them close +to zero during training. This improves not only the error gradient norm but also the condition number +of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons. +We first demonstrate that the Lyapunov exponents can be controlled during training by including an +additional loss term. We then demonstrate that gradient flossing improves the gradient norm and +effective dimension of the gradient signal. We find empirically that gradient flossing improves test +accuracy and convergence speed on synthetic tasks over a range of temporal complexities. Finally, +we find that gradient flossing during training further helps to bridge long-time horizons and show +that it combines well with other approaches to ameliorate exploding and vanishing gradients, such as +dynamic mean-field theory for initialization, orthogonal initialization and gated units. +Our contributions include: + + • Gradient flossing, a novel approach to the problem of exploding and vanishing gradients in + recurrent neural networks based on regularization of Lyapunov exponents. + + • Analytical estimates of the condition number of the long-term Jacobian based on Lyapunov + exponents. + + • Empirical evidence that gradient flossing improves training on tasks that involve bridging + long time horizons. + + +2 RNN Gradients and Lyapunov Exponents + +We begin by revisiting the established mathematical relationship between the gradients of the loss +function, computed via backpropagation through time, and Lyapunov exponents [9, 12], and how +it relates to the problem of vanishing and exploding gradients. In backpropagation through time, +network parameters θ are iteratively updated by stochastic gradient descent such that a loss Lt is +locally reduced [1, 2, 3, 4]. For RNN dynamics hs+1 = fθ (hs , xs+1 ), with recurrent network state +h, external input x, and parameters θ, the gradient of the loss Lt with respect to θ is evaluated by + + + 2 +unrolling the network dynamics in time. The resulting expression for the gradient is given by: + τ =t−1 t−1 + ! + ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X ∂hτ + = = Tt (hτ ) (1) + ∂θ ∂ht ′ + ∂hτ ′ ∂θ ∂ht τ ∂θ + τ =t−l τ =τ + +where Tt (hτ ) is composed of a product of one-step Jacobians Ds = ∂hs+1 + ∂hs : + t−1 t−1 + Y ∂hτ ′ +1 Y + Tt (hτ ) = = Dτ ′ (2) + ∂hτ ′ + τ ′ =τ ′ τ =τ + +Due to the chain of matrix multiplications in Tt , the gradients tend to vanish or explode exponentially +with time. This complicates training particularly when the task loss at time t dependents on inputs x +or states h from many time steps prior which creates long temporal dependencies [5, 6, 7, 8]. How +well error signals can propagate back in time is constrained by the tangent space dynamics along +trajectory ht , which dictate how local perturbations around each point on the trajectory stretch, rotate, +shear, or compress as the system evolves. +The singular values of the Jacobian’s product Tt , which determine how quickly gradients vanish or +explode during backpropagation through time, are directly related to the Lyapunov exponents of the +forward dynamics [9, 12]: Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN are defined as the asymptotic +time-averaged logarithms of the singular values of the long-term Jacobian [13, 42, 43] + 1 + λi = lim log(σi,t ) (3) + t→∞ t − τ + +where σi,t denotes the ith singular value of Tt (hτ ) with σ1,t ≥ σ2,t . . . σN,t (See Appendix I +for details). This means that positive Lyapunov exponents in the forward dynamics correspond to +exponentially exploding gradient modes, while negative Lyapunov exponents in the forward dynamics +correspond to exponentially vanishing gradient modes. +In summary, the Lyapunov exponents give the average asymptotic exponential growth rates of +infinitesimal perturbations in the tangent space of the forward dynamics, which also constrain the +signal propagation in backpropagation for long time horizons. Lyapunov exponents close to zero in +the forward dynamics correspond to tangent space directions along which error signals are neither +drastically attenuated nor amplified in backpropagation through time. Such close-to-neutral modes in +the tangent dynamics can propagate information reliably across many time steps. + +3 Gradient Flossing: Idea and Algorithm +We now leverage the mathematical connection established between Lyapunov exponents and the +prevalent issue of exploding and vanishing gradients for regularizing the singular values of the +long-term Jacobian. We term this procedure gradient flossing. To prevent exploding and vanishing +gradients, we constrain Lyapunov exponents to be close to zero. This ensures that the corresponding +directions in tangent space grow and shrink on average only slowly. This leads to a better-conditioned +long-term Jacobian Tt (hτ ). We achieve this by using the sum of the squares of the first k largest +Lyapunov exponent λ1 , λ2 . . . λk as a loss function: + k + X + Lflossing = λ2i (4) + i=1 + +and evaluate the gradient obtained from backpropagation through time: + k + ∂Lflossing X ∂λ2i + = (5) + ∂θ i=1 + ∂θ + +This might seem like an ill-fated enterprise, as the gradient expression in Eq 5 suffers from its +own problem of exploding and vanishing gradients. However, instead of calculating the Lyapunov +exponents by directly evaluating the long-term Jacobian Tt (Eq 2), we use an established iterative +reorthonormalization method involving QR decomposition that avoids directly evaluating the ill- +conditioned long-term Jacobian [12, 44]. + + + 3 +First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qks ] in the tangent space along the +trajectory using the Jacobian Ds = ∂h s+1 + ∂hs . This means to calculate + + Q + e s+1 = Ds Qs (6) +at every time-step. Second, we extract the exponential growth rates using the QR decomposition, + Qe s+1 = Qs+1 Rs+1 , +which decomposes Q e s+1 uniquely into the product of an orthonormal matrix Qs+1 of size N × k + ⊤ +so Qs+1 Qs+1 = 1k×k and an upper triangular matrix Rs+1 of size k × k with positive diagonal +elements. Note that the QR decomposition does not have to be applied at every step, just sufficiently +often, i.e., once every tONS such that Q + e does not become ill-conditioned. +The Lyapunov exponents are given by time-averaged logarithms of the diagonal entries of Rs [43, 44]: + t t + 1 Y 1X + λi = lim log Rsii = lim log Rsii . (7) + t→∞ t t→∞ t + s=1 s=1 +This way, the Lyapunov exponent can be expressed in terms of a temporal average over the diagonal +elements of the Rs -matrix of a QR decomposition of the iterated Jacobian. To propagate the gradient +of the square of the Lyapunov exponents backward through time in gradient flossing, we used an +analytical expression for the pullback of the QR decomposition [45]: The backward pass of the QR +decomposition is given by [45, 46, 47, 48] + Q = Q + Q copyltu(M) R−T , + + (8) + T T +where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying +the lower triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = +Mmax(i,j),min(i,j) [45, 46, 47, 48]. We denote here adjoint variable as T = ∂L/∂T . A simple +implementation of this algorithm in pseudocode is: +Algorithm 1 Algorithm for gradient flossing of k tangent space directions + initialize h, Q + for e = 1 → E do + for t = 1 → T do + h ← fθ (h, x) + dht + D ← dh t−1 + Q←D·Q + if t ≡ 0 (mod tONS ) then + Q, R ← qr(Q) + γi += log(Rii ) + end if + end for + λi = γi /T + ∂Lflossing + θe+1 ← θe − η ∂θ + end for +For clarity, we described gradient flossing in terms of stochastic gradient descent, but we actually +implemented it with the ADAM optimizer using standard hyperparameters η, β1 and β2 . An example +implementation is available here. Note that this algorithm also works for different recurrent network +architectures. In this case, the Jacobians D has size n × n, where n is the number of dynamic +variables of the recurrent network model. For example, in case of a single recurrent network of +N LSTM units, the Jacobian has size 2N × 2N [9, 12, 41]. The Jacobian matrix D can either be +calculated analytically or it can be obtained via automatic differentiation. + +4 Gradient Flossing: Control of Lyapunov Exponents +In Fig 1, we demonstrate that gradient flossing can set one or several Lyapunov exponents to a +target value via gradient descent with the ADAM optimizer in random Vanilla RNNs initialized with +different weight variances. The N units of the recurrent neural network follow the dynamics + hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 . (9) + + + 4 + C + t−1 + ∂hτ ′ +1 10 1 + number of flossed i + Y + Tt (hτ ) = k = 32 + ∂hτ ′ k = 16 + τ ′ =τ + 10 3 k=1 + + + + + 2 + i + i=1 + k + 1 + k + 10 5 + + + 0 20 40 60 80 100 + Epochs + B D + 0.0 0 + 0.5 2 +1(1/ ) + + + + + i(1/ ) + 1.0 4 number of flossed i + k = 32 + 1.5 target 1 6 k = 16 + actual 1 k=1 + 2.0 + 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 + Epochs i/N +Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation +A) Exploding and vanishing gradients in backpropagation through time arise from amplifica- + Qt−1 ∂h ′ +tion/attenuation of product of Jacobians that form the long-term Jacobian Tt (hτ ) = τ ′ =τ ∂hτ +1 . + τ′ +B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the +mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent +λ1 = −1, −0.5, 0 by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent +weights Wij ∼ N (0, g 2 /N ) where values of g were drawn g ∼ Unif(0, 1). C) Gradient flossing +minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla +RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note, +the variability of the Lyapunov exponents that were not flossed. Parameters: network size N = 32 +with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line +shows median. + + +The initial entries of W are drawn independently from a Gaussian distribution with zero mean and +variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We here use +the transfer function ϕ(x) = tanh(x). (See appendix B for gradient flossing with ReLU and LSTM +units). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. Gaussian input +xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient +flossing. +In Fig 1B, we show that for randomly initialized RNNs, the Lyapunov exponent can be modified by +gradient flossing to match a desired target value. The networks were initialized with 10 different values +of initial weight strength g chosen uniformly between 0 and 1. During gradient flossing, they quickly +approached three different target values of the first Lyapunov exponents λtarget 1 = {−1, −0.5, 0} +within less than 100 training epochs with batch size B = 1. We note that gradient flossing with +positive target λtarget + 1 seems not to arrive at a positive Lyapunov exponent λ1 . +Fig 1C shows gradient flossing for different numbers of Lyapunov exponents k. Here, during gradient- +descent, the sum of the squares of 1, 16, or 32 Lyapunov exponents is used as loss in gradient flossing +(see Fig 1A). Fig 1D shows the Lyapunov spectrum after flossing, which now has 1, 16, or 32 +Lyapunov exponents close to zero. We conclude that gradient flossing can selectively manipulate one, +several, or all Lyapunov exponents before or during network training. Gradient flossing also works for +RNNs of ReLU and LSTM units (See appendix B. Further, we find that the computational bottleneck +of gradient flossing is the QR decomposition, which has a computational complexity of O N k 2 , +both in the forward pass and in the backward pass. Thus, gradient flossing of the entire Lyapunov +spectrum is computationally expensive. However, as we will show, not all Lyapunov exponents need +to be flossed and only short episodes of gradient flossing are sufficient for significantly improving the +training performance. + + + 5 +5 Gradient Flossing: Condition Number of the Long-Term Jacobian + A B C +condition number 2(Tt( )) + 1034 1026 initial + after flossing 1030 + 1025 theory + 1019 simulations + + + + + 2 (theory) + 2(Tt( )) + 1020 + 1016 1012 k + initial 1010 5 + 107 after flossing 105 10 + theory 15 + simulations 100 + 0 100 200 300 5 10 15 100 1010 1020 1030 + time horizon t tangent space dimension m 2 (numerical) +Figure 2: Gradient flossing reduces condition number of the long-term Jacobian A) Condition +number κ2 of long-term Jacobian Tt (hτ ) as a function of time horizon t − τ at initialization (blue) +and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision +floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on +Lyapunov exponents (dashed lines) (Eq 10). B) Condition number for different number of tangent +space dimensions m. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at +initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of +tangent space dimensions available for backpropagation for a given condition number (Grey dotted +line as a guide for eye for κ2 = 105 .) First 15 Lyapunov exponents were flossed. C) Comparison of +condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors +denote the number of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 1, N = 80, +epochs = 500, T = 500, gradient flossing for Ef = 500 epochs. Input xs identical to delayed XOR +task in Fig 3D. + +A well-conditioned Jacobian is essential for efficient and fast learning [21, 49, 50]. Gradient +flossing improves the condition number of the long-term Jacobian which constrains the error signal +propagation across long time horizons in backpropagation (Fig 2). The condition number κ2 of a +linear map A measures how close the map is to being singular and is given by the ratio of the largest +singular value σmax and the smallest singular values σmin , so κ2 (A) = σσmax (A) + min (A) + . According to the + p +rule of thumb given in [51], if κ2 (A) = 10 , one can anticipate losing at least p digits of precision +when solving the equation Ax = b. Note that the long-term Jacobian Tt is composed of a product of +Jacobians, which generically makes it ill-conditioned. To nevertheless quantify the condition number +numerically, we use arbitrary-precision arithmetic with 256 bits per float. We find numerically that +the condition number of Tt exponentially diverges with the number of time steps (Fig 2A). We +compare the numerically measured condition number κ2 with an asymptotic approximation of the +condition number based on Lyapunov exponents that are calculated in the forward pass and find a +good match (Fig 2A). +Our theoretical estimate of the condition number κ2 of an orthonormal system Q of size N × m that +is temporally evolved by the long-term Jacobian Tt is: + + e t+τ ) = κ2 Tt (hτ )Qt = σ1 (Tt (hτ )) ≈ exp ((λ1 − λm )(t − τ )) . + + κ2 (Q (10) + σm (Tt (hτ )) +where σ1 (Tt (hτ )) and σm (Tt (hτ )) are the first and mth singular value of the long-term Jacobian. +We note that this theoretical estimate of the condition number follows from the asymptotic definition +of Lyapunov exponents and should be exact in the limit of long times. We find that gradient flossing +reduces the condition number by a factor whose magnitude increases exponentially with time (orange +in Fig 2A). Thus, we can expect that gradient flossing has a stronger effect on problems with a long +time horizon to bridge. We will later confirm this numerically. +Moreover, Lyapunov exponents enable the estimation of the number of gradient dimensions available +for the backpropagation of error signals. Generally, the long-term Jacobian is ill-conditioned, however, +the Lyapunov spectrum provides for a given number of tangent space dimensions an estimate of the +condition number. This indicates how close to singular the gradient signal for a given number of +tangent space dimensions is. Given a fixed acceptable condition number—determined, for example, +by noise level or floating-point precision—we observe that gradient flossing increases the number of +usable tangent space dimensions for backpropagation (Fig 2B). + + + 6 +Finally, we show that the asymptotic estimate of the condition number based on Lyapunov exponents +can even predict differences in condition number that originate from finite network size N (Fig 2C). +We emphasize that this goes beyond mean-field methods, which become exact only in the large- +network limit N → ∞ and usually do not capture finite-size effects [52] (see appendix G). + +6 Initial Gradient Flossing Improves Trainability + A delayed copy task B delayed XOR task + 10 2 10 2 +test loss + + + + + test loss + no gradient flossing no gradient flossing + 10 3 gradient flossing gradient flossing + 10 3 + 10 4 + 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 + Epochs Epochs + C delayed copy task D delayed XOR task + 10 2 10 2 + + 10 3 10 3 +test loss + + + + + test loss + + 10 4 + 10 4 + + 20 40 60 80 10 20 30 40 50 60 70 80 + task difficulty (delay d) task difficulty (delay d) +Figure 3: Gradient flossing improves trainability on tasks that involve long time horizons A) Test +error for Vanilla RNNs trained on delayed copy task yt = xt−d for d = 40 with and without gradient +flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed +XOR task with yt = |xt−d/2 − xt−d |. C) Mean final test loss as a function of task difficulty (delay d) +for delayed copy task. D) Mean final test loss as a function of task difficulty (delay d) for delayed +XOR task. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing +for Ef = 500 epochs on k = 75 before training. Shaded regions in C and D indicate the 20% and +80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE(y, ŷ). + +We next present numerical results on two tasks with variable spatial and temporal complexity, +demonstrating that gradient flossing before training improves the trainability of Vanilla RNNs. We +call gradient flossing before training in the following preflossing. For preflossing, we first initialize the + Pk +network randomly, then minimize Lflossing = i=1 λ2i using the ADAM optimizer and subsequently +train on the tasks. We deliberately do not use sequential MNIST or similar toy tasks commonly used +to probe exploding/vanishing gradients, because we want a task where the structure of long-range +dependencies in the data is transparent and can be varied as desired. +First, we consider the delayed copy task, where a scalar stream of random input numbers x must be +reproduced by the output y delayed by d time steps, i.e. yt = xt−d . Although the task itself is trivial +and can be solved even by a linear network through a delay line (see appendix E), RNNs encounter +vanishing gradients for large delays d during training even with ’critical’ initialization with g = 1. +Our experiments show that gradient flossing can substantially improve the performance of RNNs +on this task (Fig 3A, C). While Vanilla RNNs without gradient flossing fail to train reliably beyond +d = 20, Vanilla RNNs with gradient flossing can be reliably trained for d = 40 (Fig 3C). Note that +we flossed here k = 40 Lyapunov exponents before training. We will later investigate the role of the +number of flossed Lyapunov exponents. +Second, we consider the temporal XOR task, which requires the RNN to perform a nonlinear input- +output computation on a sequential stream of scalar inputs, i.e., yt = |xt−d/2 − xt−d |, where d +denotes a time delay of d time steps (For details see appendix H). Fig 3D demonstrates that gradient +flossing helps to train networks on a substantially longer delay d. We found similar improvements +through gradient flossing for RNNs initialized with orthogonal weights (see appendix G). + + + 7 +7 Gradient Flossing During Training + A delayed temporal XOR B delayed spatial XOR + 100 flossing during training 100 +test accuracy (%) 90 preflossing 90 + + + + + test accuracy (%) + no flossing + 80 80 + 70 70 + 60 60 + 50 50 + 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 + Epochs Epochs + C D + 100 100 + 90 90 +test accuracy (%) + + + + + test accuracy (%) + 80 80 flossing during training + preflossing + 70 70 no flossing + 60 60 + 50 50 + 10 20 + 30 40 50 60 70 10 20 30 40 50 60 70 + complexity (delay d) complexity (delay d) +Figure 4: Gradient flossing during training further improves trainability +A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task yt = xt−d/2 ⊕ xt−d +with gradient flossing during training (green), preflossing (gradient flossing before training) (orange), +and with no gradient flossing (blue) for d = 70. Solid lines are mean across 20 network realizations, +individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial +XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . Parameters (g = 1, batch size b = 16). C) Test accuracy +as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test accuracy as a +function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, batch size +b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on k = 75 before +training and during training for green lines, and only before training for orange lines. Same plotting +conventions as previous figure. Task loss: cross-entropy between y and ŷ. +We next investigate the effects of gradient flossing during the training and find that gradient flossing +during training can further improve trainability. We trained RNNs on two more challenging tasks +with variable temporal complexity and performed gradient flossing either both during and before +training, only before training, or not at all. +Fig 4A shows the test accuracy for Vanilla RNNs training on the delayed temporal XOR task +yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of Vanilla RNNs +falls to chance level for d ≥ 40 (Fig 4C). With gradient flossing before training, the trainability +can be improved, but still goes to chance level for d = 70. In contrast, for networks with gradient +flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed +for 500 epochs before task training and again after 500 epochs of training on the task. In Fig 4B, +D the networks have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a +three-dimensional binary input signal x1 , x2 , and x3 and generate the correct output with a delay of +d steps. While the solution of the task itself is not difficult and could even be implemented by hand +(see appendix), the task is challenging for backpropagation through time because nonlinear temporal +associations bridging long time horizons have to be formed. Again, we observe that gradient flossing +before training improves the performance compared to baseline, but starts failing for long delays +d > 60. In contrast, networks that are also flossed during training can solve even more difficult tasks +(Fig 4D). We find that after gradient flossing, the norm of the error gradient with respect to initial +conditions h0 is amplified (appendix C). Interestingly, gradient flossing can also be detrimental to +task performance if it is continued throughout all training epochs (appendix C) +We note that merely regularizing the spectral radius of the recurrent weight matrix W or the individual +one-step Jacobians Ds numerically or based on mean-field theory does not yield such a training +improvement. This suggests that taking the temporal correlations between Jacobians Ds into account +is important for improving trainability. + + + 8 +7.1 Gradient Flossing for Different Numbers of Flossed Lyapunov Exponents + +We investigated how many Lyapunov exponents k have to be flossed to achieve an improvement in +training success (Fig 5). We studied this in the binary temporal delayed XOR task with gradient +flossing during training (same as Fig 3) and varied the task difficulty by changing the delay d. +We found that as the task becomes more difficult, networks where not enough Lyapunov exponents +k are flossed begin to fall below 100% test accuracy (Fig 5A). Correspondingly, when measuring +final test accuracy as a function of the number of flossed Lyapunov exponents, we observed that +more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the tasks become more +difficult (Fig 5B). We also show the entire parameter plane of median test accuracy as a function of +both number of flossed Lyapunov exponents k and task difficulty (delay d), and found the same trend +(Fig 5B). Overall, we found that tasks with larger delay d require more Lyapunov exponents close to +zero. We note that this might also partially be caused by the ’streaming’ nature of the task: in our +tasks, longer delays automatically imply that more values have to be stored as at any moment all the +values in the ’delay line’ have to be remembered to successfully solve the tasks. This is different from +tasks where a single variable has to be stored and recalled after a long delay. It would be interesting +to study tasks where the number of delay steps and the number of items in memory can be varied +independently. +Finally, we did the same analysis on networks with only preflossing (gradient flossing before training) +and found the same trend (supplement Fig 7D), however, in that case even if all N Lyapunov +exponents were flossed, thus k = N , they were not able to solve the most difficult tasks. This seems +to indicate that gradient flossing during training cannot be replaced by just gradient flossing more +Lyapunov exponents before training. + + + + +Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents +A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers +flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number +of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 +for delayed temporal XOR, epochs = 5000 for delayed spatial XOR, T = 300, gradient flossing +for Ef = 500 epochs before training and during training for A, B. Shaded areas are 25% and 75% +percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy +between y and ŷ. +8 Limitations +The mathematical connection between Lyapunov exponents and backpropagation through time +exploited in gradient flossing is rigorously established only in the infinite-time limit. It would be +interesting to extend our analysis to finite-time Lyapunov exponents. +Furthermore, the backpropagation through time gradient involves a sum over products of Jacobians +of different time periods t − τ , but the Lyapunov exponent only considers the asymptotic longest +product. Additionally, Lyapunov exponents characterize the asymptotic dynamics on the attractor of +the dynamics, whereas RNNs often exploit transient dynamics from some initial conditions outside +or towards the attractor. +Although our proposed method focuses on exploiting Lyapunov exponents, it neglects the geometry +of covariant Lyapunov vectors [53], which could be used to improve training performance, speed, +and reliability. Additionally, it is important to investigate how sensitive the method is to the choice +of orthonormal basis employed because it is only guaranteed to become unique asymptotically [54]. + + + 9 +Finally, the computational cost of our method scales with O(N k 2 ), where N is the network size +and k is the number of Lyapunov exponents calculated. To reduce the computational cost, we +suggest doing QR decomposition only sufficiently often to ensure that the orthonormal system is +not ill-conditioned and using gradient flossing only intermittently or as pretraining. One could also +calculate the Lyapunov spectrum for a shorter time interval or use a cheaper proxy for the Lyapunov +spectrum and investigate more efficient gradient flossing schedules. + + +9 Discussion + +We tackle the problem of gradient signal propagation in recurrent neural networks through a dy- +namical systems lens. We introduce a novel method called gradient flossing that addresses the +problem of gradient instability during training. Our approach enhances gradient signal stability both +before and during training by regularizing Lyapunov exponents. By keeping the long-term Jacobian +well-conditioned, gradient flossing optimizes both training accuracy and speed. To achieve this, +we combine established dynamical systems methods for calculating Lyapunov exponents with an +analytical pullback of the QR factorization. This allows us to establish and maintain gradient stability +in a in a manner that is memory-efficient, numerically stable, and exact across long time horizons. +Our method is applicable to arbitrary RNN architectures, nonlinearities, and also neural ODEs [55]. +Empirically, pre-training with gradient flossing enhances both training speed and accuracy. For +difficult temporal credit assignment problems, gradient flossing throughout training further enhances +signal propagation. We also demonstrate the versatility of our method on a set of synthetic tasks +with controllable time-complexity and show that it can be combined with other approaches to tackle +exploding and vanishing gradients, such as dynamic mean-field theory for initialization, orthogonal +initialization and specialized single units, such as LSTMs. +Prior research on exploding and vanishing gradients mainly focused on selecting network architectures +that are less prone to exploding/vanishing gradients or finding parameter initializations that provide +well-conditioned gradients at least at the beginning of training. Our introduced gradient flossing can +be seen as a complementary approach that can further enhance gradient stability throughout training. +Compared to the work on picking good parameter initializations based on random matrix theory [41] +and mean-field heuristics [40], gradient flossing provides several improvements: First, mean-field +theory only considers the gradient flow at initialization, while gradient flossing can maintain gradient +flow and well-conditioned Jacobians throughout the training process. Second, random matrix theory +and mean-field heuristics are usually confined to the limit of large networks [52], while gradient +flossing can be used for networks of any size. The link between Lyapunov exponents and the gradients +of backpropagation through time has been described previously [9, 12] and has been spelled out +analytically and studied numerically [10, 11, 56, 57, 58]. In contrast, we use Lyapunov exponents +here not only as a diagnostic tool for gradient stability but also to show that they can directly be part +of the cure for exploding and vanishing gradients. +Future investigations could delve further into the roles of the second to N th Lyapunov exponents +in trainability, and how it is related to the task at hand, the rank of the parameter update, the dimen- +sionality of the solution space, as well as the network dynamics (see also [32, 59, 60]). Our results +suggest a trade-off between trainability across long time horizons and the nonlinear task demands +that is worth exploring in more detail (appendix C). Applying gradient flossing to real-time recurrent +learning and its biologically plausible variants is another avenue [61]. Extending gradient flossing to +feedforward networks, state-space models and transformers is a promising avenue for future research +(see also [62, 63]). While Lyapunov exponents are only strictly defined for dynamical systems, such +as maps or flows that are endomorphisms, the long-term Jacobian of deep feedforward networks +could be treated similarly. This could also provide a link between the stability of the network against +adversarial examples and its dynamic stability, as measured by Lyapunov exponents. Given that +time-varying input can suppress chaos in recurrent networks [9, 12, 64, 65, 66, 67], we anticipate they +may exacerbate vanishing gradients. Gradient flossing could also be applied in neural architecture +search, to identify and optimize trainable networks. Finally, gradient flossing is applicable to other +model parameters, as well. For instance, gradients of Lyapunov exponents with respect to single- +unit parameters could optimize the activation function and single-neuron biophysics in biologically +plausible neuron models. + + + + 10 +Acknowledgments and Disclosure of Funding +I thank E. Izhikevich, F. Wolf, S. Goedeke, J. Lindner, L.F. Abbott, L. Logiaco, M. Schottdorf, G. +Wayne and P. Sokol for fruitful discussions and J. Stone, L.F. Abbott, M. Ding, O. Marshall, S. +Goedeke, S. Lippl, M.P. Puelma-Touzel, J. Lindner and the reviewers for feedback on the manuscript. +I thank Jinguo Liu for the Julia package BackwardsLinalg.jl. Research supported by NSF NeuroNex +Award (DBI-1707398), the Gatsby Charitable Foundation (GAT3708), the Simons Collaboration for +the Global Brain (542939SPI), and the Swartz Foundation (2021-6). + +References + [1] P. WERBOS. Beyond Regression :. Ph. D. dissertation, Harvard University, 1974. + [2] DB Parker. Learning-logic (TR-47). Center for Computational Research in Economics and Management + Science. MIT-Press, Cambridge, Mass, 8, 1985. + [3] Y. LECUN. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Proceedings of Cognitiva 85, + pages 599–604, 1985. + [4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back- + propagating errors. Nature, 323(6088):533, October 1986. + [5] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, 1991. + [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735– + 1780, 1997. + [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. + IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. + [8] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural + Networks. arXiv:1211.5063 [cs], November 2012. arXiv: 1211.5063. + [9] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks. + arXiv:2006.02427 [nlin, q-bio], June 2020. arXiv: 2006.02427. +[10] Jonas Mikhaeil, Zahra Monfared, and Daniel Durstewitz. On the difficulty of learning chaotic dynamics + with RNNs. Advances in Neural Information Processing Systems, 35:11297–11312, December 2022. +[11] Il Memming Park, Ábel Ságodi, and Piotr Aleksander Sokół. Persistent learning signals and working + memory without continuous attractors. ArXiv, page arXiv:2308.12585v1, August 2023. +[12] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks. + Physical Review Research, 5(4):043044, October 2023. +[13] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, + 57(3):617–656, July 1985. +[14] Arkady Pikovsky and Antonio Politi. Lyapunov Exponents: A Tool to Explore Complex Dynamics. + Cambridge University Press, Cambridge, February 2016. +[15] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of + Neural Machine Translation: Encoder-Decoder Approaches. Technical report, September 2014. ADS + Bibcode: 2014arXiv1409.1259C Type: article. +[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural + networks. In Proceedings of the 30th International Conference on International Conference on Machine + Learning - Volume 28, ICML’13, pages III–1310–III–1318, Atlanta, GA, USA, June 2013. JMLR.org. +[17] Tomáš Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of + Technology, Faculty of Information Technology, Brno, CZ, 2012. +[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016. + Google-Books-ID: omivDQAAQBAJ. +[19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by + Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine + Learning, pages 448–456. PMLR, June 2015. + + + 11 +[20] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A Dynamical System View + on Recurrent Neural Networks. International Conference on Learning Representations, December 2018. + +[21] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of + learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013. arXiv: + 1312.6120. + +[22] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In + Proceedings of The 33rd International Conference on Machine Learning, pages 1120–1128. PMLR, June + 2016. + +[23] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with Scaled Cayley + Transform. In Proceedings of the 35th International Conference on Machine Learning, pages 1969–1978. + PMLR, July 2018. + +[24] T. Konstantin Rusch and Siddhartha Mishra. Coupled Oscillatory Recurrent Neural Network (coRNN): + An accurate and (gradient) stable architecture for learning long time dependencies. arXiv e-prints, page + arXiv:2010.00951, October 2020. + +[25] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney. + Lipschitz Recurrent Neural Networks. International Conference on Learning Representations, January + 2021. + +[26] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and + Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. Technical report, March 2023. + ADS Bibcode: 2023arXiv230306349O Type: article. + +[27] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an + erratum note. Bonn, Germany: German National Research Center for Information Technology GMD + Technical Report, 148:34, 2001. + +[28] Mustafa C. Ozturk, Dongming Xu, and José C. Príncipe. Analysis and Design of Echo State Networks. + Neural Computation, 19(1):111–138, January 2007. + +[29] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training Very Deep Networks. In Advances + in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. + +[30] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent Highway + Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4189–4198. + PMLR, July 2017. + +[31] Piotr A. Sokół, Ian Jordan, Eben Kadile, and Il Memming Park. Adjoint Dynamics of Stable Limit Cycle + Neural Networks. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 884–887, + November 2019. ISSN: 2576-2303. + +[32] Piotr A. Sokół. Geometry of Learning and Representations in Neural Networks. PhD thesis, Stony Brook + University, May 2023. + +[33] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. volume 15, + pages 315–323, April 2011. + +[34] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential + expressivity in deep neural networks through transient chaos. arXiv:1606.05340 [cond-mat, stat], June + 2016. arXiv: 1606.05340. + +[35] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical Isometry and a Mean Field + Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. arXiv:1806.05394 + [cs, stat], August 2018. arXiv: 1806.05394. + +[36] Boris Hanin and Mihai Nica. Products of Many Large Random Matrices and Gradients in Deep Neural + Networks. December 2018. + +[37] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information + Propagation. November 2016. + +[38] Boris Hanin and David Rolnick. How to Start Training: The Effect of Initialization and Architecture. In + Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. + + + 12 +[39] Piotr A. Sokol and Il Memming Park. Information Geometry of Orthogonal Initializations and Training. + Technical report, October 2018. ADS Bibcode: 2018arXiv181003785S Type: article. +[40] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey + Pennington. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs. January 2019. +[41] Tankut Can, Kamesh Krishnamurthy, and David J. Schwab. Gating creates slow modes and controls + phase-space complexity in GRUs and LSTMs. arXiv:2002.00025 [cond-mat, stat], January 2020. arXiv: + 2002.00025. +[42] Valery Iustinovich Oseledets. A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of + dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva, 19:179–210, 1968. +[43] Karlheinz Geist, Ulrich Parlitz, and Werner Lauterborn. Comparison of Different Methods for Computing + Lyapunov Exponents. Progress of Theoretical Physics, 83(5):875–893, May 1990. +[44] Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli, and Jean-Marie Strelcyn. Lyapunov Characteristic + Exponents for smooth dynamical systems and for hamiltonian systems; A method for computing all of + them. Part 2: Numerical application. Meccanica, 15(1):21–30, March 1980. +[45] Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang. Differentiable Programming Tensor Networks. + Physical Review X, 9(3):031041, September 2019. arXiv:1903.09650 [cond-mat, physics:quant-ph]. +[46] S. F. Walter and L. Lehmann. Algorithmic Differentiation of Linear Algebra Functions with Application + in Optimum Experimental Design (Extended Version). Technical report, January 2010. ADS Bibcode: + 2010arXiv1001.1654W Type: article. +[47] Robert Schreiber and Charles Van Loan. A Storage-Efficient $WY$ Representation for Products of + Householder Transformations. SIAM Journal on Scientific and Statistical Computing, 10(1):53–57, January + 1989. +[48] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Auto-Differentiating + Linear Algebra. Technical report, October 2017. ADS Bibcode: 2017arXiv171008717S Type: article. +[49] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning + through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems, + volume 30. Curran Associates, Inc., 2017. +[50] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in + Deep Networks. arXiv:1802.09979 [cs, stat], February 2018. arXiv: 1802.09979. +[51] E. Cheney and David Kincaid. Numerical Mathematics and Computing. Cengage Learning, August 2007. + Google-Books-ID: ZUfVZELlrMEC. +[52] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and + Surya Ganguli. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics, + 11(1):501–528, 2020. +[53] F. Ginelli, P. Poggi, A. Turchi, H. Chaté, R. Livi, and A. Politi. Characterizing Dynamics with Covariant + Lyapunov Vectors. Physical Review Letters, 99(13):130601, September 2007. +[54] Sergey V. Ershov and Alexei B. Potapov. On the concept of stationary Lyapunov basis. Physica D: + Nonlinear Phenomena, 118(3):167–198, July 1998. +[55] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential + Equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., + 2018. +[56] Javed Lindner. Investigating the exploding and vanishing gradients problem with Lyapunov exponents. + Master’s thesis, RWTH Aaachen, Aaachen/Juelich, 2021. +[57] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov Exponents + for RNNs: Understanding Information Propagation Using Dynamical Systems Tools. Frontiers in Applied + Mathematics and Statistics, 8, 2022. +[58] Kamesh Krishnamurthy, Tankut Can, and David J. Schwab. Theory of Gating in Recurrent Neural Networks. + Physical Review X, 12(1):011011, January 2022. +[59] L. S. Pontryagin. Mathematical Theory of Optimal Processes: The Mathematical Theory of Optimal + Processes. Routledge, New York, 1st edition edition, March 1987. + + + 13 +[60] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. In Calculus + of Variations and Optimal Control Theory. Princeton University Press, December 2011. + +[61] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A unified framework of online learning algorithms + for training recurrent neural networks. The Journal of Machine Learning Research, 21(1):135:5320– + 135:5353, January 2020. + +[62] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust Learning with Jacobian Regularization. Technical + report, August 2019. ADS Bibcode: 2019arXiv190802729H Type: article. +[63] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup Initialization: Residual Learning Without + Normalization. Technical report, January 2019. ADS Bibcode: 2019arXiv190109321Z Type: article. + +[64] L. Molgedey, J. Schuchhardt, and H. G. Schuster. Suppressing chaos in neural networks by noise. Physical + Review Letters, 69(26):3717–3719, December 1992. + +[65] Kanaka Rajan, L. F. Abbott, and Haim Sompolinsky. Stimulus-dependent suppression of chaos in recurrent + neural networks. Physical Review E, 82(1):011903, July 2010. + +[66] Jannis Schuecker, Sven Goedeke, David Dahmen, and Moritz Helias. Functional methods for disordered + neural networks. arXiv:1605.06758 [cond-mat, q-bio], May 2016. arXiv: 1605.06758. + +[67] Rainer Engelken, Alessandro Ingrosso, Ramin Khajeh, Sven Goedeke, and L. F. Abbott. Input correlations + impede suppression of chaos and learning in balanced firing-rate networks. PLOS Computational Biology, + 18(12):e1010590, December 2022. + +[68] Edward Ott. Chaos in Dynamical Systems. Cambridge University Press, August 2002. Google-Books-ID: + PfXoAwAAQBAJ. +[69] Angelo Vulpiani, Fabio Cecconi, and Massimo Cencini. Chaos: From Simple Models to Complex Systems. + World Scientific Pub Co Inc, Hackensack, NJ, September 2009. + + + + + 14 +A Backpropagation Through QR Decomposition +The backward pass of the QR decomposition is given by [45, 46, 47, 48] + Q = Q + Q copyltu(M) R−T + + (11) + T T +where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying the lower +triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = Mmax(i,j),min(i,j) +[45, 46, 47, 48]. Adjoint variable are written here as T = ∂L/∂T . +Using an analytical pullback is more memory-efficient and less computationally costly than directly doing +automatic differentiation through the QR-decomposition. Moreover, from a practical perspective, for QR +decomposition, often BLAS/LAPACK routines are utilized which are not amenable to common differentiable +programming frameworks like TensorFlow, PyTorch, JAX and Zygote. In our implementation of gradient +flossing, we used the Julia package BackwardsLinalg.jl by Jinguo Liu available at here . + +B Further Details and Analysis of Gradient Flossing +An example implementation of gradient flossing in Flux, a machine learning library in Julia is available here. +We are actively developing implementations for other widely used differentiable programming frameworks. + +B.1 Gradient Flossing for recurrent LSTM and ReLU networks + + A LSTM C ReLU + 0.2 target 1 6 target 1 + actual 1 actual 1 + 4 + 0.0 + 2 + 1 + + + + + 1 + + + + + 0.2 + 0 + 0.4 + 2 + 0 20 40 60 80 100 0 20 40 60 80 100 + Epochs Epochs + B D 101 + 10 1 + 10 2 10 1 + 10 3 + 2 + + + + + 2 + 1 + + + + + 1 + + + + + 10 4 10 3 + 10 5 + 10 6 10 5 + 0 20 40 60 80 100 0 20 40 60 80 100 + Epochs Epochs + + +Figure 6: Gradient flossing for recurrent LSTM networks and recurrent ReLU networks A) First +Lyapunov exponent of LSTM network as a function of training epochs. Minimizing the mean +squared error between estimated first Lyapunov exponent and target Lyapunov exponent λ1 = 0 +by gradient descent. First Lyapunov exponent of LSTM network (solid lines) converges to target +value (thick dashed lines) within less than 100 epochs. 10 random LSTM RNNs were initialized with +Gaussian recurrent weights, where standard deviations of weight scaling were drawn g ∼ Unif(0, 1). +B) Gradient flossing minimizes the square of the first Lyapunov exponent of random recurrent +LSTM networks over epochs. C) Same as A for recurrent ReLU network. Here networks were +initialized with Gaussian recurrent weights Wij ∼ N (−0.1, g 2 /N ) where values of g were drawn +g ∼ Unif(0, 1) D) B) for recurrent ReLU network. Parameters: network size N = 32 with 10 +network realizations. Shaded regions in B, D are 25% and 75% percentiles, solid line shows median. + +We demonstrate that gradient flossing can also be applied to recurrent LSTM and ReLU networks in Fig 6. To +this end, we generated random LSTM networks where the weights of all the different gates and biases were + + + 15 +independently and identically distributed (i.i.d.) and sampled from Gaussian distributions of different variance. +Our results show that gradient flossing can also constrain the Lyapunov exponent to be close to zero. The +dynamics of each of the N LSTM units follows the map [6]: + ft = σg (Uf ht−1 + Wf xt + bf ) (12) + ot = σg (Uo ht−1 + Wo xt + bo ) (13) + it = σg (Ui ht−1 + Wi xt + bi ) (14) + c̃t = σh (Uc ht−1 + Wc xt + bc ) (15) + ct = ft ⊙ ct−1 + it ⊙ c̃t (16) + ht = ot ⊙ ϕ(ct ) (17) + 1 +where ⊙ denotes the Hadamard product, σg (x) = 1+exp(−x) is the sigmoid function, σh (x) = tanh(x) and + 2 +entries of the matrices Ux are drawn from Ux ∼ N (0, gx /N ). For simplicity, the bias terms bx are scalars. +Subscripts f , o and i denote respectively the forget gate, the output gate, the input gate, and c is the cell state. +In each LSTM unit, there are two dynamic variables c and h, and three gates f , o, and i that control the flow +if signals into and out of the cell c. We set the values gih , gix , gf x , bf , gch , gcx , gcx , gox to be uniformly +distributed between 0 and 1 and initialize bi , gf h ,bc ,b0 as zero. +During gradient flossing, the actual Lyapunov exponents of different random network realizations converge close +to the target Lyapunov exponent λtarget + 1 = 0 in fewer than 100 epochs as shown in Fig 6A. Fig 6B shows that +the squared Lyapunov exponents converge towards zero. We note that for LSTM networks, a target Lyapunov +exponent of λtarget + 1 = −1 is achieved after 100 gradient flossing steps only for a subset of random network +realizations (not shown). We speculate that behavior is influenced by the gating structure of LSTM units, +which seems to naturally place the first Lyapunov exponent close to zero for certain initializations (See also +[9, 12, 41, 58]). +For the recurrent ReLU networks, we considered the same Vanilla RNN dynamics as in the main manuscript in +Eq 9 + hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 , +The initial entries of W are drawn independently from a Gaussian distribution with a negative mean of −0.1 +and variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We use the transfer +function ϕ(x) = max(x, 0). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. +Gaussian input xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient +flossing. We found that some ReLU network had initially unstable dynamics with positive Lyapunov exponents +Fig 6C. However, during gradient flossing, these unstable networks were quickly stabilized. Fig 6D shows that +the squared Lyapunov exponents of ReLU networks converge towards zero. + +B.2 Additional Results for Different Numbers of Flossed Lyapunov Exponents +Additionally to the main Fig 5, we did the same analysis on networks with only preflossing (gradient flossing +before training) and found that more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the +tasks become more difficult (Fig 7D), however, in that case even if all N Lyapunov exponents were flossed, thus +k = N , they were not able to solve the most difficult tasks. This seems to indicate that gradient flossing during +training cannot be replaced by just gradient flossing more Lyapunov exponents before training. + + +C Additional Results on Gradient Flossing Throughout Training +We now discuss some additional results on gradient flossing throughout training. First, we analyze how gradient +flossing affects the gradients and find that during gradient flossing, the norm of gradients that bridge many time +steps are boosted. Moreover, subordinate singular values of the error norm of the recurrent weights are also +boosted, indicating that gradient flossing can increase the effective rank of the parameter update. Additionally, +we show that if gradient flossing is continued throughout training it can be detrimental to the accuracy. Finally, +we show that Lyapunov exponents of successfully trained networks after training for the spatial delayed XOR +task have a simple relationship to the delay d. + + +D Gradient Flossing boosts the Gradient Norm for Long Time Horizons +In this section, we investigate the impact of gradient flossing on the norm and structure of the gradient. It is +important to note that the complete error gradient of backpropagation through time is composed of a summation +of products of one-step Jacobians, reflecting the number of "loops" the error signal traverses through the recurrent +dynamics before reaching its target. Consequently, when the singular values of the long-term Jacobian are +smaller than 1, the influence of the shorter loops typically dominates the long-term Jacobian. + + + 16 + A B + 100 100 + k + test accuracy (%) + + + + + test accuracy (%) + 80 1 80 delay + 20 30 + 40 50 + 60 70 + 60 80 60 + + 20 40 60 0 20 40 60 80 + complexity (delay d) number of flossed i k + C 100 D 100 + 70 70 + complexity (delay d) + + + + + complexity (delay d) + test accuracy (%) + + + + + test accuracy (%) + 60 60 + 50 50 + 40 40 + 30 30 + 20 20 + 10 10 + 1 10 20 30 40 50 60 70 80 50 1 10 20 30 40 50 60 70 80 50 + number of flossed i k number of flossed i k + + +Figure 7: Gradient flossing for different numbers of flossed Lyapunov exponents +A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers +flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number of +flossed Lyapunov exponents k. C) Median test accuracy for delayed temporal XOR task as a function +of delay d and k for networks with gradient flossing during training (500 steps of gradient flossing at +epochs e ∈ {0, 100, 200, 300, 400}). D)Same as B for preflossing only. Parameters: g = 1, batch +size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed spatial +XOR, T = 300, gradient flossing for Ef = 500 epochs before training and during training for A, B, +C, and only before training for C. Shaded areas are 25% and 75% percentiles, solid lines are means, +transparent dots are individual simulations, task loss: cross-entropy btw. y, ŷ. + + +In our tasks, we have full control over the correlation structure of the task and thus know exactly which loop +length of backpropagation through time is necessary for finding the correct association. We were moreover +careful in our task design not to have any additional signals in our task that might help to bridge the long time +scale. In the case of vanishing gradients, the gradient norm is predominantly influenced by the shorter loops, +even though the actual signal in the gradient originates solely from the loop of length d in our task. To mitigate +the contamination of spurious signals from shorter loops and effectively extract the gradient that spans long time +horizons, we focus on the gradient with respect to the initial conditions h0 . + τ =t−1 t−1 + ! τ =t−1 t−1 + ! + ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X Y ∂hτ ′ +1 ∂Lt + = = δτ 0 = Tt (h0 ) (18) + ∂h0 ∂ht ′ + ∂h τ ′ ∂h 0 ∂h t ′ + ∂h τ ′ ∂ht + τ =t−l τ =τ τ =t−l τ =τ + +We note that the sum conveniently drops as only the longest ’loop’, in other words, the only summand that +contributes is the product of Jacobians going from 0 to t. By considering this gradient, we can therefore ensure +that no undesired signals stemming from shorter loops interfere with the analysis. Moreover, we note that we +use the binary cross entropy loss which makes the derivative ∂L + ∂ht + t + trivial. +In Fig 8 we show that gradient flossing boosts the gradient with respect to the initial conditions. Specifically, we +compare two identical networks trained on the binary delayed temporal XOR task with a loop length of d = 70. +One network is trained with gradient flossing at epochs e ∈ {0, 100, 200, 300, 400}), while the other is trained +without gradient flossing. + + + 17 + dL +For the network without gradient flossing, the gradient norm of | dh 0 + | diminishes to extremely small values + −6 +(< 10 ) and remains small throughout training. In contrast, for the network trained with gradient flossing, each +episode of gradient flossing causes the norm | dh dL + 0 + | to spike, surpassing values larger than 10−2 . These findings +are direct evidence that gradient flossing boosts the gradient norm, facilitating to bridge long time horizons in +challenging temporal credit assignment tasks. We observe that after several episodes of gradient flossing, the + dL +gradient | dh 0 + | of the networks stays around 10−4 and eventually rise up to values around 10−2 . Subsequent +in training, the test accuracy surpasses chance level (Fig 8B). We observed this temporal relationship between + dL +gradient norm | dh 0 + | and training success consistently across numerous network realizations (Fig 8C and D). + dL +These findings suggest that the gradient norm | dh 0 + | can be a good predictor of learning success, sometimes +hundreds of epochs before the accuracy exceeds the chance level of 50%. Indeed, when depicting the gradient +norm aligned to the last epoch where accuracy was ≤ 50%, we see for many network realizations a gradual +growth of gradient norm oven epochs before accuracy surpasses chance level (Fig 9A). Analogously, when + dL +plotting the accuracy as a function of epoch aligned with the last epoch with | dh 0 + | < 0.001, we observe for this + dL +task that the increase of gradient norm | dh0 | reliably precedes the epoch at which the accuracy surpasses the +chance level (See Fig 9B). We note that when measuring the overlap of the orientation of the gradient vector + dL + dh0 + with the first covariant Lyapunov vector of the forward dynamics, we found a significant increase in overlap +around the training epoch where the accuracy surpasses the chance level both in networks with and without +gradient flossing. This does not come as a surprise as the covariant Lyapunov vector measure the most unstable +(or least stable) direction in the tangent space of a trajectory and perturbations of h0 that have to travel over +many epochs align + +D.1 Gradient Flossing Boosts Effective Dimension of Error Gradient +To further investigate the effect of gradient flossing on training, we investigated the structure of the error gradient + dL +and how it is changed by gradient flossing. To this end, we decompose the recurrent weight gradient σi dW +into in weighted sum of outer products using singular value decomposition (Fig 10). +As the Lyapunov exponents are the time-averaged logarithms of the singular values of the asymptotic long-term +Jacobian Tt (hτ ), this allows us to directly link the effect of pushing Lyapunov exponents toward zero during +gradient flossing to the structure of the error gradient of the recurrent weights, as they are intimately linked: + τ =t−1 t−1 + ! + ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X ∂hτ + = = Tt (hτ ) (19) + ∂W ∂ht ′ + ∂h τ ′ ∂W ∂h t + τ + ∂W + τ =t−l τ =τ + +We again note that different ’loops’ contribute to the total gradient expression and the Lyapunov exponents only +characterize the longest loop. Further, we note that in our controlled tasks, depending on delay d, only few of the +summands are relevant for solving the task. We thus expect the relevant gradient summand that carries important +signals about the task to be contaminated by summands of both shorter and longer chains, which contribute +irrelevant fluctuations. + dL + +The singular values of the recurrent weight gradient σi dW as a function of training epoch reveal that the +subordinate singular values subordinate singular value σ20 and σ40 exhibit peaks at the times of gradient flossing, +while the first singular value σ1 only shows a slight peak (Fig 10A). This indicates that gradient flossing + dL +increases the effective rank of the recurrent weight gradient dW . In other words, gradient flossing facilitates +high-dimensional parameter updates. Our interpretation is, as gradient flossing pushes Lyapunov exponents +to zero, the different summands in the total gradient contribute more equitable as long loops have neither a +dominant contribution (which would happen for exploding gradients) nor a vanishing contribution (which would +happen for vanishing gradients). This way, the sum of gradient terms has a higher effective rank. +In contrast, without gradient flossing, the subordinate singular values (in Fig 10A σ20 and σ40 ) rapidly diminish +to extremely small values over training epochs and remain very small throughout training. Note however that the +leading singular values σ1 are of comparable size irrespective whether gradient flossing was performed or not. + dL +We note that similar to the gradient norm of the loss with respect to the initial condition | dh 0 + |, the subordinate +singular values seem to predict when the test accuracy of networks with gradient flossing grows beyond chance +level (Fig 10B). We confirmed this in multiple other network realizations and give here another example we the +accuracy grows beyond chance only later during training (Fig 11). + +D.2 Gradient Flossing Throughout Training Can Be Detrimental +We find that gradient flossing continued throughout all training epochs can be detrimental for performance +(Fig 12). We demonstrate this again in the binary delayed temporal XOR task. We compare three different +conditions: Either, we floss throughout the training every 100 training epochs for 500 flossing epochs (red), or +we floss only early during training at training epochs e ∈ {0, 100, 200, 300, 400})(green) or we do not floss at +all (blue). + + + 18 + A C + 10 2 0.05 + 10 4 0.04 + |dhd 0 | 10 6 0.03 + + + + + |dhd 0 | + 10 8 0.02 + 10 10 0.01 + with flossing during training + 10 12 without flossing + 0.00 + 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 + Epochs Epochs + B D + 100 100 + accuracy (%) + + + + + accuracy (%) + with flossing during training + without flossing + + + 50 50 + 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 + Epochs Epochs + dL +Figure 8: Gradient flossing boosts norm of long-term Jacobian A) Gradient norm of | dh 0 + | as +a function of training epochs for networks without flossing (blue) and networks with flossing +during training (orange). Error gradient norm is boosted after gradient flossing at epochs e ∈ + dL +{0, 100, 200, 300, 400, 500}). In networks without gradient flossing, the gradient norms | dh 0 + | are +much smaller overall. One out of ten random network realizations with solid line, the other 9 with +transparent line. B) Accuracy as a function of epoch, same depiction and network realizations as in +A. Note that accuracy of networks with gradient flossing grows beyond chance level approximately + dL +when the gradient norm | dh 0 + | becomes macroscopically large. C) Same as A in linear scale. Mean +final test loss as a function of task difficulty (delay d) for delayed copy task. Different colors are +different network realizations with gradient flossing during training. Black lines are without any +gradient flossing. D) Accuracy as a function of epochs, same colors as in C. Note that for all network + dL +realizations the moments where gradient norm | dh 0 + | becomes macroscopically large coincides with +the moment the accuracy is beyond chance level. Parameters: g = 1, batch size b = 16, N = 80, +epochs = 104 , T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before +training. Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ). + + +We observe that after every episode of gradient flossing, the accuracy drops down close to chance level of 50% +(Fig 12A red line). Between flossing, the accuracy quickly recovers but never reaches 100%. Simultaneously, + dL +when the accuracy drops the test error jumps up (Fig 12B). We also observed that the gradient norm | dh 0 + | is + dL +initially boosted by gradient flossing, but stays close to indistinguishable once the gradient norm | dh0 | becomes +macroscopically large (Fig 12C). This suggests that once gradient flossing facilitates signal propagation across +long time horizons and the network picks up the relevant gradient signal, further gradient flossing can be harmful +to the actual task execution. We hypothesize that there might be (at least for the Vanilla networks considered +here), a trade-off between the ability to bridge long time scales which seems to require one or several Lyapunov +exponents of the forward dynamics close to zero and nonlinear tasks requirements, which require at least a +fraction of the units to be in the nonlinear regime of the nonlinearity ϕ, where ϕ′ (x) < 1. It would be an +interesting future research avenue to further investigate this potential trade-off also in other network architectures. + +D.3 Lyapunov Exponents after Training With and Without Gradient Flossing +In Fig 13, we show the first (Fig 13A, B) and the tenth (Fig 13C, D) Lyapunov exponent after training on +the spatial delayed XOR task both with and without gradient flossing. We find for successful networks with +gradient flossing a systematic relationship between the first Lyapunov exponent and the delay, that can be fitted +by approximately by λ1 (d) = −0.2exp.(−0.03delay). Unsuccessful networks with accuracy at chance level +have a much smaller largest Lyapunov exponent. The same seems to hold true for the tenth Lyapunov exponent. +In a previous study [57], a similar trend was observed, albeit in the context of a task that did not possess an + + + 19 + A B 80 + + 75 + 10 2 + 70 + 10 3 + + + + + accuracy (%) + 65 + |dhd 0 | + + + + 10 4 60 + 55 + 10 5 + 50 + 45 + 3000 2000 1000 0 1000 0 100 200 300 400 + Epochs Epochs + + +Figure 9: Increase of gradient norm precedes epoch when accuracy exceeds chance level + dL +A) Gradient norm of | dh 0 + | as a function of training epochs for 20 network realizations with flossing +during training. Epochs are aligned to last epoch where accuracy is ≤ 50%. B) Same task and +simulations as in A, but here accuracy as a function of epoch, for 20 network realizations with + dL +flossing during training. Epochs are aligned to the last epoch with | dh 0 + | < 0.001. Different colors +are different network realizations. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , +T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents early during training at +training epochs e ∈ {0, 100, 200, 300, 400}. Task: binary delayed XOR, delay d = 70, loss: cross +entropy(y, ŷ). + + +analytically tractable temporal correlation structure, which might partially explain the less conclusive results. +It is important to note that the numerical evaluation of Lyapunov exponents in recurrent LSTM networks in +[57] was based solely on the N × N Jacobian of the memory state. From a dynamical systems standpoint, a +2N × 2N Jacobian matrix encompassing interactions between both memory and cell states into account is +required [9, 12, 58]. + + +E Gradient Flossing for Linear Network +We provide code for gradient flossing in linear networks here. We find that gradient flossing also helps to train +linear networks on tasks with many time steps that can be solved by linear networks, for example the copy task, +but not for tasks the require a nonlinear input-output operation like the temporally delayed XOR task. Full +analytical description of gradient flossing for linear networks would be a promising avenue for future research as +networks with linear dynamics can still have nonlinear learning dynamics [21]. However this is beyond the cope +of the presented work. + + +F Computational Complexity of Gradient Flossing +We present here a more in-depth scaling analysis of the computational cost of gradient flossing. There are three +main contributors to the computational cost (table 1): First the RNN step, which has a computational complexity +of O N 2 b per time step, where N is the dimension of the recurrent network state (which in case of Vanilla + +networks equals the number of units) and b is the batch-size both in the forward and backward pass. Second, the +Jacobian step which scales with O N 2 k per time step, where k is the number of flossed Lyapunov exponents. +Third, the QR decomposition, which scales with O N k2 , where k is the number of Lyapunov exponents + +considered. +Together, this results in a total amortized cost of O N 2 b T per training epoch, where T is the number of + + +training time steps and a total amortized costs per flossing epoch of O N 2 Tf (1 + k/tONS + k) where Tf is + +the number of flossing time steps. + + + 20 + A + 10 2 + i dW ) + d + ( + + + + 10 4 + + + 0 500 1000 1500 2000 2500 3000 + B 100 + early training flossing + without flossing + accuracy (%) + + + + + 50 + 0 500 1000 1500 2000 2500 3000 + Epochs +Figure 10: Gradient flossing decreases condition number of recurrent weight error gradient + dL + +A) Singular values of recurrent weight gradient σi dW as a function of training epochs for singular +values i ∈ 1, 20, 40 for networks without gradient flossing (blue) and early training gradient flossing +(green). At epochs of gradient flossing, the subordinate singular value σ20 and σ40 are peaked, while +the first singular value σ1 has only a slight peak. This indicates that gradient flossing increases the + dL +effective rank of the recurrent weight gradient dW . B) Accuracy as a function of training epochs. +Note that accuracy of networks with gradient flossing grows beyond chance level approximately +when the subordinate singular values singular value σ20 and σ40 are peaked increase, which enables +high-dimensional parameters updates. Parameters: g = 1, batch size b = 16, N = 80, epochs = 103 , +T = 300, gradient flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training. +Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ). + + + + +In case of preflossing, thus, the total computation cost scale with O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] , + +where E is the number of training epochs and Ep is the number of preflossing epochs. +For gradient flossing during training (assuming that there is also preflossing done), the amortized cost scale with +O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] , where Ef is the total number of flossing epochs during +training. +Empirically, we find that both the number of preflossing epochs Ep and flossing episodes Ef necessary for +training success is much smaller than the total number of training epochs E. For example, the preflossing for +500 epochs in the numerical experiment of Fig 3 took ∼ 37 seconds, while the overall training on 10000 training +epochs with batch size b = 16 took ∼ 1680 seconds. Thus only approximately 2.2% of the total training time +was spent on gradient flossing. Moreover, Tp can be smaller than T , it just has to be long enough such that the +temporal correlations in the task can be bridged. In case of the tasks discussed in the manuscript, this would be +the delay d. It remains an important challenge to infer the suitable number of flossing time steps Tf for tasks +with unknown temporal correlation structure. +It would also be interesting to investigate how the CPU hours/wall-clock time/flops/Joule/CO2-emission spent +on gradient flossing vs on training networks with larger N are trading off against each other. For this, we would +suggest to first find the smallest network that on median successfully trains on a binary temporal XOR task for +a fixed given delay d and measure the computational resources involved in training it, e.g. in terms of CPU +hours. Then compare it to a network with gradient flossing. This would be a promising analysis but is beyond +our current computational budget. We will start such experiments an might be able to provide results during the +reviewer period. + + + 21 + A + 10 2 + i dW ) + d + ( + + + + 10 4 + + + 0 500 1000 1500 2000 2500 3000 + B 100 + early training flossing + without flossing + accuracy (%) + + + + + 50 + 0 500 1000 1500 2000 2500 3000 + Epochs +Figure 11: Gradient flossing decreases condition number of recurrent weight error gradient +Same as Fig 10 for different network realization. + + + +G Additional controls + +We also investigate the effects of gradient flossing during the training with orthogonal weight initializations +and confirm our finding that gradient flossing improves trainability on tasks that have long time horizons to +bridge. Moreover, we find that gradient flossing during training can further improve trainability. We replicated +the two more challenging tasks from the main paper (Fig 4) for orthogonal initialization with variable temporal +complexity and performed gradient flossing either both during and before training, only before training, or not at +all. +Fig 14A shows the test accuracy for Vanilla RNNs with orthogonal initialization trained on the delayed temporal +XOR task yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of orthogonal Vanilla +RNNs falls to chance level for d ≥ 40 (Fig 14C). With gradient flossing before training, the trainability can +be improved, but still falls close to chance level for d = 70. In contrast, for initially orthogonal networks with +gradient flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed for +500 epochs before task training and again after 500 epochs of training on the task. In Fig 14B, D the networks +have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a three-dimensional binary input +signal x1 , x2 , and x3 and generate the correct output with a delay of d steps identical to Fig 4 in the main text. +Again, we observe similar to networks with Gaussian initialization that flossing before training improves the +performance compared to baseline, but starts failing for long delays d > 60. In contrast, orthogonal networks +that are also flossed during training can solve even more difficult tasks (Fig 14D). We note that for Fig 14B and +D, we trained the network only on 5000 epochs, compared to 10000 epochs in networks with random Gaussian +initialization because for 10000 epochs, both networks with gradient flossing only before training and with +gradient flossing before and during training were able to bridge d = 70. These results suggest that orthogonal +initialization does seem to slightly improve performance for tasks with long time horizons to bridge and gradient +flossing and additionally boost the performance. Thus orthogonal initialization and gradient flossing seems to go +well together. It would be interesting to study if orthogonal initialization also reduces the number of gradient +flossing steps necessary to improve performance. + + + +H Additional Details on Training Tasks + +In this section, we provide a more rigorous definition of the tasks used for training, as discussed in Section 3: + + + 22 + A 100 + + + accuracy (%) + with flossing throughout training + early training flossing + without flossing + 50 + 0 500 1000 1500 2000 2500 3000 + B + 100 + test error + + + + + 10 1 + 0 500 1000 1500 2000 2500 3000 + C + 10 3 with flossing throughout training + early training flossing + without flossing + |dhd 0 | + + + + + 10 7 + + 0 500 1000 1500 2000 2500 3000 + Epochs +Figure 12: Gradient flossing throughout training can be detrimental to learning A) Accuracy as a +function of training epochs for binary temporal delayed XOR task for gradient flossing throughout +training every 100 training epochs (red). Accuracy drops down close to chance level every time after +gradient flossing but recovers quickly between. Same for only 5 episodes of gradient flossing at +epochs e ∈ {0, 100, 200, 300, 400}) (green) and no flossing at all (blue). B) Test error as a function + dL +of training epochs. C) Gradient norm of | dh 0 + | as a function of training epochs for networks without +gradient flossing (blue) and networks with flossing throughout training (red) and early training +gradient flossing (green). Error gradient norm is boosted after each gradient flossing. In networks + dL +without gradient flossing, the gradient norms | dh 0 + | are much smaller overall. Parameters: g = 1, +batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on +k = 75 Lyapunov exponents before training. Task: binary delayed XOR, delay d = 70, loss: cross +entropy(y, ŷ). + + +H.1 Copy task +For the copy task, the target network readout at time t is yt = xt−d , where d denotes the delay. We chose the +input to be sampled i.i.d. from a uniform distribution between 0 and 1. + +H.2 Temporal XOR task +The temporal XOR task requires the target network readout yt at time t to be computed as follows: + yt = |xt−d/2 − xt−d | (20) + 2 +where again d denotes a time delay of d time steps. In the case of x ∈ {0, 1} and y ∈ {0, 1}, the output yt +follows the truth table of the XOR digital logic gate (Table 2). Thus, the function f (xa , xb ) = |xa − xb | can be +seen as an analytical representation of the XOR gate. It is important to note that f (x, 0) = x only for x ≥ 0, +and that this task requires a nonlinearity. The implementation can easily be constructed analytically, for example, +using two rectified linear units ϕ(x) = max(x, 0) the outbut can be constructed by + f (xa , xb ) = |xa − xb | = ϕ(xa − xb ) + ϕ(xb − xa ). (21) +Together with a delay line to transmit the signal xt−d over time, this can solve the task. + + + 23 + A no gradient flossing B gradient flossing during training + 0.1 + 1 + 0.1 + + + + + 1 + 0.2 + 0.2 + 20 40 60 20 40 60 + complexity (delay d) complexity (delay d) + C D + 0.1 + 0.2 0.1 + 10 + + + + + 10 + 0.3 0.2 + 20 40 60 20 40 60 + complexity (delay d) complexity (delay d) +Figure 13: Lyapunov exponents of trained networks with and without gradient flossing A) First +Lyapunov exponents λ1 for Vanilla networks trained on spatial delayed XOR task as a function of +the delay with no gradient flossing. Colored-coded is test accuracy at the end of training where red +corresponds to 100% accuracy and blue to chance level (50%). B) Same as A for networks with +gradient flossing during training. Black dashed line shows that Lyapunov exponents of successfully +trained networks can be approximated by the empirical fit λ1 (d) = −0.2exp.(−0.03delay). (Proto- +col for gradient flossing during training same as main text Fig 4B). C) Same as A for tenth Lyapunov +exponents λ10 . D) Same as B for tenth Lyapunov exponents λ10 . Same fit as in B also describes +λ10 . Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for +Ef = 500 epochs on k = 75 Lyapunov exponents before training. Task: binary spatial delayed XOR, +loss: cross entropy(y, ŷ). + + +I Additional Background on Lyapunov Exponents of RNNs + +An autonomous dynamical system is usually defined by a set of ordinary differential equations dh/dt = +F(h), h ∈ RN in the case of continuous-time dynamics, or as a map hs+1 = f (hs ) in the case of discrete-time +dynamics. In the following, the theory is presented for discrete-time dynamical systems for ease of notation, +but everything directly extends to continuous-time systems [43]. Together with an initial condition h0 , the +map forms a trajectory. As a natural extension of linear stability analysis, one can ask how an infinitesimal +perturbation h′0 = h0 + ϵu0 evolves in time. Chaotic systems are sensitive to initial conditions; almost all +infinitesimal perturbations ϵu0 of the initial condition grow exponentially with time |ϵut | ≈ exp(λ1 t)|ϵu0 |. +Finite-size perturbations, therefore, may lead to a drastically different subsequent behavior. The largest Lyapunov +exponent λ1 measures the average rate of exponential divergence or convergence of nearby initial conditions: + + + 1 ||ϵut || + λ1 (h0 ) = lim lim log (22) + t→∞ t ϵ→0 ||ϵu0 || +In dynamical systems that are ergodic on the attractor, the Lyapunov exponents do not depend on the initial +conditions as long as the initial conditions are in the basins of attraction of the attractor. Note that it is crucial +to first take the limit ϵ → 0 and then t → ∞, as λ1 (h0 ) would be trivially zero for a bounded attractor if the + ||ϵut || +limits are exchanged, as limt→∞ log ||ϵu 0 || + is bounded for finite perturbations even if the system is chaotic. To +measure k Lyapunov exponents, one has to study the evolution of k independent infinitesimal perturbations us +spanning the tangent space: + + + us+1 = Ds us (23) + + + 24 + forward pass backward pass + O N2 b + + RNN dynamics " + O N2 k + + Jacobian step " + O N k2 + + QR step " + O N2 b T + + total amortized costs " + per training epoch + O N 2 Tf (1 + k/tONS + k) + + total amortized costs " + per gradient flossing epoch + O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] + + total amortized costs " + of preflossing + O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] + + total amortized costs " + flossing during training + +Table 1: Computational cost for gradient flossing and training of RNNs +N denotes number of neurons, b is the batch size, T is the number of time steps in forward pass of +training, Tf is the number of time steps in forward pass of flossing, tONS is the reorthonormalization +interval, k is the number of flossed Lyapunov exponents, E is the number of training epochs, Ep is +the number of preflossing epochs, Ef is the number of flossing epochs during training. Empirically, +we find that the necessary number of preflossing epochs Ep and flossing episodes Ef is much smaller +than both the total number of training epochs E. Moreover, Tp can be smaller than T . + + Table 2: XOR + input xt−d input xt−2d target output yt + 0 0 0 + 0 1 1 + 1 0 1 + 1 1 0 + + +where the N × N Jacobian Ds (hs ) = df (hs )/dh characterizes the evolution of generic infinitesimal perturba- +tions during one step. Note that this Jacobian along the trajectory is equivalent to a stability matrix only at a +fixed point, i.e., when hs+1 = f (hs ) = hs . +We are interested in the asymptotic behavior, and therefore we study the long-term Jacobian + + Tt (h0 ) = Dt−1 (ht−1 ) . . . D1 (h1 )D0 (h0 ). (24) +Note that Tt (h0 ) is a product of generally noncommuting matrices. The Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN +are defined as the logarithms of the eigenvalues of the Oseledets matrix + 1 + Λ(h0 ) = lim [Tt (h0 )⊤ Tt (h0 )] 2t , (25) + t→∞ + +where ⊤ denotes the transpose operation. The expression inside the brackets is the Gram matrix of the long-term +Jacobian Tt (h0 ). Geometrically, the determinant of the Gram matrix is the squared volume of the parallelotope +spanned by the columns of Tt (h0 ). Thus, the exponential volume growth rate is given by the sum of the +logarithms of its first k (sorted) eigenvalues. Oseledets’ multiplicative ergodic theorem guarantees the existence +of the Oseledets matrix Λ(h0 ) for almost all initial conditions h0 [42]. In ergodic systems, the Lyapunov +exponents λi do not depend on the initial condition h0 . However, for a numerical calculation of the Lyapunov +spectrum, Eq 25 cannot be used directly because the long-term Jacobian Tt (h0 ) quickly becomes ill-conditioned, +i.e., the ratio between its largest and smallest singular value diverges exponentially with time. + + +J Algorithm for Calculating Lyapunov Spectrum of Rate Networks +For calculating the first k Lyapunov exponents, we exploit the fact that the growth rate of a k-dimensional +infinitesimal volume element is given by λ(m) = m (1) + , λ2 = λ(2) − λ1 , λ3 = + P + i=1 λi . Therefore, λ1 = λ + (3) +λ − λ1 − λ2 , . . . [44]. The volume growth rates can be obtained via QR-decomposition. + + + 25 + orthogonal weight initialization + A delayed temporal XOR B delayed spatial XOR + 100 100 +test accuracy (%) flossing during training + + + + + test accuracy (%) + preflossing + 80 no flossing 80 + + 60 60 + + 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 + Epochs Epochs + C D + 100 100 +test accuracy (%) + + + + + test accuracy (%) + 80 80 flossing during training + preflossing + no flossing + 60 60 + + 10 20 30 40 50 60 70 10 20 30 40 50 60 70 + complexity (delay d) complexity (delay d) + +Figure 14: Gradient flossing before and during training improves trainability for orthogonal +nets +A) Test accuracy for orthogonally initialized vanilla RNNs trained on delayed temporal binary XOR +task yt = xt−d/2 ⊕ xt−d with gradient flossing during training (green), preflossing (orange), and +with no gradient flossing (blue) for d = 70. Solid lines are mean, transparent thin lines are individual +network realizations B) Same as A for delayed spatial XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . +C) Test accuracy as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test +accuracy as a function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, +batch size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed +spatial XOR, T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training +and during training for green lines, and only before training for orange lines. Shaded areas are 25% +and 75% percentiles, solid lines are means, transparent dots are individual simulations, task loss is +cross-entropy between y, ŷ. + + +First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qm + s ] in the tangent space along the trajectory +using the Jacobian Ds : + Qe s+1 = Ds Qs (26) +A continuous system can be transformed into a discrete system by considering a stroboscopic representation, +where the trajectory is only considered at certain discrete time points. We use here the notation of discrete +dynamical systems where this corresponds to performing the product of Jacobians along the trajectory Q + e s+1 = +Ds Qs . We study the discrete network dynamics in the limit of small time step ∆t → 0 and for discrete time +∆t = 1. The notation can be readily extended to continuous systems [43]. +Second, we extract the exponential growth rates using the QR-decomposition, + e s+1 = Qs+1 Rs+1 , + Q + +which uniquely decomposes Q e s+1 into an orthonormal matrix Qs+1 of size N × k so Q⊤ s+1 Qs+1 = 1m×m +and to an upper triangular matrix Rs+1 of size k × k with positive diagonal elements. Geometrically, Qs+1 +describes the rotation of Qs caused by Ds and the diagonal entries of Rs+1 describe the stretching and shrinking +of the columns of Qs , while the off-diagonal elements represent the shearing. Fig 15 visualizes Ds and the +QR-decomposition for k = 2. +The Lyapunov exponents are given by time-averaged logarithms of the diagonal elements of Rs : + t t + 1 Y 1X + λi = lim log Rsii = lim log Rsii . (27) + t→∞ t t→∞ t + s=1 s=1 + + + + 26 + Ds + QR + + s+1 + R22 +qs2 ~2 s+1 + 2 + qs+1 + qs+1 R11 ~1 + qs+1 + s+1 + qs1 R11 1 + qs+1 + s+1 + R22 + +Figure 15: Geometric illustration of Lyapunov spectrum calculation. An orthonormal matrix +Qs = [q1s , q2s , . . . qms ], whose columns are the axes of an k-dimensional cube, is rotated and distorted +by the Jacobian Ds into an k-dimensional parallelotope Q e s+1 = Ds Qs embedded in RN . The +figure illustrates this for k = 2, in which case the columns of Q e s+1 span a parallelogram, which +can be divided into a right triangle and a trapezoid and rearranged into a rectangle. Thus, the +area of the gray parallelogram is the same as that of the orange rectangle. The QR-decomposition +reorthonormalizes Q e s+1 by decomposing it into the product of an orthonormal matrix Qs+1 = +[q1s+1 , q2s+1 , . . . qms+1 ] and the upper-triangular matrix R + s+1 + . Qs+1 describes the rotation of Qs + s+1 +caused by Ds . The diagonal entries of R gives the stretching/shrinking along the columns of +Qs+1 ,Qthus the volume of the parallelotope formed by the first k columns of Q e s+1 is given by + m +Vm = i=1 Rs+1 ii . The time-averaged logarithms of the diagonal elements of R s + give the Lyapunov + 1 + Qt s 1 + Pt s +spectrum: λi = limtsim →∞ tsim log s=1 Rii = limtsim →∞ t s=1 log Rii . + + +Note that the QR-decomposition does not need to be performed at every simulation step, just sufficiently often, +i.e., once every sONS steps such that Q ONS = Ds+sONS −1 · Ds+sONS −2 . . . Ds · Qs remains well-conditioned + e s+s +[44]. An appropriate reorthonormalization interval sONS = tONS /∆t thus depends on the condition number, the +ratio of the smallest and largest singular value: + s+s + e s+s ) = κ2 (Rs+sONS ) = σ1 (Rs+sONS ) R ONS + κ2 ( Q = 11 . (28) + Rs+s + ONS s+s + σm (R ONS ) mm + ONS + +An initial transient should be disregarded in the calculation of the Lyapunov spectrum because h first has to +converge towards the attractor and Q has to converge to the unique eigenvectors of the Oseledets matrix (Eq 25) +[54]. A simple example of this algorithm in pseudocode is: + +Algorithm 2 Jacobian-based algorithm for Lyapunov spectrum + initialize h, Q + evolve h until it is on attractor (avoid initial transient) + evolve Q until it converges to the eigenvectors of the backward Oseledets matrix + set γi = 0 + for t = 1 → T do + h ← f (h) + df + D ← dh + Q←D·Q + if s ≡ 0 (mod sONS ) then + Q, R ← qr(Q) + γi += log(Rii ) + end if + end for + λi = γi /T + + +It is guaranteed that under general conditions initially random orthonormal systems will exponentially converge +towards a unique basis that is given by the eigenvectors of the Oseledets matrix Eq 25 [54]. A minimal example +of this algorithm in pseudocode is shown in appendix 3. A feasible strategy to determine the reorthonormalization +time interval tONS is to get first a rough estimate of the Lyapunov spectrum using a short simulation time tsim and +a small tONS and repeat with a longer simulation time and a tONS based on the Lyapunov spectrum of the rough +estimate of the Lyapunov spectrum. Another strategy is, to first iteratively adapt tONS on a short simulation +run to get an acceptable condition number. It should be noted that there exists a diversity of other methods to +estimate the Lyapunov spectrum [14, 43, 68, 69]. + + + 27 +K Convergence of Lyapunov Exponents of RNNs +In Fig. 16, we demonstrate the convergence of the Lyapunov exponents. We show the estimate of the Lyapunov +exponents λi for i = 1, 20, 60, 80 for different initial conditions but identical network realization. + + 0 + 10 +i(1/steps) + + + + + 20 + 30 + 40 + 100 101 102 + steps + + +Figure 16: Convergence of Lyapunov exponents Convergence of selected Lyapunov exponents +λi for ten identical network realizations with different initial conditions with simulation time (i = +1, 20, 60, 80) for σ = 1 and g = 1. (Other parameters: N = 80, tsim = 100 steps, tONS = 1). + + + + + 28 +
\ No newline at end of file diff --git a/papers/txt/gram2025_generative_recursive.txt b/papers/txt/gram2025_generative_recursive.txt new file mode 100644 index 0000000..d5f299d --- /dev/null +++ b/papers/txt/gram2025_generative_recursive.txt @@ -0,0 +1,1532 @@ + Generative Recursive Reasoning + + + Junyeob Baek1†∗ Mingyu Jo1†∗ Minsu Kim1,2 + + Mengye Ren3 Yoshua Bengio2,4 Sungjin Ahn1,3† +arXiv:2605.19376v2 [cs.AI] 20 May 2026 + + + + + 1 + KAIST 2 Mila – Québec AI Institute + 3 + New York University 4 Université de Montréal + + + + Abstract + How should future neural reasoning systems implement extended computation? + Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres- + sive sequence extension by performing iterative latent-state refinement with shared + transition functions. Yet existing RRMs are largely deterministic, following a + single latent trajectory and converging to a single prediction. We introduce Gen- + erative Recursive reAsoning Models (GRAM), a framework that turns recursive + latent reasoning into probabilistic multi-trajectory computation. GRAM models + reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna- + tive solution strategies, and inference-time scaling through both recursive depth + and parallel trajectory sampling. This yields a latent-variable generative model + supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs, + unconditional generation via pθ (x). Trained with amortized variational inference, + GRAM improves over deterministic recurrent and recursive baselines on structured + reasoning and multi-solution constraint satisfaction tasks, while demonstrating an + unconditional generation capability. https://ahn-ml.github.io/gram-website + + + 1 Introduction + A central question for future neural reasoning systems is how extended computation should be imple- + mented. Large autoregressive models typically scale reasoning by extending a sequence-generation + process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im- + plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive + Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather + than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing + because it decouples reasoning depth from both parameter scale and output length: a compact model + can perform many steps of internal computation by repeatedly applying shared transition functions + over time. + Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the + potential of this approach in structured reasoning. Rather than producing a solution in a single + feedforward pass, they perform extended computation through iterative latent-state refinement, deep + supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical + latent dynamics. These features make them well suited to problems requiring constraint propagation, + state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle + ∗ Equal contribution + † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn + + (sungjin.ahn@kaist.ac.kr) + + + Preprint. + Solution 1, + + + Input Task, + + + + Solution 2, + + + + + (a) Deterministic RRMs (b) GRAM (Ours) + +Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions. +Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are +deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore +alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple +valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling. + +also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped +Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational +depth without increasing parameter count. Together, these models suggest that reasoning capability +can emerge not only from scaling model size or generating longer traces, but also from the organization +of computation itself. +While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing +reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning +system should also be able to maintain uncertainty, consider alternative hypotheses, and explore +multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity +or multiple valid solutions are intrinsic, and more generally in problems where a single refinement +path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be +not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and +exploring multiple latent trajectories in parallel. +Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and +initialization, they follow a single latent trajectory and converge to a single prediction. This deter- +ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving +probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This +motivates the central question of our work: can recursive latent computation support probabilistic, +generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models? +In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns +recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning +process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition +conditioned on the input and the current reasoning state, rather than deterministically updating to a +single next state. Repeating this process defines a distribution over possible reasoning trajectories, +allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale +inference not only by increasing recursive depth but also by sampling trajectories in parallel. From +a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by +marginalizing over latent reasoning trajectories, while the same recursive process can also define an +unconditional generative model pθ (x) when the input is fixed or absent. +We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar- +chitectural properties targeted by our formulation: recursive refinement, stochastic exploration, +multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on +comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines, +including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs +whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku- +Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract +transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized +MNIST [15] probes the unconditional generative interpretation. +Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle +for future recurrent and recursive reasoning architectures. Concretely, we make three contributions. + + + 2 +First, we formulate recursive reasoning as a latent-variable generative process, where solutions are +obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based +inference-time scaling, enabling inference to scale not only with recursive depth but also with the +number of sampled latent trajectories. Third, we provide empirical evidence that this formulation +yields the intended architectural advantages over deterministic recurrent and recursive baselines, +improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation. + +2 Generative Recursive Reasoning Models +In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation +of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training +procedure in Section 2.2, with an architecture schematic shown in Figure 2. + +2.1 Architecture + +Overview. GRAM models the conditional distri- 𝑦 (A) CE loss +bution pθ (y | x) by marginalizing over stochas- + Prior (B) KL Div. Posterior Decoder +tic latent reasoning trajectories. Given an input 𝑝𝜃 (⋅ |𝑢𝑡 ) 𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡 𝑓dec +x, GRAM first computes an embedding + ℎ𝑡−1 𝑓𝐻 𝑢𝑡 ℎ𝑡 + ex = fenc (x; θ), (1) + 𝐾 times +which is reused throughout the entire recursive 𝑙𝑡−1 𝑓𝐿 𝑓𝐿 𝑙𝑡 +computation. Starting from a fixed initial la- + Encoder +tent state z0 , the model evolves the latent state 𝑥 + 𝑓enc +through learned stochastic transitions. The re- +cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic +levels: inner and outer loops. latent transition in the hierarchical instantiation z = +At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high- + level update fH produces a deterministic proposal ut , to +a new latent state conditioned on the previous which stochastic guidance ϵt is added: ht = ut + ϵt . +latent state and the input embedding, + zt ∼ pθ (zt | zt−1 , ex ), t = 1, . . . , T. (2) +At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer +to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A +supervision step is the unit at which the decoder is invoked, and the training objective is applied, with +gradients computed as described in Section 2.2. +At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision +step serving as the initial state of the next, thereby forming the full recursive computation: + (1) T transitions (1) (2) T transitions T transitions (N ) + z0 −−−−−−−→ zT = z0 −−−−−−−→ · · · −−−−−−−→ zT sup , (3) + (n) (1) +where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the +fixed initial state, and the terminal state of one supervision step serves as the initial state of the next + (n+1) (n) +(z0 := zT ). This abstract formulation can be instantiated with various recurrent Transformer +backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7], +as well as hierarchical designs such as HRM and TRM [8, 9]. +Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the +latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt | +zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent +reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual +perturbation around a deterministic update: at each transition, the model first computes a deterministic +update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian, +and adds it to ut : + ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I , + + (4) + zt = ut + ϵt . (5) + + + 3 +We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent +direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex- +ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima, +and support robust exploration of the solution space without discarding the deterministic refinement +performed by ut . +Hierarchical Instantiation. We instantiate the latent state with two interacting components, z = +(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning +state, while the low-level component l is updated K times within a single transition and carries +fine-grained intermediate computation. This decomposition separates the two roles across time scales, +with h accumulating slowly across transitions and l refined rapidly within each one. +With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows. +The low-level component is first refined for K updates, with the high-level component held fixed: + lt,k = fL (ht−1 , lt,k−1 , ex ; θ), k = 1, . . . , K, (6) +where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level +component is then updated as a stochastic transition conditioned on the refined lt , + ut = fH (ht−1 , lt ; θ), (7) + ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I + + , (8) + ht = ut + ϵt , (9) +and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low- +level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower, +more abstract component of the latent state, where it can steer the overall reasoning trajectory +across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e., +fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1. +Modeling Unconditional Distribution. While the description so far focuses on the conditional +setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model +pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for +generation tasks in Section 4.3. + +2.2 Training + +GRAM is trained to model the conditional distribution pθ (y | x), where each training example +consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a +latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with +respect to the generative parameters θ and variational parameters ϕ. +Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where +the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with +TTotal = T × Nsup . The conditional likelihood is defined as + Z + pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ, (10) + +where x denotes the input problem and y denotes the corresponding ground-truth output. +Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization +over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the +evidence lower bound (ELBO), jointly training θ and ϕ via variational inference: + log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) . (11) + +During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has +access to both the input problem x and the target output y. At inference time, where y is unavailable, +trajectories are instead generated from the learned prior pθ (· | x). + 3We also tried injecting noise into the low-level state, but found that it did not improve performance. + + + + + 4 +Both the prior and the posterior are modeled as conditional Markov processes over latent states: + TY + Total TY + Total + + pθ (τ | x) = p(z0 ) pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 ) qϕ (zt | zt−1 , x, y). (12) + t=1 t=1 +Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by +adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same +transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y), +whereas the prior uses pθ (ϵt | ut ). +Since the two processes share the same Markov structure and all stochasticity is introduced through +ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover, +since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies +pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as + TX Total h i + LELBO = Eqϕ log pθ (y | zTTotal , x) − Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13) + t=1 +Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined +in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise +variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples. +Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we +train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T +recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather +than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each +step is reused as the initial state of the next step. +Following standard practice for recurrent models with long computation chains, we apply truncated +gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our +implementation, gradients are propagated only through the final transition of each supervision step, + (n) (n) +zT −1 → zT . This gives the following surrogate objective for each supervision step: + (n) (n) (n) (n) (n) (n) + LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14) + (n) +where zT is the terminal state of the current supervision step n, and gradients are stopped through +preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the +exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further +analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters +are listed in Appendix B.2. + +2.3 Inference-Time Scaling + +GRAM supports two complementary axes of inference-time scaling: depth, by varying the number +of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel. +For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation +time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in +Appendix A.1). For width — the focus of this section — we draw {τ (i) }N i=1 ∼ pθ (τ(i)| x) from the +learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring +multiple stochastic reasoning paths simultaneously rather than extending a single trajectory. +To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward +Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory +from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At +inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection +chooses the candidate with the highest predicted terminal value. Details of LPRM training are +provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through +parallel exploration, without increasing the sequential recursion length. + +3 Related Work +Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit +Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous + + + 5 + Sudoku ARC-AGI-1 ARC-AGI-2 + 20 + 100 97.0% 66.7% + 87.4% 16.0% + 80 60 52.0% 55.7% 15 +Accuracy (%) 61.3% 44.6% 11.1% + 60 55.0% 40.3% 9.7% + 40 34.5% 10 7.8% + 40 5.0% + 20 5 3.0% + 20 + 0 0 0 + Looped HRM TRM GRAM HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- + TF (Ours) (Ours) high (low) thinking (Ours) high (low) thinking + Recursive Models GRAM (Ours) LLMs + +Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently +outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent +transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are +omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included +only as external reference points for benchmark difficulty. + +representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa- +tions can make reasoning traces more compact and reduce generation overhead. Existing approaches +instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea- +soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26]. +However, many remain organized around autoregressive sequence generation, where additional +computation is tied to generating more tokens, latent positions, or sequential reasoning states. +Recursive Architectures. Recursive architectures perform iterative state updates and have evolved +from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent +recursive reasoning models show that increasing inference-time depth can outperform larger static +models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process: +instead of following a single deterministic refinement path, it maintains stochastic latent trajectories, +enabling multi-path exploration and generative sampling. +Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent +transitions to capture uncertainty and multimodal dynamics, often trained with variational infer- +ence [33–38]. They have been widely used in sequential generative modeling, video prediction, +and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets +stochastic dynamics as computation rather than temporal observation modeling: latent transitions +define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional +pθ (y | x) and unconditional pθ (x) generation. + + +4 Experiments + +GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general- +purpose large language reasoning model whose training data, inference budgets, prompting strategies, +tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and +recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning +tasks that probe the computational properties targeted by our formulation: iterative latent refinement, +stochastic trajectory exploration, multi-solution coverage, and inference-time scaling. +In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC- +AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring +(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized +MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design +choices (Section 4.4). + +4.1 Challenging Puzzle Tasks + +Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re- +quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract +visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans- + + + 6 + 1.0 N= 1.0 + 1 5 10 20 50 + 0.8 + 0.9 +Accuracy Looped TF + + + + + Accuracy + Looped TF 0.6 HRM + HRM + 0.8 TRM TRM + GRAM (Ours) 0.4 GRAM (N=1) + 0.2 + 0.6 + 0.5 0.0 + 8 16 32 128 320 2 4 6 8 10 12 14 16 18 + Iterations Number of Solutions + Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from + longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each + iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right) + Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer + a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent + performance. + + + former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning + model results [40] are included as external reference points for benchmark difficulty, rather than + as controlled baselines, since their training and inference settings are not directly comparable to + task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are + reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9]. + Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results. + GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this + improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped + TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages + stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution + of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex + problem spaces more effectively than models constrained to a single sequential refinement process. + Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1. + Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the + number of parallel samples consistently improves performance across all iteration counts. Notably, + GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations, + including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic + recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to + explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process + Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses + the latency bottlenecks of depth-based scaling while achieving superior performance. Additional + analysis on the ARC-AGI Challenge is provided in Appendix D.2. + + 4.2 Multi-solution Puzzle Tasks + +Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8, +10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each +input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7], +HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For +N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found / +total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of +constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in +Appendix C.2. + Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive + models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks. + Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three + deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains + consistent performance regardless of solution count. This confirms that deterministic latent updates + + + 7 +Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the +model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation +over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid +solutions discovered with 20 samples (%). + N-Queens Graph Coloring + 8×8 10 × 10 8-vertex 10-vertex + Method Rec. Gen. # Params Accuracy Coverage Accuracy Coverage Conflict↓ Coverage Conflict↓ Coverage + Direct Pred (8 layers) ✗ ✗ 27M 40.4±1.1 13.7±1.1 13.6±0.5 1.6±0.2 179.3±4.0 19.9±0.2 198.7±5.0 6.7±0.1 + Direct Pred (32 layers) ✗ ✗ 100M 40.2±1.3 13.6±1.1 13.1±0.4 1.6±0.2 174.0±18.0 19.1±1.7 227.7±34.5 6.5±1.9 + Looped TF ✓ ✗ 7M 68.4±3.7 23.6±1.9 50.0±7.6 6.2±3.2 136.0±16.1 20.5±1.5 157.3±9.0 7.2±0.7 + HRM ✓ ✗ 27M 78.7±2.9 26.7±1.3 37.4±0.3 4.7±0.1 109.7±1.5 21.8±0.3 164.3±21.6 8.9±1.7 + TRM ✓ ✗ 7M 66.8±5.7 36.1±22.5 17.5±11.2 2.0±1.3 109.3±3.1 22.3±0.6 170.7±17.9 6.8±0.3 + AR ✗ ✓ 10.6M 96.3±1.0 84.8±0.8 90.0±2.2 53.2±0.8 19.0±11.3 83.0±0.7 61.3±8.3 40.0±0.3 + MDLM ✗ ✓ 12.6M 96.1±1.5 87.2±0.6 74.3±6.6 47.4±2.2 2.7±0.6 84.5±4.0 12.0±7.0 48.2±1.4 + GRAM (Ours) ✓ ✓ 10M 99.7±0.3 90.3±1.9 89.7±2.7 57.5±3.4 2.7±2.1 85.8±0.5 3.3±1.5 51.3±2.8 + + + + +cause mode collapse when multiple valid outputs exist for the same input. Additional coverage +analysis is provided in Appendix D.3. +Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR, +MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver- +sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The +gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8- +and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement +enables stricter constraint satisfaction than generative sampling alone. + +4.3 Exploring GRAM as an Unconditional Generator + +Setup. To investigate GRAM’s unconditional gen- Table 2: Unconditional generation results on +erative capability beyond conditional reasoning, we binarized MNIST. We report IS (↑) and FID (↓). +evaluate generation in two domains: structured con- For iterative models, a step corresponds to a super- +straint generation on Sudoku (from empty boards, vision step for TRM and GRAM, and a denoising + step for D3PM. FID is calculated using real sam- +evaluated by the fraction of generated boards satis- ples with original pixel values (0–255). +fying Sudoku constraints) and image generation on +binarized MNIST [15], where pixel values are thresh- Method IS (↑) FID (↓) +olded to 0 or 1 (evaluated by Inception Score (IS) [42] +and FID [43]). In both cases, the input is replaced by VAE 1.70 86.28 + D3PM (1000 steps) 1.86 74.03 +an empty conditioning signal and the model samples TRM (16 steps) 1.00 303.29 +an output from its learned prior. Baselines include +D3PM [44], a discrete diffusion model, on both tasks, GRAM (Ours) +and additionally a VAE [45] trained with binary re- 8 steps 1.85 84.08 + 16 steps 1.89 77.79 +construction loss on MNIST. To ensure a fair compar- 32 steps 1.91 76.65 +ison with existing literature, FID is calculated using 64 steps 1.95 75.39 +real samples from the original standard MNIST. 128 steps 1.99 74.30 + 256 steps 2.04 73.34 +Generative Behavior Beyond Reasoning. GRAM +extends from conditional reasoning to unconditional +generation in two different domains. On Sudoku +generation (Figure 5), GRAM produces valid boards +with 99.05% validity using 10.9M parameters and +16 supervision steps, surpassing D3PM baselines +that use up to 55.1M parameters and 1000 denois- +ing steps. Figure 7 shows qualitative examples, illus- +trating that the model produces diverse, fully valid +boards from empty inputs without any explicit con- +straint checker. On MNIST (Table 2), the deter- +ministic baseline TRM exhibits mode collapse (FID +303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va- +its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM +these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan- + tially fewer parameters and steps. + + 8 +D3PM + t=0 t=100 t=200 t=300 t=400 t=500 t=600 t=700 t=800 t=900 t=1000 +GRAM TRM + t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 + + + + t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 Sample 1 Sample 2 Sample 3 Sample 4 + (a) Generation Process (b) Samples + Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion + steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image + through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model. + + Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a), + Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical + recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing + either significantly degrades performance. + (a) Architecture Ablation. (b) Mechanism Ablation. + + Model variant Sudoku N-Queens Model variant Sudoku N-Queens + base (Looped TF) 61.25 71.30 GRAM (ours) 93.96 99.69 + + DS + HR (=HRM, TRM) 55.00 / 87.40 80.70 / 72.90 + w/o stochastic guidance 82.87 72.91 + + SG 65.64 86.30 + stochasticity only 94.88 50.27 + + DS + SG 73.90 100.00 + guide only 0.00 0.00 + + DS + HR + SG (=GRAM) 93.96 99.69 w/ direct prediction 63.43 61.44 + TRM w/ stochastic decoder 82.87 71.66 + TRM w/ random init. 78.53 71.82 + + + transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction + emerging as a natural byproduct of the recursive generative process. + Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion + at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8 + to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement + advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process; + additional samples are in Section D.4. + + 4.4 Ablation Study + + We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples. + Table 3 summarizes the results. + Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that + stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone + lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100% + on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall + (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent + gains in every configuration, supporting our design of stochastic guidance as the core extension + introduced by GRAM. + Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the + learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains + Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths. + However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to + navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both + tasks), as deterministic guidance conditioned on the target leads to severe overfitting. + Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to + TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and + + + 9 + 8 9 648 59 1 648 59 1 648259317 648259317 648259317 648259317 + 7 79 5 79 68 5 79 68 415 79 683415 79 683415 792683415 792683415 + 9 6 289 3 6 1289 3 6 1289 3 6 71289 356 71289 356471289 356471289 + + + +D3PM + 657 4 8 657 4 831657 4 83165734 983165734 983165734 983165734 + 3 9 4 3 9 5 4 3 9 5 4 389 56 42389 561 423897561 423897561 423897561 + 7 8 7 8 2 7 8 2 7 648 2 5 73648 2 5 73648 2 5173648 2 517364892 + 5 5 13 548 139548 2 139548626 139548626 139548626 139548626 + 59 593 593 1 8 593 1 8 27593 148 275936148 275936148 275936148 + 2 8 2 3 8 7 2 3 8 7 2 53 8 47 2953 8 4712953 864712953 864712953 + t=0 t=125 t=250 t=375 t=500 t=625 t=750 t=875 t=1000 + 717348652 716348952 716348952 716348952 716348952 716348952 716348952 716348952 + 483927379 493527861 493527861 493527861 493527861 493527861 493527861 493527861 + 523971734 528996734 528169734 528169734 528169734 528169734 528169734 528169734 +GRAM + + + 864235917 864251379 864215379 864215379 864215379 864215379 864215379 864215379 + 359841126 359784126 359874126 359874126 359874126 359874126 359874126 359874126 + 172496538 172936548 172936548 172936548 172936548 172936548 172936548 172936548 + 937812445 937812615 937482615 937482615 937482615 937482615 937482615 937482615 + 645779283 645179283 645791283 645791283 645791283 645791283 645791283 645791283 + 281623497 281663497 281653497 281653497 281653497 281653497 281653497 281653497 + iter=0 iter=1 iter=2 iter=4 iter=6 iter=8 iter=10 iter=13 iter=16 +Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently +sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all +row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are +highlighted in red. + + +(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither +improves performance, demonstrating that GRAM’s gains stem from the variational framework rather +than mere randomness. + +5 Conclusions and Limitations +We introduced GRAM, a generative framework that transforms deterministic recursive architectures +into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor- +tized variational inference. For reasoning problems, introducing stochasticity into latent transitions +enables diverse solution discovery and improved exploration compared to deterministic counterparts. +Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement +to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of +depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten- +sion that consistently improves any recursive architecture, and that the gains stem specifically from +the variational framework — not from mere randomness, as naive stochastic alternatives applied to +existing models yield no improvement. +Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model +through recursion-based generation over inputs, with generation quality improving monotonically +with recursive depth even beyond training-time steps. This suggests new directions for generative +modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision +limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM +toward larger foundation models. + + + + + 10 +Acknowledgment +This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and +the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165) +through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and +ICT (MSIT). This work was also supported by the Institute of Information & Communications +Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS- +2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research +Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We +are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the +members of the MLML for valuable discussions and feedback throughout this project. + +Broader Impacts +GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main- +taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific +problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way +to improve reasoning through inference-time computation rather than parameter scaling alone. Its +generality also entails risks: plausible but invalid generations may be mistaken for verified solutions +in downstream decision-making pipelines, and multi-sample inference may increase computational +and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in +real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and +domain-specific safeguards. + +References + [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, + Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. + Advances in neural information processing systems, 35:24824–24837, 2022. + + [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik + Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad- + vances in neural information processing systems, 36:11809–11822, 2023. + + [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas + Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph + of thoughts: Solving elaborate problems with large language models. In Proceedings of the + AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024. + + [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong + Tian. Training large language models to reason in a continuous latent space. arXiv preprint + arXiv:2412.06769, 2024. + + [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning + by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint + arXiv:2505.12514, 2025. + + [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh + Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and + reasoning. arXiv preprint arXiv:2505.23648, 2025. + + [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers + are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023. + + [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and + Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025. + + [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv + preprint arXiv:2510.04871, 2025. + + + 11 +[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- + versal transformers. arXiv preprint arXiv:1807.03819, 2018. +[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011. +[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017. +[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. +[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- + agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, + 2025. +[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document + recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. +[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of + recurrent network trajectories. Neural computation, 2(4):490–501, 1990. +[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv + preprint arXiv:1705.08209, 2017. +[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold- + son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute + with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. +[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation + beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025. +[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong + Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous + concept space. arXiv preprint arXiv:2505.15778, 2025. +[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, + hard truths. arXiv preprint arXiv:2509.19170, 2025. +[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: + Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint + arXiv:2502.21074, 2025. +[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu + Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv + preprint arXiv:2505.18454, 2025. +[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, + Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic + recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025. +[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of- + thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint + arXiv:2310.10845, 2023. +[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. + Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint + arXiv:2410.20672, 2024. +[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. +[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): + 1735–1780, 1997. +[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, + Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- + decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. + + + 12 +[30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu + Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv + preprint arXiv:1909.11942, 2019. +[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv + preprint arXiv:1910.10073, 2019. +[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint + arXiv:1603.08983, 2016. +[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua + Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv. + org/abs/1506.02216. +[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural + models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571. +[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https: + //arxiv.org/abs/1511.05121. +[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and + James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https: + //arxiv.org/abs/1811.04551. +[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with + discrete world models. arXiv preprint arXiv:2010.02193, 2020. +[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains + through world models. arXiv preprint arXiv:2301.04104, 2023. +[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information + processing systems, 30, 2017. +[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI + benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22. +[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, + Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language + models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024. +[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, + 2018. +[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. + Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in + neural information processing systems, 30, 2017. +[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. + Structured denoising diffusion models in discrete state-spaces. Advances in neural information + processing systems, 34:17981–17993, 2021. +[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint + arXiv:1312.6114, 2013. +[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: + Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. +[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020. +[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in + discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024. +[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings + of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. + + + 13 +[50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning + applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002. +[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep + convolutional neural networks. Advances in neural information processing systems, 25, 2012. +[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network + function approximation in reinforcement learning. Neural networks, 107:3–11, 2018. +[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference + on computer vision (ECCV), pages 3–19, 2018. +[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint + arXiv:1711.05101, 2017. +[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity + natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. +[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. + Advances in neural information processing systems, 33:12438–12448, 2020. +[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica + Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/ + BF02066689. URL https://doi.org/10.1007/BF02066689. +[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep + learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE + 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. + IEEE, 2019. +[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum + Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141. +[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com- + mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID: + 13091446. + + + + + 14 +A Additional Method Details +A.1 Adaptive Computation Time + +GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each +trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision +steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9]. +Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h +to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the +binary action of halting or continuing computation at the current supervision step. +Training. The halt head is trained jointly with the main objective via a temporal-difference loss. + (n) +After computing the latent state zT at the end of supervision step n, we form Q-learning targets: + + • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct + prediction. + + • q̂ncontinue = max qn+1halt continue + , qn+1 , the bootstrapped value of running one more supervision + step. + +The halt head is trained by regression to these targets: + Nsup h + X 2 2 i + LACT = qnhalt − q̂nhalt + qncontinue − q̂ncontinue . (15) + n=1 + +This auxiliary loss is added to the main training objective and contributes only through the halt head; +it does not propagate gradients into the recursive core. +Inference. At inference, computation proceeds one supervision step at a time. After each step +n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise, + max +computation continues to the next supervision step, up to a maximum budget of Nsup steps. Different +trajectories sampled in parallel may therefore terminate at different depths, complementing the +parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt +(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying +implementation; our released code uses this variant. + +A.2 Latent Process Reward Model (LPRM). + +To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected +accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly +with the main objective via a regression loss: + T + X + LLPRM = (vψ (zt ) − r)2 , (16) + t=1 + +where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory. + +A.3 Empirical Validation of the Surrogate Objective + +We further analyze the approximation introduced by the surrogate training objective LGRAM used in +Section 2.2, both qualitatively and empirically. +Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than +a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL +terms across all TTotal transitions, and computing its exact gradient requires backpropagation through +the entire trajectory. To enable training with constant memory, we propagate gradients only through +the final transition of each supervision step. This is a standard practice in recurrent latent variable +models with long computation chains: ELBOs over truncated sequences are used, for example, in +VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world +models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore + + + 15 +well-precedented; what is specific to GRAM is applying this approximation at the level of recursive +reasoning trajectories rather than temporal sequences. +Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full +variational bound, we compute both quantities on the validation set throughout training. The full +ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions + (n) +across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the +Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8. + + Sudoku-Extreme N-Queens 8 × 8 + GRAM (training objective) GRAM (training objective) + 0.6 ELBO (full) 103 ELBO (full) + + + 0.5 102 + ELBO + + + + + ELBO + 0.4 101 + + 100 + 0.3 + 10 1 + 0.2 + 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000 + Training Step Training Step +Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO, +smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease +monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational +bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions +while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL +across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to +the large dynamic range. + +Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates +that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full +variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood, +its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood, +even though training relies on the surrogate. +The gap between the two curves in Figure 8 reflects the structural difference between the two +quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the +final-step KL of each supervision step — rather than an optimization failure. This gap is consistent +with LGRAM being a biased but useful surrogate for LELBO . + +B Training and Architecture Details +B.1 Architecture Details + +GRAM consists of three components: Encoder, Recursive Core, and Decoder. +Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con- +catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings +(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8]. +Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer +step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at +each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same +architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use +[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for +Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard +Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again, +meaning the initialized z0 has a fixed value. +Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions) +and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and +correctness values from the first token of h. + + + 16 + Table 4: Architecture components. + Component Module Description + Encoder + Token Embedding vocab → D + Puzzle Embedding 16 tokens (optional, for ARC) + Position Encoding RoPE or learned + Recursive Core + f L , fH [Attention + SwiGLU] × 2 layers + Iterations K low-level, T high-level steps + µθ , σ θ , µ ϕ , σ ϕ SwiGLU MLP for each parameter + Decoder + LM Head Linear(D → vocab) + Q Head Linear(D → 2) for halt + V Head Linear(D → 1) for value + + + +Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first +construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the +continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a +convolutional patch encoder, following [48, 49]. +The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are +normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification, +the normalized image passes through a shallow convolutional encoder. This encoder consists of +two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and +padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting +feature map is divided into non-overlapping patches of size P × P and linearly projected to match +the model’s hidden dimension D. The detailed architectural specifications and dimension transitions +are summarized in Table 5. + +Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input +channels, P patch size, Np the number of patches, and D the hidden dimension. + Stage Layer / Operation Output Dim. + Input Tokens (B, C, H, W ) + 1. Norm. + Linear Scaling [−1, 1] (B, C, H, W ) + Conv2d 5 × 5 (p = 2) + (B, D/2, H, W ) + SiLU → GN(32) + 2. Conv + Conv2d 5 × 5 (p = 2) + (B, D/2, H, W ) + SiLU → GN(32) + Flatten Patches (B, Np , P 2 · D + 2) + 3. Patch + Linear Projection (B, Np , D) + + +Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are +represented as sequences of shape [B, L], where B denotes the batch size and L the context length. +Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well +as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39] +backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension +Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner) +steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3. + +B.2 Training Details + +Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized +in Table 6). + + + 17 + • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11 + (0=pad, 1=blank, 2–10=digits). + • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS + markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task- + specific puzzle embeddings are prepended to distinguish different ARC tasks. + • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3 + (0=pad, 1=empty, 2=queen). + • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens, + using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors. + • For image generation on MNIST [15], images are quantized and processed via CNN-based + patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then, + patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad, + 1=black, 2=white). + + + Table 6: Task-specific configurations. + Task Seq. Len Vocab Puzzle Emb Encoding + Sudoku 81 11 ✗ 9×9 grid, row-major + ARC-AGI 900 12 ✓ 30×30 padded canvas + N-Queens N2 3 ✗ N × N board + n(n−1) + Graph Coloring 2 + 6 ✗ Strict adjacency upper triangle + MNIST 196 3 ✗ 14×14 patches + +Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay +1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential +moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56]. +To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep +supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1 +(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes), +and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7. + + Table 7: Training configurations on NVIDIA RTX 4090 GPUs. + Task Epochs GPUs Time + Sudoku 50K 8 2h + ARC-AGI 200K 8 5 days + N-Queens (8×8) 3K 8 1h + N-Queens (10×10) 1K 8 3h + Graph Coloring (8 nodes) 5K 8 1.5h + Graph Coloring (10 nodes) 5K 8 6h + MNIST 1.8K 8 16h + + + +C Additional Details of Experiment Setup +C.1 Challenging Puzzle Tasks + +C.1.1 Looped TF on ARC-AGI +We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to +prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs, +batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme +already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences +and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run. +This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM, +TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped +TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped + + + 18 +TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum +recursion depth, increasing per-example sequential compute. Both inefficiencies compound at +ARC-AGI scale, making a full Looped TF training run impractical. + +C.2 Multi-solution Puzzle Tasks + +C.2.1 N-Queens Problem + + Input Solution 1 Solution 2 Solution 3 + + + + +Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the +full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration +admits exactly 3 valid solutions. + +Data Generation Details. The N-Queens problem requires placing N queens on an N × N +chessboard such that no two queens attack each other—meaning no queens share the same row, +column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8 +solution, resulting in a puzzle with 3 distinct valid completions. +To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and +N = 10. We then created puzzle instances by removing a specific number of queens, treating the +remaining partial configuration as the input and the original complete board as the target label. To +generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8 +setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our +generated dataset is shown in Figure 10. +For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure +the model learns to reason rather than memorize, the split was performed based on unique input +configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs +are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded +puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2). + + 8x8 N-Queens 10x10 N-Queens + 3000 20000 + 15000 + Counts + + + + + Counts + + + + + 2000 + 10000 + 1000 + 5000 + 0 0 + 3 6 9 12 15 18 0 20 40 60 80 + Number of solutions Number of solutions +Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset +covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs. + + +C.2.2 Graph Coloring Problem +Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each +node in a graph such that no two adjacent nodes share the same color. We consider graphs with +N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8 +nodes and k = 3 colors. +Graphs are generated using the Erdős–Rényi random graph model [57], following the generation +pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently + + + 19 +with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are +3-colorable. +For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate +redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of +structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12. +The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and +192 test instances for N = 10. +Input and Output Representation. The input graph is represented by extracting the upper triangular +portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The +output is a sequence of length N , where each position encodes the assigned color for the corresponding +node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red, +blue, and green respectively. + + Input Solution 1 Solution 2 Solution 3 Solution 4 + + + + + Figure 11: Graph Coloring Example + + + Vertex 8 Graph Coloring Vertex 10 Graph Coloring + 2000 + 1500 + 1500 + Counts + + + + + Counts + + + + + 1000 + 1000 + 500 500 + 0 0 + 3 6 9 12 15 18 0 20 40 60 80 + Number of solutions Number of solutions +Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The +dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs. + + +D Additional Experiment Results +D.1 Additional Results on Challenging Puzzle Benchmarks + +Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional +observations complementing the main text. + +GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis- +tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters +than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0% +in both cases) further confirms that recursive computation is essential for these tasks — single-pass +models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains +arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from +increased model capacity. + +Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including +Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone +does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme +probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI, +more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially + + + 20 +Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior +recursive models. All recursive model scores were obtained at 16 supervision steps. + Method #Params Sudoku ARC-1 ARC-2 + Large Reasoning Models + Deepseek-R1 671B 0.0 15.8 1.3 + Claude 3.7 16k N/A 0.0 28.6 0.7 + o3-mini-high N/A 0.0 34.5 3.0 + GPT 5.2 (low) N/A – 55.7 9.7 + Grok-4-thinking 1.7T – 66.7 16.0 + Gemini 3 Pro N/A – 75.0 31.1 + Recursive Models + Direct Pred 27M 0.0 21.0 0.0 + Looped TF 7M 61.3 - - + HRM 27M 55.0 40.3 5.0 + TRM 7M 87.4 44.6 7.8 + GRAM (Ours) 10M 97.0 52.0 11.1 + Human Results + Avg. Human – – 60.2 – + Best Human – – 98.0 100.0 + + + +ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale; +we view these numbers as benchmark-difficulty reference points rather than controlled baselines. + +D.2 Scales with Parallel Sampling on ARC-AGI Challenge + +To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor- +mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy +by generating 1,000 augmentations for a single problem and performing majority voting over the +results. Because this augmentation process itself creates a wide variety of samples, we isolated +the specific effect of generative sampling by performing inference solely on the original problem +instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM +was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning +rate, and the number of layers. +As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM +and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe +that accuracy consistently improves as the model generates more parallel samples. This trend mirrors +observations in Section 4.2, suggesting that increased inference-time compute through width scaling +allows the model to explore more plausible reasoning trajectories and recover from initial errors, +eventually leading to more robust solution discovery. + +Interaction between Augmentation and Sampling. A natural question arises: why not combine +higher levels of augmentation with extensive parallel sampling? To address this, we conducted an +ablation study examining the interaction between data augmentation and inference-time sampling. +Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without +augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements, +demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However, +as the level of augmentation increases, the marginal benefit of additional sampling diminishes +substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains +nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation +and sampling serve complementary rather than additive roles: both mechanisms enable the model +to capture solution diversity, but through different means. When training data is limited, parallel +sampling compensates by exploring varied reasoning paths at inference time. When training data +is abundant through augmentation, the model has already internalized sufficient diversity during +training, rendering additional inference-time exploration redundant. Consequently, scaling sampling +beyond augmentation provides diminishing returns, justifying our experimental design choice to +evaluate these two scaling axes separately. + + + 21 + 0.450 + 0.425 + 0.400 + 0.375 + + + + + Accuracy + 0.350 + 0.325 + 0.300 TRM + GRAM ( =0.1) + 0.275 GRAM ( =0.05) + GRAM ( =0.04) + 0.250 1 2 5 10 20 50 100 250 500 + Number of Samples (N) +Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling +effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing +augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling +as the number of parallel samples N increases, outperforming the TRM baseline. + + + 0.500 + 0.475 + 0.450 + Accuracy + + + + + 0.425 + 0.400 + 0.375 Aug=0 + Aug=5 + 0.350 Aug=10 + Aug=50 + 0.325 + 1 2 5 10 20 50 100 250 500 + Number of Samples (N) +Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel +sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50, +performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary +roles in capturing solution diversity. + + + +D.3 Solution Coverage Analysis + +We analyze the ability of GRAM to capture the diversity of the solution space compared to determin- +istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with +respect to the total number of valid ground-truth solutions. +As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in +coverage as the number of possible solutions increases. Since these models are constrained to a single +fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode +collapse in multi-solution landscapes. +In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range +of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage +improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms +that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces +where multiple valid reasoning paths exist. + + +D.4 Additional Generated Image Samples + +In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon- +ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized +MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16. + + + 22 + 1.0 HRM 1.0 HRM + TRM TRM + GRAM (N=1) GRAM (N=1) + GRAM (N=5) GRAM (N=5) + 0.8 GRAM (N=10) 0.8 GRAM (N=10) + GRAM (N=20) GRAM (N=20) + + + 0.6 0.6 +Coverage + + + + + Coverage + 0.4 0.4 + + + 0.2 0.2 + + + 0.0 0.0 + 2 4 6 8 10 12 14 16 18 0 15 30 45 60 75 90 + Number of Solutions Number of Solutions + (a) N-Queens 8 × 8 (b) N-Queens 10 × 10 + + Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of + ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution + space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N + increases. + + + + As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting + from a black initialization, the model iteratively adds details and sharpens the structure of the digit. + A particularly compelling property of this process is the model’s ability to recover from initially + ambiguous or incorrect formations. + For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’), + the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However, + as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors, + resolving the ambiguity and converging to a coherent, high-quality digit by t = 16. + + + + + t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 + Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi- + tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its + recursive refinement process. + + + + + 23 +D.5 Additional Experiment Results on Unconditional Sudoku Generation + +In this section, we provide additional details on unconditional Sudoku generation. Unlike the +conditional Sudoku-solving setting, where the input board contains given clues, the model receives an +entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate +each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box +must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated +boards. To check whether high validity comes from repeatedly producing the same board, we also +compute the fraction of unique boards among valid samples. +For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku +benchmark used by HRM and TRM. We sample 50K complete solutions from the original training +split, discard the clue patterns, and use an all-blank board as input with the complete solution as +the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200 +epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model +contains 10.9M parameters and uses 16 inference steps. +For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM- +Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M +parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention +heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and +generate boards with 1000 denoising steps. +As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The +strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame- +ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps. +In all cases, the valid samples are unique under exact board matching, indicating that the reported +validity is not due to simple repetition of a small set of boards. These results show that GRAM can +generate highly constrained symbolic structures from an empty input, supporting its potential as a +generator beyond conditional puzzle solving. +Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board, +the task is to generate complete boards, and validity is determined by whether the generated board +satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM. + +Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku +constraints over 100K samples. All valid boards are unique for all methods in this evaluation. + Method #Params Steps Validity(%) + D3PM-Uniform (Big) 55.1M 1000 91.33 + D3PM-Uniform (Small) 15.9M 1000 29.24 + D3PM-Absorb (Big) 55.1M 1000 79.18 + D3PM-Absorb (Small) 15.9M 1000 21.88 + GRAM (Ours) 10.9M 16 99.05 + + + Empty input Valid sample Invalid sample + 3 6 5 4 7 8 9 2 1 4 3 8 2 9 1 7 6 5 + 9 4 1 2 5 6 8 7 3 5 2 1 7 3 6 4 9 8 + 2 7 8 9 1 3 6 5 4 7 6 9 4 5 8 1 2 3 + 5 2 9 8 4 7 3 1 6 9 1 3 4 4 7 5 8 6 + 7 3 6 1 9 2 5 4 8 2 5 4 6 8 9 3 1 7 + 1 8 4 3 6 5 2 9 7 6 8 7 5 1 3 2 4 9 + 8 9 7 5 3 4 1 6 2 3 7 2 9 6 4 8 5 1 + 4 1 3 6 2 9 7 8 5 1 9 5 3 2 8 6 7 4 + 6 5 2 7 8 1 4 3 9 8 4 6 1 7 5 9 3 2 + + +Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate +complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid +sample indicate cells involved in constraint violations. + + + + 24 +D.6 Visualizing Latent Recursion Process + +To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during +recursive computation. Specifically, we track the high-level state h at each supervision step throughout +the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and +interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape. +Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single +deterministic path from initialization to solution, offering no mechanism to escape if the trajectory +enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different +regions of latent space before converging. While some trajectories become trapped in local minima +(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions). +This diversity enables GRAM to discover valid solutions more reliably through parallel exploration. + + + + +Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot +indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high +loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with +no ability to escape suboptimal trajectories. + + + + + 25 +Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as +Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration +of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach +the global optimum (left middle), demonstrating how parallel sampling improves solution discovery. + + + + + 26 +Licenses +Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public +reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances +are generated by the authors and are therefore not external assets. + + Asset Use in this paper License / terms Source link + MNIST Binarized MNIST genera- Creative Com- https://keras.io/api/ + tion experiments mons Attribution- datasets/mnist/ + Share Alike 3.0 + ARC-AGI-1 / origi- ARC-AGI reasoning Apache License https://github.com/fchollet/ + nal ARC benchmark 2.0 ARC-AGI + ARC-AGI-2 ARC-AGI-2 reasoning Apache License https://github.com/arcprize/ + benchmark / reference 2.0 ARC-AGI-2 + results + HRM repository HRM baseline and Apache License https://github.com/ + Sudoku-Extreme-related 2.0 sapientinc/HRM + reference implementation + TinyRecursiveModels TRM baseline and recur- MIT License https://github.com/ + / TRM repository sive reasoning reference SamsungSAILMontreal/ + implementation TinyRecursiveModels + MDLM repository Masked diffusion base- Apache License https://github.com/ + line reference implemen- 2.0 kuleshov-group/mdlm + tation, if public code is + used + Google Research D3PM image-generation Apache License https://github.com/ + D3PM implementa- baseline reference imple- 2.0 google-research/ + tion mentation, if public code google-research/blob/master/ + is used d3pm/images/diffusion_ + categorical.py + Looped Trans- Looped Transformer base- MIT License https://github.com/Leiay/ + former repository line reference implemen- looped_transformer + tation, if public code is + used + N-Queens Synthetic multi-solution Not an external N/A + constraint satisfaction asset + task generated by the + authors + Graph Coloring Synthetic multi-solution Not an external N/A + constraint satisfaction asset + task generated by the + authors + + + + + 27 +
\ No newline at end of file diff --git a/papers/txt/gram2026_generative_recursive.txt b/papers/txt/gram2026_generative_recursive.txt new file mode 100644 index 0000000..d5f299d --- /dev/null +++ b/papers/txt/gram2026_generative_recursive.txt @@ -0,0 +1,1532 @@ + Generative Recursive Reasoning + + + Junyeob Baek1†∗ Mingyu Jo1†∗ Minsu Kim1,2 + + Mengye Ren3 Yoshua Bengio2,4 Sungjin Ahn1,3† +arXiv:2605.19376v2 [cs.AI] 20 May 2026 + + + + + 1 + KAIST 2 Mila – Québec AI Institute + 3 + New York University 4 Université de Montréal + + + + Abstract + How should future neural reasoning systems implement extended computation? + Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres- + sive sequence extension by performing iterative latent-state refinement with shared + transition functions. Yet existing RRMs are largely deterministic, following a + single latent trajectory and converging to a single prediction. We introduce Gen- + erative Recursive reAsoning Models (GRAM), a framework that turns recursive + latent reasoning into probabilistic multi-trajectory computation. GRAM models + reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna- + tive solution strategies, and inference-time scaling through both recursive depth + and parallel trajectory sampling. This yields a latent-variable generative model + supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs, + unconditional generation via pθ (x). Trained with amortized variational inference, + GRAM improves over deterministic recurrent and recursive baselines on structured + reasoning and multi-solution constraint satisfaction tasks, while demonstrating an + unconditional generation capability. https://ahn-ml.github.io/gram-website + + + 1 Introduction + A central question for future neural reasoning systems is how extended computation should be imple- + mented. Large autoregressive models typically scale reasoning by extending a sequence-generation + process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im- + plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive + Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather + than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing + because it decouples reasoning depth from both parameter scale and output length: a compact model + can perform many steps of internal computation by repeatedly applying shared transition functions + over time. + Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the + potential of this approach in structured reasoning. Rather than producing a solution in a single + feedforward pass, they perform extended computation through iterative latent-state refinement, deep + supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical + latent dynamics. These features make them well suited to problems requiring constraint propagation, + state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle + ∗ Equal contribution + † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn + + (sungjin.ahn@kaist.ac.kr) + + + Preprint. + Solution 1, + + + Input Task, + + + + Solution 2, + + + + + (a) Deterministic RRMs (b) GRAM (Ours) + +Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions. +Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are +deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore +alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple +valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling. + +also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped +Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational +depth without increasing parameter count. Together, these models suggest that reasoning capability +can emerge not only from scaling model size or generating longer traces, but also from the organization +of computation itself. +While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing +reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning +system should also be able to maintain uncertainty, consider alternative hypotheses, and explore +multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity +or multiple valid solutions are intrinsic, and more generally in problems where a single refinement +path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be +not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and +exploring multiple latent trajectories in parallel. +Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and +initialization, they follow a single latent trajectory and converge to a single prediction. This deter- +ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving +probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This +motivates the central question of our work: can recursive latent computation support probabilistic, +generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models? +In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns +recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning +process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition +conditioned on the input and the current reasoning state, rather than deterministically updating to a +single next state. Repeating this process defines a distribution over possible reasoning trajectories, +allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale +inference not only by increasing recursive depth but also by sampling trajectories in parallel. From +a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by +marginalizing over latent reasoning trajectories, while the same recursive process can also define an +unconditional generative model pθ (x) when the input is fixed or absent. +We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar- +chitectural properties targeted by our formulation: recursive refinement, stochastic exploration, +multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on +comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines, +including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs +whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku- +Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract +transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized +MNIST [15] probes the unconditional generative interpretation. +Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle +for future recurrent and recursive reasoning architectures. Concretely, we make three contributions. + + + 2 +First, we formulate recursive reasoning as a latent-variable generative process, where solutions are +obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based +inference-time scaling, enabling inference to scale not only with recursive depth but also with the +number of sampled latent trajectories. Third, we provide empirical evidence that this formulation +yields the intended architectural advantages over deterministic recurrent and recursive baselines, +improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation. + +2 Generative Recursive Reasoning Models +In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation +of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training +procedure in Section 2.2, with an architecture schematic shown in Figure 2. + +2.1 Architecture + +Overview. GRAM models the conditional distri- 𝑦 (A) CE loss +bution pθ (y | x) by marginalizing over stochas- + Prior (B) KL Div. Posterior Decoder +tic latent reasoning trajectories. Given an input 𝑝𝜃 (⋅ |𝑢𝑡 ) 𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡 𝑓dec +x, GRAM first computes an embedding + ℎ𝑡−1 𝑓𝐻 𝑢𝑡 ℎ𝑡 + ex = fenc (x; θ), (1) + 𝐾 times +which is reused throughout the entire recursive 𝑙𝑡−1 𝑓𝐿 𝑓𝐿 𝑙𝑡 +computation. Starting from a fixed initial la- + Encoder +tent state z0 , the model evolves the latent state 𝑥 + 𝑓enc +through learned stochastic transitions. The re- +cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic +levels: inner and outer loops. latent transition in the hierarchical instantiation z = +At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high- + level update fH produces a deterministic proposal ut , to +a new latent state conditioned on the previous which stochastic guidance ϵt is added: ht = ut + ϵt . +latent state and the input embedding, + zt ∼ pθ (zt | zt−1 , ex ), t = 1, . . . , T. (2) +At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer +to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A +supervision step is the unit at which the decoder is invoked, and the training objective is applied, with +gradients computed as described in Section 2.2. +At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision +step serving as the initial state of the next, thereby forming the full recursive computation: + (1) T transitions (1) (2) T transitions T transitions (N ) + z0 −−−−−−−→ zT = z0 −−−−−−−→ · · · −−−−−−−→ zT sup , (3) + (n) (1) +where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the +fixed initial state, and the terminal state of one supervision step serves as the initial state of the next + (n+1) (n) +(z0 := zT ). This abstract formulation can be instantiated with various recurrent Transformer +backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7], +as well as hierarchical designs such as HRM and TRM [8, 9]. +Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the +latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt | +zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent +reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual +perturbation around a deterministic update: at each transition, the model first computes a deterministic +update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian, +and adds it to ut : + ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I , + + (4) + zt = ut + ϵt . (5) + + + 3 +We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent +direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex- +ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima, +and support robust exploration of the solution space without discarding the deterministic refinement +performed by ut . +Hierarchical Instantiation. We instantiate the latent state with two interacting components, z = +(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning +state, while the low-level component l is updated K times within a single transition and carries +fine-grained intermediate computation. This decomposition separates the two roles across time scales, +with h accumulating slowly across transitions and l refined rapidly within each one. +With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows. +The low-level component is first refined for K updates, with the high-level component held fixed: + lt,k = fL (ht−1 , lt,k−1 , ex ; θ), k = 1, . . . , K, (6) +where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level +component is then updated as a stochastic transition conditioned on the refined lt , + ut = fH (ht−1 , lt ; θ), (7) + ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I + + , (8) + ht = ut + ϵt , (9) +and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low- +level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower, +more abstract component of the latent state, where it can steer the overall reasoning trajectory +across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e., +fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1. +Modeling Unconditional Distribution. While the description so far focuses on the conditional +setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model +pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for +generation tasks in Section 4.3. + +2.2 Training + +GRAM is trained to model the conditional distribution pθ (y | x), where each training example +consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a +latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with +respect to the generative parameters θ and variational parameters ϕ. +Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where +the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with +TTotal = T × Nsup . The conditional likelihood is defined as + Z + pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ, (10) + +where x denotes the input problem and y denotes the corresponding ground-truth output. +Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization +over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the +evidence lower bound (ELBO), jointly training θ and ϕ via variational inference: + log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) . (11) + +During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has +access to both the input problem x and the target output y. At inference time, where y is unavailable, +trajectories are instead generated from the learned prior pθ (· | x). + 3We also tried injecting noise into the low-level state, but found that it did not improve performance. + + + + + 4 +Both the prior and the posterior are modeled as conditional Markov processes over latent states: + TY + Total TY + Total + + pθ (τ | x) = p(z0 ) pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 ) qϕ (zt | zt−1 , x, y). (12) + t=1 t=1 +Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by +adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same +transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y), +whereas the prior uses pθ (ϵt | ut ). +Since the two processes share the same Markov structure and all stochasticity is introduced through +ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover, +since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies +pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as + TX Total h i + LELBO = Eqϕ log pθ (y | zTTotal , x) − Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13) + t=1 +Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined +in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise +variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples. +Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we +train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T +recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather +than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each +step is reused as the initial state of the next step. +Following standard practice for recurrent models with long computation chains, we apply truncated +gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our +implementation, gradients are propagated only through the final transition of each supervision step, + (n) (n) +zT −1 → zT . This gives the following surrogate objective for each supervision step: + (n) (n) (n) (n) (n) (n) + LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14) + (n) +where zT is the terminal state of the current supervision step n, and gradients are stopped through +preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the +exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further +analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters +are listed in Appendix B.2. + +2.3 Inference-Time Scaling + +GRAM supports two complementary axes of inference-time scaling: depth, by varying the number +of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel. +For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation +time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in +Appendix A.1). For width — the focus of this section — we draw {τ (i) }N i=1 ∼ pθ (τ(i)| x) from the +learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring +multiple stochastic reasoning paths simultaneously rather than extending a single trajectory. +To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward +Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory +from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At +inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection +chooses the candidate with the highest predicted terminal value. Details of LPRM training are +provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through +parallel exploration, without increasing the sequential recursion length. + +3 Related Work +Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit +Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous + + + 5 + Sudoku ARC-AGI-1 ARC-AGI-2 + 20 + 100 97.0% 66.7% + 87.4% 16.0% + 80 60 52.0% 55.7% 15 +Accuracy (%) 61.3% 44.6% 11.1% + 60 55.0% 40.3% 9.7% + 40 34.5% 10 7.8% + 40 5.0% + 20 5 3.0% + 20 + 0 0 0 + Looped HRM TRM GRAM HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- + TF (Ours) (Ours) high (low) thinking (Ours) high (low) thinking + Recursive Models GRAM (Ours) LLMs + +Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently +outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent +transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are +omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included +only as external reference points for benchmark difficulty. + +representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa- +tions can make reasoning traces more compact and reduce generation overhead. Existing approaches +instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea- +soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26]. +However, many remain organized around autoregressive sequence generation, where additional +computation is tied to generating more tokens, latent positions, or sequential reasoning states. +Recursive Architectures. Recursive architectures perform iterative state updates and have evolved +from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent +recursive reasoning models show that increasing inference-time depth can outperform larger static +models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process: +instead of following a single deterministic refinement path, it maintains stochastic latent trajectories, +enabling multi-path exploration and generative sampling. +Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent +transitions to capture uncertainty and multimodal dynamics, often trained with variational infer- +ence [33–38]. They have been widely used in sequential generative modeling, video prediction, +and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets +stochastic dynamics as computation rather than temporal observation modeling: latent transitions +define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional +pθ (y | x) and unconditional pθ (x) generation. + + +4 Experiments + +GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general- +purpose large language reasoning model whose training data, inference budgets, prompting strategies, +tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and +recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning +tasks that probe the computational properties targeted by our formulation: iterative latent refinement, +stochastic trajectory exploration, multi-solution coverage, and inference-time scaling. +In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC- +AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring +(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized +MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design +choices (Section 4.4). + +4.1 Challenging Puzzle Tasks + +Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re- +quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract +visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans- + + + 6 + 1.0 N= 1.0 + 1 5 10 20 50 + 0.8 + 0.9 +Accuracy Looped TF + + + + + Accuracy + Looped TF 0.6 HRM + HRM + 0.8 TRM TRM + GRAM (Ours) 0.4 GRAM (N=1) + 0.2 + 0.6 + 0.5 0.0 + 8 16 32 128 320 2 4 6 8 10 12 14 16 18 + Iterations Number of Solutions + Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from + longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each + iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right) + Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer + a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent + performance. + + + former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning + model results [40] are included as external reference points for benchmark difficulty, rather than + as controlled baselines, since their training and inference settings are not directly comparable to + task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are + reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9]. + Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results. + GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this + improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped + TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages + stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution + of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex + problem spaces more effectively than models constrained to a single sequential refinement process. + Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1. + Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the + number of parallel samples consistently improves performance across all iteration counts. Notably, + GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations, + including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic + recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to + explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process + Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses + the latency bottlenecks of depth-based scaling while achieving superior performance. Additional + analysis on the ARC-AGI Challenge is provided in Appendix D.2. + + 4.2 Multi-solution Puzzle Tasks + +Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8, +10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each +input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7], +HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For +N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found / +total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of +constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in +Appendix C.2. + Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive + models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks. + Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three + deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains + consistent performance regardless of solution count. This confirms that deterministic latent updates + + + 7 +Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the +model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation +over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid +solutions discovered with 20 samples (%). + N-Queens Graph Coloring + 8×8 10 × 10 8-vertex 10-vertex + Method Rec. Gen. # Params Accuracy Coverage Accuracy Coverage Conflict↓ Coverage Conflict↓ Coverage + Direct Pred (8 layers) ✗ ✗ 27M 40.4±1.1 13.7±1.1 13.6±0.5 1.6±0.2 179.3±4.0 19.9±0.2 198.7±5.0 6.7±0.1 + Direct Pred (32 layers) ✗ ✗ 100M 40.2±1.3 13.6±1.1 13.1±0.4 1.6±0.2 174.0±18.0 19.1±1.7 227.7±34.5 6.5±1.9 + Looped TF ✓ ✗ 7M 68.4±3.7 23.6±1.9 50.0±7.6 6.2±3.2 136.0±16.1 20.5±1.5 157.3±9.0 7.2±0.7 + HRM ✓ ✗ 27M 78.7±2.9 26.7±1.3 37.4±0.3 4.7±0.1 109.7±1.5 21.8±0.3 164.3±21.6 8.9±1.7 + TRM ✓ ✗ 7M 66.8±5.7 36.1±22.5 17.5±11.2 2.0±1.3 109.3±3.1 22.3±0.6 170.7±17.9 6.8±0.3 + AR ✗ ✓ 10.6M 96.3±1.0 84.8±0.8 90.0±2.2 53.2±0.8 19.0±11.3 83.0±0.7 61.3±8.3 40.0±0.3 + MDLM ✗ ✓ 12.6M 96.1±1.5 87.2±0.6 74.3±6.6 47.4±2.2 2.7±0.6 84.5±4.0 12.0±7.0 48.2±1.4 + GRAM (Ours) ✓ ✓ 10M 99.7±0.3 90.3±1.9 89.7±2.7 57.5±3.4 2.7±2.1 85.8±0.5 3.3±1.5 51.3±2.8 + + + + +cause mode collapse when multiple valid outputs exist for the same input. Additional coverage +analysis is provided in Appendix D.3. +Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR, +MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver- +sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The +gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8- +and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement +enables stricter constraint satisfaction than generative sampling alone. + +4.3 Exploring GRAM as an Unconditional Generator + +Setup. To investigate GRAM’s unconditional gen- Table 2: Unconditional generation results on +erative capability beyond conditional reasoning, we binarized MNIST. We report IS (↑) and FID (↓). +evaluate generation in two domains: structured con- For iterative models, a step corresponds to a super- +straint generation on Sudoku (from empty boards, vision step for TRM and GRAM, and a denoising + step for D3PM. FID is calculated using real sam- +evaluated by the fraction of generated boards satis- ples with original pixel values (0–255). +fying Sudoku constraints) and image generation on +binarized MNIST [15], where pixel values are thresh- Method IS (↑) FID (↓) +olded to 0 or 1 (evaluated by Inception Score (IS) [42] +and FID [43]). In both cases, the input is replaced by VAE 1.70 86.28 + D3PM (1000 steps) 1.86 74.03 +an empty conditioning signal and the model samples TRM (16 steps) 1.00 303.29 +an output from its learned prior. Baselines include +D3PM [44], a discrete diffusion model, on both tasks, GRAM (Ours) +and additionally a VAE [45] trained with binary re- 8 steps 1.85 84.08 + 16 steps 1.89 77.79 +construction loss on MNIST. To ensure a fair compar- 32 steps 1.91 76.65 +ison with existing literature, FID is calculated using 64 steps 1.95 75.39 +real samples from the original standard MNIST. 128 steps 1.99 74.30 + 256 steps 2.04 73.34 +Generative Behavior Beyond Reasoning. GRAM +extends from conditional reasoning to unconditional +generation in two different domains. On Sudoku +generation (Figure 5), GRAM produces valid boards +with 99.05% validity using 10.9M parameters and +16 supervision steps, surpassing D3PM baselines +that use up to 55.1M parameters and 1000 denois- +ing steps. Figure 7 shows qualitative examples, illus- +trating that the model produces diverse, fully valid +boards from empty inputs without any explicit con- +straint checker. On MNIST (Table 2), the deter- +ministic baseline TRM exhibits mode collapse (FID +303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va- +its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM +these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan- + tially fewer parameters and steps. + + 8 +D3PM + t=0 t=100 t=200 t=300 t=400 t=500 t=600 t=700 t=800 t=900 t=1000 +GRAM TRM + t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 + + + + t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 Sample 1 Sample 2 Sample 3 Sample 4 + (a) Generation Process (b) Samples + Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion + steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image + through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model. + + Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a), + Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical + recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing + either significantly degrades performance. + (a) Architecture Ablation. (b) Mechanism Ablation. + + Model variant Sudoku N-Queens Model variant Sudoku N-Queens + base (Looped TF) 61.25 71.30 GRAM (ours) 93.96 99.69 + + DS + HR (=HRM, TRM) 55.00 / 87.40 80.70 / 72.90 + w/o stochastic guidance 82.87 72.91 + + SG 65.64 86.30 + stochasticity only 94.88 50.27 + + DS + SG 73.90 100.00 + guide only 0.00 0.00 + + DS + HR + SG (=GRAM) 93.96 99.69 w/ direct prediction 63.43 61.44 + TRM w/ stochastic decoder 82.87 71.66 + TRM w/ random init. 78.53 71.82 + + + transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction + emerging as a natural byproduct of the recursive generative process. + Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion + at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8 + to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement + advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process; + additional samples are in Section D.4. + + 4.4 Ablation Study + + We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples. + Table 3 summarizes the results. + Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that + stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone + lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100% + on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall + (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent + gains in every configuration, supporting our design of stochastic guidance as the core extension + introduced by GRAM. + Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the + learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains + Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths. + However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to + navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both + tasks), as deterministic guidance conditioned on the target leads to severe overfitting. + Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to + TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and + + + 9 + 8 9 648 59 1 648 59 1 648259317 648259317 648259317 648259317 + 7 79 5 79 68 5 79 68 415 79 683415 79 683415 792683415 792683415 + 9 6 289 3 6 1289 3 6 1289 3 6 71289 356 71289 356471289 356471289 + + + +D3PM + 657 4 8 657 4 831657 4 83165734 983165734 983165734 983165734 + 3 9 4 3 9 5 4 3 9 5 4 389 56 42389 561 423897561 423897561 423897561 + 7 8 7 8 2 7 8 2 7 648 2 5 73648 2 5 73648 2 5173648 2 517364892 + 5 5 13 548 139548 2 139548626 139548626 139548626 139548626 + 59 593 593 1 8 593 1 8 27593 148 275936148 275936148 275936148 + 2 8 2 3 8 7 2 3 8 7 2 53 8 47 2953 8 4712953 864712953 864712953 + t=0 t=125 t=250 t=375 t=500 t=625 t=750 t=875 t=1000 + 717348652 716348952 716348952 716348952 716348952 716348952 716348952 716348952 + 483927379 493527861 493527861 493527861 493527861 493527861 493527861 493527861 + 523971734 528996734 528169734 528169734 528169734 528169734 528169734 528169734 +GRAM + + + 864235917 864251379 864215379 864215379 864215379 864215379 864215379 864215379 + 359841126 359784126 359874126 359874126 359874126 359874126 359874126 359874126 + 172496538 172936548 172936548 172936548 172936548 172936548 172936548 172936548 + 937812445 937812615 937482615 937482615 937482615 937482615 937482615 937482615 + 645779283 645179283 645791283 645791283 645791283 645791283 645791283 645791283 + 281623497 281663497 281653497 281653497 281653497 281653497 281653497 281653497 + iter=0 iter=1 iter=2 iter=4 iter=6 iter=8 iter=10 iter=13 iter=16 +Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently +sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all +row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are +highlighted in red. + + +(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither +improves performance, demonstrating that GRAM’s gains stem from the variational framework rather +than mere randomness. + +5 Conclusions and Limitations +We introduced GRAM, a generative framework that transforms deterministic recursive architectures +into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor- +tized variational inference. For reasoning problems, introducing stochasticity into latent transitions +enables diverse solution discovery and improved exploration compared to deterministic counterparts. +Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement +to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of +depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten- +sion that consistently improves any recursive architecture, and that the gains stem specifically from +the variational framework — not from mere randomness, as naive stochastic alternatives applied to +existing models yield no improvement. +Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model +through recursion-based generation over inputs, with generation quality improving monotonically +with recursive depth even beyond training-time steps. This suggests new directions for generative +modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision +limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM +toward larger foundation models. + + + + + 10 +Acknowledgment +This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and +the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165) +through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and +ICT (MSIT). This work was also supported by the Institute of Information & Communications +Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS- +2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research +Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We +are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the +members of the MLML for valuable discussions and feedback throughout this project. + +Broader Impacts +GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main- +taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific +problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way +to improve reasoning through inference-time computation rather than parameter scaling alone. Its +generality also entails risks: plausible but invalid generations may be mistaken for verified solutions +in downstream decision-making pipelines, and multi-sample inference may increase computational +and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in +real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and +domain-specific safeguards. + +References + [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, + Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. + Advances in neural information processing systems, 35:24824–24837, 2022. + + [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik + Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad- + vances in neural information processing systems, 36:11809–11822, 2023. + + [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas + Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph + of thoughts: Solving elaborate problems with large language models. In Proceedings of the + AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024. + + [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong + Tian. Training large language models to reason in a continuous latent space. arXiv preprint + arXiv:2412.06769, 2024. + + [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning + by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint + arXiv:2505.12514, 2025. + + [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh + Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and + reasoning. arXiv preprint arXiv:2505.23648, 2025. + + [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers + are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023. + + [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and + Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025. + + [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv + preprint arXiv:2510.04871, 2025. + + + 11 +[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- + versal transformers. arXiv preprint arXiv:1807.03819, 2018. +[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011. +[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017. +[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. +[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- + agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, + 2025. +[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document + recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. +[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of + recurrent network trajectories. Neural computation, 2(4):490–501, 1990. +[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv + preprint arXiv:1705.08209, 2017. +[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold- + son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute + with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. +[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation + beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025. +[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong + Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous + concept space. arXiv preprint arXiv:2505.15778, 2025. +[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, + hard truths. arXiv preprint arXiv:2509.19170, 2025. +[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: + Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint + arXiv:2502.21074, 2025. +[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu + Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv + preprint arXiv:2505.18454, 2025. +[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, + Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic + recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025. +[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of- + thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint + arXiv:2310.10845, 2023. +[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. + Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint + arXiv:2410.20672, 2024. +[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. +[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): + 1735–1780, 1997. +[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, + Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- + decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. + + + 12 +[30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu + Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv + preprint arXiv:1909.11942, 2019. +[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv + preprint arXiv:1910.10073, 2019. +[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint + arXiv:1603.08983, 2016. +[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua + Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv. + org/abs/1506.02216. +[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural + models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571. +[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https: + //arxiv.org/abs/1511.05121. +[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and + James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https: + //arxiv.org/abs/1811.04551. +[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with + discrete world models. arXiv preprint arXiv:2010.02193, 2020. +[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains + through world models. arXiv preprint arXiv:2301.04104, 2023. +[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information + processing systems, 30, 2017. +[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI + benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22. +[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, + Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language + models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024. +[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, + 2018. +[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. + Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in + neural information processing systems, 30, 2017. +[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. + Structured denoising diffusion models in discrete state-spaces. Advances in neural information + processing systems, 34:17981–17993, 2021. +[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint + arXiv:1312.6114, 2013. +[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: + Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. +[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020. +[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in + discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024. +[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings + of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. + + + 13 +[50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning + applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002. +[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep + convolutional neural networks. Advances in neural information processing systems, 25, 2012. +[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network + function approximation in reinforcement learning. Neural networks, 107:3–11, 2018. +[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference + on computer vision (ECCV), pages 3–19, 2018. +[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint + arXiv:1711.05101, 2017. +[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity + natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. +[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. + Advances in neural information processing systems, 33:12438–12448, 2020. +[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica + Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/ + BF02066689. URL https://doi.org/10.1007/BF02066689. +[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep + learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE + 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. + IEEE, 2019. +[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum + Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141. +[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com- + mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID: + 13091446. + + + + + 14 +A Additional Method Details +A.1 Adaptive Computation Time + +GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each +trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision +steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9]. +Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h +to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the +binary action of halting or continuing computation at the current supervision step. +Training. The halt head is trained jointly with the main objective via a temporal-difference loss. + (n) +After computing the latent state zT at the end of supervision step n, we form Q-learning targets: + + • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct + prediction. + + • q̂ncontinue = max qn+1halt continue + , qn+1 , the bootstrapped value of running one more supervision + step. + +The halt head is trained by regression to these targets: + Nsup h + X 2 2 i + LACT = qnhalt − q̂nhalt + qncontinue − q̂ncontinue . (15) + n=1 + +This auxiliary loss is added to the main training objective and contributes only through the halt head; +it does not propagate gradients into the recursive core. +Inference. At inference, computation proceeds one supervision step at a time. After each step +n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise, + max +computation continues to the next supervision step, up to a maximum budget of Nsup steps. Different +trajectories sampled in parallel may therefore terminate at different depths, complementing the +parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt +(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying +implementation; our released code uses this variant. + +A.2 Latent Process Reward Model (LPRM). + +To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected +accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly +with the main objective via a regression loss: + T + X + LLPRM = (vψ (zt ) − r)2 , (16) + t=1 + +where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory. + +A.3 Empirical Validation of the Surrogate Objective + +We further analyze the approximation introduced by the surrogate training objective LGRAM used in +Section 2.2, both qualitatively and empirically. +Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than +a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL +terms across all TTotal transitions, and computing its exact gradient requires backpropagation through +the entire trajectory. To enable training with constant memory, we propagate gradients only through +the final transition of each supervision step. This is a standard practice in recurrent latent variable +models with long computation chains: ELBOs over truncated sequences are used, for example, in +VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world +models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore + + + 15 +well-precedented; what is specific to GRAM is applying this approximation at the level of recursive +reasoning trajectories rather than temporal sequences. +Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full +variational bound, we compute both quantities on the validation set throughout training. The full +ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions + (n) +across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the +Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8. + + Sudoku-Extreme N-Queens 8 × 8 + GRAM (training objective) GRAM (training objective) + 0.6 ELBO (full) 103 ELBO (full) + + + 0.5 102 + ELBO + + + + + ELBO + 0.4 101 + + 100 + 0.3 + 10 1 + 0.2 + 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000 + Training Step Training Step +Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO, +smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease +monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational +bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions +while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL +across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to +the large dynamic range. + +Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates +that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full +variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood, +its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood, +even though training relies on the surrogate. +The gap between the two curves in Figure 8 reflects the structural difference between the two +quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the +final-step KL of each supervision step — rather than an optimization failure. This gap is consistent +with LGRAM being a biased but useful surrogate for LELBO . + +B Training and Architecture Details +B.1 Architecture Details + +GRAM consists of three components: Encoder, Recursive Core, and Decoder. +Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con- +catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings +(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8]. +Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer +step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at +each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same +architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use +[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for +Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard +Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again, +meaning the initialized z0 has a fixed value. +Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions) +and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and +correctness values from the first token of h. + + + 16 + Table 4: Architecture components. + Component Module Description + Encoder + Token Embedding vocab → D + Puzzle Embedding 16 tokens (optional, for ARC) + Position Encoding RoPE or learned + Recursive Core + f L , fH [Attention + SwiGLU] × 2 layers + Iterations K low-level, T high-level steps + µθ , σ θ , µ ϕ , σ ϕ SwiGLU MLP for each parameter + Decoder + LM Head Linear(D → vocab) + Q Head Linear(D → 2) for halt + V Head Linear(D → 1) for value + + + +Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first +construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the +continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a +convolutional patch encoder, following [48, 49]. +The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are +normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification, +the normalized image passes through a shallow convolutional encoder. This encoder consists of +two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and +padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting +feature map is divided into non-overlapping patches of size P × P and linearly projected to match +the model’s hidden dimension D. The detailed architectural specifications and dimension transitions +are summarized in Table 5. + +Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input +channels, P patch size, Np the number of patches, and D the hidden dimension. + Stage Layer / Operation Output Dim. + Input Tokens (B, C, H, W ) + 1. Norm. + Linear Scaling [−1, 1] (B, C, H, W ) + Conv2d 5 × 5 (p = 2) + (B, D/2, H, W ) + SiLU → GN(32) + 2. Conv + Conv2d 5 × 5 (p = 2) + (B, D/2, H, W ) + SiLU → GN(32) + Flatten Patches (B, Np , P 2 · D + 2) + 3. Patch + Linear Projection (B, Np , D) + + +Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are +represented as sequences of shape [B, L], where B denotes the batch size and L the context length. +Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well +as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39] +backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension +Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner) +steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3. + +B.2 Training Details + +Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized +in Table 6). + + + 17 + • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11 + (0=pad, 1=blank, 2–10=digits). + • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS + markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task- + specific puzzle embeddings are prepended to distinguish different ARC tasks. + • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3 + (0=pad, 1=empty, 2=queen). + • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens, + using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors. + • For image generation on MNIST [15], images are quantized and processed via CNN-based + patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then, + patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad, + 1=black, 2=white). + + + Table 6: Task-specific configurations. + Task Seq. Len Vocab Puzzle Emb Encoding + Sudoku 81 11 ✗ 9×9 grid, row-major + ARC-AGI 900 12 ✓ 30×30 padded canvas + N-Queens N2 3 ✗ N × N board + n(n−1) + Graph Coloring 2 + 6 ✗ Strict adjacency upper triangle + MNIST 196 3 ✗ 14×14 patches + +Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay +1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential +moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56]. +To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep +supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1 +(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes), +and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7. + + Table 7: Training configurations on NVIDIA RTX 4090 GPUs. + Task Epochs GPUs Time + Sudoku 50K 8 2h + ARC-AGI 200K 8 5 days + N-Queens (8×8) 3K 8 1h + N-Queens (10×10) 1K 8 3h + Graph Coloring (8 nodes) 5K 8 1.5h + Graph Coloring (10 nodes) 5K 8 6h + MNIST 1.8K 8 16h + + + +C Additional Details of Experiment Setup +C.1 Challenging Puzzle Tasks + +C.1.1 Looped TF on ARC-AGI +We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to +prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs, +batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme +already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences +and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run. +This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM, +TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped +TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped + + + 18 +TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum +recursion depth, increasing per-example sequential compute. Both inefficiencies compound at +ARC-AGI scale, making a full Looped TF training run impractical. + +C.2 Multi-solution Puzzle Tasks + +C.2.1 N-Queens Problem + + Input Solution 1 Solution 2 Solution 3 + + + + +Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the +full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration +admits exactly 3 valid solutions. + +Data Generation Details. The N-Queens problem requires placing N queens on an N × N +chessboard such that no two queens attack each other—meaning no queens share the same row, +column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8 +solution, resulting in a puzzle with 3 distinct valid completions. +To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and +N = 10. We then created puzzle instances by removing a specific number of queens, treating the +remaining partial configuration as the input and the original complete board as the target label. To +generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8 +setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our +generated dataset is shown in Figure 10. +For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure +the model learns to reason rather than memorize, the split was performed based on unique input +configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs +are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded +puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2). + + 8x8 N-Queens 10x10 N-Queens + 3000 20000 + 15000 + Counts + + + + + Counts + + + + + 2000 + 10000 + 1000 + 5000 + 0 0 + 3 6 9 12 15 18 0 20 40 60 80 + Number of solutions Number of solutions +Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset +covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs. + + +C.2.2 Graph Coloring Problem +Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each +node in a graph such that no two adjacent nodes share the same color. We consider graphs with +N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8 +nodes and k = 3 colors. +Graphs are generated using the Erdős–Rényi random graph model [57], following the generation +pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently + + + 19 +with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are +3-colorable. +For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate +redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of +structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12. +The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and +192 test instances for N = 10. +Input and Output Representation. The input graph is represented by extracting the upper triangular +portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The +output is a sequence of length N , where each position encodes the assigned color for the corresponding +node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red, +blue, and green respectively. + + Input Solution 1 Solution 2 Solution 3 Solution 4 + + + + + Figure 11: Graph Coloring Example + + + Vertex 8 Graph Coloring Vertex 10 Graph Coloring + 2000 + 1500 + 1500 + Counts + + + + + Counts + + + + + 1000 + 1000 + 500 500 + 0 0 + 3 6 9 12 15 18 0 20 40 60 80 + Number of solutions Number of solutions +Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The +dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs. + + +D Additional Experiment Results +D.1 Additional Results on Challenging Puzzle Benchmarks + +Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional +observations complementing the main text. + +GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis- +tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters +than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0% +in both cases) further confirms that recursive computation is essential for these tasks — single-pass +models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains +arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from +increased model capacity. + +Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including +Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone +does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme +probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI, +more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially + + + 20 +Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior +recursive models. All recursive model scores were obtained at 16 supervision steps. + Method #Params Sudoku ARC-1 ARC-2 + Large Reasoning Models + Deepseek-R1 671B 0.0 15.8 1.3 + Claude 3.7 16k N/A 0.0 28.6 0.7 + o3-mini-high N/A 0.0 34.5 3.0 + GPT 5.2 (low) N/A – 55.7 9.7 + Grok-4-thinking 1.7T – 66.7 16.0 + Gemini 3 Pro N/A – 75.0 31.1 + Recursive Models + Direct Pred 27M 0.0 21.0 0.0 + Looped TF 7M 61.3 - - + HRM 27M 55.0 40.3 5.0 + TRM 7M 87.4 44.6 7.8 + GRAM (Ours) 10M 97.0 52.0 11.1 + Human Results + Avg. Human – – 60.2 – + Best Human – – 98.0 100.0 + + + +ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale; +we view these numbers as benchmark-difficulty reference points rather than controlled baselines. + +D.2 Scales with Parallel Sampling on ARC-AGI Challenge + +To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor- +mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy +by generating 1,000 augmentations for a single problem and performing majority voting over the +results. Because this augmentation process itself creates a wide variety of samples, we isolated +the specific effect of generative sampling by performing inference solely on the original problem +instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM +was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning +rate, and the number of layers. +As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM +and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe +that accuracy consistently improves as the model generates more parallel samples. This trend mirrors +observations in Section 4.2, suggesting that increased inference-time compute through width scaling +allows the model to explore more plausible reasoning trajectories and recover from initial errors, +eventually leading to more robust solution discovery. + +Interaction between Augmentation and Sampling. A natural question arises: why not combine +higher levels of augmentation with extensive parallel sampling? To address this, we conducted an +ablation study examining the interaction between data augmentation and inference-time sampling. +Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without +augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements, +demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However, +as the level of augmentation increases, the marginal benefit of additional sampling diminishes +substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains +nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation +and sampling serve complementary rather than additive roles: both mechanisms enable the model +to capture solution diversity, but through different means. When training data is limited, parallel +sampling compensates by exploring varied reasoning paths at inference time. When training data +is abundant through augmentation, the model has already internalized sufficient diversity during +training, rendering additional inference-time exploration redundant. Consequently, scaling sampling +beyond augmentation provides diminishing returns, justifying our experimental design choice to +evaluate these two scaling axes separately. + + + 21 + 0.450 + 0.425 + 0.400 + 0.375 + + + + + Accuracy + 0.350 + 0.325 + 0.300 TRM + GRAM ( =0.1) + 0.275 GRAM ( =0.05) + GRAM ( =0.04) + 0.250 1 2 5 10 20 50 100 250 500 + Number of Samples (N) +Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling +effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing +augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling +as the number of parallel samples N increases, outperforming the TRM baseline. + + + 0.500 + 0.475 + 0.450 + Accuracy + + + + + 0.425 + 0.400 + 0.375 Aug=0 + Aug=5 + 0.350 Aug=10 + Aug=50 + 0.325 + 1 2 5 10 20 50 100 250 500 + Number of Samples (N) +Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel +sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50, +performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary +roles in capturing solution diversity. + + + +D.3 Solution Coverage Analysis + +We analyze the ability of GRAM to capture the diversity of the solution space compared to determin- +istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with +respect to the total number of valid ground-truth solutions. +As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in +coverage as the number of possible solutions increases. Since these models are constrained to a single +fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode +collapse in multi-solution landscapes. +In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range +of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage +improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms +that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces +where multiple valid reasoning paths exist. + + +D.4 Additional Generated Image Samples + +In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon- +ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized +MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16. + + + 22 + 1.0 HRM 1.0 HRM + TRM TRM + GRAM (N=1) GRAM (N=1) + GRAM (N=5) GRAM (N=5) + 0.8 GRAM (N=10) 0.8 GRAM (N=10) + GRAM (N=20) GRAM (N=20) + + + 0.6 0.6 +Coverage + + + + + Coverage + 0.4 0.4 + + + 0.2 0.2 + + + 0.0 0.0 + 2 4 6 8 10 12 14 16 18 0 15 30 45 60 75 90 + Number of Solutions Number of Solutions + (a) N-Queens 8 × 8 (b) N-Queens 10 × 10 + + Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of + ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution + space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N + increases. + + + + As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting + from a black initialization, the model iteratively adds details and sharpens the structure of the digit. + A particularly compelling property of this process is the model’s ability to recover from initially + ambiguous or incorrect formations. + For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’), + the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However, + as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors, + resolving the ambiguity and converging to a coherent, high-quality digit by t = 16. + + + + + t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 + Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi- + tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its + recursive refinement process. + + + + + 23 +D.5 Additional Experiment Results on Unconditional Sudoku Generation + +In this section, we provide additional details on unconditional Sudoku generation. Unlike the +conditional Sudoku-solving setting, where the input board contains given clues, the model receives an +entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate +each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box +must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated +boards. To check whether high validity comes from repeatedly producing the same board, we also +compute the fraction of unique boards among valid samples. +For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku +benchmark used by HRM and TRM. We sample 50K complete solutions from the original training +split, discard the clue patterns, and use an all-blank board as input with the complete solution as +the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200 +epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model +contains 10.9M parameters and uses 16 inference steps. +For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM- +Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M +parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention +heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and +generate boards with 1000 denoising steps. +As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The +strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame- +ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps. +In all cases, the valid samples are unique under exact board matching, indicating that the reported +validity is not due to simple repetition of a small set of boards. These results show that GRAM can +generate highly constrained symbolic structures from an empty input, supporting its potential as a +generator beyond conditional puzzle solving. +Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board, +the task is to generate complete boards, and validity is determined by whether the generated board +satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM. + +Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku +constraints over 100K samples. All valid boards are unique for all methods in this evaluation. + Method #Params Steps Validity(%) + D3PM-Uniform (Big) 55.1M 1000 91.33 + D3PM-Uniform (Small) 15.9M 1000 29.24 + D3PM-Absorb (Big) 55.1M 1000 79.18 + D3PM-Absorb (Small) 15.9M 1000 21.88 + GRAM (Ours) 10.9M 16 99.05 + + + Empty input Valid sample Invalid sample + 3 6 5 4 7 8 9 2 1 4 3 8 2 9 1 7 6 5 + 9 4 1 2 5 6 8 7 3 5 2 1 7 3 6 4 9 8 + 2 7 8 9 1 3 6 5 4 7 6 9 4 5 8 1 2 3 + 5 2 9 8 4 7 3 1 6 9 1 3 4 4 7 5 8 6 + 7 3 6 1 9 2 5 4 8 2 5 4 6 8 9 3 1 7 + 1 8 4 3 6 5 2 9 7 6 8 7 5 1 3 2 4 9 + 8 9 7 5 3 4 1 6 2 3 7 2 9 6 4 8 5 1 + 4 1 3 6 2 9 7 8 5 1 9 5 3 2 8 6 7 4 + 6 5 2 7 8 1 4 3 9 8 4 6 1 7 5 9 3 2 + + +Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate +complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid +sample indicate cells involved in constraint violations. + + + + 24 +D.6 Visualizing Latent Recursion Process + +To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during +recursive computation. Specifically, we track the high-level state h at each supervision step throughout +the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and +interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape. +Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single +deterministic path from initialization to solution, offering no mechanism to escape if the trajectory +enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different +regions of latent space before converging. While some trajectories become trapped in local minima +(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions). +This diversity enables GRAM to discover valid solutions more reliably through parallel exploration. + + + + +Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot +indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high +loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with +no ability to escape suboptimal trajectories. + + + + + 25 +Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as +Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration +of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach +the global optimum (left middle), demonstrating how parallel sampling improves solution discovery. + + + + + 26 +Licenses +Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public +reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances +are generated by the authors and are therefore not external assets. + + Asset Use in this paper License / terms Source link + MNIST Binarized MNIST genera- Creative Com- https://keras.io/api/ + tion experiments mons Attribution- datasets/mnist/ + Share Alike 3.0 + ARC-AGI-1 / origi- ARC-AGI reasoning Apache License https://github.com/fchollet/ + nal ARC benchmark 2.0 ARC-AGI + ARC-AGI-2 ARC-AGI-2 reasoning Apache License https://github.com/arcprize/ + benchmark / reference 2.0 ARC-AGI-2 + results + HRM repository HRM baseline and Apache License https://github.com/ + Sudoku-Extreme-related 2.0 sapientinc/HRM + reference implementation + TinyRecursiveModels TRM baseline and recur- MIT License https://github.com/ + / TRM repository sive reasoning reference SamsungSAILMontreal/ + implementation TinyRecursiveModels + MDLM repository Masked diffusion base- Apache License https://github.com/ + line reference implemen- 2.0 kuleshov-group/mdlm + tation, if public code is + used + Google Research D3PM image-generation Apache License https://github.com/ + D3PM implementa- baseline reference imple- 2.0 google-research/ + tion mentation, if public code google-research/blob/master/ + is used d3pm/images/diffusion_ + categorical.py + Looped Trans- Looped Transformer base- MIT License https://github.com/Leiay/ + former repository line reference implemen- looped_transformer + tation, if public code is + used + N-Queens Synthetic multi-solution Not an external N/A + constraint satisfaction asset + task generated by the + authors + Graph Coloring Synthetic multi-solution Not an external N/A + constraint satisfaction asset + task generated by the + authors + + + + + 27 +
\ No newline at end of file diff --git a/papers/txt/hrm2025_hierarchical_reasoning.txt b/papers/txt/hrm2025_hierarchical_reasoning.txt new file mode 100644 index 0000000..3cc1f2e --- /dev/null +++ b/papers/txt/hrm2025_hierarchical_reasoning.txt @@ -0,0 +1,1302 @@ + Hierarchical Reasoning Model + Guan Wang1,† , Jin Li1 , Yuhao Sun1 , Xing Chen1 , Changling Liu1 , + Yue Wu1 , Meng Lu1,† , Sen Song2,† , Yasin Abbasi Yadkori1,† + 1 + Sapient Intelligence, Singapore + + + + Abstract + + Reasoning, the process of devising and executing complex goal-oriented action sequences, +arXiv:2506.21734v3 [cs.AI] 4 Aug 2025 + + + + + remains a critical challenge in AI. Current large language models (LLMs) primarily employ + Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive + data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro- + cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel + recurrent architecture that attains significant computational depth while maintaining both train- + ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass + without explicit supervision of the intermediate process, through two interdependent recurrent + modules: a high-level module responsible for slow, abstract planning, and a low-level mod- + ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves + exceptional performance on complex reasoning tasks using only 1000 training samples. The + model operates without pre-training or CoT data, yet achieves nearly perfect performance on + challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. + Furthermore, HRM outperforms much larger models with significantly longer context windows + on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial + general intelligence capabilities. These results underscore HRM’s potential as a transformative + advancement toward universal computation and general-purpose reasoning systems. + + + ARC-AGI-1 ARC-AGI-2 Sudoku-Extreme (9x9) Maze-Hard (30x30) + 960 training examples 1120 training examples 1000 training examples 1000 training examples + 40.3 5.0 60 55.0 80 74.5 + 40 5 + 34.5 + 60 + Deepseek R1 + + + + + 4 + Claude 3.7 8K + + + + + 30 40 + Accuracy % + + + + + 3.0 + 3 + Claude 3.7 8K + + + + + Claude 3.7 8K + 21.0 21.2 + Deepseek R1 + + + + + Deepseek R1 + 40 + o3-mini-high + + + + + o3-mini-high + + + 20 15.8 + Direct pred + + + + + Direct pred + + + + + Direct pred + o3-mini-high + + + + + o3-mini-high + Claude 3.7 8K + + + + + 2 20 + Deepseek R1 + + + + + 1.3 + Direct pred + + + + + 10 0.9 20 + 1 + HRM + + + + + HRM + + + + + HRM + + + + + HRM + + 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 + 0 0 0 0 + Chain-of-thought, pretrained Direct prediction, small-sample learning + + + + Figure 1: Left: HRM is inspired by hierarchical processing and temporal separation in the brain. It + has two recurrent networks operating at different timescales to collaboratively solve tasks. Right: + With only about 1000 training examples, the HRM (~27M parameters) surpasses state-of-the-art + CoT models on inductive benchmarks (ARC-AGI) and challenging symbolic tree-search puzzles + (Sudoku-Extreme, Maze-Hard) where CoT models failed completely. The HRM was randomly + initialized, and it solved the tasks directly from inputs without chain of thoughts. + 2 + Tsinghua University † Corresponding author. Contact: research@sapient.inc. + Code available at: github.com/sapientinc/HRM + + 1 +1 Introduction +Deep learning, as its name suggests, emerged from the idea of stacking more layers to achieve +increased representation power and improved performance 1,2 . However, despite the remarkable +success of large language models, their core architecture is paradoxically shallow 3 . This imposes +a fundamental constraint on their most sought-after capability: reasoning. The fixed depth of stan- +dard Transformers places them in computational complexity classes such as AC 0 or T C 0 4 , prevent- +ing them from solving problems that require polynomial time 5,6 . LLMs are not Turing-complete +and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic rea- +soning that is necessary for deliberate planning or symbolic manipulation tasks 7,8 . For example, +our results on the Sudoku task show that increasing Transformer model depth can improve per- +formance,1 but performance remains far from optimal even with very deep models (see Figure 2), +which supports the conjectured limitations of the LLM scaling paradigm 9 . +The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning 10 . +CoT externalizes reasoning into token-level language by breaking down complex tasks into sim- +pler intermediate steps, sequentially generating text using a shallow model 11 . However, CoT for +reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions +where a single misstep or a misorder of the steps can derail the reasoning process entirely 12,13 . This +dependency on explicit linguistic steps tethers reasoning to patterns at the token level. As a result, +CoT reasoning often requires significant amount of training data and generates a large number of +tokens for complex reasoning tasks, resulting in slow response times. A more efficient approach is +needed to minimize these data requirements 14 . +Towards this goal, we explore “latent reasoning”, where the model conducts computations within +its internal hidden state space 15,16 . This aligns with the understanding that language is a tool for +human communication, not the substrate of thought itself 17 ; the brain sustains lengthy, coherent +chains of reasoning with remarkable efficiency in a latent space, without constant translation back +to language. However, the power of latent reasoning is still fundamentally constrained by a model’s +effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gra- +dients, which plague training stability and effectiveness 1,18 . Recurrent architectures, a natural al- +ternative for sequential tasks, often suffer from early convergence, rendering subsequent computa- +tional steps inert, and rely on the biologically implausible, computationally expensive and memory +intensive Backpropagation Through Time (BPTT) for training 19 . +The human brain provides a compelling blueprint for achieving the effective computational depth +that contemporary artificial models lack. It organizes computation hierarchically across corti- +cal regions operating at different timescales, enabling deep, multi-stage reasoning 20,21,22 . Recur- +rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to +guide, and fast, lower-level circuits to execute—subordinate processing while preserving global +coherence 23,24,25 . Notably, the brain achieves such depth without incurring the prohibitive credit- +assignment costs that typically hamper recurrent networks from backpropagation through time 19,26 . +Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierar- +chical Reasoning Model (HRM). HRM is designed to significantly increase the effective compu- +tational depth. It features two coupled recurrent modules: a high-level (H) module for abstract, +deliberate reasoning, and a low-level (L) module for fast, detailed computations. This structure + 1 + Simply increasing the model width does not improve performance here. + + 2 + 100 100 + Scaling Width - 8 layers fixed Transformer + Scaling Depth - 512 hidden size fixed Recurrent Transformer + 80 80 HRM + Accuracy % + 60 60 + + 40 40 + + 20 20 + 27M 54M 109M 218M 436M 872M 8 16 32 64 128 256 512 + Parameters Depth / Transformer layers computed + +Figure 2: The necessity of depth for complex reasoning. Left: On Sudoku-Extreme Full, which +require extensive tree-search and backtracking, increasing a Transformer’s width yields no perfor- +mance gain, while increasing depth is critical. Right: Standard architectures saturates, failing to +benefit from increased depth. HRM overcomes this fundamental limitation, effectively using its +computational depth to achieve near-perfect accuracy. + +avoids the rapid convergence of standard recurrent models through a process we term “hierarchi- +cal convergence.” The slow-updating H-module advances only after the fast-updating L-module +has completed multiple computational steps and reached a local equilibrium, at which point the +L-module is reset to begin a new computational phase. +Furthermore, we propose a one-step gradient approximation for training HRM, which offers im- +proved efficiency and eliminates the requirement for BPTT. This design maintains a constant mem- +ory footprint (O(1) compared to BPTT’s O(T ) for T timesteps) throughout the backpropagation +process, making it scalable and more biologically plausible. +Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and +backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervi- +sion, HRM learns to solve problems that are intractable for even the most advanced LLMs. For +example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and +optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% ac- +curacy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark +of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 exam- +ples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance +of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) +and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and con- +text lengths, as shown in Figure 1. This represents a promising direction toward the development +of next-generation AI reasoning systems with universal computational capabilities. + + +2 Hierarchical Reasoning Model +We present the HRM, inspired by three fundamental principles of neural computation observed in +the brain: +• Hierarchical processing: The brain processes information across a hierarchy of cortical ar- + eas. Higher-level areas integrate information over longer timescales and form abstract repre- + sentations, while lower-level areas handle more immediate, detailed sensory and motor process- + ing 20,22,21 . + + 3 +• Temporal Separation: These hierarchical levels in the brain operate at distinct intrinsic timescales, + reflected in neural rhythms (e.g., slow theta waves, 4–8 Hz and fast gamma waves, 30–100 + Hz) 30,31 . This separation allows for stable, high-level guidance of rapid, low-level computa- + tions 32,33 . +• Recurrent Connectivity: The brain features extensive recurrent connections. These feedback + loops enable iterative refinement, yielding more accurate and context-sensitive representations + at the cost of additional processing time. Additionally, the brain largely avoids the problematic + deep credit assignment problem associated with BPTT 19 . +The HRM model consists of four learnable components: an input network fI (·; θI ), a low-level re- +current module fL (·; θL ), a high-level recurrent module fH (·; θH ), and an output network fO (·; θO ). +The model’s dynamics unfold over N high-level cycles of T low-level timesteps each2 . We index +the total timesteps of one forward pass by i = 1, . . . , N × T . The modules fL and fH each keep a +hidden state—zLi for fL and zH i + for fH —which are initialized with the vectors zL0 and zH0 + , respec- +tively. +The HRM maps an input vector x to an output prediction vector ŷ as follows. First, the input x is +projected into a working representation x̃ by the input network: + + x̃ = fI (x; θI ) . +At each timestep i, the L-module updates its state conditioned on its own previous state, the H- +module’s current state (which remains fixed throughout the cycle), and the input representation. +The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final +state at the end of that cycle: + + zLi = fL zLi−1 , zH + i−1 + + , x̃; θL , + ( + i−1 i−1 + + i fH zH , zL ; θH if i ≡ 0 (mod T ) , + zH = i−1 + zH otherwise . + +Finally, after N full cycles, a prediction ŷ is extracted from the hidden state of the H-module: + + + NT + ŷ = fO (zH ; θO ) . + +This entire N T -timestep process represents a single forward pass of the HRM. A halting mecha- +nism (detailed later in this section) determines whether the model should terminate, in which case +ŷ will be used as the final prediction, or continue with an additional forward pass. +Hierarchical convergence Although convergence is crucial for recurrent networks, standard RNNs +are fundamentally limited by their tendency to converge too early. As the hidden state settles toward +a fixed point, update magnitudes shrink, effectively stalling subsequent computation and capping +the network’s effective depth. To preserve computational power, we actually want convergence to +proceed very slowly–but engineering that gradual approach is difficult, since pushing convergence +too far edges the system toward instability. + 2 + While inspired by temporal separation in the brain, our model’s “high-level” and “low-level” modules are concep- +tual abstractions and do not map directly to specific neural oscillation frequencies. + + + 4 + 250 250 250 + HRM H Recurrent Neural Net Deep Neural Net + 200 200 200 +Forward residual + HRM L + 150 150 150 + 100 100 100 + 50 50 50 + 0 0 0 + 0 20 40 60 0 20 40 60 0 100 200 + Step Index # Step Index # Layer Index # + + 60 60 + + + + + Layer Index # + Step Index # + + + + + Step Index # + 200 + 30 30 100 + + + Principal Components Principal Components Principal Components + +Figure 3: Comparison of forward residuals and PCA trajectories. HRM shows hierarchical conver- +gence: the H-module steadily converges, while the L-module repeatedly converges within cycles +before being reset by H, resulting in residual spikes. The recurrent neural network exhibits rapid +convergence with residuals quickly approaching zero. In contrast, the deep neural network experi- +ences vanishing gradients, with significant residuals primarily in the initial (input) and final layers. + +HRM is explicitly designed to counteract this premature convergence through a process we term +hierarchical convergence. During each cycle, the L-module (an RNN) exhibits stable convergence +to a local equilibrium. This equilibrium, however, depends on the high-level state zH supplied +during that cycle. After completing the T steps, the H-module incorporates the sub-computation’s +outcome (the final state zL ) and performs its own update. This zH update establishes a fresh context +for the L-module, essentially “restarting” its computational path and initiating a new convergence +phase toward a different local equilibrium. +This process allows the HRM to perform a sequence of distinct, stable, nested computations, where +the H-module directs the overall problem-solving strategy and the L-module executes the intensive +search or refinement required for each step. Although a standard RNN may approach convergence +within T iterations, the hierarchical convergence benefits from an enhanced effective depth of N T +steps. As empirically shown in Figure 3, this mechanism allows HRM both to maintain high +computational activity (forward residual) over many steps (in contrast to a standard RNN, whose +activity rapidly decays) and to enjoy stable convergence. This translates into better performance at +any computation depth, as illustrated in Figure 2. +Approximate gradient Recurrent models typically use BPTT to compute gradients. However, +BPTT requires storing the hidden states from the forward pass and then combining them with +gradients during the backward pass, which demands O(T ) memory for T timesteps. This heavy +memory burden forces smaller batch sizes and leads to poor GPU utilization, especially for large- +scale networks. Additionally, because retaining the full history trace through time is biologically +implausible, it is unlikely that the brain implements BPTT 19 . +Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unrolling its state +sequence by applying backpropagation in a single step at that equilibrium point. Moreover, such a +mechanism could plausibly be implemented in the brain using only local learning rules 34,35 . Based + + 5 +on this finding, we propose a one-step approximation of the HRM gradient–using the gradient of +the last state of each module and treating other states as constant. The gradient path is, therefore, + + Output head → final state of the H-module → final state of the L-module → input embedding + +The above method needs O(1) memory, does not require unrolling through time, and can be easily +implemented with an autograd framework such as PyTorch, as shown in Figure 4. Given that +each module only needs to back-propagate errors through its most recent local synaptic activity, +this approach aligns well with the perspective that cortical credit assignment relies on short-range, +temporally local mechanisms rather than on a global replay of activity patterns. +The one-step gradient approximation is theoretically +grounded in the mathematics of Deep Equilibrium Mod- +els (DEQ) 36 which employs the Implicit Function Theo- +rem (IFT) to bypass BPTT, as detailed next. Consider an +idealized HRM behavior where, during high-level cycle +k, the L-module repeatedly updates until its state zL con- def hrm(z, x, N=2, T=2): + x = input_embedding(x) +verges to a local fixed point zL⋆ . This fixed point, given zH, zL = z + k−1 +the current high-level state zH , can be expressed as with torch.no_grad(): + for _i in range(N ∗ T − 1): + k−1 zL = L_net(zL, zH, x) + zL⋆ = fL (zL⋆ , zH , x̃; θL ) . if (_i + 1) % T == 0: + zH = H_net(zH, zL) + +The H-module then performs a single update using this # 1−step grad + zL = L_net(zL, zH, x) +converged L-state: zH = H_net(zH, zL) + return (zH, zL), output_head(zH) + k k−1 ⋆ + zH = fH (zH , zL ; θH ) . # Deep Supervision + for x, y_true in train_dataloader: + z = z_init +With a proper mapping F, the updates to the high-level for step in range(N_supervision): + k z, y_hat = hrm(z, x) +state can be written in a more compact form as zH = + k−1 loss = softmax_cross_entropy(y_hat, y_true) +F(zH ; x̃, θ), where θ = (θI , θL ), and the fixed-point z = z.detach() + ⋆ ⋆ ∂F +can be written as zH = F(zH ; x̃, θ). Let JF = ∂zH be loss.backward() +the Jacobian of F, and assume that the matrix I − JF is opt.step() + opt.zero_grad() + ⋆ +invertible at zH and that the mapping F is continuously +differentiable. The Implicit Function Theorem then al- Figure 4: Top: Diagram of HRM with + ⋆ +lows us to calculate the exact gradient of fixed point zH approximate gradient. Bottom: Pseu- +with respect to the parameters θ without explicit back- docode of HRM with deep supervision +propagation: training in PyTorch. + ⋆ −1 ∂F + ∂zH + = I − JF z⋆ . (1) + ∂θ H ∂θ z⋆ + H + + +Calculating the above gradient requires evaluating and inverting matrix (I − JF ) that can be com- +putationally expensive. Given the Neumann series expansion, + (I − JF )−1 = I + JF + JF2 + JF3 + . . . , +the so-called 1-step gradient 37 approximates the series by considering only its first term, i.e. (I − +JF )−1 ≈ I, and leads to the following approximation of Equation (1): + ∗ ∗ + ∂zH ∂fH ∂zH ∂fH ∂zL∗ ∗ + ∂zH ∂fH ∂zL∗ + ≈ , ≈ · , ≈ · . (2) + ∂θH ∂θH ∂θL ∂zL∗ ∂θL ∂θI ∂zL∗ ∂θI + 6 + ∂z ∗ ∂z ∗ +The gradients of the low-level fixed point, ∂θLL and ∂θLI , can also be approximated using another +application of the 1-step gradient: + ∂zL∗ ∂fL ∂zL∗ ∂fL + ≈ , ≈ . (3) + ∂θL ∂θL ∂θI ∂θI +By substituting Equation (3) back into Equation (2), we arrive at the final simplified gradients. +Before defining our loss function, we must first introduce two key elements of our proposed +method: deep supervision and adaptive computational time. +Deep supervision Inspired by the principle that periodic neural oscillations regulate when learning +occurs in the brain 38 , we incorporate a deep supervision mechanism into HRM, as detailed next. +Given a data sample (x, y), we run multiple forward passes of the HRM model, each of which we +refer to as a segment. Let M denote the total number of segments executed before termination. +For each segment m ∈ {1, . . . , M }, let z m = (zHmN T + , zLmN T ) represent the hidden state at the +conclusion of segment m, encompassing both high-level and low-level state components. +At each segment m, we apply a deep supervision step as follows: + 1. Given the state z m−1 from the previous segment, compute the next state z m and its associated + output ŷ m through a forward pass in the HRM model: + + (z m , ŷ m ) ← HRM(z m−1 , x; θ) + + 2. Compute the loss for the current segment: + + Lm ← L OSS(ŷ m , y) + + 3. Update parameters: + + θ ← O PTIMIZER S TEP(θ, ∇θ Lm ) + +The crucial aspect of this procedure is that the hidden state z m is “detached” from the computa- +tion graph before being used as the input state for the next segment. Consequently, gradients from +segment m + 1 do not propagate back through segment m, effectively creating a 1-step approxi- +mation of the gradient of the recursive deep supervision process 39,40 . This approach provides more +frequent feedback to the H-module and serves as a regularization mechanism, demonstrating supe- +rior empirical performance and enhanced stability in deep equilibrium models when compared to +more complex, Jacobian-based regularization techniques 39,41 . Figure 4 shows pseudocode of deep +supervision training. +Adaptive computational time (ACT) The brain dynamically alternates between automatic think- +ing (“System 1”) and deliberate reasoning (“System 2”) 42 . Neuroscientific evidence shows that +these cognitive modes share overlapping neural circuits, particularly within regions such as the +prefrontal cortex and the default mode network 43,44 . This indicates that the brain dynamically mod- +ulates the “runtime” of these circuits according to task complexity and potential rewards 45,46 . +Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that en- +ables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning + + + + 7 +algorithm 47 to adaptively determine the number of segments. A Q-head uses the final state of the +H-module to predict the Q-values Q̂m = (Q̂m m + halt , Q̂continue ) of the “halt” and “continue” actions: + + ⊤ mN T + Q̂m = σ(θQ zH ) , +where σ denotes the sigmoid function applied element-wise. The halt or continue action is chosen +using a randomized strategy as detailed next. Let Mmax denote the maximum number of segments +(a fixed hyperparameter) and Mmin denote the minimum number of segments (a random variable). +The value of Mmin is determined stochastically: with probability ε, it is sampled uniformly from the +set {2, · · · , Mmax } (to encourage longer thinking), and with probability 1 − ε, it is set to 1. The halt +action is selected under two conditions: when the segment count surpasses the maximum threshold +Mmax , or when the estimated halt value Q̂halt exceeds the estimated continue value Q̂continue and the +segment count has reached at least the minimum threshold Mmin . +The Q-head is updated through a Q-learning algorithm, which is defined on the following episodic +Markov Decision Process (MDP). The state of the MDP at segment m is z m , and the action space +is {halt, continue}. Choosing the action “halt” terminates the episode and returns a binary reward +indicating prediction correctness, i.e., 1{ŷ m = y}. Choosing “continue” yields a reward of 0 and +the state transitions to z m+1 . Thus, the Q-learning targets for the two actions Ĝm = (Ĝm m + halt , Ĝcontinue ) +are given by + + Ĝm m + halt = 1{ŷ = y} , + + Q̂m+1 + halt , if m ≥ Nmax , + m + Ĝcontinue = + max(Q̂m+1 , Q̂m+1 ) , otherwise . + halt continue + +We can now define the loss function of our learning procedure. The overall loss for each supervision +segment combines both the Q-head loss and the sequence-to-sequence loss: + + Lm m m m + ACT = L OSS (ŷ , y) + B INARY C ROSS E NTROPY (Q̂ , Ĝ ) . + +Minimizing the above loss enables both accurate predictions and nearly optimal stopping decisions. +Selecting the “halt” action ends the supervision loop. In practice, sequences are processed in +batches, which can be easily handled by substituting any halted sample in the batch with a fresh +sample from the dataloader. +Figure 5 presents a performance comparison between two HRM variants: one incorporating ACT +and another employing a fixed computational step count equivalent to ACT’s Mmax parameter. It +shows that ACT effectively adapts its computational resources based on task complexity, achieving +significant computational savings with minimal impact on performance. +Inference-time scaling An effective neural model should exploit additional computational re- +sources during inference to enhance performance. As illustrated in Figure 5-(c), HRM seamlessly +achieves inference-time scaling by simply increasing the computational limit parameter, Mmax +without requiring further training or architectural modifications. +Additional compute is especially effective for tasks that demand deeper reasoning. On Sudoku— +a problem that often requires long-term planning—HRM exhibits strong inference-time scaling. +On the other hand, we find that extra computational resources yield minimal gains in ARC-AGI +challenge, as solutions generally require only a few transformations. + + 8 + (a) ACT Compute Spent (b) ACT Performance (c) Inference-time scaling + 8 100.0 100.0 + Fixed M Fixed M + 7 +Mean Compute Steps + + ACT (Mmax limit) 97.5 ACT (Mmax limit) 97.5 + 6 95.0 95.0 + + + + + Accuracy % + + + + + Accuracy % + 5 92.5 92.5 + 4 90.0 90.0 + 3 87.5 87.5 Train Mmax = 2 + 85.0 85.0 Train Mmax = 4 + 2 Train Mmax = 8 + 1 82.5 82.5 + 2 4 8 2 4 8 2 4 8 16 + M (Fixed) or Mmax (ACT) M (Fixed) or Mmax (ACT) Inference Mmax + + +Figure 5: Effectiveness of Adaptive Computation Time (ACT) on the Sudoku-Extreme-Full. (a) +Mean compute steps used by models with ACT versus models with a fixed number of compute steps +(M ). ACT maintains a low and stable number of average compute steps even as the maximum limit +(Mmax ) increases. (b) Accuracy comparison. The ACT model achieves performance comparable +to the fixed-compute model while utilizing substantially fewer computational steps on average. (c) +Inference-time scalability. Models trained with a specific Mmax can generalize to higher compu- +tational limits during inference, leading to improved accuracy. For example, a model trained with +Mmax = 8 continues to see accuracy gains when run with Mmax = 16 during inference. + +Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is +known to be prone to instability, often requiring stabilization techniques such as replay buffers +and target networks 48 , which are absent in our design. Our approach, however, achieves stability +through the intrinsic properties of our model and training procedure. Recent theoretical work by +Gallici et al. 49 shows that Q-learning can achieve convergence if network parameters are bounded, +weight decay is incorporated during training, and post-normalization layers are implemented. Our +model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a +layer normalization variant) and the AdamW optimizer. AdamW has been shown to solve an L∞ - +constrained optimization problem, ensuring that model parameters remain bounded by 1/λ 50 . +Architectural details We employ a sequence-to-sequence architecture for HRM. Both input and +output are represented as token sequences: x = (x1 , . . . , xl ) and y = (y1 , . . . , yl′ ) respectively. +The model includes an embedding layer fI that converts discrete tokens into vector representa- +tions, and an output head fO (z; θO ) = softmax(θO z) that transforms hidden states into token prob- +ability distributions ŷ. For small-sample experiments, we replace softmax with stablemax 51 to +improve generalization performance. The sequence-to-sequence loss is averaged over all tokens, + Pl′ +L OSS(ŷ, y) = l1′ i=1 log p(yi ), where p(yi ) is the probability that distribution ŷi assigns to token +yi . The initial hidden states z 0 are initialized by sampling from a truncated normal distribution with +standard deviation of 1, truncation of 2, and kept fixed throughout training. +Both the low-level and high-level recurrent modules fL and fH are implemented using encoder- +only Transformer 52 blocks with identical architectures and dimensions. These modules take mul- +tiple inputs, and we use straightforward element-wise addition to combine them, though more +sophisticated merging techniques such as gating mechanisms could potentially improve perfor- +mance and is left for future work. For all Transformer blocks in this work—including those in +the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama 53 +architectures). These improvements include Rotary Positional Encoding 54 , Gated Linear Units 55 , +RMSNorm 56 , and the removal of bias terms from linear layers. +Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture + + 9 + 8 4 5 6 + 8 7 + 3 4 + 3 8 4 2 + 6 3 8 + 9 6 + 5 + 2 1 + 2 5 3 8 + + + + + 7 8 4 1 2 5 9 6 3 + 2 6 1 3 8 9 7 4 5 + 3 5 9 6 4 7 8 1 2 + 5 3 8 4 9 6 1 2 7 + 4 1 6 2 7 3 5 9 8 + 9 7 2 8 5 1 4 3 6 + 6 9 3 5 1 8 2 7 4 + 8 4 7 9 6 2 3 5 1 + 1 2 5 7 3 4 6 8 9 + + + + + (a) ARC-AGI (b) Sudoku-Hard (c) Maze navigation (d) Sudoku-Extreme subset difficulty + +Figure 6: Left: Visualization of benchmark tasks. Right: Difficulty of Sudoku-Extreme examples. + +with weights initialized via truncated LeCun Normal initialization 57,58,59 , while the scale and bias +parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 op- +timizer 60 , a scale-invariant variant of Adam 61 , combined with a constant learning rate that includes +linear warm-up. + + +3 Results +This section begins by describing the ARC-AGI, Sudoku, and Maze benchmarks, followed by an +overview of the baseline models and their results. Figure 6-(a,b,c) presents a visual representa- +tion of the three benchmark tasks, which are selected to evaluate various reasoning abilities in AI +models. + +3.1 Benchmarks +ARC-AGI Challenge The ARC-AGI benchmark evaluates general fluid intelligence through IQ- +test-like puzzles that require inductive reasoning 27 . The initial version, ARC-AGI-1, presents chal- +lenges as input-label grid pairs that force AI systems to extract and generalize abstract rules from +just a few examples. Each task provides a few input–output demonstration pairs (usually 2–3) and +a test input. An AI model has two attempts to produce the correct output grid. Although some be- +lieve that mastering ARC-AGI would signal true artificial general intelligence, its primary purpose +is to expose the current roadblocks in AGI progress. In fact, both conventional deep learning meth- +ods and CoT techniques have faced significant challenges with ARC-AGI-1, primarily because it +requires the ability to generalize to entirely new tasks 28 . +Addressing the limitations identified in ARC-AGI-1, ARC-AGI-2 significantly expands the bench- +mark by providing a more comprehensive and carefully refined collection of tasks. These new +tasks emphasize deeper compositional reasoning, multi-step logic, contextual rule application, and +symbolic abstraction. Human calibration studies show these tasks are challenging but doable for +people, while being much harder for current AI systems, offering a clearer measure of general +reasoning abilities 29 . + + + 10 +Sudoku-Extreme Sudoku is a 9×9 logic puzzle, requiring each row, column, and 3×3 block to +contain the digits 1–9 exactly once. A prediction is considered correct if it exactly matches the +puzzle’s unique solution. Sudoku’s complex logical structure makes it a popular benchmark for +evaluating logical reasoning in machine learning 62,63,64 . +The most frequently used Sudoku dataset in research, namely the Kaggle dataset 65 , can be fully +solved using elementary single-digit techniques 66 . The minimal 17-clue puzzles 62 , another widely- +used collection, might seem more challenging due to its small number of clues. However, this +perception is misleading—since 17 represents the minimum number of clues required to guarantee +a unique Sudoku solution, these hints need to be highly orthogonal to each other. This orthogonal +arrangement leads to many direct, easily-resolved solution paths 67 . +We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforemen- +tioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally +difficult for human players: +• Easy puzzles compiled from Kaggle, 17-clue, plus unbiased samples from the Sudoku puzzle + distribution 67 : totaling 1 149 158 puzzles. +• Challenging puzzles compiled from Magictour 1465, Forum-Hard and Forum-Extreme subsets: + totaling 3 104 157 puzzles. +The compiled data then undergo a strict 90/10 train-test split, ensuring that the test set puzzles +cannot be derived through equivalent transformations of any training samples. Sudoku-Extreme is +a down-sampled subset of this data containing 1000 training examples. We use Sudoku-Extreme in +our main experiments (Figure 1), which focuses on small-sample learning scenarios. To guarantee +convergence and control overfitting effects in our analysis experiments (Figures 2, 3 and 5), we use +the complete training data, Sudoku-Extreme-Full, containing 3 831 994 examples. +We measure puzzle difficulty by counting the number of search backtracks (“guesses”) required +by a smart Sudoku solver program tdoku, which uses propositional logic to reduce the number of +guesses 67 . Our Sudoku-Extreme dataset exhibits a mean difficulty of 22 backtracks per puzzle, sig- +nificantly higher than existing datasets, including recent handmade puzzles Sudoku-Bench 68 which +average just 0.45 backtracks per puzzle. These subset complexity levels are shown in Figure 6-(d). +Maze-Hard This task involves finding the optimal path in a 30×30 maze, making it interpretable +and frequently used for training LLMs in search tasks 69,70,71 . We adopt the instance generation +procedure of Lehnert et al. 71 , but introduce an additional filter to retain only those instances whose +difficulty exceeds 110. Here, “difficulty” is defined as the length of the shortest path, which aligns +with the linear time complexity of the wavefront breadth-first search algorithm on GPUs 72 . A path +is considered correct if it is valid and optimal—that is, the shortest route from the start to the goal. +The training and test set both include 1000 examples. + +3.2 Evaluation Details +For all benchmarks, HRM models were initialized with random weights and trained in the sequence- +to-sequence setup using the input-output pairs. The two-dimensional input and output grids were +flattened and then padded to the maximum sequence length. The resulting performance is shown in +Figure 1. Remarkably, HRM attains these results with just ~1000 training examples per task—and +without pretraining or CoT labels. + + + 11 +For ARC-AGI challenge, we start with (1) all demonstration and test input-label pairs from the +training set, and (2) all demonstration pairs along with test inputs from the evaluation set. The +dataset is augmented by applying translations, rotations, flips, and color permutations to the puz- +zles. Each task example is prepended with a learnable special token that represents the puzzle it +belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Gener- +ate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to +obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All reported +results are obtained by comparing the outputs with the withheld test labels from the evaluation set. +We augment Sudoku puzzles by applying band and digit permutations, while data augmentation is +disabled for Maze tasks. Both tasks undergo only a single inference pass. +For ARC-AGI, the scores of the CoT models are taken from the official leaderboard 29 , while for +Sudoku and Maze, the scores are obtained by evaluating through the corresponding API. +In Figure 1, the baselines are grouped based on whether they are pre-trained and use CoT, or neither. +The “Direct pred” baseline means using “direct prediction without CoT and pre-training”, which +retains the exact training setup of HRM but swaps in a Transformer architecture. Interestingly, on +ARC-AGI-1, “Direct pred” matches the performance of Liao and Gu 73 , who built a carefully de- +signed, domain-specific equivariant network for learning the ARC-AGI task from scratch, without +pre-training. By substituting the Transformer architecture with HRM’s hierarchical framework and +implementing ACT, we achieve more than a twofold performance improvement. +On the Sudoku-Extreme and Maze-Hard benchmarks, the performance gap between HRM and the +baseline methods is significant, as the baselines almost never manage to solve the tasks. These +benchmarks that demand lengthy reasoning traces are particularly difficult for CoT-based methods. +With only 1000 training examples, the “Direct pred” baseline—which employs an 8-layer Trans- +former identical in size to HRM—fails entirely on these challenging reasoning problems. When +trained on the larger Sudoku-Extreme-Full dataset, however, “Direct pred” can solve some easy +Sudoku puzzles and reaches 16.9% accuracy (see Figure 2). Lehnert et al. 71 showed that a large +vanilla Transformer model with 175M parameters, trained on 1 million examples across multiple +trials, achieved only marginal success on 30x30 Maze tasks, with accuracy below 20% using the +pass@64 evaluation metric. + +3.3 Visualization of intermediate timesteps +Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intrigu- +ing question: what underlying reasoning algorithms does the HRM neural network actually imple- +ment? Addressing this question is important for enhancing model interpretability and developing a +deeper understanding of the HRM solution space. +While a definitive answer lies beyond our current scope, we begin our investigation by analyzing +state trajectories and their corresponding solution evolution. More specifically, at each timestep +i and given the low-level and high-level state pair (zLi and zH i + ) we perform a preliminary forward + i i i +pass through the H-module to obtain z̄ = fH (zH , zL ; θH ) and its corresponding decoded prediction +ȳ i = fO (z̄ i ; θO ). The prediction ȳ i is then visualized in Figure 7. +In the Maze task, HRM appears to initially explore several potential paths simultaneously, subse- +quently eliminating blocked or inefficient routes, then constructing a preliminary solution outline + 3 + The ARC-AGI allows two attempts for each test input. + + 12 + Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 + + + + + Initial Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 Timestep i = 7 + 4 89 + 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 3 2 4 3 5 7 1 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 + 7 3 1 + 6 7 8 6 3 4 1 5 4 6 7 8 6 3 4 1 5 4 8 7 9 6 3 4 1 5 2 8 7 9 6 3 4 1 5 2 6 7 9 6 3 4 1 5 2 6 7 9 9 3 4 1 5 2 6 7 9 8 3 4 1 5 2 6 7 9 8 3 4 1 5 2 + 2 6 5 1 2 7 9 7 3 4 6 5 1 2 8 9 7 3 4 6 5 1 2 8 8 7 3 4 6 5 1 2 8 9 7 3 4 6 5 3 2 1 8 7 6 4 9 5 1 2 8 8 7 3 4 8 5 3 2 1 9 7 6 4 8 5 3 2 1 9 7 6 4 + 67 8 3 4 8 6 7 2 1 2 5 3 4 8 6 7 2 1 9 5 3 4 8 6 7 2 1 5 5 3 4 8 6 7 2 1 5 5 3 4 9 6 7 2 1 5 5 3 4 8 6 7 2 1 8 5 3 4 9 6 7 2 1 8 5 3 4 9 6 7 2 1 8 + 3 4 7 2 8 3 1 8 6 4 8 7 2 5 3 1 5 6 4 8 7 2 5 3 1 5 6 4 9 7 2 5 3 1 1 6 4 6 7 2 5 3 8 1 6 4 6 7 2 5 3 1 1 6 4 6 7 2 8 3 8 1 6 4 6 7 2 8 3 5 1 6 4 9 +1 64 23 15649237 9 19645237 7 19645237 7 19645237 7 19645238 7 19645237 7 19648237 5 19648237 5 + 27 3 8 1 2 7 9 3 4 6 6 9 1 2 7 4 3 9 8 6 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6 +46 12 468128 7 3 7 465128 9 7 3 465128 9 7 3 468128 9 7 3 468128 9 7 3 468128 9 3 7 465128 9 3 7 465128 9 3 7 +3 7 6 1 3979 964 21 3875 564 21 3879 564 21 3879 564 21 3875 864 21 3875 864 21 3875 964 21 3875 964 21 +[7666fa5d] Example Input [7666fa5d] Example Output [7666fa5d] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 + + + + + [7b80bb43] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 + [7b80bb43] Example Input [7b80bb43] Example Output + + + + + Figure 7: Visualization of intermediate predictions by HRM on benchmark tasks. Top: Maze- + Hard—blue cells indicate the predicted path. Middle: Sudoku-Extreme—bold cells represent ini- + tial givens; red highlights cells violating Sudoku constraints; grey shading indicates changes from + the previous timestep. Bottom: ARC-AGI-2 Task—left: provided example input-output pair; right: + intermediate steps solving the test input. + + followed by multiple refinement iterations. In Sudoku, the strategy resembles a depth-first search + approach, where the model appears to explore potential solutions and backtracks when it hits dead + ends. HRM uses a different approach for ARC tasks, making incremental adjustments to the board + and iteratively improving it until reaching a solution. Unlike Sudoku, which involves frequent + backtracking, the ARC solution path follows a more consistent progression similar to hill-climbing + optimization. + Importantly, the model shows that it can adapt to different reasoning approaches, likely choosing an + effective strategy for each particular task. Further research is needed to gain more comprehensive + insights into these solution strategies. + + + 4 Brain Correspondence + A key principle from systems neuroscience is that a brain region’s functional repertoire—its ability + to handle diverse and complex tasks—is closely linked to the dimensionality of its neural represen- + tations 75,76 . Higher-order cortical areas, responsible for complex reasoning and decision-making, + must handle a wide variety of tasks, demanding more flexible and context-dependent processing 77 . + In dynamical systems, this flexibility is often realized through higher-dimensional state-space tra- + jectories, which allow for a richer repertoire of potential computations 78 . This principle gives rise + to an observable dimensionality hierarchy, where a region’s position in the processing hierarchy + + 13 + (a) (c) (e) + + + + + (b) (d) (f) + 5.0 + + 4.5 + Participation Ratio (PR) + + + + + 4.0 + + 3.5 + + 3.0 + + 2.5 + + 2.0 + + 0 20 40 + Position in the hierarchy + + +Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex. (a,b) are +adapted from Posani et al. 74 . (a) Anatomical illustration of mouse cortical areas, color-coded by +functional modules. (b) Correlation between Participation Ratio (PR), a measure of effective neural +dimensionality, and hierarchical position across different mouse cortical areas. Higher positions in +the hierarchy (e.g., MOs, ACAd) exhibit significantly higher PR values compared to lower sensory +areas (e.g., SSp-n), with a Spearman correlation coefficient of ρ = 0.79 (P = 0.0003). (c,d) Trained +HRM. (c) PR scaling of the trained HRM with task diversity. The dimensionality of the high- +level module (zH ) scales with the number of unique tasks (trajectories) included in the analysis, +indicating an adaptive expansion of its representational capacity. In contrast, the low-level module’s +(zL ) dimensionality remains stable. (d) PR values for the low-level (zL , PR = 30.22) and high- +level (zH , PR = 89.95) modules of the trained HRM, computed from neural activity during 100 +unique Sudoku-solving trajectories. A clear dimensionality hierarchy is observed, with the high- +level module operating in a substantially higher-dimensional space. (e,f) Analysis of Untrained +Network. To verify that the dimensionality hierarchy is an emergent property of training, the same +analyses were performed on an untrained HRM with random weights. (e) In contrast to the trained +model’s scaling in (c), the dimensionality of both modules in the untrained model remains low and +stable, failing to scale with the number of tasks. (f) Similarly, contrasting with the clear separation +in (d), the PR values for the untrained model’s modules (zL , PR = 42.09; zH , PR = 40.75) are +low and nearly identical, showing no evidence of hierarchical separation. This confirms that the +observed hierarchical organization of dimensionality is a learned property that emerges through +training, not an artifact of the model’s architecture. + + + + + 14 +correlates with its effective dimensionality. To quantify this phenomenon, we can examine the +Participation Ratio (PR), which serves as a standard measure of the effective dimensionality of a +high-dimensional representation 79 . The PR is calculated using the formula + ( i λi )2 + P + PR = P 2 , + i λi + +where {λi } are the eigenvalues of the covariance matrix of neural trajectories. Intuitively, a higher +PR value signifies that variance is distributed more evenly across many dimensions, corresponding +to a higher-dimensional representation. Conversely, a lower PR value indicates that variance is +concentrated in only a few principal components, reflecting a more compact, lower-dimensional +structure. +The dimensionality hierarchy can be observed, for example, in the mouse cortex, where the PR of +population activity increases monotonically from low-level sensory areas to high-level associative +areas, supporting this link between dimensionality and functional complexity 74 (Figure 8 (a,b)). +We evaluated whether HRM reproduces this neuroscientific principle by calculating the PR for +both recurrent modules after training on the Sudoku-Extreme Full dataset. The PR computation +used the covariance matrix derived from neural states gathered across multiple Sudoku-solving +trajectories. The results show a striking parallel to the biological findings. The low-level module’s +state (zL ) occupies a relatively small subspace with a participation ratio of 30.22, whereas the high- +level module’s state (zH ) operates in a substantially larger subspace with a participation ratio of +89.95, as shown in Figure 8(c). Furthermore, Figure 8(d) shows that increasing the number of +unique tasks (trajectories) from 10 to 100 causes zH dimensionality to scale up accordingly, while +zL dimensionality remains stable. These results suggest an emergent separation of representational +capacity between the modules that parallels their functional roles. +To confirm that this hierarchical organization is an emergent property of training, and not an artifact +of the network’s architecture, we performed a control analysis using an identical but untrained +network with random weights. +We initialized an identical HRM architecture with random weights and, without any training, mea- +sured the PR of its modules as the network processed the same task-specific inputs given to the +trained model. +The results, shown in Figure 8(e,f), reveal a stark contrast: the high-level and low-level modules of +the untrained network exhibit no hierarchical separation, with their PR values remaining low and +nearly indistinguishable from each other. This control analysis validates that the dimensionality +hierarchy is an emergent property that arises as the model learns to perform complex reasoning. +The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse +cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where +last-layer features converge to a low-dimensional subspace 80,81,82 . HRM therefore departs from the +collapse pattern and instead fosters a high-dimensional representation in its higher module. This +is significant because such representations are considered crucial for cognitive flexibility and are a +hallmark of higher-order brain regions like the prefrontal cortex (PFC), which is central to complex +reasoning. +This structural parallel suggests the model has discovered a fundamental organizational principle. +By learning to partition its representations into a high-capacity, high-dimensional subspace (zH ) + + 15 +and a more specialized, low-dimensional one (zL ), HRM autonomously discovers an organizational +principle that is thought to be fundamental for achieving robust and flexible reasoning in biological +systems. This provides a potential mechanistic explanation for the model’s success on complex, +long-horizon tasks that are intractable for models lacking such a differentiated internal structure. +We emphasize, however, that this evidence is correlational. While a causal link could be tested +via intervention (e.g., by constraining the H-module’s dimensionality), such methods are difficult +to interpret in deep learning due to potential confounding effects on the training process itself. +Thus, the causal necessity of this emergent hierarchy remains an important question for future +investigation. + + +5 Related Work +Reasoning and algorithm learning Given the central role of reasoning problems and their close +relation to algorithms, researchers have long explored neural architectures that enable algorithm +learning from training instances. This line of work includes Neural Turing Machines (NTM) 83 , +the Differentiable Neural Computer (DNC) 84 , and Neural GPUs 85 –all of which construct iterative +neural architectures that mimic computational hardware for algorithm execution, and are trained to +learn algorithms from data. Another notable work in this area is Recurrent Relational Networks +(RRN) 62 , which executes algorithms on graph representations through graph neural networks. +Recent studies have integrated algorithm learning approaches with Transformer-based architec- +tures. Universal Transformers extend the standard Transformer model by introducing a recurrent +loop over the layers and implementing an adaptive halting mechanism. Geiping et al. 86 demonstrate +that looped Transformers can generalize to a larger number of recurrent steps during inference than +what they were trained on. Shen et al. 16 propose adding continuous recurrent reasoning tokens +to the Transformer. Finally, TransNAR 8 combine recurrent graph neural networks with language +models. +Building on the success of CoT-based reasoning, a line of work have introduced fine-tuning meth- +ods that use reasoning paths from search algorithms (like A*) as SFT targets 87,71,70 . +We also mention adaptive halting mechanisms designed to allocate additional computational re- +sources to more challenging problems. This includes the Adaptive Computation Time (ACT) for +RNNs 88 and follow-up research like PonderNet 89 , which aims to improve the stability of this allo- +cation process. +HRM further pushes the boundary of algorithm learning through a brain-inspired computational +architecture that achieves exceptional data efficiency and model expressiveness, successfully dis- +covering complex and diverse algorithms from just 1000 training examples. +Brain-inspired reasoning architectures Developing a model with the reasoning power of the +brain has long been a goal in brain-inspired computing. Spaun 90 is one notable example, which uses +spiking neural networks to create distinct modules corresponding to brain regions like the visual +cortex and prefrontal cortex. This design enables an architecture to perform a range of cognitive +tasks, from memory recall to simple reasoning puzzles. However, its reasoning relies on hand- +designed algorithms, which may limit its ability to learn new tasks. Another significant model is the +Tolman-Eichenbaum Machine (TEM) 91 , which is inspired by the hippocampal-entorhinal system’s +role in spatial and relational memory tasks. TEM proposes that medial entorhinal cells create a +basis for structural knowledge, while hippocampal cells link this basis to sensory information. This + + 16 +allows TEM to generalize and explains the emergence of various cell types like grid, border, and +place cells. Another approach involves neural sampling models 92 , which view the neural signaling +process as inference over a distribution, functioning similarly to a Boltzmann machine. These +models often require hand-made rules to be set up for solving a specific reasoning task. In essence, +while prior models are restricted to simple reasoning problems, HRM is designed to solve complex +tasks that are hard for even advanced LLMs, without pre-training or task-specific manual design. +Hierarchical memory The hierarchical multi-timescale structure also plays an important role in +how the brain processes memory. Models such as Hierarchical Sequential Models 93 and Clockwork +RNN 94 use multiple recurrent modules that operate at varying time scales to more effectively cap- +ture long-range dependencies within sequences, thereby mitigating the forgetting issue in RNNs. +Similar mechanisms have also been adopted in linear attention methods for memorizing long con- +texts (see the Discussions section). Since HRM focuses on reasoning, full attention is applied for +simplicity. Incorporating hierarchical memory into HRM could be a promising future direction. + + +6 Discussions +Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal +Transformer 95 , HRM is computationally universal when given sufficient memory and time con- +straints. In other words, it falls into the category of models that can simulate any Turing machine, +overcoming the computational limitations of standard Transformers discussed previously in the in- +troduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks, +they suffer from premature convergence and memory intensive BPTT. Therefore, in practice, their +effective computational depth remains limited, though still deeper than that of a standard Trans- +former. By resolving these two challenges and being equipped with adaptive computation, HRM +could be trained on long reasoning processes, solve complex puzzles requiring intensive depth-first +search and backtracking, and move closer to practical Turing-completeness. +Reinforcement learning with chain-of-thought Beyond fine-tuning using human-annotated CoT, +reinforcement learning (RL) represents another widely adopted training methodology. However, +recent evidence suggests that RL primarily unlocks existing CoT-like capabilities rather than dis- +covering fundamentally new reasoning mechanisms 96,97,98,99 . Additionally, CoT-training with RL +is known for its instability and data inefficiency, often requiring extensive exploration and careful +reward design. In contrast, HRM takes feedback from dense gradient-based supervision rather than +relying on a sparse reward signal. Moreover, HRM operates naturally in a continuous space, which +is biologically plausible and avoids allocating same computational resources to each token, even +though tokens vary in their reasoning and planning complexity 16 . +Linear attention Recurrence has been explored not only for its capability in universal computa- +tion, but also as a means to replace the attention mechanism in Transformers, which suffers from +quadratic time and memory complexity 100 . Recurrent alternatives offer a more efficient design by +processing input tokens sequentially and predicting the next token at each time step, similar to early +RNN-based language models. +Some linear-attention variants, such as Log-linear Attention 101 , share an RNN-like state-update that +can be interpreted as propagating multi-timescale summary statistics, thereby retaining long-range +context without the quadratic memory growth of standard self-attention. However, substituting the +attention mechanism alone does not change the fact that Transformers are still fixed-depth, and + + 17 +require CoT as a compensatory mechanism. Notably, linear attention can operate with a reduced +key-value cache over extended contexts, making them more suitable for deployment on resource- +constrained edge devices. + + +7 Conclusion +This work introduces the Hierarchical Reasoning Model, a brain-inspired architecture that lever- +ages hierarchical structure and multi-timescale processing to achieve substantial computational +depth without sacrificing training stability or efficiency. With only 27M parameters and train- +ing on just 1000 examples, HRM effectively solves challenging reasoning problems such as ARC, +Sudoku, and complex maze navigation–tasks that typically pose significant difficulties for contem- +porary LLM and chain-of-thought models. +Although the brain relies heavily on hierarchical structures to enable most cognitive processes, +these concepts have largely remained confined to academic literature rather than being translated +into practical applications. The prevailing AI approach continues to favor non-hierarchical models. +Our results challenge this established paradigm and suggest that the Hierarchical Reasoning Model +represents a viable alternative to the currently dominant chain-of-thought reasoning methods, ad- +vancing toward a foundational framework capable of Turing-complete universal computation. +Acknowledgements We thank Mingli Yuan, Ahmed Murtadha Hasan Mahyoub and Hengshuai +Yao for their insightful discussions and valuable feedback throughout the course of this work. + + + + + 18 +References + 1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. + http://www.deeplearningbook.org. + 2. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image + recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), + pages 770–778, 2015. + 3. Lena Strobl. Average-hard attention transformers are constant-depth uniform threshold + circuits, 2023. + 4. Tom Bylander. Complexity results for planning. In Proceedings of the 12th International Joint + Conference on Artificial Intelligence - Volume 1, IJCAI’91, page 274–279, San Francisco, + CA, USA, 1991. Morgan Kaufmann Publishers Inc. ISBN 1558601600. + 5. William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. In + Neural Information Processing Systems, 2023. + 6. David Chiang. Transformers in DLOGTIME-uniform TC0 . Transactions on Machine + Learning Research, 2025. + 7. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael + Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search + dynamics bootstrapping. In First Conference on Language Modeling, 2024. + 8. Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick, Larisa Markeeva, Alex + Vitvitskyi, Razvan Pascanu, and Petar Velivckovi’c. Transformers meet neural algorithmic + reasoners. ArXiv, abs/2406.09308, 2024. + 9. William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision + transformers. Transactions of the Association for Computational Linguistics, 11:531–545, + 2023. doi: 10.1162/tacl_a_00562. +10. Jason Wei, Yi Tay, et al. Chain-of-thought prompting elicits reasoning in large language + models, 2022. arXiv preprint arXiv:2201.11903. +11. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of + thought. In ICLR, 2024. +12. Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in + reasoning with large language models. ArXiv, abs/2402.08939, 2024. +13. Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer "attacks" on chain-of-thought + reasoning. In Annual Meeting of the Association for Computational Linguistics, 2024. +14. Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius + Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. + arXiv preprint arXiv:2211.04325, 2022. +15. Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, + Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive + survey on latent chain-of-thought reasoning, 2025. +16. Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. + Training large language models to reason in a continuous latent space. arXiv preprint + arXiv:2412.07423, 2024. + + + 19 +17. Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a + tool for communication rather than thought. Nature, 630(8017):575–586, 2024. +18. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. + Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and + Machine Intelligence, 2024. +19. Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current + Opinion in Neurobiology, 55:82–89, 2019. ISSN 0959-4388. doi: https://doi.org/10.1016/j. + conb.2019.01.011. +20. John D Murray, Alberto Bernacchia, David J Freedman, Ranulfo Romo, Jonathan D Wallis, + Xinying Cai, Camillo Padoa-Schioppa, Tatiana Pasternak, Hyojung Seo, Daeyeol Lee, et al. + A hierarchy of intrinsic timescales across primate cortex. Nature neuroscience, 17(12):1661– + 1663, 2014. +21. Roxana Zeraati, Yan-Liang Shi, Nicholas A Steinmetz, Marc A Gieselmann, Alexander + Thiele, Tirin Moore, Anna Levina, and Tatiana A Engel. Intrinsic timescales in the + visual cortex change with selective attention and reflect spatial connectivity. Nature + communications, 14(1):1858, 2023. +22. Julia M Huntenburg, Pierre-Louis Bazin, and Daniel S Margulies. Large-scale gradients in + human cortical organization. Trends in cognitive sciences, 22(1):21–31, 2018. +23. Victor AF Lamme and Pieter R Roelfsema. The distinct modes of vision offered by + feedforward and recurrent processing. Trends in neurosciences, 23(11):571–579, 2000. +24. Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J + Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012. +25. Klara Kaleb, Barbara Feulner, Juan Gallego, and Claudia Clopath. Feedback control guides + credit assignment in recurrent neural networks. Advances in Neural Information Processing + Systems, 37:5122–5144, 2024. +26. Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. + Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020. +27. François Chollet. On the measure of intelligence (abstraction and reasoning corpus), 2019. + arXiv preprint arXiv:1911.01547. +28. Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: + Technical report. ArXiv, abs/2412.04604, 2024. +29. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- + agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, + 2025. +30. György Buzsáki. Gamma, alpha, delta, and theta oscillations govern cognitive processes. + International Journal of Psychophysiology, 39:241–248, 2000. +31. György Buzsáki. Rhythms of the Brain. Oxford university press, 2006. +32. Anja Pahor and Norbert Jaušovec. Theta–gamma cross-frequency coupling relates to the level + of human intelligence. Intelligence, 46:283–290, 2014. +33. Adriano BL Tort, Robert W Komorowski, Joseph R Manns, Nancy J Kopell, and Howard + Eichenbaum. Theta–gamma coupling increases during the learning of item–context + associations. Proceedings of the National Academy of Sciences, 106(49):20942–20947, 2009. + + + 20 +34. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between + energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11, + 2016. +35. Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert + Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent + networks of spiking neurons. Nature Communications, 11, 07 2020. doi: 10.1038/ + s41467-020-17236-y. +36. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in + Neural Information Processing Systems, pages 690–701, 2019. +37. Zhengyang Geng, Xinyu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training + implicit models. ArXiv, abs/2111.05177, 2021. +38. Katarina Begus and Elizabeth Bonawitz. The rhythm of learning: Theta oscillations as an + index of active learning in infancy. Developmental Cognitive Neuroscience, 45:100810, 2020. + ISSN 1878-9293. doi: https://doi.org/10.1016/j.dcn.2020.100810. +39. Shaojie Bai, Zhengyang Geng, Yash Savani, and J. Zico Kolter. Deep Equilibrium + Optical Flow Estimation . In 2022 IEEE/CVF Conference on Computer Vision and Pattern + Recognition (CVPR), pages 610–620, 2022. +40. Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and + Thomas Moreau. Shine: Sharing the inverse estimate from the forward pass for bi-level + optimization and implicit models. ArXiv, abs/2106.00553, 2021. +41. Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing equilibrium models by jacobian + regularization. In International Conference on Machine Learning, 2021. +42. Daniel Kahneman and P Egan. Thinking, fast and slow (farrar, straus and giroux, new york), + 2011. +43. Matthew D Lieberman. Social cognitive neuroscience: a review of core processes. Annu. Rev. + Psychol., 58(1):259–289, 2007. +44. Randy L Buckner, Jessica R Andrews-Hanna, and Daniel L Schacter. The brain’s default + network: anatomy, function, and relevance to disease. Annals of the new York Academy of + Sciences, 1124(1):1–38, 2008. +45. Marcus E Raichle. The brain’s default mode network. Annual review of neuroscience, 38(1): + 433–447, 2015. +46. Andrew Westbrook and Todd S Braver. Cognitive effort: A neuroeconomic approach. + Cognitive, Affective, & Behavioral Neuroscience, 15:395–415, 2015. +47. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT + Press, Cambridge, MA, 2018. +48. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan + Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. ArXiv, + abs/1312.5602, 2013. +49. Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, + Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning, + 2025. + + + + 21 +50. Shuo Xie and Zhiyuan Li. Implicit bias of adamw: L inf norm constrained optimization. + ArXiv, abs/2404.04454, 2024. +51. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of + numerical stability. In The Thirteenth International Conference on Learning Representations, + 2025. +52. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural + information processing systems, pages 5998–6008, 2017. +53. Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta, + 2024. URL https://ai.meta.com/llama/. +54. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: + Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. +55. Noam M. Shazeer. Glu variants improve transformer. ArXiv, abs/2002.05202, 2020. +56. Biao Zhang and Rico Sennrich. Root mean square layer normalization. ArXiv, + abs/1910.07467, 2019. +57. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self- + normalizing neural networks. In Neural Information Processing Systems, 2017. +58. JAX Developers. jax.nn.initializers.lecun_normal. Google Research, 2025. URL + https://docs.jax.dev/en/latest/_autosummary/jax.nn.initializers.lecun_ + normal.html. Accessed June 22, 2025. +59. Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. + In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002. +60. Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, + Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and + Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. In Forty-first + International Conference on Machine Learning, 2024. +61. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. +62. Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In Neural + Information Processing Systems, 2017. +63. Jieyi Long. Large language model guided tree-of-thought. ArXiv, abs/2305.08291, 2023. +64. Yilun Du, Jiayuan Mao, and Josh Tenenbaum. Learning iterative reasoning through energy + diffusion. ArXiv, abs/2406.11179, 2024. +65. Kyubyong Park. Can convolutional neural networks crack sudoku puzzles? https: + //github.com/Kyubyong/sudoku, 2018. +66. Single-digit techniques. https://hodoku.sourceforge.net/en/tech_singles.php. + Accessed: 2025-06-16. +67. Tom Dillion. Tdoku: A fast sudoku solver and generator. https://t-dillon.github.io/ + tdoku/, 2025. +68. Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones. Sudoku-bench: + Evaluating creative reasoning with sudoku variants. arXiv preprint arXiv:2505.16135, 2025. +69. Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous + thought machines. arXiv preprint arXiv:2505.05522, 2025. + + 22 +70. DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng. + Dualformer: Controllable fast and slow thinking by learning with randomized reasoning + traces, 2025. +71. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael + Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search + dynamics bootstrapping. In First Conference on Language Modeling, 2024. +72. Mubbasir Kapadia, Francisco Garcia, Cory D. Boatright, and Norman I. Badler. Dynamic + search on the gpu. In 2013 IEEE/RSJ International Conference on Intelligent Robots and + Systems, pages 3332–3337, 2013. doi: 10.1109/IROS.2013.6696830. +73. Isaac Liao and Albert Gu. Arc-agi without pretraining, 2025. URL https: + //iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_ + without_pretraining.html. +74. Lorenzo Posani, Shuqi Wang, Samuel P Muscinelli, Liam Paninski, and Stefano Fusi. + Rarely categorical, always high-dimensional: how the neural code changes along the cortical + hierarchy. bioRxiv, pages 2024–11, 2025. +75. Mattia Rigotti, Omri Barak, Melissa R. Warden, Xiao-Jing Wang, Nathaniel D. Daw, Earl K. + Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks. + Nature, 497:585–590, 2013. doi: 10.1038/nature12160. +76. Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. Context- + dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84, + 2013. doi: 10.1038/nature12742. +77. Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function. + Annual Review of Neuroscience, 24(1):167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167. +78. Wolfgang Maass. Real-time computing without stable states: a new framework for neural + computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi: + 10.1162/089976602760407955. +79. Ege Altan, Sara A. Solla, Lee E. Miller, and Eric J. Perreault. Estimating the dimensionality + of the manifold underlying multi-electrode neural recordings. PLoS Computational Biology, + 17(11):e1008591, 2021. doi: 10.1371/journal.pcbi.1008591. +80. Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the + terminal phase of deep learning training. Proceedings of the National Academy of Sciences, + 117(40):24652–24663, 2020. doi: 10.1073/pnas.2015509117. +81. Cong Fang, Hangfeng He, Qi Long, and Weijie J. Su. Exploring deep neural networks via + layer–peeled model: Minority collapse in imbalanced training. Proceedings of the National + Academy of Sciences, 118(43):e2103091118, 2021. doi: 10.1073/pnas.2103091118. +82. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. + A geometric analysis of neural collapse with unconstrained features. In Advances in Neural + Information Processing Systems, volume 34 of NeurIPS, pages 29820–29834, 2021. +83. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014. +84. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka + Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John + Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. + Nature, 538(7626):471–476, 2016. + + 23 + 85. Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In ICLR, 2016. + 86. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. + Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time + compute with latent reasoning: A recurrent depth approach, 2025. + 87. Tiedong Liu and Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic + tasks. ArXiv, abs/2305.14201, 2023. + 88. Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv, + abs/1603.08983, 2016. + 89. Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. ArXiv, + abs/2107.05407, 2021. + 90. Chris Eliasmith, Terrence C Stewart, Xuan Choo, Trevor Bekolay, Travis DeWolf, Yichuan + Tang, and Daniel Rasmussen. A large-scale model of the functioning brain. science, 338 + (6111):1202–1205, 2012. + 91. James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil + Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine: unifying space and + relational memory through generalization in the hippocampal formation. Cell, 183(5):1249– + 1263, 2020. + 92. Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as + sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS + computational biology, 7(11):e1002211, 2011. + 93. Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term + dependencies. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural + Information Processing Systems, volume 8. MIT Press, 1995. + 94. Jan Koutník, Klaus Greff, Faustino J. Gomez, and Jürgen Schmidhuber. A clockwork rnn. In + International Conference on Machine Learning, 2014. + 95. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. + Universal transformers, 2018. arXiv preprint arXiv:1807.03819. + 96. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, + Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, + and Yelong Shen. Reinforcement learning for reasoning in large language models with one + training example, 2025. URL https://arxiv.org/abs/2504.20571. + 97. Niklas Muennighoff. s1: Simple test-time scaling. arXiv preprint arXiv:2502.23456, 2025. + 98. Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, + Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng + Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025. + 99. Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025. +100. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms + through structured state space duality. ArXiv, abs/2405.21060, 2024. +101. Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. Log-linear + attention. arXiv preprint arXiv:2506.04761, 2025. + + + + + 24 +
\ No newline at end of file diff --git a/papers/txt/ptrm2025_probabilistic_trm.txt b/papers/txt/ptrm2025_probabilistic_trm.txt new file mode 100644 index 0000000..a00f31c --- /dev/null +++ b/papers/txt/ptrm2025_probabilistic_trm.txt @@ -0,0 +1,906 @@ + Probabilistic Tiny Recursive Model + + + Amin Sghaier Ali Parviz Alexia Jolicoeur-Martineau + Mila – Quebec AI Institute Mila – Quebec AI Institute Independent + ILLS & ETS Montreal + + {amin.sghaier, ali.parviz}@mila.quebec +arXiv:2605.19943v1 [cs.AI] 19 May 2026 + + + + + alexia.jolicoeur-martineau@mail.mcgill.ca + + + + Abstract + Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of + the parameters of modern large language models (LLMs) by iteratively refining a + latent state and final answer. While powerful, their deterministic recursion can lead + to convergence at suboptimal solutions, without escape mechanism. A common + workaround relies on task-specific input perturbations at test time combined with + answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task- + agnostic framework for test-time compute scaling that addresses this limitation + through stochastic exploration. PTRM injects Gaussian noise at each deep recursion + step, enabling parallel trajectories to explore diverse solution basins, and selects + among them using the model’s existing Q head (used for early stopping in the + original TRM). Without requiring retraining or task-specific augmentations, PTRM + enables substantial accuracy gains across benchmarks, including Sudoku-Extreme + (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to + 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs + (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters. + + + + PPBench Puzzles + sudoku, lightup, nurikabe, heyawake, and tapa Sudoku-Extreme + 100 100 98.75 + 91.2 87.4 + 80 80 + Accuracy (%) + + + + + 62.6 + 60 55.1 60 55 + 40 34.7 40 + 24.5 24.5 + 20 20 + 2 0 0 0 0 + 0 0 + Direct pred + + + + + Direct pred + gemini-3.1-pro + + claude-opus-4-6 + + + + TRM + + + + + Deepseek R1 + + + + o3-mini-high + + + + HRM + + TRM + gpt-5.2@xhigh + + + + + PTRM (ours) + + + + + Claude 3.7 8K + + + + + PTRM (ours) + LLM ensemble + + + + + Chain-of-thought, pretrained Direct prediction Probabilistic recursive prediction (ours) + LLM ensemble Deterministic recursive prediction + Best of 7 strongest LLMs. Assumes access to a perfect verifier. + + Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM + performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs + by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36 + points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%. +1 Introduction +Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with +orders of magnitude fewer parameters than the large language models (LLMs) they outperform on +tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning +Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning +models. Rather than autoregressively generating chains of token-level reasoning, they recursively +refine a latent state. This approach produces a single deterministic answer per input, fitting well with +tasks where the answer is unique. +Despite their strong performance, their deterministic inference does not make full use of their +capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent +space basins (i.e., regions of the latent space which decode to incorrect answers and from which the +deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work +on related models [5], suggests that TRM has the capabilities to solve significantly more problems +but is limited by its standard inference procedure. +Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This +is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique +answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different +trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using +a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage +test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time +scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic +decoded answer, and selecting among the answers using the model’s own Q head. +TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is +conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable +information that the standard inference procedure discards. +We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a +new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian +noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different +latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head +is used to select the one most likely to be correct. PTRM requires no training changes and no +task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy +gains across diverse reasoning benchmarks. + +2 Background: Tiny Recursive Model +Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a +question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion +consists of n updates to the latent state z followed by one update to the predicted answer y, all using +the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z). +fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T +latent recursions in sequence, with only the final one retaining gradients, allowing the model to +leverage a large effective depth while keeping training efficient. +Rather than doing one optimization step per sample, TRM is trained via deep supervision, which +consists in keeping the previous latent state z and answer y as initialization (after being detached from +the computational graph) for the next supervision step. This is done for up to Nsup supervision steps. +The loss at each step is calculated using cross entropy between the predicted answer logits fO (y) +(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively +refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more +steps than during training, providing a depth axis for test-time compute scaling (additional steps may +correct otherwise-incorrect answers). +Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision +steps rather than being replaced after each one. To avoid wasting compute on already-solved samples, +an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary +cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact + + + 2 + Correct answer Incorrect answer Cell accuracy Start End + Quick success Delayed success Failure + + PC 2 (15% var) + + + + + PC 2 (36% var) + + + + + PC 2 (8% var) + PC 1 (84% var) PC 1 (58% var) PC 1 (85% var) + + + 1.0 1.0 1.0 + 5.0 5.0 5 + + + + + Cell accuracy + 2.5 0.9 2.5 0.9 0.8 +Q value + + + + + 0.0 0 + 0.0 0.8 2.5 0.8 + 0.6 + 2.5 5 + 5.0 + 0.7 0.7 + 5.0 + 0 5 10 15 0 5 10 15 0 5 10 15 + Supervision step Supervision step Supervision step + Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell + accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating + three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are + projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories + fade from light (early steps) to dark (later steps). Circle marks the start and square marks end. + + + correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ = + ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5, + improving data efficiency. During inference, the Q head is not used, and the model performs Nsup + supervision steps to maximize answer correctness. + While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will + investigate such failures cases in order to determine a way to remedy them. + + + 3 Problem: When Does TRM Fail? + + 3.1 Analysis of failures and successes + + We present observations about TRM that motivate our method. In this section, we train a TRM on + multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head + behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent + yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into + the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction + of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and + Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe: + Quick success: the trajectory transitions in a few steps from its starting location to a convergence + region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima + within the same few steps. + Delayed success: the trajectory initially oscillates around one region and remains there for multiple + supervision steps before sharply escaping to a different region where it converges. During the initial + + + 3 +phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell +accuracy spike together. +Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches +near 100%, and the Q value stays negative for all supervision steps. +We refer to latent space regions that trajectories remain in across multiple supervision steps and +exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are +good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave +similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories, +when delayed successes find an escape to a good basin while failures remain stuck. + +3.2 The Q head tracks trajectory quality + + 6 1.00 + 0.95 Figure 3: Q value follows cell ac- + 4 + 0.90 curacy across reasoning. Mean + 2 + + + + + Cell accuracy + Incorrect (28) 0.85 Q value (solid, left axis) and mean +Q value + + + + + 0 Correct (69) 0.80 cell accuracy (dashed, right axis) + Cell accuracy (right axis) over supervision steps, aggregated + 2 0.75 + 0.70 + over 100 PPBench validation puz- + 4 zles, separated by final correctness + 0.65 + 6 (green: correct, red: incorrect). + 0.60 + 0 2 4 6 8 10 12 14 + Supervision step +Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s +value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3 +aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer +correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell +accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at +convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for +correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable +learned indicator of whether a trajectory has reached a good basin. +Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows: +can we leverage the Q head to identify better trajectories? The main challenge is that the standard +TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given +problem. In the next section, we will show that by simply adding Gaussian noise to the latent state, +we can sample different parallel trajectories and leverage the Q head to pick the best one. + +4 Method: Test-Time Compute Scaling via Stochastic Rollouts +We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion +stochastic and selects the best of K resulting trajectories. PTRM requires no special training and +can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific +augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by +σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the +highest Q value is selected and then decoded using the model’s output head fO . The algorithm in +Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories +to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new +axis for test-time scaling. + +4.1 Escaping bad basins + +In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in +latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough +for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows +K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1), + + + 4 +PTRM Inference (a) Standard TRM (deterministic) + + 1: Input: puzzle x, rollouts K, puzzle ··· answer + 2: supervision steps D, noise scale σ + 3: for k = 1, . . . , K in parallel do depth axis: D deep recursion steps + (k) (k) + 4: Initialize z0 , y0 (b) PTRM (ours): K stochastic rollouts + Q-head selection + 5: for t = 1, . . . , D do +ϵ +ϵ +ϵ + ··· + + + + + width axis: K rollouts + (k) + 6: zt−1 += ϵ, ϵ ∼ N (0, σ 2 I) 1 +ϵ +ϵ +ϵ + k= + (k) (k) (k) (k) + 7: zt , yt ← rec(x, zt−1 , yt−1 ) ··· + puzzle arg maxk Qk + 8: end for k=2 + · + · + · + · + · + · + (k) + 9: ŷ (k) ← arg max fO (yD ) k= + K + · + +ϵ + · + +ϵ + · + +ϵ + (k) (k) +10: q̂ ← fQ (yD ) ··· final answer + +11: end for ∗ +12: return ŷ (k ) , k ∗ = arg maxk q̂ (k) deep recursion step +ϵ Gaussian noise injection + + + + +Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right: +PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent +rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer. + + +projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while +a minority (8%) escape to a distinct region in latent space and produce correct answers. We also +observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape, +at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity +needed to occasionally find an escape trajectory. + +4.2 Width scaling + +Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the +number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any +rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a +metric available at inference without a correctness oracle. The choice of Q as selector is motivated by +Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3). +Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench +validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at +K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between +pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set. +By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same +range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct +solutions even when they are rare. +Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which +consists of running more deep recursions (supervision steps) at inference than the Nsup the model +was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises +PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only +provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth +and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are +independent and parallelizable while extra depth is sequential, width is the more practical scaling +axis. +PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next +section evaluates the method across multiple benchmarks and against several baselines, including +frontier LLMs. + +5 Experiments +This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against +the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs. +Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4], +PTRM substantially boosts the performance of each pretrained TRM using only inference compute. + + + 5 + Correct (8) + 10 Incorrect (92) + Start + 8 End 92.5 pass@K + 90.0 best-Q@K + 6 mode@K + 87.5 +PC 2 (34% var) + + + + + PPBench accuracy (%) + 4 85.0 + + 2 + 82.5 + 80.0 + 0 + 77.5 + 2 75.0 + 72.5 + 4 1 5 10 25 100 + 2.5 0.0 2.5 5.0 7.5 10.0 12.5 Rollouts per puzzle K (log scale) + PC 1 (53% var) + +Figure 5: Stochastic rollouts escape bad Figure 6: Width scaling. pass@K, best-Q@K, +basins. Principal plane projection of K = and mode@K as K grows, averaged over 3 +100 independent rollouts of the same failed seeds on a held-out PPBench validation set. The +puzzle as in Figure 2 (right). 92 rollouts Q head is a strong verifier on the tested puzzles, +remain caught in the bad basin (red). 8 consistently outperforming selection of the most +escape to a good basin and produce correct frequent answer. +answers (green). + + + 5.1 Setup + +Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles +(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types) +selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a +fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size), +and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup, +nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest. +We use the validation set to track performance during training and select the final checkpoint. We +report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on +shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across +types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets. + Models and inference. For each benchmark we use a standard TRM checkpoint. For Sudoku- + Extreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku), + and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running + D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step. + The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics + are averaged across three seeds. + Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone, + we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For + each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and + ARC2 we additionally report the published direct prediction and TRM baselines from [1]. + Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s + wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9]) + so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle. + + + 5.2 Pencil Puzzle Bench + + 5.2.1 Per-puzzle accuracy + + Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2 + raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48) + gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes + from scaling width (stochastic rollouts). The largest improvements are on puzzle types where + + + 6 +the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to +97.8% and tapa from 40.0% to 80.0%. + + % accuracy # Params sudoku lightup nurikabe heyawake tapa agg. + Direct prediction 27M 0.0 0.0 0.0 14.3 0.0 2.0 + TRM (K=1, D=16) 7M 46.7 87.5 74.1 85.7 40.0 62.6 + TRM (K=1, D=48) 7M 57.8 87.5 74.1 85.7 40.0 66.0 + PTRM, best-Q@K (K=100, D=16) 7M 93.3 100 88.9 85.7 80.0 89.8 + PTRM, best-Q@K (K=100, D=48) 7M 97.8 100 88.9 85.7 80.0 91.2 +Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as +the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points +over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises +accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer +trained on the same data. + + + +5.2.2 Comparison with frontier LLMs on golden set +PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response +from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct +and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM +substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively +verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while +it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn +verifier (the LLM could still self-verify by writing verification code within the same response). We +additionally observe that the agentic strategy was applied selectively in the published PPBench data: +across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried +with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our +golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high, +claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in +each strategy block. +We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved +if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against +PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under +any strategy, the ensemble counts it as solved, even though in practice we would not have access +to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that +even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble +cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct +divides total cost by the number of puzzles the ensemble solved. +Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57 +points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%) +despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for +LLMs than PTRM. + +5.3 Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 + +For each benchmark we use the standard TRM checkpoint trained as described in [1] without +modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2). +Table 3 summarizes results on all three. +On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to +99.06% pass@K and 98.75% best-Q@K, achieving state of the art. +On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain +over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73% +(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows +that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on +the previous benchmarks. + + + 7 + % accuracy sudoku lightup nurikabe heyawake tapa agg. $/att. $/corr. + Direct + gemini-3.1-pro 6.7 75.0 22.2 0.0 30.0 24.5 $0.40 $1.62 + gpt-5.2@xhigh 20.0 50.0 0.0 0.0 50.0 24.5 $1.79 $7.29 + claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 60.0 34.7 $2.91 $8.40 + Any strategy (direct or agentic)† + gemini-3.1-pro 6.7 87.5 33.3 0.0 40.0 30.6 $10.38 $33.91 + gpt-5.2@xhigh 33.3 75.0 0.0 0.0 60.0 34.7 $3.09 $8.90 + claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 70.0 36.7 $4.38 $11.92 + LLM ensemble† + Any strategy (direct or agentic) 46.7 100 44.4 0.0 80.0 55.1 $2.66 $38.51 + Ours, trained from scratch, 7M parameters + PTRM, best-Q@K 97.8 100 88.9 85.7 80.0 91.2 $0.001 $0.001 +Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt / +per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100 +wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate, +and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier. + + + + +On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them. +PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the +highest Q value within each augmentation, then vote across augmentations as in the standard pipeline. +With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to +15.97% over our deterministic TRM baseline, while matching it at pass@2. + + + Sudoku-Extreme Maze-Hard ARC-AGI-2 + Method # Params Acc. (%) Acc. (%) pass@1 pass@2 pass@100 + HRM 27M 55.0 74.5 – 5.0 – + TRM 5M / 7M† 87.4 85.3 – 7.8 – + Ours + Standard TRM, our reproduction 5M / 7M† 87.28 83.80 7.36 9.72 14.31 + PTRM 5M / 7M† 98.75 86.73 8.47 9.72 15.97 +Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100, +D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2. +pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM +shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M +for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att). + + +5.4 Q head selection as σ grows + +With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses. +For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises +pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset +we tested, it unlocks correct solutions well beyond the deterministic model’s reach. +TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and Sudoku- +Extreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration +translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces +significantly more correct rollouts, but the existing Q head is not able to identify them, leaving +performance on the table. The gap between best-Q@K and pass@K represents headroom for a +stronger verifier which is left for future work. Appendix B reports the full σ sweep. + + + 8 +6 Related Work + +A long line of work explores recursive computation for iterative reasoning and representation re- +finement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep +Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation +steps to improve reasoning performance. More recent work has introduced methods to substantially +accelerate TRM training [15], while TRM-style recursive architectures have also been extended to +language modeling tasks [16]. +Building on this broader perspective of recursive computation, a growing body of work studies +latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous +“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively +generated and appended to the model context, enabling reasoning directly in latent space without +producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition +and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token +sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate +the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21]. +Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state +z is sampled instead of deterministic. This improves performance on multiple tasks, but requires +retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute +method where they only apply noise in the initial hidden state z, while we apply noise at every +supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme +dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that +Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no +improvement in accuracy on two datasets). +Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who +identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates +these attractors through a combination of task-specific training data augmentation, inference-time +input perturbations, and model bootstrapping across training checkpoints, thereby effectively in- +creasing test-time compute. However, these interventions are comparatively less general and less +computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple +puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism: +injecting Gaussian noise into the latent state at each supervision step while using a single deterministic +checkpoint. + + +7 Conclusion + +In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for +Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales +test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth +scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on +task-specific data augmentations making it extremely easy to use and versatile. +By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide +selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM +nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51 +cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4% +to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1). +Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test +on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited +computational resources. It is not guaranteed that the method works as well for all types of problems +(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller). +Future work. It would be interesting to understand why some puzzles benefit from test-time scaling +more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less +from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones. +Developing stronger verifiers than the existing Q head is an interesting direction for future work. + + + 9 +References + [1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv + preprint arXiv:2510.04871, 2025. + [2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and + Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025. + [3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. + [4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- + agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831, + 2025. + [5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic + analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026. + [6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- + mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, + 2024. + [7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint + arXiv:1603.08983, 2016. + [8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv + preprint arXiv:2603.02119, 2026. + [9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026. + Accessed: 2026-05-01. +[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- + versal transformers. arXiv preprint arXiv:1807.03819, 2018. +[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, + Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic + recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025. +[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, + and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with + recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021. +[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, + and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation + without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242, + 2022. +[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning + of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems, + 37:97027–97052, 2024. +[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641, + 2026. +[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie, + Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation. + In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning, + 2026. +[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong + Tian. Training large language models to reason in a continuous latent space. arXiv preprint + arXiv:2412.06769, 2024. +[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, + Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. + Advances in neural information processing systems, 35:24824–24837, 2022. + + + 10 +[19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning + by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint + arXiv:2505.12514, 2025. +[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny + Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring + faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023. +[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- + man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t + always say what they think. arXiv preprint arXiv:2505.05410, 2025. +[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive + reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026. +[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search: + Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Think- + ing – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id= + kKps9W1K7n. + + + + + 11 +A Implementation Details + +A.1 Compute + +We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no +additional training cost over standard TRM since it operates entirely at inference time. + +A.2 Models + +All experiments use the standard TRM backbone [1] with the released architecture and training recipes. +Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme +and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench. +Layout and hyperparameters are unchanged from TRM. + +A.3 PPBench dataset construction + +Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM. +The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we +detail how we built our training, validation, and golden splits. + +Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types. +Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8]. + +Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles +from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training +set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types +(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku +and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform +sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline +reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no +headroom to compare against PTRM). + +Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1) +trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s +solution trajectory rather than always the empty initial grid, while the label is always the fully solved +grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among +the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the +label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining +9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not +augmented. + +Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len = +100. Per-type sample counts are reported in Table 4. + + puzzle type train val golden + sudoku 7,810 97 15 + lightup 9,504 65 8 + nurikabe 15,180 55 9 + heyawake 42,108 70 7 + tapa 3,663 26 10 + shakashaka∗ 20,702 62 12 + total 98,967 375 61 +Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation. +∗ + Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deter- +ministic TRM already solves all evaluated shakashaka puzzles. + + + + 12 + B Noise Ablation + + We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and + D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the + test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ, + averaged over three random seeds. + + pass@K best-Q@K mode@K K = 1 baseline + + Sudoku-Extreme Maze-Hard ARC-AGI-2 (within-aug) + 100 + 96 5.5 + 90 + 94 + 80 5.0 + 92 +accuracy (%) + + + + + 70 90 4.5 + 60 88 + 50 86 4.0 + 40 84 + 3.5 + 30 82 + 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 + + + Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task, + increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to + a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely + on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region + represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has + the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation + to isolate the Q head’s verification abilities from the augmentation pipeline. + + On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then + plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the + sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the + ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of + pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K + is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close). + + + C Q-guided Langevin sampling + + We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration + mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic + search by additionally steering each rollout (using the Q head gradient) toward regions of high Q + value. We ultimately found that the gain from this approach was entirely attributable to the Langevin + noise term, with the gradient component contributing nothing measurable on top of the equivalent + recurrent noise of Sec. 4. We document the approach here as a negative result. + + Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote + the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent + space. Empirical observations during early experiments suggested that regions of low E correspond + to good basins from which the decoded answer is likely correct. PCA visualizations of the latent + dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and + bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which + to push latents. + + Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin + dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to + the latent, + p + z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I), + The number of Langevin steps N is the additional scaling axis under this scheme. + + + 13 + t=0 t=5 t = 10 t = 15 + Correct (21) + Incorrect (4) + Q + + + + +Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recur- +sive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts +in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both +good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization +motivated the Langevin sampling experiment described below. + + +Tractable gradient computation. TRM’s original Q head is a linear projection on a single token, +fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent +of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion. +This works but requires backpropagating through a full latent recursion at every Langevin step, which +scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with +an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear +network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which +is ∼8× faster per step and does not sacrifice accuracy. + +The gain came from the noise,√ not the gradient. Comparing Langevin sampling against a noise- +only ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially +identical accuracy at matched N . The gradient component contributed nothing measurable on +top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in +Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time +backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective. + +D Per-puzzle accuracy on the PPBench validation set +The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct compara- +bility with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement, +Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for +golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift, +and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K +from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles +where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69% +to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only +marginally. + + % accuracy # Params sudoku lightup nurikabe heyawake tapa agg. + Direct prediction 27M 0.0 10.0 4.0 14.0 0.0 6.2 + TRM (K=1, D=16) 7M 68.7 83.3 76.0 96.7 39.7 76.4 + TRM (K=1, D=48) 7M 74.0 84.0 76.7 98.0 41.0 78.3 + PTRM, best-Q@K (K=100, D=48) 7M 93.3 93.3 84.7 100 71.8 90.4 +Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the +deterministic TRM. Results on the larger validation set follow the same trends as on the golden set. + + + + + 14 +
\ No newline at end of file diff --git a/papers/txt/trm2025_tiny_recursive.txt b/papers/txt/trm2025_tiny_recursive.txt new file mode 100644 index 0000000..55cf994 --- /dev/null +++ b/papers/txt/trm2025_tiny_recursive.txt @@ -0,0 +1,796 @@ + Less is More: Recursive Reasoning with Tiny Networks + + + + Alexia Jolicoeur-Martineau + Samsung SAIL Montréal + alexia.j@samsung.com + + + Abstract +arXiv:2510.04871v1 [cs.LG] 6 Oct 2025 + + + + + Hierarchical Reasoning Model (HRM) is a + novel approach using two small neural net- + works recursing at different frequencies. This + biologically inspired method beats Large Lan- + guage models (LLMs) on hard puzzle tasks + such as Sudoku, Maze, and ARC-AGI while + trained with small models (27M parameters) + on small data (∼ 1000 examples). HRM holds + great promise for solving hard problems with + small networks, but it is not yet well un- + derstood and may be suboptimal. We pro- + pose Tiny Recursive Model (TRM), a much + simpler recursive reasoning approach that + achieves significantly higher generalization + than HRM, while using a single tiny network + with only 2 layers. With only 7M parameters, + TRM obtains 45% test-accuracy on ARC-AGI- + 1 and 8% on ARC-AGI-2, higher than most + LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 + Pro) with less than 0.01% of the parameters. + + + + 1. Introduction + While powerful, Large Language models (LLMs) can + struggle on hard question-answer problems. Given + that they generate their answer auto-regressively, there + is a high risk of error since a single incorrect token can + render an answer invalid. To improve their reliabil- Figure 1. Tiny Recursion Model (TRM) recursively improves + ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al., its predicted answer y with a tiny network. It starts with the + 2022) and Test-Time Compute (TTC) (Snell et al., 2024). embedded input question x and initial embedded answer + y, and latent z. For up to Nsup = 16 improvements steps, + CoTs seek to emulate human reasoning by having the + it tries to improve its answer y. It does so by i) recursively + LLM to sample step-by-step reasoning traces prior to + updating n times its latent z given the question x, current + giving their answer. Doing so can improve accuracy, answer y, and current latent z (recursive reasoning), and + but CoT is expensive, requires high-quality reasoning then ii) updating its answer y given the current answer y + data (which may not be available), and can be brittle and current latent z. This recursive process allows the model + since the generated reasoning may be wrong. To fur- to progressively improve its answer (potentially address- + ther improve reliability, test-time compute can be used ing any errors from its previous answer) in an extremely + by reporting the most common answer out of K or the parameter-efficient manner while minimizing overfitting. + highest-reward answer (Snell et al., 2024). + + 1 + Recursive Reasoning with Tiny Networks + +However, this may not be enough. LLMs with CoT ers that achieves significantly higher generalization +and TTC are not enough to beat every problem. While than HRM on a variety of problems. In doing so, we +LLMs have made significant progress on ARC-AGI improve the state-of-the-art test accuracy on Sudoku- +(Chollet, 2019) since 2019, human-level accuracy still Extreme from 55% to 87%, Maze-Hard from 75% to +has not been reached (6 years later, as of writing of 85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2 +this paper). Furthermore, LLMs struggle on the newer from 5% to 8%. +ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test +accuracy with a high amount of TTC) (Chollet et al., 2. Background +2025; ARC Prize Foundation, 2025b). + HRM is described in Algorithm 2. We discuss the +An alternative direction has recently been proposed by + details of the algorithm further below. +Wang et al. (2025). They propose a new way forward +through their novel Hierarchical Reasoning Model +(HRM), which obtains high accuracy on puzzle tasks 2.1. Structure and goal +where LLMs struggle to make a dent (e.g., Sudoku The focus of HRM is supervised learning. Given an +solving, Maze pathfinding, and ARC-AGI). HRM is a input, produce an output. Both input and output are +supervised learning model with two main novelties: 1) assumed to have shape [ B, L] (when the shape differs, +recursive hierarchical reasoning, and 2) deep supervision. padding tokens can be added), where B is the batch- +Recursive hierarchical reasoning consists of recurs- size and L is the context-length. +ing multiple times through two small networks ( f L at HRM contains four learnable components: the in- +high frequency and f H at low frequency) to predict the put embedding f I (·; θ I ), low-level recurrent network +answer. Each network generates a different latent fea- f L (·; θ L ), high-level recurrent network f H (·; θ H ), and +ture: f L outputs z H and f H outputs z L . Both features the output head fO (·; θO ). Once the input is embedded, +(z L , z H ) are used as input to the two networks. The the shape becomes [ B, L, D ] where D is the embedding +authors provide some biological arguments in favor of size. Each network is a 4-layer Transformers architec- +recursing at different hierarchies based on the different ture (Vaswani et al., 2017), with RMSNorm (Zhang +temporal frequencies at which the brains operate and & Sennrich, 2019), no bias (Chowdhery et al., 2023), +hierarchical processing of sensory inputs. rotary embeddings (Su et al., 2024), and SwiGLU acti- +Deep supervision consists of improving the answer vation function (Hendrycks & Gimpel, 2016; Shazeer, +through multiple supervision steps while carrying the 2020). +two latent features as initialization for the improve- +ment steps (after detaching them from the computa- 2.2. Recursion at two different frequencies +tional graph so that their gradients do not propagate). Given the hyperparameters used by Wang et al. (2025) +This provide residual connections, which emulates (n = 2 f L steps, 1 f H steps; done T = 2 times), a +very deep neural networks that are too memory ex- forward pass of HRM is done as follows: +pensive to apply in one forward pass. +An independent analysis on the ARC-AGI benchmark x ← f I ( x̃ ) +showed that deep supervision seems to be the primary z L ← f L (z L + z H + x ) # without gradients +driver of the performance gains (ARC Prize Founda- z L ← f L (z L + z H + x ) # without gradients +tion, 2025a). Using deep supervision doubled accuracy + z H ← f H (z L + z H ) # without gradients +over single-step supervision (going from 19% to 39% +accuracy), while recursive hierarchical reasoning only z L ← f L (z L + z H + x ) # without gradients +slightly improved accuracy over a regular model with z L ← z L .detach() +a single forward pass (going from 35.7% to 39.0% ac- z H ← z H .detach() +curacy). This suggests that reasoning across different + z L ← f L (z L + z H + x ) # with gradients +supervision steps is worth it, but the recursion done +in each supervision step is not particularly important. z H ← f H (z L + z H ) # with gradients + ŷ ← argmax( fO (z H )) +In this work, we show that the benefit from recursive +reasoning can be massively improved, making it much +more than incremental. We propose Tiny Recursive where ŷ is the predicted output answer, z L and z H are +Model (TRM), an improved and simplified approach either initialized embeddings or the embeddings of +using a much smaller tiny network with only 2 lay- the previous deep supervision step (after detaching + them from the computational graph). As can be seen, + + 2 + Recursive Reasoning with Tiny Networks + +def hrm(z, x, n=2, T=2): # hierarchical reasoning 2.4. Deep supervision + zH, zL = z + with torch.no grad(): To improve effective depth, deep supervision is used. + for i in range(nT − 2): + zL = L net(zL, zH, x) This consists of reusing the previous latent features + if (i + 1) % T == 0: + zH = H net(zH, zL) + (z H and z L ) as initialization for the next forward pass. + # 1−step grad This allows the model to reason over many iterations + zL = L net(zL, zH, x) + zH = H net(zH, zL) and improve its latent features (z L and z H ) until it + return (zH, zL), output head(zH), Q head(zH) (hopefully) converges to the correct solution. At most +def ACT halt(q, y hat, y true): Nsup = 16 supervision steps are used. + target halt = (y hat == y true) + loss = 0.5∗binary cross entropy(q[0], target halt) + return loss 2.5. Adaptive computational time (ACT) +def ACT continue(q, last step): + if last step: With deep supervision, each mini-batch of data sam- + target continue = sigmoid(q[0]) ples must be used for Nsup = 16 supervision steps + else: + target continue = sigmoid(max(q[0], q[1]))) before moving to the next mini-batch. This is expen- + loss = 0.5∗binary cross entropy(q[1], target continue) + return loss + sive, and there is a balance to be reached between + optimizing a few data examples for many supervision +# Deep Supervision +for x input, y true in train dataloader: steps versus optimizing many data examples with less + z = z init supervision steps. To reach a better balance, a halting + for step in range(N sup): # deep supervision + x = input embedding(x input) mechanism is incorporated to determine whether the + z, y pred, q = hrm(z, x) + loss = softmax cross entropy(y pred, y true) + model should terminate early. It is learned through + # Adaptive computational time (ACT) using Q−learning a Q-learning objective that requires passing the z H + loss += ACT halt(q, y pred, y true) + , , q next = hrm(z, x) # extra forward pass through an additional head and running an additional + loss += ACT continue(q next, step == N sup − 1) forward pass (to determine if halting now rather than + z = z.detach() + loss.backward() later would have been preferable). They call this + opt.step() + opt.zero grad() + method Adaptive computational time (ACT). It is only + if q[0] > q[1]: # early−stopping used during training, while the full Nsup = 16 super- + break + vision steps are done at test time to maximize down- + stream performance. ACT greatly diminishes the time +Figure 2. Pseudocode of Hierarchical Reasoning Models spent per example (on average spending less than 2 +(HRMs). steps on the Sudoku-Extreme dataset rather than the + full Nsup = 16 steps), allowing more coverage of the +a forward pass of HRM consists of applying 6 function dataset given a fixed number of training iterations. +evaluations, where the first 4 function evaluations are +detached from the computational graph and are not 2.6. Deep supervision and 1-step gradient +back-propagated through. The authors uses n = 2 approximations replaces BPTT +with T = 2 in all experiments, but HRM can be gener- +alized by allowing for an arbitrary number of L steps Deep supervision and the 1-step gradient approxima- +(n) and recursions (T) as shown in Algorithm 2. tion provide a more biologically plausible and less + computationally-expansive alternative to Backpropa- +2.3. Fixed-point recursion with 1-step gradient gation Through Time (BPTT) (Werbos, 1974; Rumel- + approximation hart et al., 1985; LeCun, 1985) for solving the temporal + credit assignment (TCA) (Rumelhart et al., 1985; Wer- +Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H ) bos, 1988; Elman, 1990) problem (Lillicrap & Santoro, +through recursing from both f L and f H , 2019). The implication is that HRM can learn what + would normally require an extremely large network + z∗L ≈ f L (z∗L + z H + x ) without having to back-propagate through its entire + z∗H ≈ f H (z L + z∗H ) , depth. Given the hyperparameters used by Jang et al. + (2023) in all their experiments, HRM effectively rea- +the Implicit Function Theorem (Krantz & Parks, 2002) + sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 = +with the 1-step gradient approximation (Bai et al., + 384 layers of effective depth. +2019) is used to approximate the gradient by back- +propagating only the last f L and f H steps. This theo- +rem is used to justify only tracking the gradients of +the last two steps (out of 6), which greatly reduces +memory demands. + + 3 + Recursive Reasoning with Tiny Networks + +2.7. Summary of HRM different from the much smaller n = 2 and T = 2 used + in every experiment of their paper, we observe the +HRM leverages recursion from two networks at dif- + following: +ferent frequencies (high frequency versus low fre- +quency) and deep supervision to learn to improve +its answer over multiple supervision steps (with ACT 1. the residual for z H is clearly well above 0 at every +to reduce time spent per data example). This enables step +the model to imitate extremely large depth without +requiring backpropagation through all layers. This +approach obtains significantly higher performance on 2. the residual for z L only becomes closer to 0 after +hard question-answer tasks that regular supervised many cycles, but it remains significantly above 0 +models struggle with. However, this method is quite +complicated, relying a bit too heavily on uncertain +biological arguments and fixed-point theorems that 3. z L is very far from converged after one f L evalu- +are not guaranteed to be applicable. In the next sec- ation at T cycles, which is when the fixed-point +tion, we discuss those issues and potential targets for is assumed to be reached and the 1-step gradient +improvements in HRM. approximation is used + +3. Target for improvements in Hierarchical Thus, while the application of the IFT theorem and + Reasoning Models 1-step gradient approximation to HRM has some basis + since the residuals do generally reduce over time, a +In this section, we identify key targets for improve- + fixed point is unlikely to be reached when the theorem +ments in HRM, which will be addressed by our pro- + is actually applied. +posed method, Tiny Recursion Models (TRMs). + In the next section, we show that we can bypass the +3.1. Implicit Function Theorem (IFT) with 1-step need for the IFT theorem and 1-step gradient approxi- + gradient approximation mation, thus bypassing the issue entirely. + +HRM only back-propagates through the last 2 of the 6 3.2. Twice the forward passes with Adaptive +recursions. The authors justify this by leveraging the computational time (ACT) +Implicit Function Theorem (IFT) and one-step approx- +imation, which states that when a recurrent function HRM uses Adaptive computational time (ACT) during +converges to a fixed point, backpropagation can be training to optimize the time spent of each data sam- +applied in a single step at that equilibrium point. ple. Without ACT, Nsup = 16 supervision steps would + be spent on the same data sample, which is highly in- +There are concerns about applying this theorem to + efficient. They implement ACT through an additional +HRM. Most importantly, there is no guarantee that + Q-learning objective, which decides when to halt and +a fixed-point is reached. Deep equilibrium models + move to a new data sample rather than keep iterating +normally do fixed-point iteration to solve for the fixed + on the same data. This allows much more efficient +pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case + use of time especially since the average number of su- +of HRM, it is not iterating to the fixed-point but simply + pervision steps during training is quite low with ACT +doing forward passes of f L and f H . To make matters + (less than 2 steps on the Sudoku-Extreme dataset as +worse, HRM is only doing 4 recursions before stopping + per their reported numbers). +to apply the one-step approximation. After its first +loop of two f L and 1 f H evaluations, it only apply a However, ACT comes at a cost. This cost is not directly +single f L evaluation before assuming that a fixed-point shown in the HRM’s paper, but it is shown in their of- +is reached for both z L and z H (z∗L = f L (z∗L + z H + x ) ficial code. The Q-learning objective relies on a halting +and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient loss and a continue loss. The continue loss requires an +approximation is applied to both latent variables in extra forward pass through HRM (with all 6 function +succession. evaluations). This means that while ACT optimizes + time more efficiently per sample, it requires 2 forward +The authors justify that a fixed-point is reached by + passes per optimization step. The exact formulation is +depicting an example with n = 7 and T = 7 where + shown in Algorithm 2. +the forward residuals is reduced over time (Figure 3 +in Wang et al. (2025)). Even in this setting, which is In the next section, we show that we can bypass the + need for two forward passes in ACT. + + 4 + Recursive Reasoning with Tiny Networks + +3.3. Hierarchical interpretation based on complex of 2 passes). Our approach is described in Algorithm 3 + biological arguments and illustrated in Figure 1. We also provide an ablation + in Table 1 on the Sudoku-Extreme dataset (a dataset +The HRM’s authors justify the two latent variables + of difficult Sudokus with only 1K training examples, +and two networks operating at different hierarchies + but 423K test examples). Below, we explain the key +based on biological arguments, which are very far + components of TRMs. +from artificial neural networks. They even try to match +HRM to actual brain experiments on mice. While in- +teresting, this sort of explanation makes it incredibly Table 1. Ablation of TRM on Sudoku-Extreme comparing % +hard to parse out why HRM is designed the way it Test accuracy, effective depth per supervision step ( T (n + +is. Given the lack of ablation table in their paper, the 1)nlayers ), number of Forward Passes (NFP) per optimization +over-reliance on biological arguments and fixed-point step, and number of parameters +theorems (that are not perfectly applicable), it is hard Method Acc (%) Depth NFP # Params +to determine what parts of HRM is helping what and HRM 55.0 24 2 27M +why. Furthermore, it is not clear why they use two TRM (T = 3, n = 6) 87.4 42 1 5M +latent features rather than other combinations of fea- w/ ACT 86.1 42 2 5M +tures. w/ separate f H , f L 82.4 42 1 10M + no EMA 79.9 42 1 5M +In the next section, we show that the recursive process w/ 4-layers, n = 3 79.5 48 1 10M +can be greatly simplified and understood in a much w/ self-attention 74.7 42 1 7M +simpler manner that does not require any biological w/ T = 2, n = 2 73.7 12 1 5M +argument, fixed-point theorem, hierarchical interpre- w/ 1-step gradient 56.5 42 1 5M +tation, nor using two networks. It also explains why 2 +is the optimal number of features (z L and z H ). + 4.1. No fixed-point theorem required + +def latent recursion(x, y, z, n=6): HRM assumes that the recursions converge to a fixed- + for i in range(n): # latent reasoning point for both z L and z H in order to leverage the 1-step + z = net(x, y, z) + y = net(y, z) # refine output answer gradient approximation (Bai et al., 2019). This allows + return y, z + the authors to justify only back-propagating through +def deep recursion(x, y, z, n=6, T=3): the last two function evaluations (1 f L and 1 f H ). To + # recursing T−1 times to improve y and z (no gradients needed) + with torch.no grad(): bypass this theoretical requirement, we define a full + for j in range(T−1): recursion process as containing n evaluations of f L + y, z = latent recursion(x, y, z, n) + # recursing once to improve y and z and 1 evaluation of f H : + y, z = latent recursion(x, y, z, n) + return (y.detach(), z.detach()), output head(y), Q head(y) z L ← f L (z L + z H + x ) +# Deep Supervision ... +for x input, y true in train dataloader: + y, z = y init, z init z L ← f L (z L + z H + x ) + for step in range(N supervision): + x = input embedding(x input) z H ← f H (z L + z H ) . + (y, z), y hat, q hat = deep recursion(x, y, z) + loss = softmax cross entropy(y hat, y true) + loss += binary cross entropy(q hat, (y hat == y true)) Then, we simply back-propagate through the full re- + loss.backward() cursion process. + opt.step() + opt.zero grad() + if q hat > 0: # early−stopping Through deep supervision, the models learns to take + break any (z L , z H ) and improve it through a full recursion + process, hopefully making z H closer to the solution. +Figure 3. Pseudocode of Tiny Recursion Models (TRMs). This means that by the design of the deep supervi- + sion goal, running a few full recursion processes (even + without gradients) is expected to bring us closer to the +4. Tiny Recursion Models solution. We propose to run T − 1 recursion processes + without gradient to improve (z L , z H ) before running +In this section, we present Tiny Recursion Models + one recursion process with backpropagation. +(TRMs). Contrary to HRM, TRM requires no com- +plex mathematical theorem, hierarchy, nor biological Thus, instead of using the 1-step gradient approxi- +arguments. It generalizes better while requiring only mation, we apply a full recursion process containing +a single tiny network (instead of two medium-size net- n evaluations of f L and 1 evaluation of f H . This re- +works) and a single forward pass for the ACT (instead moves entirely the need to assume that a fixed-point + + 5 + Recursive Reasoning with Tiny Networks + +is reached and the use of the IFT theorem with 1-step While this is intuitive, we wanted to verify whether +gradient approximation. Yet, we can still leverage using more or less features could be helpful. Results +multiple backpropagation-free recursion processes to are shown in Table 2. +improve (z L , z H ). With this approach, we obtain a + More features (> 2): We tested splitting z into dif- +massive boost in generalization on Sudoku-Extreme + ferent features by treating each of the n recursions as +(improving TRM from 56.5% to 87.4%; see Table 1). + producing a different zi for i = 1, ..., n. Then, each + zi is carried across supervision steps. The approach +4.2. Simpler reinterpretation of z H and z L is described in Algorithm 5. In doing so, we found +HRM is interpreted as doing hierarchical reasoning performance to drop. This is expected because, as dis- +over two latent features of different hierarchies due to cussed, there is no apparent need for splitting z into +arguments from biology. However, one might wonder multiple parts. It does not have to be hierarchical. +why use two latent features instead of 1, 3, or more? Single feature: Similarly, we tested the idea of taking +And do we really need to justify these so-called ”hier- a single feature by only carrying z H across supervi- +archical” features based on biology to make sense of sion steps. The approach is described in Algorithm 4. +them? We propose a simple non-biological explana- In doing so, we found performance to drop. This is +tion, which is more natural, and directly answers the expected because, as discussed, it forces the model to +question of why there are 2 features. store the solution y within z. +The fact of the matter is: z H is simply the current Thus, we explored using more or less latent variables +(embedded) solution. The embedding is reversed by on Sudoku-Extreme, but found that having only y and +applying the output head and rounding to the nearest z lead to better test accuracy in addition to being the +token using the argmax operation. On the other hand, simplest more natural approach. +z L is a latent feature that does not directly correspond +to a solution, but it can be transformed into a solution +by applying z H ← f H ( x, z L , z H ). We show an example +on Sudoku-Extreme in Figure 6 to highlight the fact Table 2. TRM on Sudoku-Extreme comparing % Test accu- +that z H does correspond to the solution, but z L does racy when using more or less latent features +not. Method # of features Acc (%) + TRM y, z (Ours) 2 87.4 +Once this is understood, hierarchy is not needed; there TRM multi-scale z n+1 = 7 77.6 +is simply an input x, a proposed solution y (previously TRM single z 1 71.9 +called z H ), and a latent reasoning feature z (previously +called z L ). Given the input question x, current solution +y, and current latent reasoning z, the model recursively +improves its latent z. Then, given the current latent z 4.3. Single network +and the previous solution y, the model proposes a new HRM uses two networks, one applied frequently as a +solution y (or stay at the current solution if its already low-level module f H and one applied rarely as an high- +good). level module ( f H ). This requires twice the number of +Although this has no direct influence on the algorithm, parameters compared to regular supervised learning +this re-interpretation is much simpler and natural. It with a single network. +answers the question about why two features: remem- As mentioned previously, while f L iterates on the la- +bering in context the question x, previous reasoning tent reasoning feature z (z L in HRM), the goal of f H +z, and previous answer y helps the model iterate on is to update the solution y (z H in HRM) given the la- +the next reasoning z and then the next answer y. If tent reasoning and current solution. Importantly, since +we were not passing the previous reasoning z, the z ← f L ( x + y + z) contains x but y ← f H (y + z) does +model would forget how it got to the previous solu- not contains x, the task to achieve (iterating on z versus +tion y (since z acts similarly as a chain-of-thought). If using z to update y) is directly specified by the inclu- +we were not passing the previous solution y, then the sion or lack of x in the inputs. Thus, we considered +model would forget what solution it had and would the possibility that both networks could be replaced +be forced to store the solution y within z instead of by a single network doing both tasks. In doing so, we +using it for latent reasoning. Thus, we need both y and obtain better generalization on Sudoku-Extreme (im- +z separately, and there is no apparent reason why one proving TRM from 82.4% to 87.4%; see Table 1) while +would need to split z into multiple features. reducing the number of parameters by half. It turns + out that a single network is enough. + + 6 + Recursive Reasoning with Tiny Networks + +4.4. Less is more 4.6. No additional forward pass needed with ACT +We attempted to increase capacity by increasing the As previously mentioned, the implementation of ACT +number of layers in order to scale the model. Sur- in HRM through Q-learning requires two forward +prisingly, we found that adding layers decreased gen- passes, which slows down training. We propose a +eralization due to overfitting. In doing the oppo- simple solution, which is to get rid of the continue loss +site, decreasing the number of layers while scaling (from the Q-learning) and only learn a halting proba- +the number of recursions (n) proportionally (to keep bility through a Binary-Cross-Entropy loss of having +the amount of compute and emulated depth approxi- reached the correct solution. By removing the continue +mately the same), we found that using 2 layers (instead loss, we remove the need for the expensive second for- +of 4 layers) maximized generalization. In doing so, we ward pass, while still being able to determine when to +obtain better generalization on Sudoku-Extreme (im- halt with relatively good accuracy. We found no sig- +proving TRM from 79.5% to 87.4%; see Table 1) while nificant difference in generalization from this change +reducing the number of parameters by half (again). (going from 86.1% to 87.4%; see Table 1). +It is quite surprising that smaller networks are bet- +ter, but 2 layers seems to be the optimal choice. Bai 4.7. Exponential Moving Average (EMA) +& Melas-Kyriazi (2024) also observed optimal perfor- On small data (such as Sudoku-Extreme and Maze- +mance for 2-layers in the context of deep equilibrium Hard), HRM tends to overfit quickly and then diverge. +diffusion models; however, they had similar perfor- To reduce this problem and improves stability, we +mance to the bigger networks, while we instead ob- integrate Exponential Moving Average (EMA) of the +serve better performance with 2 layers. This may ap- weights, a common technique in GANs and diffusion +pear unusual, as with modern neural networks, gener- models to improve stability (Brock et al., 2018; Song & +alization tends to directly correlate with model sizes. Ermon, 2020). We find that it prevents sharp collapse +However, when data is too scarce and model size is and leads to higher generalization (going from 79.9% +large, there can be an overfitting penalty (Kaplan et al., to 87.4%; see Table 1). +2020). This is likely an indication that there is too little +data. Thus, using tiny networks with deep recursion 4.8. Optimal the number of recursions +and deep supervision appears to allow us to bypass a +lot of the overfitting. We experimented with different number of recursions + by varying T and n and found that T = 3, n = 3 +4.5. attention-free architecture for tasks with small (equivalent to 48 recursions) in HRM and T = 3, n = 6 + fixed context length in TRM (equivalent to 42 recursions) to lead to optimal + generalization on Sudoku-Extreme. More recursions +Self-attention is particularly good for long-context could be helpful for harder problems (we have not +lengths when L ≫ D since it only requires a matrix of tested it, given our limited resources); however, in- +[ D, 3D ] parameters, even though it can account for the creasing either T or n incurs massive slowdowns. We +whole sequence. However, when focusing on tasks show results at different n and T for HRM and TRM +where L ≤ D, a linear layer is cheap, requiring only a in Table 3. Note that TRM requires backpropagation +matrix of [ L, L] parameters. Taking inspiration from through a full recursion process, thus increasing n too +the MLP-Mixer (Tolstikhin et al., 2021), we can replace much leads to Out Of Memory (OOM) errors. How- +the self-attention layer with a multilayer perceptron ever, this memory cost is well worth its price in gold. +(MLP) applied on the sequence length. Using an MLP +instead of self-attention, we obtain better generaliza- In the following section, we show our main results on +tion on Sudoku-Extreme (improving from 74.7% to multiple datasets comparing HRM, TRM, and LLMs. +87.4%; see Table 1). This worked well on Sudoku 9x9 +grids, given the small and fixed context length; how- 5. Results +ever, we found this architecture to be suboptimal for +tasks with large context length, such as Maze-Hard Following Wang et al. (2025), we test our approach +and ARC-AGI (both using 30x30 grids). We show on the following datasets: Sudoku-Extreme (Wang +results with and without self-attention for all experi- et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI- +ments. 1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025). + Results are presented in Tables 4 and 5. Hyperparame- + ters are detailed in Section 6. Datasets are discussed + below. + + + 7 + Recursive Reasoning with Tiny Networks + + ity of the MLP on large 30x30 grids). TRM with self- +Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM + attention obtains 85.3% accuracy on Maze-Hard, 44.6% +versus TRM matched at a similar effective depth per super- +vision step ( T (n + 1)nlayers ) + accuracy on ARC-AGI-1, and 7.8% accuracy on ARC- + AGI-2 with 7M parameters. This is significantly higher + HRM TRM + than the 74.5%, 40.3%, and 5.0% obtained by HRM us- + n = k, 4 layers n = 2k, 2 layers + ing 4 times the number of parameters (27M). + k T Depth Acc (%) Depth Acc (%) + 1 1 9 46.4 7 63.2 + 2 2 24 55.0 20 81.9 Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku- + 3 3 48 61.6 42 87.4 Extreme and Maze-Hard) + 4 4 80 59.5 72 84.2 Method # Params Sudoku Maze + 6 3 84 62.3 78 OOM Chain-of-thought, pretrained + 3 6 96 58.8 84 85.8 Deepseek R1 671B 0.0 0.0 + 6 6 168 57.5 156 OOM Claude 3.7 8K ? 0.0 0.0 + O3-mini-high ? 0.0 0.0 + Direct prediction, small-sample training +Sudoku-Extreme consists of extremely difficult Su- Direct pred 27M 0.0 0.0 +doku puzzles (Dillion, 2025; Palm et al., 2018; Park, HRM 27M 55.0 74.5 +2018) (9x9 grid), for which only 1K training samples TRM-Att (Ours) 7M 74.7 85.3 +are used to test small-sample learning. Testing is done TRM-MLP (Ours) 5M/19M 1 87.4 0.0 +on 423K samples. Maze-Hard consists of 30x30 mazes +generated by the procedure by Lehnert et al. (2024) +whose shortest path is of length above 110; both the +training set and test set include 1000 mazes. Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries) + Method # Params ARC-1 ARC-2 +ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in- Chain-of-thought, pretrained +volving monetary prizes. Each puzzle is designed to Deepseek R1 671B 15.8 1.3 +be easy for a human, yet hard for current AI models. Claude 3.7 16K ? 28.6 0.7 +Each puzzle task consists of 2-3 input–output demon- o3-mini-high ? 34.5 3.0 +stration pairs and 1-2 test inputs to be solved. The final Gemini 2.5 Pro 32K ? 37.0 4.9 +score is computed as the accuracy over all test inputs Grok-4-thinking 1.7T 66.7 16.0 +from two attempts to produce the correct output grid. Bespoke (Grok-4) 1.7T 79.6 29.4 +The maximum grid size is 30x30. ARC-AGI-1 con- Direct prediction, small-sample training +tains 800 tasks, while ARC-AGI-2 contains 1120 tasks. + Direct pred 27M 21.0 0.0 +We also augment our data with the 160 tasks from + HRM 27M 40.3 5.0 +the closely related ConceptARC dataset (Moskvichev +et al., 2023). We provide results on the public evalua- TRM-Att (Ours) 7M 44.6 7.8 +tion set for both ARC-AGI-1 and ARC-AGI-2. TRM-MLP (Ours) 19M 29.6 2.4 + +While these datasets are small, heavy data- +augmentation is used in order to improve gen- 6. Conclusion +eralization. Sudoku-Extreme uses 1000 shuffling +(done without breaking the Sudoku rules) augmenta- We propose Tiny Recursion Models (TRM), a simple +tions per data example. Maze-Hard uses 8 dihedral recursive reasoning approach that achieves strong gen- +transformations per data example. ARC-AGI uses eralization on hard tasks using a single tiny network +1000 data augmentations (color permutation, dihedral- recursing on its latent reasoning feature and progres- +group, and translations transformations) per data sively improving its final answer. Contrary to the +example. The dihedral-group transformations consist Hierarchical Reasoning Model (HRM), TRM requires +of random 90-degree rotations, horizontal/vertical no fixed-point theorem, no complex biological justi- +flips, and reflections. fications, and no hierarchy. It significantly reduces + the number of parameters by halving the number of +From the results, we see that TRM without self- layers and replacing the two networks with a single +attention obtains the best generalization on Sudoku- tiny network. It also simplifies the halting process, +Extreme (87.4% test accuracy). Meanwhile, TRM with removing the need for the extra forward pass. Over- +self-attention generalizes better on the other tasks +(probably due to inductive biases and the overcapac- 1 5M on Sudoku and 19M on Maze + + + + 8 + Recursive Reasoning with Tiny Networks + +all, TRM is much simpler than HRM, while achieving gan training for high fidelity natural image synthe- +better generalization. sis. arXiv preprint arXiv:1809.11096, 2018. +While our approach led to better generalization on 4 Chollet, F. On the measure of intelligence. arXiv +benchmarks, every choice made is not guaranteed to preprint arXiv:1911.01547, 2019. +be optimal on every dataset. For example, we found +that replacing the self-attention with an MLP worked Chollet, F., Knoop, M., Kamradt, G., Landers, B., +extremely well on Sudoku-Extreme (improving test ac- and Pinkard, H. Arc-agi-2: A new challenge +curacy by 10%), but poorly on other datasets. Different for frontier ai reasoning systems. arXiv preprint +problem settings may require different architectures arXiv:2505.11831, 2025. +or number of parameters. Scaling laws are needed +to parametrize these networks optimally. Although Chowdhery, A., Narang, S., Devlin, J., Bosma, M., +we simplified and improved on deep recursion, the Mishra, G., Roberts, A., Barham, P., Chung, H. W., +question of why recursion helps so much compared Sutton, C., Gehrmann, S., et al. Palm: Scaling lan- +to using a larger and deeper network remains to be guage modeling with pathways. Journal of Machine +explained; we suspect it has to do with overfitting, but Learning Research, 24(240):1–113, 2023. +we have no theory to back this explaination. Not all +our ideas made the cut; we briefly discuss some of the Dillion, T. Tdoku: A fast sudoku solver and gener- +failed ideas that we tried but did not work in Section 6. ator. https://t-dillon.github.io/tdoku/, +Currently, recursive reasoning models such as HRM 2025. +and TRM are supervised learning methods rather than + Elman, J. L. Finding structure in time. Cognitive science, +generative models. This means that given an input + 14(2):179–211, 1990. +question, they can only provide a single deterministic +answer. In many settings, multiple answers exist for a Fedus, W., Zoph, B., and Shazeer, N. Switch transform- +question. Thus, it would be interesting to extend TRM ers: Scaling to trillion parameter models with simple +to generative tasks. and efficient sparsity. Journal of Machine Learning Re- + search, 23(120):1–39, 2022. +Acknowledgements + Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep +Thank you Emy Gervais for your invaluable support equilibrium models. arXiv preprint arXiv:2310.18605, +and extra push. This research was enabled in part 2023. +by computing resources, software, and technical as- +sistance provided by Mila and the Digital Research Hendrycks, D. and Gimpel, K. Gaussian error linear +Alliance of Canada. units (gelus). arXiv preprint arXiv:1606.08415, 2016. + + Jang, Y., Kim, D., and Ahn, S. Hierarchical graph +References generation with k2-trees. In ICML 2023 Workshop on +ARC Prize Foundation. The Hidden Drivers of HRM’s Structured Probabilistic Inference Generative Modeling, + Performance on ARC-AGI. https://arcprize. 2023. + org/blog/hrm-analysis, 2025a. [Online; ac- + Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., + cessed 2025-09-15]. + Chess, B., Child, R., Gray, S., Radford, A., Wu, J., +ARC Prize Foundation. ARC-AGI Leaderboard. and Amodei, D. Scaling laws for neural language + https://arcprize.org/leaderboard, 2025b. models. arXiv preprint arXiv:2001.08361, 2020. + [Online; accessed 2025-09-24]. + Kingma, D. P. and Ba, J. Adam: A method for stochas- +Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium tic optimization. arXiv preprint arXiv:1412.6980, + models. Advances in neural information processing 2014. + systems, 32, 2019. + Krantz, S. G. and Parks, H. R. The implicit function +Bai, X. and Melas-Kyriazi, L. Fixed point diffusion theorem: history, theory, and applications. Springer + models. In Proceedings of the IEEE/CVF Conference Science & Business Media, 2002. + on Computer Vision and Pattern Recognition, pp. 9430– + 9440, 2024. LeCun, Y. Une procedure d’apprentissage ponr reseau + a seuil asymetrique. Proceedings of cognitiva 85, pp. +Brock, A., Donahue, J., and Simonyan, K. Large scale 599–604, 1985. + + 9 + Recursive Reasoning with Tiny Networks + +Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P., all-mlp architecture for vision. Advances in neural + Rabbat, M., and Tian, Y. Beyond a*: Better planning information processing systems, 34:24261–24272, 2021. + with transformers via search dynamics bootstrap- + ping. arXiv preprint arXiv:2402.14083, 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., + Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, +Lillicrap, T. P. and Santoro, A. Backpropagation I. Attention is all you need. Advances in neural + through time and the brain. Current opinion in neuro- information processing systems, 30, 2017. + biology, 55:82–89, 2019. + Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., +Loshchilov, I. and Hutter, F. Decoupled weight decay Lu, M., Song, S., and Yadkori, Y. A. Hierarchical + regularization. arXiv preprint arXiv:1711.05101, 2017. reasoning model. arXiv preprint arXiv:2506.21734, + 2025. +Moskvichev, A., Odouard, V. V., and Mitchell, M. The + conceptarc benchmark: Evaluating understanding Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., + and generalization in the arc domain. arXiv preprint Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought + arXiv:2305.07141, 2023. prompting elicits reasoning in large language mod- + els. Advances in neural information processing systems, +Palm, R., Paquet, U., and Winther, O. Recurrent re- + 35:24824–24837, 2022. + lational networks. Advances in neural information + processing systems, 31, 2018. Werbos, P. Beyond regression: New tools for predic- + tion and analysis in the behavioral sciences. PhD +Park, K. Can convolutional neural networks + thesis, Committee on Applied Mathematics, Harvard + crack sudoku puzzles? https://github.com/ + University, Cambridge, MA, 1974. + Kyubyong/sudoku, 2018. + Werbos, P. J. Generalization of backpropagation with +Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T. + application to a recurrent gas market model. Neural + Grokking at the edge of numerical stability. arXiv + networks, 1(4):339–356, 1988. + preprint arXiv:2501.04697, 2025. + Zhang, B. and Sennrich, R. Root mean square layer +Rumelhart, D. E., Hinton, G. E., and Williams, R. J. + normalization. Advances in Neural Information Pro- + Learning internal representations by error propaga- + cessing Systems, 32, 2019. + tion. Technical report, 1985. +Shazeer, N. Glu variants improve transformer. arXiv + preprint arXiv:2002.05202, 2020. +Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, + Q., Hinton, G., and Dean, J. Outrageously large neu- + ral networks: The sparsely-gated mixture-of-experts + layer. arXiv preprint arXiv:1701.06538, 2017. +Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling + llm test-time compute optimally can be more effec- + tive than scaling model parameters. arXiv preprint + arXiv:2408.03314, 2024. +Song, Y. and Ermon, S. Improved techniques for train- + ing score-based generative models. Advances in + neural information processing systems, 33:12438–12448, + 2020. +Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, + Y. Roformer: Enhanced transformer with rotary + position embedding. Neurocomputing, 568:127063, + 2024. +Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, + L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., + Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An + + 10 + Recursive Reasoning with Tiny Networks + +Hyper-parameters and setup propagating through the whole n + 1 recursions makes + the most sense and works best. +All models are trained with the AdamW opti- +mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014) We tried removing ACT with the option of stopping +with β 1 = 0.9, β 2 = 0.95, small learning rate warm- when the solution is reached, but we found that gen- +up (2K iterations), batch-size 768, hidden-size of 512, eralization dropped significantly. This can probably +Nsup = 16 max supervision steps, and stable-max loss be attributed to the fact that the model is spending +(Prieto et al., 2025) for improved stability. TRM uses an too much time on the same data samples rather than +Exponential Moving Average (EMA) of 0.999. HRM focusing on learning on a wide range of data samples. +uses n = 2, T = 2 with two 4-layers networks, while We tried weight tying the input embedding and out- +we use n = 6, T = 3 with one 2-layer network. put head, but this was too constraining and led to a +For Sudoku-Extreme and Maze-Hard, we train for 60k massive generalization drop. +epochs with learning rate 1e-4 and weight decay 1.0. We tried using TorchDEQ (Geng & Kolter, 2023) to +For ARC-AGI, we train for 100K epochs with learning replace the recursion steps by fixed-point iteration as +rate 1e-4 (with 1e-2 learning rate for the embeddings) done by Deep Equilibrium Models (Bai et al., 2019). +and weight decay 0.1. The numbers for Deepseek R1, This would provide a better justification for the 1-step +Claude 3.7 8K, O3-mini-high, Direct prediction, and gradient approximation. However, this slowed down +HRM from the Table 4 and 5 are taken from Wang et al. training due to the fixed-point iteration and led to +(2025). Both HRM and TRM add an embedding of worse generalization. This highlights the fact that +shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to converging to a fixed-point is not essential. +the input. For ARC-AGI, each puzzle (containing 2-3 +training examples and 1-2 test examples) at each data- +augmentation is given a specific embedding of shape +[0, 1, D ] and, at test-time, the most common answer +out of the 1000 data augmentations is given as answer. +Experiments on Sudoku-Extreme were ran with 1 L40S +with 40Gb of RAM for generally less than 36 hours. +Experiments on Maze-Hard were ran with 4 L40S with +40Gb of RAM for less than 24 hours. Experiments on +ARC-AGI were ran for around 3 days with 4 H100 +with 80Gb of RAM. + +Ideas that failed +In this section, we quickly mention a few ideas that +did not work to prevent others from making the same +mistake. +We tried replacing the SwiGLU MLPs by SwiGLU +Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus +et al., 2022), but we found generalization to decrease +massively. MoEs clearly add too much unnecessary +capacity, just like increasing the number of layers does. +Instead of back-propagating through the whole n + 1 +recursions, we tried a compromise between HRM 1- +step gradient approximation, which back-propagates +through the last 2 recursions. We did so by decou- +pling n from the number of last recursions k that we +back-propagate through. For example, while n = 6 +requires 7 steps with gradients in TRM, we can use +gradients for only the k = 4 last steps. However, we +found that this did not help generalization in any way, +and it made the approach more complicated. Back- + + + 11 + Recursive Reasoning with Tiny Networks + +Algorithms with different number of latent Example on Sudoku-Extreme +features + 8 3 1 + 9 6 8 7 +def latent recursion(x, z, n=6): 3 5 + for i in range(n+1): # latent recursion + z = net(x, z) 6 8 + return z 6 2 +def deep recursion(x, z, n=6, T=3): + 7 4 3 + # recursing T−1 times to improve z (no gradients needed) 9 4 + with torch.no grad(): + for j in range(T−1): + 2 4 1 + z = latent recursion(x, z, n) 6 2 5 7 + # recursing once to improve z + z = latent recursion(x, z, n) Input x + return z.detach(), output head(y), Q head(y) + 5 2 6 7 9 4 8 3 1 +# Deep Supervision +for x input, y true in train dataloader: 3 9 1 2 6 8 4 7 5 + z = z init 4 8 7 3 1 5 2 9 6 + for step in range(N supervision): + x = input embedding(x input) + 1 6 8 5 3 2 7 4 9 + z, y hat, q hat = deep recursion(x, z) 9 3 5 4 7 6 1 8 2 + loss = softmax cross entropy(y hat, y true) 7 4 2 9 8 1 5 6 3 + loss += binary cross entropy(q hat, (y hat == y true)) + z = z.detach() 8 7 3 1 5 9 6 2 4 + loss.backward() 2 5 9 6 4 7 3 1 8 + opt.step() + opt.zero grad() 6 1 4 8 5 3 9 5 7 + if q[0] > 0: # early−stopping + break Output y + 5 2 6 7 9 4 8 3 1 +Figure 4. Pseudocode of TRM using a single-z with deep 3 9 1 2 6 8 4 7 5 +supervision training in PyTorch. 4 8 7 3 1 5 2 9 6 + 1 6 8 5 3 2 7 4 9 + 9 3 5 4 7 6 1 8 2 + 7 4 2 9 8 1 5 6 3 + 8 7 3 1 5 9 6 2 4 +def latent recursion(x, y, z, n=6): + for i in range(n): # latent recursion 2 5 9 6 4 7 3 1 8 + z[i] = net(x, y, z[0], ... , z[n−1]) 6 1 4 8 5 3 9 5 7 + y = net(y, z[0], ... , z[n−1]) # refine output answer + return y, z Tokenized z H (denoted y in TRM) +def deep recursion(x, y, z, n=6, T=3): 5 5 4 9 4 6 3 + # recursing T−1 times to improve y and z (no gradients needed) + with torch.no grad(): 4 3 1 4 6 5 + for j in range(T−1): 4 8 4 3 6 6 4 + y, z = latent recursion(x, y, z, n) + # recursing once to improve y and z 9 6 5 3 5 4 + y, z = latent recursion(x, y, z, n) 3 5 4 3 5 4 4 + return (y.detach(), z.detach()), output head(y), Q head(y) + 6 3 3 3 5 8 8 +# Deep Supervision 3 3 3 6 5 6 6 4 +for x input, y true in train dataloader: 7 5 6 3 3 6 6 + y, z = y init, z init + for step in range(N supervision): 4 3 4 8 3 6 6 4 + x = input embedding(x input) + (y, z), y hat, q hat = deep recursion(x, y, z) Tokenized z L (denoted z in TRM) + loss = softmax cross entropy(y hat, y true) + loss += binary cross entropy(q hat, (y hat == y true)) + loss.backward() + opt.step() + Figure 6. This Sudoku-Extreme example shows an input, ex- + opt.zero grad() pected output, and the tokenized z H and z L (after reversing + if q[0] > 0: # early−stopping the embedding and using argmax) for a pretrained model. + break + This highlights the fact that z H corresponds to the predicted + response, while z L is a latent feature that cannot be decoded +Figure 5. Pseudocode of TRM using multi-scale z with deep to a sensible output unless transformed into z H by f H . +supervision training in PyTorch. + + + + + 12 +
\ No newline at end of file |
