summaryrefslogtreecommitdiff
path: root/papers
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-06-29 12:15:51 -0500
committerYurenHao0426 <blackhao0426@gmail.com>2026-06-29 12:15:51 -0500
commita6ec4288a2232988b130b2f00bb2565f81706966 (patch)
tree1bb86e7f0b899b823b9e7fdf383e832d30a181e0 /papers
Recursive reasoning dynamics: analysis pipeline, paper drafts, toy models
Failure=more-chaotic (task-general under validity labeling) reduces to convergence/completeness detection; mechanism (transient chaos vs multistability vs input-induced) under investigation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Diffstat (limited to 'papers')
-rw-r--r--papers/txt/engelken2023_gradient_flossing.txt1879
-rw-r--r--papers/txt/gram2025_generative_recursive.txt1532
-rw-r--r--papers/txt/gram2026_generative_recursive.txt1532
-rw-r--r--papers/txt/hrm2025_hierarchical_reasoning.txt1302
-rw-r--r--papers/txt/ptrm2025_probabilistic_trm.txt906
-rw-r--r--papers/txt/trm2025_tiny_recursive.txt796
6 files changed, 7947 insertions, 0 deletions
diff --git a/papers/txt/engelken2023_gradient_flossing.txt b/papers/txt/engelken2023_gradient_flossing.txt
new file mode 100644
index 0000000..6d70608
--- /dev/null
+++ b/papers/txt/engelken2023_gradient_flossing.txt
@@ -0,0 +1,1879 @@
+ Gradient Flossing: Improving Gradient Descent
+ through Dynamic Control of Jacobians
+
+
+ Rainer Engelken
+ Zuckerman Mind Brain Behavior Institute
+ Columbia University
+arXiv:2312.17306v1 [cs.LG] 28 Dec 2023
+
+
+
+
+ New York, USA
+ re2365@columbia.edu
+
+
+
+ Abstract
+ Training recurrent neural networks (RNNs) remains a challenge due to the insta-
+ bility of gradients across long time horizons, which can lead to exploding and
+ vanishing gradients. Recent research has linked these problems to the values
+ of Lyapunov exponents for the forward-dynamics, which describe the growth or
+ shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a
+ novel approach to tackling gradient instability by pushing Lyapunov exponents
+ of the forward dynamics toward zero during learning. We achieve this by regu-
+ larizing Lyapunov exponents through backpropagation using differentiable linear
+ algebra. This enables us to "floss" the gradients, stabilizing them and thus improv-
+ ing network training. We demonstrate that gradient flossing controls not only the
+ gradient norm but also the condition number of the long-term Jacobian, facilitating
+ multidimensional error feedback propagation. We find that applying gradient
+ flossing prior to training enhances both the success rate and convergence speed for
+ tasks involving long time horizons. For challenging tasks, we show that gradient
+ flossing during training can further increase the time horizon that can be bridged
+ by backpropagation through time. Moreover, we demonstrate the effectiveness of
+ our approach on various RNN architectures and tasks of variable temporal com-
+ plexity. Additionally, we provide a simple implementation of our gradient flossing
+ algorithm that can be used in practice. Our results indicate that gradient flossing
+ via regularizing Lyapunov exponents can significantly enhance the effectiveness of
+ RNN training and mitigate the exploding and vanishing gradients problem.
+
+
+ 1 Introduction
+ Recurrent neural networks are commonly used both in machine learning and computational neu-
+ roscience for tasks that involve input-to-output mappings over sequences and dynamic trajectories.
+ Training is often achieved through gradient descent by the backpropagation of error information
+ across time steps [1, 2, 3, 4]. This amounts to unrolling the network dynamics in time and recursively
+ applying the chain rule to calculate the gradient of the loss with respect to the network parameters.
+ Mathematically, evaluating the product of Jacobians of the recurrent state update describes how error
+ signals travel across time steps. When trained on tasks that have long-range temporal dependencies,
+ recurrent neural networks are prone to exploding and vanishing gradients [5, 6, 7, 8]. These arise from
+ the exponential amplification or attenuation of recursive derivatives of recurrent network states over
+ many time steps. Intuitively, to evaluate how an output error depends on a small parameter change at
+ a much earlier point in time, the error information has to be propagated through the recurrent network
+ states iteratively. Mathematically, this corresponds to a product of Jacobians that describe how
+ changes in one recurrent network state depend on changes in the previous network state. Together,
+ this product forms the long-term Jacobian. The singular value spectrum of the long-term Jacobian reg-
+
+ 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
+ ulates how well error signals can propagate backwards along multiple time steps, allowing temporal
+credit assignment. A close mathematical correspondence of these singular values and the Lyapunov
+exponents of the forward dynamics was established recently [9, 10, 11, 12]. Lyapunov exponents
+characterize the asymptotic average rate of exponential divergence or convergence of nearby initial
+conditions and are a cornerstone of dynamical systems theory [13, 14]. We will use this link to
+improve the trainability of RNNs.
+Previous approaches that tackled the problem of exploding or vanishing gradients have suggested
+solutions at different levels. First, specialized units such as LSTM and GRU were introduced,
+which have additional latent variables that can be decoupled from the recurrent network states via
+multiplicative (gating) interactions. The gating interactions shield the latent memory state, which can
+therefore transport information across multiple time steps [5, 6, 15]. Second, exploding gradients can
+be avoided by gradient clipping, which re-scales the gradient norm [16] or their individual elements
+[17] if they become too large [18]. Third, normalization schemes like batch normalization prevent
+saturated nonlinearities that contribute to vanishing gradients [19]. Third, it was suggested that the
+problem of exploding/vanishing gradients can be ameliorated by specialized network architectures,
+for example, antisymmetric networks [20], orthogonal/unitary initializations [21, 22, 23], coupled
+oscillatory RNNs [24], Lipschitz RNNs [25], linear recurrent units [26], echo state networks [27, 28],
+(recurrent) highway networks [29, 30], and stable limit cycle neural networks [11, 31, 32]. Fourth, for
+large networks, a suitable choice of weights can guarantee a well-conditioned Jacobian at initialization
+[21, 33, 34, 35, 36, 37, 38, 39, 40, 41]. These initializations are based on mean-field methods, which
+become exact only in the large-network limit. Such initialization schemes have also been suggested
+for gated networks [40]. However, even when initializing the network with well-behaved gradients,
+gradients will typically not retain their stability during training once the network parameters have
+changed.
+Here, we propose a novel approach to tackling this challenge by introducing gradient flossing, a
+technique that keeps gradients well-behaved throughout training. Gradient flossing is based on
+a recently described link between the gradients of backpropagation through time and Lyapunov
+exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian
+[9, 11, 12, 32]. Gradient flossing regularizes one or several Lyapunov exponents to keep them close
+to zero during training. This improves not only the error gradient norm but also the condition number
+of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons.
+We first demonstrate that the Lyapunov exponents can be controlled during training by including an
+additional loss term. We then demonstrate that gradient flossing improves the gradient norm and
+effective dimension of the gradient signal. We find empirically that gradient flossing improves test
+accuracy and convergence speed on synthetic tasks over a range of temporal complexities. Finally,
+we find that gradient flossing during training further helps to bridge long-time horizons and show
+that it combines well with other approaches to ameliorate exploding and vanishing gradients, such as
+dynamic mean-field theory for initialization, orthogonal initialization and gated units.
+Our contributions include:
+
+ • Gradient flossing, a novel approach to the problem of exploding and vanishing gradients in
+ recurrent neural networks based on regularization of Lyapunov exponents.
+
+ • Analytical estimates of the condition number of the long-term Jacobian based on Lyapunov
+ exponents.
+
+ • Empirical evidence that gradient flossing improves training on tasks that involve bridging
+ long time horizons.
+
+
+2 RNN Gradients and Lyapunov Exponents
+
+We begin by revisiting the established mathematical relationship between the gradients of the loss
+function, computed via backpropagation through time, and Lyapunov exponents [9, 12], and how
+it relates to the problem of vanishing and exploding gradients. In backpropagation through time,
+network parameters θ are iteratively updated by stochastic gradient descent such that a loss Lt is
+locally reduced [1, 2, 3, 4]. For RNN dynamics hs+1 = fθ (hs , xs+1 ), with recurrent network state
+h, external input x, and parameters θ, the gradient of the loss Lt with respect to θ is evaluated by
+
+
+ 2
+ unrolling the network dynamics in time. The resulting expression for the gradient is given by:
+ τ =t−1 t−1
+ !
+ ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X ∂hτ
+ = = Tt (hτ ) (1)
+ ∂θ ∂ht ′
+ ∂hτ ′ ∂θ ∂ht τ ∂θ
+ τ =t−l τ =τ
+
+where Tt (hτ ) is composed of a product of one-step Jacobians Ds = ∂hs+1
+ ∂hs :
+ t−1 t−1
+ Y ∂hτ ′ +1 Y
+ Tt (hτ ) = = Dτ ′ (2)
+ ∂hτ ′
+ τ ′ =τ ′ τ =τ
+
+Due to the chain of matrix multiplications in Tt , the gradients tend to vanish or explode exponentially
+with time. This complicates training particularly when the task loss at time t dependents on inputs x
+or states h from many time steps prior which creates long temporal dependencies [5, 6, 7, 8]. How
+well error signals can propagate back in time is constrained by the tangent space dynamics along
+trajectory ht , which dictate how local perturbations around each point on the trajectory stretch, rotate,
+shear, or compress as the system evolves.
+The singular values of the Jacobian’s product Tt , which determine how quickly gradients vanish or
+explode during backpropagation through time, are directly related to the Lyapunov exponents of the
+forward dynamics [9, 12]: Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN are defined as the asymptotic
+time-averaged logarithms of the singular values of the long-term Jacobian [13, 42, 43]
+ 1
+ λi = lim log(σi,t ) (3)
+ t→∞ t − τ
+
+where σi,t denotes the ith singular value of Tt (hτ ) with σ1,t ≥ σ2,t . . . σN,t (See Appendix I
+for details). This means that positive Lyapunov exponents in the forward dynamics correspond to
+exponentially exploding gradient modes, while negative Lyapunov exponents in the forward dynamics
+correspond to exponentially vanishing gradient modes.
+In summary, the Lyapunov exponents give the average asymptotic exponential growth rates of
+infinitesimal perturbations in the tangent space of the forward dynamics, which also constrain the
+signal propagation in backpropagation for long time horizons. Lyapunov exponents close to zero in
+the forward dynamics correspond to tangent space directions along which error signals are neither
+drastically attenuated nor amplified in backpropagation through time. Such close-to-neutral modes in
+the tangent dynamics can propagate information reliably across many time steps.
+
+3 Gradient Flossing: Idea and Algorithm
+We now leverage the mathematical connection established between Lyapunov exponents and the
+prevalent issue of exploding and vanishing gradients for regularizing the singular values of the
+long-term Jacobian. We term this procedure gradient flossing. To prevent exploding and vanishing
+gradients, we constrain Lyapunov exponents to be close to zero. This ensures that the corresponding
+directions in tangent space grow and shrink on average only slowly. This leads to a better-conditioned
+long-term Jacobian Tt (hτ ). We achieve this by using the sum of the squares of the first k largest
+Lyapunov exponent λ1 , λ2 . . . λk as a loss function:
+ k
+ X
+ Lflossing = λ2i (4)
+ i=1
+
+and evaluate the gradient obtained from backpropagation through time:
+ k
+ ∂Lflossing X ∂λ2i
+ = (5)
+ ∂θ i=1
+ ∂θ
+
+This might seem like an ill-fated enterprise, as the gradient expression in Eq 5 suffers from its
+own problem of exploding and vanishing gradients. However, instead of calculating the Lyapunov
+exponents by directly evaluating the long-term Jacobian Tt (Eq 2), we use an established iterative
+reorthonormalization method involving QR decomposition that avoids directly evaluating the ill-
+conditioned long-term Jacobian [12, 44].
+
+
+ 3
+ First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qks ] in the tangent space along the
+trajectory using the Jacobian Ds = ∂h s+1
+ ∂hs . This means to calculate
+
+ Q
+ e s+1 = Ds Qs (6)
+at every time-step. Second, we extract the exponential growth rates using the QR decomposition,
+ Qe s+1 = Qs+1 Rs+1 ,
+which decomposes Q e s+1 uniquely into the product of an orthonormal matrix Qs+1 of size N × k
+ ⊤
+so Qs+1 Qs+1 = 1k×k and an upper triangular matrix Rs+1 of size k × k with positive diagonal
+elements. Note that the QR decomposition does not have to be applied at every step, just sufficiently
+often, i.e., once every tONS such that Q
+ e does not become ill-conditioned.
+The Lyapunov exponents are given by time-averaged logarithms of the diagonal entries of Rs [43, 44]:
+ t t
+ 1 Y 1X
+ λi = lim log Rsii = lim log Rsii . (7)
+ t→∞ t t→∞ t
+ s=1 s=1
+This way, the Lyapunov exponent can be expressed in terms of a temporal average over the diagonal
+elements of the Rs -matrix of a QR decomposition of the iterated Jacobian. To propagate the gradient
+of the square of the Lyapunov exponents backward through time in gradient flossing, we used an
+analytical expression for the pullback of the QR decomposition [45]: The backward pass of the QR
+decomposition is given by [45, 46, 47, 48]
+ Q = Q + Q copyltu(M) R−T ,
+  
+ (8)
+ T T
+where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying
+the lower triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij =
+Mmax(i,j),min(i,j) [45, 46, 47, 48]. We denote here adjoint variable as T = ∂L/∂T . A simple
+implementation of this algorithm in pseudocode is:
+Algorithm 1 Algorithm for gradient flossing of k tangent space directions
+ initialize h, Q
+ for e = 1 → E do
+ for t = 1 → T do
+ h ← fθ (h, x)
+ dht
+ D ← dh t−1
+ Q←D·Q
+ if t ≡ 0 (mod tONS ) then
+ Q, R ← qr(Q)
+ γi += log(Rii )
+ end if
+ end for
+ λi = γi /T
+ ∂Lflossing
+ θe+1 ← θe − η ∂θ
+ end for
+For clarity, we described gradient flossing in terms of stochastic gradient descent, but we actually
+implemented it with the ADAM optimizer using standard hyperparameters η, β1 and β2 . An example
+implementation is available here. Note that this algorithm also works for different recurrent network
+architectures. In this case, the Jacobians D has size n × n, where n is the number of dynamic
+variables of the recurrent network model. For example, in case of a single recurrent network of
+N LSTM units, the Jacobian has size 2N × 2N [9, 12, 41]. The Jacobian matrix D can either be
+calculated analytically or it can be obtained via automatic differentiation.
+
+4 Gradient Flossing: Control of Lyapunov Exponents
+In Fig 1, we demonstrate that gradient flossing can set one or several Lyapunov exponents to a
+target value via gradient descent with the ADAM optimizer in random Vanilla RNNs initialized with
+different weight variances. The N units of the recurrent neural network follow the dynamics
+ hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 . (9)
+
+
+ 4
+ C
+ t−1
+ ∂hτ ′ +1 10 1
+ number of flossed i
+ Y
+ Tt (hτ ) = k = 32
+ ∂hτ ′ k = 16
+ τ ′ =τ
+ 10 3 k=1
+
+
+
+
+ 2
+ i
+ i=1
+ k
+ 1
+ k
+ 10 5
+
+
+ 0 20 40 60 80 100
+ Epochs
+ B D
+ 0.0 0
+ 0.5 2
+1(1/ )
+
+
+
+
+ i(1/ )
+ 1.0 4 number of flossed i
+ k = 32
+ 1.5 target 1 6 k = 16
+ actual 1 k=1
+ 2.0
+ 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0
+ Epochs i/N
+Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation
+A) Exploding and vanishing gradients in backpropagation through time arise from amplifica-
+ Qt−1 ∂h ′
+tion/attenuation of product of Jacobians that form the long-term Jacobian Tt (hτ ) = τ ′ =τ ∂hτ +1 .
+ τ′
+B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the
+mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent
+λ1 = −1, −0.5, 0 by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent
+weights Wij ∼ N (0, g 2 /N ) where values of g were drawn g ∼ Unif(0, 1). C) Gradient flossing
+minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla
+RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note,
+the variability of the Lyapunov exponents that were not flossed. Parameters: network size N = 32
+with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line
+shows median.
+
+
+The initial entries of W are drawn independently from a Gaussian distribution with zero mean and
+variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We here use
+the transfer function ϕ(x) = tanh(x). (See appendix B for gradient flossing with ReLU and LSTM
+units). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. Gaussian input
+xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
+flossing.
+In Fig 1B, we show that for randomly initialized RNNs, the Lyapunov exponent can be modified by
+gradient flossing to match a desired target value. The networks were initialized with 10 different values
+of initial weight strength g chosen uniformly between 0 and 1. During gradient flossing, they quickly
+approached three different target values of the first Lyapunov exponents λtarget 1 = {−1, −0.5, 0}
+within less than 100 training epochs with batch size B = 1. We note that gradient flossing with
+positive target λtarget
+ 1 seems not to arrive at a positive Lyapunov exponent λ1 .
+Fig 1C shows gradient flossing for different numbers of Lyapunov exponents k. Here, during gradient-
+descent, the sum of the squares of 1, 16, or 32 Lyapunov exponents is used as loss in gradient flossing
+(see Fig 1A). Fig 1D shows the Lyapunov spectrum after flossing, which now has 1, 16, or 32
+Lyapunov exponents close to zero. We conclude that gradient flossing can selectively manipulate one,
+several, or all Lyapunov exponents before or during network training. Gradient flossing also works for
+RNNs of ReLU and LSTM units (See appendix B. Further, we find that the computational bottleneck 
+of gradient flossing is the QR decomposition, which has a computational complexity of O N k 2 ,
+both in the forward pass and in the backward pass. Thus, gradient flossing of the entire Lyapunov
+spectrum is computationally expensive. However, as we will show, not all Lyapunov exponents need
+to be flossed and only short episodes of gradient flossing are sufficient for significantly improving the
+training performance.
+
+
+ 5
+ 5 Gradient Flossing: Condition Number of the Long-Term Jacobian
+ A B C
+condition number 2(Tt( ))
+ 1034 1026 initial
+ after flossing 1030
+ 1025 theory
+ 1019 simulations
+
+
+
+
+ 2 (theory)
+ 2(Tt( ))
+ 1020
+ 1016 1012 k
+ initial 1010 5
+ 107 after flossing 105 10
+ theory 15
+ simulations 100
+ 0 100 200 300 5 10 15 100 1010 1020 1030
+ time horizon t tangent space dimension m 2 (numerical)
+Figure 2: Gradient flossing reduces condition number of the long-term Jacobian A) Condition
+number κ2 of long-term Jacobian Tt (hτ ) as a function of time horizon t − τ at initialization (blue)
+and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision
+floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on
+Lyapunov exponents (dashed lines) (Eq 10). B) Condition number for different number of tangent
+space dimensions m. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at
+initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of
+tangent space dimensions available for backpropagation for a given condition number (Grey dotted
+line as a guide for eye for κ2 = 105 .) First 15 Lyapunov exponents were flossed. C) Comparison of
+condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors
+denote the number of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 1, N = 80,
+epochs = 500, T = 500, gradient flossing for Ef = 500 epochs. Input xs identical to delayed XOR
+task in Fig 3D.
+
+A well-conditioned Jacobian is essential for efficient and fast learning [21, 49, 50]. Gradient
+flossing improves the condition number of the long-term Jacobian which constrains the error signal
+propagation across long time horizons in backpropagation (Fig 2). The condition number κ2 of a
+linear map A measures how close the map is to being singular and is given by the ratio of the largest
+singular value σmax and the smallest singular values σmin , so κ2 (A) = σσmax (A)
+ min (A)
+ . According to the
+ p
+rule of thumb given in [51], if κ2 (A) = 10 , one can anticipate losing at least p digits of precision
+when solving the equation Ax = b. Note that the long-term Jacobian Tt is composed of a product of
+Jacobians, which generically makes it ill-conditioned. To nevertheless quantify the condition number
+numerically, we use arbitrary-precision arithmetic with 256 bits per float. We find numerically that
+the condition number of Tt exponentially diverges with the number of time steps (Fig 2A). We
+compare the numerically measured condition number κ2 with an asymptotic approximation of the
+condition number based on Lyapunov exponents that are calculated in the forward pass and find a
+good match (Fig 2A).
+Our theoretical estimate of the condition number κ2 of an orthonormal system Q of size N × m that
+is temporally evolved by the long-term Jacobian Tt is:
+
+ e t+τ ) = κ2 Tt (hτ )Qt = σ1 (Tt (hτ )) ≈ exp ((λ1 − λm )(t − τ )) .
+ 
+ κ2 (Q (10)
+ σm (Tt (hτ ))
+where σ1 (Tt (hτ )) and σm (Tt (hτ )) are the first and mth singular value of the long-term Jacobian.
+We note that this theoretical estimate of the condition number follows from the asymptotic definition
+of Lyapunov exponents and should be exact in the limit of long times. We find that gradient flossing
+reduces the condition number by a factor whose magnitude increases exponentially with time (orange
+in Fig 2A). Thus, we can expect that gradient flossing has a stronger effect on problems with a long
+time horizon to bridge. We will later confirm this numerically.
+Moreover, Lyapunov exponents enable the estimation of the number of gradient dimensions available
+for the backpropagation of error signals. Generally, the long-term Jacobian is ill-conditioned, however,
+the Lyapunov spectrum provides for a given number of tangent space dimensions an estimate of the
+condition number. This indicates how close to singular the gradient signal for a given number of
+tangent space dimensions is. Given a fixed acceptable condition number—determined, for example,
+by noise level or floating-point precision—we observe that gradient flossing increases the number of
+usable tangent space dimensions for backpropagation (Fig 2B).
+
+
+ 6
+ Finally, we show that the asymptotic estimate of the condition number based on Lyapunov exponents
+can even predict differences in condition number that originate from finite network size N (Fig 2C).
+We emphasize that this goes beyond mean-field methods, which become exact only in the large-
+network limit N → ∞ and usually do not capture finite-size effects [52] (see appendix G).
+
+6 Initial Gradient Flossing Improves Trainability
+ A delayed copy task B delayed XOR task
+ 10 2 10 2
+test loss
+
+
+
+
+ test loss
+ no gradient flossing no gradient flossing
+ 10 3 gradient flossing gradient flossing
+ 10 3
+ 10 4
+ 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
+ Epochs Epochs
+ C delayed copy task D delayed XOR task
+ 10 2 10 2
+
+ 10 3 10 3
+test loss
+
+
+
+
+ test loss
+
+ 10 4
+ 10 4
+
+ 20 40 60 80 10 20 30 40 50 60 70 80
+ task difficulty (delay d) task difficulty (delay d)
+Figure 3: Gradient flossing improves trainability on tasks that involve long time horizons A) Test
+error for Vanilla RNNs trained on delayed copy task yt = xt−d for d = 40 with and without gradient
+flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed
+XOR task with yt = |xt−d/2 − xt−d |. C) Mean final test loss as a function of task difficulty (delay d)
+for delayed copy task. D) Mean final test loss as a function of task difficulty (delay d) for delayed
+XOR task. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing
+for Ef = 500 epochs on k = 75 before training. Shaded regions in C and D indicate the 20% and
+80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE(y, ŷ).
+
+We next present numerical results on two tasks with variable spatial and temporal complexity,
+demonstrating that gradient flossing before training improves the trainability of Vanilla RNNs. We
+call gradient flossing before training in the following preflossing. For preflossing, we first initialize the
+ Pk
+network randomly, then minimize Lflossing = i=1 λ2i using the ADAM optimizer and subsequently
+train on the tasks. We deliberately do not use sequential MNIST or similar toy tasks commonly used
+to probe exploding/vanishing gradients, because we want a task where the structure of long-range
+dependencies in the data is transparent and can be varied as desired.
+First, we consider the delayed copy task, where a scalar stream of random input numbers x must be
+reproduced by the output y delayed by d time steps, i.e. yt = xt−d . Although the task itself is trivial
+and can be solved even by a linear network through a delay line (see appendix E), RNNs encounter
+vanishing gradients for large delays d during training even with ’critical’ initialization with g = 1.
+Our experiments show that gradient flossing can substantially improve the performance of RNNs
+on this task (Fig 3A, C). While Vanilla RNNs without gradient flossing fail to train reliably beyond
+d = 20, Vanilla RNNs with gradient flossing can be reliably trained for d = 40 (Fig 3C). Note that
+we flossed here k = 40 Lyapunov exponents before training. We will later investigate the role of the
+number of flossed Lyapunov exponents.
+Second, we consider the temporal XOR task, which requires the RNN to perform a nonlinear input-
+output computation on a sequential stream of scalar inputs, i.e., yt = |xt−d/2 − xt−d |, where d
+denotes a time delay of d time steps (For details see appendix H). Fig 3D demonstrates that gradient
+flossing helps to train networks on a substantially longer delay d. We found similar improvements
+through gradient flossing for RNNs initialized with orthogonal weights (see appendix G).
+
+
+ 7
+ 7 Gradient Flossing During Training
+ A delayed temporal XOR B delayed spatial XOR
+ 100 flossing during training 100
+test accuracy (%) 90 preflossing 90
+
+
+
+
+ test accuracy (%)
+ no flossing
+ 80 80
+ 70 70
+ 60 60
+ 50 50
+ 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
+ Epochs Epochs
+ C D
+ 100 100
+ 90 90
+test accuracy (%)
+
+
+
+
+ test accuracy (%)
+ 80 80 flossing during training
+ preflossing
+ 70 70 no flossing
+ 60 60
+ 50 50
+ 10 20
+ 30 40 50 60 70 10 20 30 40 50 60 70
+ complexity (delay d) complexity (delay d)
+Figure 4: Gradient flossing during training further improves trainability
+A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task yt = xt−d/2 ⊕ xt−d
+with gradient flossing during training (green), preflossing (gradient flossing before training) (orange),
+and with no gradient flossing (blue) for d = 70. Solid lines are mean across 20 network realizations,
+individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial
+XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . Parameters (g = 1, batch size b = 16). C) Test accuracy
+as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test accuracy as a
+function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, batch size
+b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on k = 75 before
+training and during training for green lines, and only before training for orange lines. Same plotting
+conventions as previous figure. Task loss: cross-entropy between y and ŷ.
+We next investigate the effects of gradient flossing during the training and find that gradient flossing
+during training can further improve trainability. We trained RNNs on two more challenging tasks
+with variable temporal complexity and performed gradient flossing either both during and before
+training, only before training, or not at all.
+Fig 4A shows the test accuracy for Vanilla RNNs training on the delayed temporal XOR task
+yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of Vanilla RNNs
+falls to chance level for d ≥ 40 (Fig 4C). With gradient flossing before training, the trainability
+can be improved, but still goes to chance level for d = 70. In contrast, for networks with gradient
+flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed
+for 500 epochs before task training and again after 500 epochs of training on the task. In Fig 4B,
+D the networks have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a
+three-dimensional binary input signal x1 , x2 , and x3 and generate the correct output with a delay of
+d steps. While the solution of the task itself is not difficult and could even be implemented by hand
+(see appendix), the task is challenging for backpropagation through time because nonlinear temporal
+associations bridging long time horizons have to be formed. Again, we observe that gradient flossing
+before training improves the performance compared to baseline, but starts failing for long delays
+d > 60. In contrast, networks that are also flossed during training can solve even more difficult tasks
+(Fig 4D). We find that after gradient flossing, the norm of the error gradient with respect to initial
+conditions h0 is amplified (appendix C). Interestingly, gradient flossing can also be detrimental to
+task performance if it is continued throughout all training epochs (appendix C)
+We note that merely regularizing the spectral radius of the recurrent weight matrix W or the individual
+one-step Jacobians Ds numerically or based on mean-field theory does not yield such a training
+improvement. This suggests that taking the temporal correlations between Jacobians Ds into account
+is important for improving trainability.
+
+
+ 8
+ 7.1 Gradient Flossing for Different Numbers of Flossed Lyapunov Exponents
+
+We investigated how many Lyapunov exponents k have to be flossed to achieve an improvement in
+training success (Fig 5). We studied this in the binary temporal delayed XOR task with gradient
+flossing during training (same as Fig 3) and varied the task difficulty by changing the delay d.
+We found that as the task becomes more difficult, networks where not enough Lyapunov exponents
+k are flossed begin to fall below 100% test accuracy (Fig 5A). Correspondingly, when measuring
+final test accuracy as a function of the number of flossed Lyapunov exponents, we observed that
+more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the tasks become more
+difficult (Fig 5B). We also show the entire parameter plane of median test accuracy as a function of
+both number of flossed Lyapunov exponents k and task difficulty (delay d), and found the same trend
+(Fig 5B). Overall, we found that tasks with larger delay d require more Lyapunov exponents close to
+zero. We note that this might also partially be caused by the ’streaming’ nature of the task: in our
+tasks, longer delays automatically imply that more values have to be stored as at any moment all the
+values in the ’delay line’ have to be remembered to successfully solve the tasks. This is different from
+tasks where a single variable has to be stored and recalled after a long delay. It would be interesting
+to study tasks where the number of delay steps and the number of items in memory can be varied
+independently.
+Finally, we did the same analysis on networks with only preflossing (gradient flossing before training)
+and found the same trend (supplement Fig 7D), however, in that case even if all N Lyapunov
+exponents were flossed, thus k = N , they were not able to solve the most difficult tasks. This seems
+to indicate that gradient flossing during training cannot be replaced by just gradient flossing more
+Lyapunov exponents before training.
+
+
+
+
+Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents
+A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
+flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number
+of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104
+for delayed temporal XOR, epochs = 5000 for delayed spatial XOR, T = 300, gradient flossing
+for Ef = 500 epochs before training and during training for A, B. Shaded areas are 25% and 75%
+percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy
+between y and ŷ.
+8 Limitations
+The mathematical connection between Lyapunov exponents and backpropagation through time
+exploited in gradient flossing is rigorously established only in the infinite-time limit. It would be
+interesting to extend our analysis to finite-time Lyapunov exponents.
+Furthermore, the backpropagation through time gradient involves a sum over products of Jacobians
+of different time periods t − τ , but the Lyapunov exponent only considers the asymptotic longest
+product. Additionally, Lyapunov exponents characterize the asymptotic dynamics on the attractor of
+the dynamics, whereas RNNs often exploit transient dynamics from some initial conditions outside
+or towards the attractor.
+Although our proposed method focuses on exploiting Lyapunov exponents, it neglects the geometry
+of covariant Lyapunov vectors [53], which could be used to improve training performance, speed,
+and reliability. Additionally, it is important to investigate how sensitive the method is to the choice
+of orthonormal basis employed because it is only guaranteed to become unique asymptotically [54].
+
+
+ 9
+ Finally, the computational cost of our method scales with O(N k 2 ), where N is the network size
+and k is the number of Lyapunov exponents calculated. To reduce the computational cost, we
+suggest doing QR decomposition only sufficiently often to ensure that the orthonormal system is
+not ill-conditioned and using gradient flossing only intermittently or as pretraining. One could also
+calculate the Lyapunov spectrum for a shorter time interval or use a cheaper proxy for the Lyapunov
+spectrum and investigate more efficient gradient flossing schedules.
+
+
+9 Discussion
+
+We tackle the problem of gradient signal propagation in recurrent neural networks through a dy-
+namical systems lens. We introduce a novel method called gradient flossing that addresses the
+problem of gradient instability during training. Our approach enhances gradient signal stability both
+before and during training by regularizing Lyapunov exponents. By keeping the long-term Jacobian
+well-conditioned, gradient flossing optimizes both training accuracy and speed. To achieve this,
+we combine established dynamical systems methods for calculating Lyapunov exponents with an
+analytical pullback of the QR factorization. This allows us to establish and maintain gradient stability
+in a in a manner that is memory-efficient, numerically stable, and exact across long time horizons.
+Our method is applicable to arbitrary RNN architectures, nonlinearities, and also neural ODEs [55].
+Empirically, pre-training with gradient flossing enhances both training speed and accuracy. For
+difficult temporal credit assignment problems, gradient flossing throughout training further enhances
+signal propagation. We also demonstrate the versatility of our method on a set of synthetic tasks
+with controllable time-complexity and show that it can be combined with other approaches to tackle
+exploding and vanishing gradients, such as dynamic mean-field theory for initialization, orthogonal
+initialization and specialized single units, such as LSTMs.
+Prior research on exploding and vanishing gradients mainly focused on selecting network architectures
+that are less prone to exploding/vanishing gradients or finding parameter initializations that provide
+well-conditioned gradients at least at the beginning of training. Our introduced gradient flossing can
+be seen as a complementary approach that can further enhance gradient stability throughout training.
+Compared to the work on picking good parameter initializations based on random matrix theory [41]
+and mean-field heuristics [40], gradient flossing provides several improvements: First, mean-field
+theory only considers the gradient flow at initialization, while gradient flossing can maintain gradient
+flow and well-conditioned Jacobians throughout the training process. Second, random matrix theory
+and mean-field heuristics are usually confined to the limit of large networks [52], while gradient
+flossing can be used for networks of any size. The link between Lyapunov exponents and the gradients
+of backpropagation through time has been described previously [9, 12] and has been spelled out
+analytically and studied numerically [10, 11, 56, 57, 58]. In contrast, we use Lyapunov exponents
+here not only as a diagnostic tool for gradient stability but also to show that they can directly be part
+of the cure for exploding and vanishing gradients.
+Future investigations could delve further into the roles of the second to N th Lyapunov exponents
+in trainability, and how it is related to the task at hand, the rank of the parameter update, the dimen-
+sionality of the solution space, as well as the network dynamics (see also [32, 59, 60]). Our results
+suggest a trade-off between trainability across long time horizons and the nonlinear task demands
+that is worth exploring in more detail (appendix C). Applying gradient flossing to real-time recurrent
+learning and its biologically plausible variants is another avenue [61]. Extending gradient flossing to
+feedforward networks, state-space models and transformers is a promising avenue for future research
+(see also [62, 63]). While Lyapunov exponents are only strictly defined for dynamical systems, such
+as maps or flows that are endomorphisms, the long-term Jacobian of deep feedforward networks
+could be treated similarly. This could also provide a link between the stability of the network against
+adversarial examples and its dynamic stability, as measured by Lyapunov exponents. Given that
+time-varying input can suppress chaos in recurrent networks [9, 12, 64, 65, 66, 67], we anticipate they
+may exacerbate vanishing gradients. Gradient flossing could also be applied in neural architecture
+search, to identify and optimize trainable networks. Finally, gradient flossing is applicable to other
+model parameters, as well. For instance, gradients of Lyapunov exponents with respect to single-
+unit parameters could optimize the activation function and single-neuron biophysics in biologically
+plausible neuron models.
+
+
+
+ 10
+ Acknowledgments and Disclosure of Funding
+I thank E. Izhikevich, F. Wolf, S. Goedeke, J. Lindner, L.F. Abbott, L. Logiaco, M. Schottdorf, G.
+Wayne and P. Sokol for fruitful discussions and J. Stone, L.F. Abbott, M. Ding, O. Marshall, S.
+Goedeke, S. Lippl, M.P. Puelma-Touzel, J. Lindner and the reviewers for feedback on the manuscript.
+I thank Jinguo Liu for the Julia package BackwardsLinalg.jl. Research supported by NSF NeuroNex
+Award (DBI-1707398), the Gatsby Charitable Foundation (GAT3708), the Simons Collaboration for
+the Global Brain (542939SPI), and the Swartz Foundation (2021-6).
+
+References
+ [1] P. WERBOS. Beyond Regression :. Ph. D. dissertation, Harvard University, 1974.
+ [2] DB Parker. Learning-logic (TR-47). Center for Computational Research in Economics and Management
+ Science. MIT-Press, Cambridge, Mass, 8, 1985.
+ [3] Y. LECUN. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Proceedings of Cognitiva 85,
+ pages 599–604, 1985.
+ [4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-
+ propagating errors. Nature, 323(6088):533, October 1986.
+ [5] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, 1991.
+ [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–
+ 1780, 1997.
+ [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult.
+ IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
+ [8] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural
+ Networks. arXiv:1211.5063 [cs], November 2012. arXiv: 1211.5063.
+ [9] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
+ arXiv:2006.02427 [nlin, q-bio], June 2020. arXiv: 2006.02427.
+[10] Jonas Mikhaeil, Zahra Monfared, and Daniel Durstewitz. On the difficulty of learning chaotic dynamics
+ with RNNs. Advances in Neural Information Processing Systems, 35:11297–11312, December 2022.
+[11] Il Memming Park, Ábel Ságodi, and Piotr Aleksander Sokół. Persistent learning signals and working
+ memory without continuous attractors. ArXiv, page arXiv:2308.12585v1, August 2023.
+[12] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
+ Physical Review Research, 5(4):043044, October 2023.
+[13] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics,
+ 57(3):617–656, July 1985.
+[14] Arkady Pikovsky and Antonio Politi. Lyapunov Exponents: A Tool to Explore Complex Dynamics.
+ Cambridge University Press, Cambridge, February 2016.
+[15] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of
+ Neural Machine Translation: Encoder-Decoder Approaches. Technical report, September 2014. ADS
+ Bibcode: 2014arXiv1409.1259C Type: article.
+[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural
+ networks. In Proceedings of the 30th International Conference on International Conference on Machine
+ Learning - Volume 28, ICML’13, pages III–1310–III–1318, Atlanta, GA, USA, June 2013. JMLR.org.
+[17] Tomáš Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of
+ Technology, Faculty of Information Technology, Brno, CZ, 2012.
+[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016.
+ Google-Books-ID: omivDQAAQBAJ.
+[19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
+ Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine
+ Learning, pages 448–456. PMLR, June 2015.
+
+
+ 11
+ [20] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A Dynamical System View
+ on Recurrent Neural Networks. International Conference on Learning Representations, December 2018.
+
+[21] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of
+ learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013. arXiv:
+ 1312.6120.
+
+[22] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In
+ Proceedings of The 33rd International Conference on Machine Learning, pages 1120–1128. PMLR, June
+ 2016.
+
+[23] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with Scaled Cayley
+ Transform. In Proceedings of the 35th International Conference on Machine Learning, pages 1969–1978.
+ PMLR, July 2018.
+
+[24] T. Konstantin Rusch and Siddhartha Mishra. Coupled Oscillatory Recurrent Neural Network (coRNN):
+ An accurate and (gradient) stable architecture for learning long time dependencies. arXiv e-prints, page
+ arXiv:2010.00951, October 2020.
+
+[25] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney.
+ Lipschitz Recurrent Neural Networks. International Conference on Learning Representations, January
+ 2021.
+
+[26] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and
+ Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. Technical report, March 2023.
+ ADS Bibcode: 2023arXiv230306349O Type: article.
+
+[27] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an
+ erratum note. Bonn, Germany: German National Research Center for Information Technology GMD
+ Technical Report, 148:34, 2001.
+
+[28] Mustafa C. Ozturk, Dongming Xu, and José C. Príncipe. Analysis and Design of Echo State Networks.
+ Neural Computation, 19(1):111–138, January 2007.
+
+[29] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training Very Deep Networks. In Advances
+ in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
+
+[30] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent Highway
+ Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4189–4198.
+ PMLR, July 2017.
+
+[31] Piotr A. Sokół, Ian Jordan, Eben Kadile, and Il Memming Park. Adjoint Dynamics of Stable Limit Cycle
+ Neural Networks. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 884–887,
+ November 2019. ISSN: 2576-2303.
+
+[32] Piotr A. Sokół. Geometry of Learning and Representations in Neural Networks. PhD thesis, Stony Brook
+ University, May 2023.
+
+[33] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. volume 15,
+ pages 315–323, April 2011.
+
+[34] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential
+ expressivity in deep neural networks through transient chaos. arXiv:1606.05340 [cond-mat, stat], June
+ 2016. arXiv: 1606.05340.
+
+[35] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical Isometry and a Mean Field
+ Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. arXiv:1806.05394
+ [cs, stat], August 2018. arXiv: 1806.05394.
+
+[36] Boris Hanin and Mihai Nica. Products of Many Large Random Matrices and Gradients in Deep Neural
+ Networks. December 2018.
+
+[37] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information
+ Propagation. November 2016.
+
+[38] Boris Hanin and David Rolnick. How to Start Training: The Effect of Initialization and Architecture. In
+ Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
+
+
+ 12
+ [39] Piotr A. Sokol and Il Memming Park. Information Geometry of Orthogonal Initializations and Training.
+ Technical report, October 2018. ADS Bibcode: 2018arXiv181003785S Type: article.
+[40] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey
+ Pennington. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs. January 2019.
+[41] Tankut Can, Kamesh Krishnamurthy, and David J. Schwab. Gating creates slow modes and controls
+ phase-space complexity in GRUs and LSTMs. arXiv:2002.00025 [cond-mat, stat], January 2020. arXiv:
+ 2002.00025.
+[42] Valery Iustinovich Oseledets. A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of
+ dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva, 19:179–210, 1968.
+[43] Karlheinz Geist, Ulrich Parlitz, and Werner Lauterborn. Comparison of Different Methods for Computing
+ Lyapunov Exponents. Progress of Theoretical Physics, 83(5):875–893, May 1990.
+[44] Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli, and Jean-Marie Strelcyn. Lyapunov Characteristic
+ Exponents for smooth dynamical systems and for hamiltonian systems; A method for computing all of
+ them. Part 2: Numerical application. Meccanica, 15(1):21–30, March 1980.
+[45] Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang. Differentiable Programming Tensor Networks.
+ Physical Review X, 9(3):031041, September 2019. arXiv:1903.09650 [cond-mat, physics:quant-ph].
+[46] S. F. Walter and L. Lehmann. Algorithmic Differentiation of Linear Algebra Functions with Application
+ in Optimum Experimental Design (Extended Version). Technical report, January 2010. ADS Bibcode:
+ 2010arXiv1001.1654W Type: article.
+[47] Robert Schreiber and Charles Van Loan. A Storage-Efficient $WY$ Representation for Products of
+ Householder Transformations. SIAM Journal on Scientific and Statistical Computing, 10(1):53–57, January
+ 1989.
+[48] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Auto-Differentiating
+ Linear Algebra. Technical report, October 2017. ADS Bibcode: 2017arXiv171008717S Type: article.
+[49] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning
+ through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems,
+ volume 30. Curran Associates, Inc., 2017.
+[50] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in
+ Deep Networks. arXiv:1802.09979 [cs, stat], February 2018. arXiv: 1802.09979.
+[51] E. Cheney and David Kincaid. Numerical Mathematics and Computing. Cengage Learning, August 2007.
+ Google-Books-ID: ZUfVZELlrMEC.
+[52] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and
+ Surya Ganguli. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics,
+ 11(1):501–528, 2020.
+[53] F. Ginelli, P. Poggi, A. Turchi, H. Chaté, R. Livi, and A. Politi. Characterizing Dynamics with Covariant
+ Lyapunov Vectors. Physical Review Letters, 99(13):130601, September 2007.
+[54] Sergey V. Ershov and Alexei B. Potapov. On the concept of stationary Lyapunov basis. Physica D:
+ Nonlinear Phenomena, 118(3):167–198, July 1998.
+[55] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential
+ Equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
+ 2018.
+[56] Javed Lindner. Investigating the exploding and vanishing gradients problem with Lyapunov exponents.
+ Master’s thesis, RWTH Aaachen, Aaachen/Juelich, 2021.
+[57] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov Exponents
+ for RNNs: Understanding Information Propagation Using Dynamical Systems Tools. Frontiers in Applied
+ Mathematics and Statistics, 8, 2022.
+[58] Kamesh Krishnamurthy, Tankut Can, and David J. Schwab. Theory of Gating in Recurrent Neural Networks.
+ Physical Review X, 12(1):011011, January 2022.
+[59] L. S. Pontryagin. Mathematical Theory of Optimal Processes: The Mathematical Theory of Optimal
+ Processes. Routledge, New York, 1st edition edition, March 1987.
+
+
+ 13
+ [60] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. In Calculus
+ of Variations and Optimal Control Theory. Princeton University Press, December 2011.
+
+[61] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A unified framework of online learning algorithms
+ for training recurrent neural networks. The Journal of Machine Learning Research, 21(1):135:5320–
+ 135:5353, January 2020.
+
+[62] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust Learning with Jacobian Regularization. Technical
+ report, August 2019. ADS Bibcode: 2019arXiv190802729H Type: article.
+[63] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup Initialization: Residual Learning Without
+ Normalization. Technical report, January 2019. ADS Bibcode: 2019arXiv190109321Z Type: article.
+
+[64] L. Molgedey, J. Schuchhardt, and H. G. Schuster. Suppressing chaos in neural networks by noise. Physical
+ Review Letters, 69(26):3717–3719, December 1992.
+
+[65] Kanaka Rajan, L. F. Abbott, and Haim Sompolinsky. Stimulus-dependent suppression of chaos in recurrent
+ neural networks. Physical Review E, 82(1):011903, July 2010.
+
+[66] Jannis Schuecker, Sven Goedeke, David Dahmen, and Moritz Helias. Functional methods for disordered
+ neural networks. arXiv:1605.06758 [cond-mat, q-bio], May 2016. arXiv: 1605.06758.
+
+[67] Rainer Engelken, Alessandro Ingrosso, Ramin Khajeh, Sven Goedeke, and L. F. Abbott. Input correlations
+ impede suppression of chaos and learning in balanced firing-rate networks. PLOS Computational Biology,
+ 18(12):e1010590, December 2022.
+
+[68] Edward Ott. Chaos in Dynamical Systems. Cambridge University Press, August 2002. Google-Books-ID:
+ PfXoAwAAQBAJ.
+[69] Angelo Vulpiani, Fabio Cecconi, and Massimo Cencini. Chaos: From Simple Models to Complex Systems.
+ World Scientific Pub Co Inc, Hackensack, NJ, September 2009.
+
+
+
+
+ 14
+ A Backpropagation Through QR Decomposition
+The backward pass of the QR decomposition is given by [45, 46, 47, 48]
+ Q = Q + Q copyltu(M) R−T
+  
+ (11)
+ T T
+where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying the lower
+triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = Mmax(i,j),min(i,j)
+[45, 46, 47, 48]. Adjoint variable are written here as T = ∂L/∂T .
+Using an analytical pullback is more memory-efficient and less computationally costly than directly doing
+automatic differentiation through the QR-decomposition. Moreover, from a practical perspective, for QR
+decomposition, often BLAS/LAPACK routines are utilized which are not amenable to common differentiable
+programming frameworks like TensorFlow, PyTorch, JAX and Zygote. In our implementation of gradient
+flossing, we used the Julia package BackwardsLinalg.jl by Jinguo Liu available at here .
+
+B Further Details and Analysis of Gradient Flossing
+An example implementation of gradient flossing in Flux, a machine learning library in Julia is available here.
+We are actively developing implementations for other widely used differentiable programming frameworks.
+
+B.1 Gradient Flossing for recurrent LSTM and ReLU networks
+
+ A LSTM C ReLU
+ 0.2 target 1 6 target 1
+ actual 1 actual 1
+ 4
+ 0.0
+ 2
+ 1
+
+
+
+
+ 1
+
+
+
+
+ 0.2
+ 0
+ 0.4
+ 2
+ 0 20 40 60 80 100 0 20 40 60 80 100
+ Epochs Epochs
+ B D 101
+ 10 1
+ 10 2 10 1
+ 10 3
+ 2
+
+
+
+
+ 2
+ 1
+
+
+
+
+ 1
+
+
+
+
+ 10 4 10 3
+ 10 5
+ 10 6 10 5
+ 0 20 40 60 80 100 0 20 40 60 80 100
+ Epochs Epochs
+
+
+Figure 6: Gradient flossing for recurrent LSTM networks and recurrent ReLU networks A) First
+Lyapunov exponent of LSTM network as a function of training epochs. Minimizing the mean
+squared error between estimated first Lyapunov exponent and target Lyapunov exponent λ1 = 0
+by gradient descent. First Lyapunov exponent of LSTM network (solid lines) converges to target
+value (thick dashed lines) within less than 100 epochs. 10 random LSTM RNNs were initialized with
+Gaussian recurrent weights, where standard deviations of weight scaling were drawn g ∼ Unif(0, 1).
+B) Gradient flossing minimizes the square of the first Lyapunov exponent of random recurrent
+LSTM networks over epochs. C) Same as A for recurrent ReLU network. Here networks were
+initialized with Gaussian recurrent weights Wij ∼ N (−0.1, g 2 /N ) where values of g were drawn
+g ∼ Unif(0, 1) D) B) for recurrent ReLU network. Parameters: network size N = 32 with 10
+network realizations. Shaded regions in B, D are 25% and 75% percentiles, solid line shows median.
+
+We demonstrate that gradient flossing can also be applied to recurrent LSTM and ReLU networks in Fig 6. To
+this end, we generated random LSTM networks where the weights of all the different gates and biases were
+
+
+ 15
+ independently and identically distributed (i.i.d.) and sampled from Gaussian distributions of different variance.
+Our results show that gradient flossing can also constrain the Lyapunov exponent to be close to zero. The
+dynamics of each of the N LSTM units follows the map [6]:
+ ft = σg (Uf ht−1 + Wf xt + bf ) (12)
+ ot = σg (Uo ht−1 + Wo xt + bo ) (13)
+ it = σg (Ui ht−1 + Wi xt + bi ) (14)
+ c̃t = σh (Uc ht−1 + Wc xt + bc ) (15)
+ ct = ft ⊙ ct−1 + it ⊙ c̃t (16)
+ ht = ot ⊙ ϕ(ct ) (17)
+ 1
+where ⊙ denotes the Hadamard product, σg (x) = 1+exp(−x) is the sigmoid function, σh (x) = tanh(x) and
+ 2
+entries of the matrices Ux are drawn from Ux ∼ N (0, gx /N ). For simplicity, the bias terms bx are scalars.
+Subscripts f , o and i denote respectively the forget gate, the output gate, the input gate, and c is the cell state.
+In each LSTM unit, there are two dynamic variables c and h, and three gates f , o, and i that control the flow
+if signals into and out of the cell c. We set the values gih , gix , gf x , bf , gch , gcx , gcx , gox to be uniformly
+distributed between 0 and 1 and initialize bi , gf h ,bc ,b0 as zero.
+During gradient flossing, the actual Lyapunov exponents of different random network realizations converge close
+to the target Lyapunov exponent λtarget
+ 1 = 0 in fewer than 100 epochs as shown in Fig 6A. Fig 6B shows that
+the squared Lyapunov exponents converge towards zero. We note that for LSTM networks, a target Lyapunov
+exponent of λtarget
+ 1 = −1 is achieved after 100 gradient flossing steps only for a subset of random network
+realizations (not shown). We speculate that behavior is influenced by the gating structure of LSTM units,
+which seems to naturally place the first Lyapunov exponent close to zero for certain initializations (See also
+[9, 12, 41, 58]).
+For the recurrent ReLU networks, we considered the same Vanilla RNN dynamics as in the main manuscript in
+Eq 9
+ hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 ,
+The initial entries of W are drawn independently from a Gaussian distribution with a negative mean of −0.1
+and variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We use the transfer
+function ϕ(x) = max(x, 0). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d.
+Gaussian input xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
+flossing. We found that some ReLU network had initially unstable dynamics with positive Lyapunov exponents
+Fig 6C. However, during gradient flossing, these unstable networks were quickly stabilized. Fig 6D shows that
+the squared Lyapunov exponents of ReLU networks converge towards zero.
+
+B.2 Additional Results for Different Numbers of Flossed Lyapunov Exponents
+Additionally to the main Fig 5, we did the same analysis on networks with only preflossing (gradient flossing
+before training) and found that more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the
+tasks become more difficult (Fig 7D), however, in that case even if all N Lyapunov exponents were flossed, thus
+k = N , they were not able to solve the most difficult tasks. This seems to indicate that gradient flossing during
+training cannot be replaced by just gradient flossing more Lyapunov exponents before training.
+
+
+C Additional Results on Gradient Flossing Throughout Training
+We now discuss some additional results on gradient flossing throughout training. First, we analyze how gradient
+flossing affects the gradients and find that during gradient flossing, the norm of gradients that bridge many time
+steps are boosted. Moreover, subordinate singular values of the error norm of the recurrent weights are also
+boosted, indicating that gradient flossing can increase the effective rank of the parameter update. Additionally,
+we show that if gradient flossing is continued throughout training it can be detrimental to the accuracy. Finally,
+we show that Lyapunov exponents of successfully trained networks after training for the spatial delayed XOR
+task have a simple relationship to the delay d.
+
+
+D Gradient Flossing boosts the Gradient Norm for Long Time Horizons
+In this section, we investigate the impact of gradient flossing on the norm and structure of the gradient. It is
+important to note that the complete error gradient of backpropagation through time is composed of a summation
+of products of one-step Jacobians, reflecting the number of "loops" the error signal traverses through the recurrent
+dynamics before reaching its target. Consequently, when the singular values of the long-term Jacobian are
+smaller than 1, the influence of the shorter loops typically dominates the long-term Jacobian.
+
+
+ 16
+ A B
+ 100 100
+ k
+ test accuracy (%)
+
+
+
+
+ test accuracy (%)
+ 80 1 80 delay
+ 20 30
+ 40 50
+ 60 70
+ 60 80 60
+
+ 20 40 60 0 20 40 60 80
+ complexity (delay d) number of flossed i k
+ C 100 D 100
+ 70 70
+ complexity (delay d)
+
+
+
+
+ complexity (delay d)
+ test accuracy (%)
+
+
+
+
+ test accuracy (%)
+ 60 60
+ 50 50
+ 40 40
+ 30 30
+ 20 20
+ 10 10
+ 1 10 20 30 40 50 60 70 80 50 1 10 20 30 40 50 60 70 80 50
+ number of flossed i k number of flossed i k
+
+
+Figure 7: Gradient flossing for different numbers of flossed Lyapunov exponents
+A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
+flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number of
+flossed Lyapunov exponents k. C) Median test accuracy for delayed temporal XOR task as a function
+of delay d and k for networks with gradient flossing during training (500 steps of gradient flossing at
+epochs e ∈ {0, 100, 200, 300, 400}). D)Same as B for preflossing only. Parameters: g = 1, batch
+size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed spatial
+XOR, T = 300, gradient flossing for Ef = 500 epochs before training and during training for A, B,
+C, and only before training for C. Shaded areas are 25% and 75% percentiles, solid lines are means,
+transparent dots are individual simulations, task loss: cross-entropy btw. y, ŷ.
+
+
+In our tasks, we have full control over the correlation structure of the task and thus know exactly which loop
+length of backpropagation through time is necessary for finding the correct association. We were moreover
+careful in our task design not to have any additional signals in our task that might help to bridge the long time
+scale. In the case of vanishing gradients, the gradient norm is predominantly influenced by the shorter loops,
+even though the actual signal in the gradient originates solely from the loop of length d in our task. To mitigate
+the contamination of spurious signals from shorter loops and effectively extract the gradient that spans long time
+horizons, we focus on the gradient with respect to the initial conditions h0 .
+ τ =t−1 t−1
+ ! τ =t−1 t−1
+ !
+ ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X Y ∂hτ ′ +1 ∂Lt
+ = = δτ 0 = Tt (h0 ) (18)
+ ∂h0 ∂ht ′
+ ∂h τ ′ ∂h 0 ∂h t ′
+ ∂h τ ′ ∂ht
+ τ =t−l τ =τ τ =t−l τ =τ
+
+We note that the sum conveniently drops as only the longest ’loop’, in other words, the only summand that
+contributes is the product of Jacobians going from 0 to t. By considering this gradient, we can therefore ensure
+that no undesired signals stemming from shorter loops interfere with the analysis. Moreover, we note that we
+use the binary cross entropy loss which makes the derivative ∂L
+ ∂ht
+ t
+ trivial.
+In Fig 8 we show that gradient flossing boosts the gradient with respect to the initial conditions. Specifically, we
+compare two identical networks trained on the binary delayed temporal XOR task with a loop length of d = 70.
+One network is trained with gradient flossing at epochs e ∈ {0, 100, 200, 300, 400}), while the other is trained
+without gradient flossing.
+
+
+ 17
+ dL
+For the network without gradient flossing, the gradient norm of | dh 0
+ | diminishes to extremely small values
+ −6
+(< 10 ) and remains small throughout training. In contrast, for the network trained with gradient flossing, each
+episode of gradient flossing causes the norm | dh dL
+ 0
+ | to spike, surpassing values larger than 10−2 . These findings
+are direct evidence that gradient flossing boosts the gradient norm, facilitating to bridge long time horizons in
+challenging temporal credit assignment tasks. We observe that after several episodes of gradient flossing, the
+ dL
+gradient | dh 0
+ | of the networks stays around 10−4 and eventually rise up to values around 10−2 . Subsequent
+in training, the test accuracy surpasses chance level (Fig 8B). We observed this temporal relationship between
+ dL
+gradient norm | dh 0
+ | and training success consistently across numerous network realizations (Fig 8C and D).
+ dL
+These findings suggest that the gradient norm | dh 0
+ | can be a good predictor of learning success, sometimes
+hundreds of epochs before the accuracy exceeds the chance level of 50%. Indeed, when depicting the gradient
+norm aligned to the last epoch where accuracy was ≤ 50%, we see for many network realizations a gradual
+growth of gradient norm oven epochs before accuracy surpasses chance level (Fig 9A). Analogously, when
+ dL
+plotting the accuracy as a function of epoch aligned with the last epoch with | dh 0
+ | < 0.001, we observe for this
+ dL
+task that the increase of gradient norm | dh0 | reliably precedes the epoch at which the accuracy surpasses the
+chance level (See Fig 9B). We note that when measuring the overlap of the orientation of the gradient vector
+ dL
+ dh0
+ with the first covariant Lyapunov vector of the forward dynamics, we found a significant increase in overlap
+around the training epoch where the accuracy surpasses the chance level both in networks with and without
+gradient flossing. This does not come as a surprise as the covariant Lyapunov vector measure the most unstable
+(or least stable) direction in the tangent space of a trajectory and perturbations of h0 that have to travel over
+many epochs align
+
+D.1 Gradient Flossing Boosts Effective Dimension of Error Gradient
+To further investigate the effect of gradient flossing on training, we investigated the structure of the error gradient
+ dL
+and how it is changed by gradient flossing. To this end, we decompose the recurrent weight gradient σi dW
+into in weighted sum of outer products using singular value decomposition (Fig 10).
+As the Lyapunov exponents are the time-averaged logarithms of the singular values of the asymptotic long-term
+Jacobian Tt (hτ ), this allows us to directly link the effect of pushing Lyapunov exponents toward zero during
+gradient flossing to the structure of the error gradient of the recurrent weights, as they are intimately linked:
+ τ =t−1 t−1
+ !
+ ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X ∂hτ
+ = = Tt (hτ ) (19)
+ ∂W ∂ht ′
+ ∂h τ ′ ∂W ∂h t
+ τ
+ ∂W
+ τ =t−l τ =τ
+
+We again note that different ’loops’ contribute to the total gradient expression and the Lyapunov exponents only
+characterize the longest loop. Further, we note that in our controlled tasks, depending on delay d, only few of the
+summands are relevant for solving the task. We thus expect the relevant gradient summand that carries important
+signals about the task to be contaminated by summands of both shorter and longer chains, which contribute
+irrelevant fluctuations.
+ dL
+ 
+The singular values of the recurrent weight gradient σi dW as a function of training epoch reveal that the
+subordinate singular values subordinate singular value σ20 and σ40 exhibit peaks at the times of gradient flossing,
+while the first singular value σ1 only shows a slight peak (Fig 10A). This indicates that gradient flossing
+ dL
+increases the effective rank of the recurrent weight gradient dW . In other words, gradient flossing facilitates
+high-dimensional parameter updates. Our interpretation is, as gradient flossing pushes Lyapunov exponents
+to zero, the different summands in the total gradient contribute more equitable as long loops have neither a
+dominant contribution (which would happen for exploding gradients) nor a vanishing contribution (which would
+happen for vanishing gradients). This way, the sum of gradient terms has a higher effective rank.
+In contrast, without gradient flossing, the subordinate singular values (in Fig 10A σ20 and σ40 ) rapidly diminish
+to extremely small values over training epochs and remain very small throughout training. Note however that the
+leading singular values σ1 are of comparable size irrespective whether gradient flossing was performed or not.
+ dL
+We note that similar to the gradient norm of the loss with respect to the initial condition | dh 0
+ |, the subordinate
+singular values seem to predict when the test accuracy of networks with gradient flossing grows beyond chance
+level (Fig 10B). We confirmed this in multiple other network realizations and give here another example we the
+accuracy grows beyond chance only later during training (Fig 11).
+
+D.2 Gradient Flossing Throughout Training Can Be Detrimental
+We find that gradient flossing continued throughout all training epochs can be detrimental for performance
+(Fig 12). We demonstrate this again in the binary delayed temporal XOR task. We compare three different
+conditions: Either, we floss throughout the training every 100 training epochs for 500 flossing epochs (red), or
+we floss only early during training at training epochs e ∈ {0, 100, 200, 300, 400})(green) or we do not floss at
+all (blue).
+
+
+ 18
+ A C
+ 10 2 0.05
+ 10 4 0.04
+ |dhd 0 | 10 6 0.03
+
+
+
+
+ |dhd 0 |
+ 10 8 0.02
+ 10 10 0.01
+ with flossing during training
+ 10 12 without flossing
+ 0.00
+ 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
+ Epochs Epochs
+ B D
+ 100 100
+ accuracy (%)
+
+
+
+
+ accuracy (%)
+ with flossing during training
+ without flossing
+
+
+ 50 50
+ 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
+ Epochs Epochs
+ dL
+Figure 8: Gradient flossing boosts norm of long-term Jacobian A) Gradient norm of | dh 0
+ | as
+a function of training epochs for networks without flossing (blue) and networks with flossing
+during training (orange). Error gradient norm is boosted after gradient flossing at epochs e ∈
+ dL
+{0, 100, 200, 300, 400, 500}). In networks without gradient flossing, the gradient norms | dh 0
+ | are
+much smaller overall. One out of ten random network realizations with solid line, the other 9 with
+transparent line. B) Accuracy as a function of epoch, same depiction and network realizations as in
+A. Note that accuracy of networks with gradient flossing grows beyond chance level approximately
+ dL
+when the gradient norm | dh 0
+ | becomes macroscopically large. C) Same as A in linear scale. Mean
+final test loss as a function of task difficulty (delay d) for delayed copy task. Different colors are
+different network realizations with gradient flossing during training. Black lines are without any
+gradient flossing. D) Accuracy as a function of epochs, same colors as in C. Note that for all network
+ dL
+realizations the moments where gradient norm | dh 0
+ | becomes macroscopically large coincides with
+the moment the accuracy is beyond chance level. Parameters: g = 1, batch size b = 16, N = 80,
+epochs = 104 , T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before
+training. Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).
+
+
+We observe that after every episode of gradient flossing, the accuracy drops down close to chance level of 50%
+(Fig 12A red line). Between flossing, the accuracy quickly recovers but never reaches 100%. Simultaneously,
+ dL
+when the accuracy drops the test error jumps up (Fig 12B). We also observed that the gradient norm | dh 0
+ | is
+ dL
+initially boosted by gradient flossing, but stays close to indistinguishable once the gradient norm | dh0 | becomes
+macroscopically large (Fig 12C). This suggests that once gradient flossing facilitates signal propagation across
+long time horizons and the network picks up the relevant gradient signal, further gradient flossing can be harmful
+to the actual task execution. We hypothesize that there might be (at least for the Vanilla networks considered
+here), a trade-off between the ability to bridge long time scales which seems to require one or several Lyapunov
+exponents of the forward dynamics close to zero and nonlinear tasks requirements, which require at least a
+fraction of the units to be in the nonlinear regime of the nonlinearity ϕ, where ϕ′ (x) < 1. It would be an
+interesting future research avenue to further investigate this potential trade-off also in other network architectures.
+
+D.3 Lyapunov Exponents after Training With and Without Gradient Flossing
+In Fig 13, we show the first (Fig 13A, B) and the tenth (Fig 13C, D) Lyapunov exponent after training on
+the spatial delayed XOR task both with and without gradient flossing. We find for successful networks with
+gradient flossing a systematic relationship between the first Lyapunov exponent and the delay, that can be fitted
+by approximately by λ1 (d) = −0.2exp.(−0.03delay). Unsuccessful networks with accuracy at chance level
+have a much smaller largest Lyapunov exponent. The same seems to hold true for the tenth Lyapunov exponent.
+In a previous study [57], a similar trend was observed, albeit in the context of a task that did not possess an
+
+
+ 19
+ A B 80
+
+ 75
+ 10 2
+ 70
+ 10 3
+
+
+
+
+ accuracy (%)
+ 65
+ |dhd 0 |
+
+
+
+ 10 4 60
+ 55
+ 10 5
+ 50
+ 45
+ 3000 2000 1000 0 1000 0 100 200 300 400
+ Epochs Epochs
+
+
+Figure 9: Increase of gradient norm precedes epoch when accuracy exceeds chance level
+ dL
+A) Gradient norm of | dh 0
+ | as a function of training epochs for 20 network realizations with flossing
+during training. Epochs are aligned to last epoch where accuracy is ≤ 50%. B) Same task and
+simulations as in A, but here accuracy as a function of epoch, for 20 network realizations with
+ dL
+flossing during training. Epochs are aligned to the last epoch with | dh 0
+ | < 0.001. Different colors
+are different network realizations. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 ,
+T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents early during training at
+training epochs e ∈ {0, 100, 200, 300, 400}. Task: binary delayed XOR, delay d = 70, loss: cross
+entropy(y, ŷ).
+
+
+analytically tractable temporal correlation structure, which might partially explain the less conclusive results.
+It is important to note that the numerical evaluation of Lyapunov exponents in recurrent LSTM networks in
+[57] was based solely on the N × N Jacobian of the memory state. From a dynamical systems standpoint, a
+2N × 2N Jacobian matrix encompassing interactions between both memory and cell states into account is
+required [9, 12, 58].
+
+
+E Gradient Flossing for Linear Network
+We provide code for gradient flossing in linear networks here. We find that gradient flossing also helps to train
+linear networks on tasks with many time steps that can be solved by linear networks, for example the copy task,
+but not for tasks the require a nonlinear input-output operation like the temporally delayed XOR task. Full
+analytical description of gradient flossing for linear networks would be a promising avenue for future research as
+networks with linear dynamics can still have nonlinear learning dynamics [21]. However this is beyond the cope
+of the presented work.
+
+
+F Computational Complexity of Gradient Flossing
+We present here a more in-depth scaling analysis of the computational cost of gradient flossing. There are three
+main contributors to the computational cost (table 1): First the RNN step, which has a computational complexity
+of O N 2 b per time step, where N is the dimension of the recurrent network state (which in case of Vanilla
+ 
+networks equals the number of units) and b is the batch-size both in the forward and backward pass. Second, the
+Jacobian step which scales with O N 2 k per time step, where k is the number of flossed Lyapunov exponents.
+Third, the QR decomposition, which scales with O N k2 , where k is the number of Lyapunov exponents
+ 
+considered.
+Together, this results in a total amortized cost of O N 2 b T per training epoch, where T is the number of
+ 
+
+training time steps and a total amortized costs per flossing epoch of O N 2 Tf (1 + k/tONS + k) where Tf is
+ 
+the number of flossing time steps.
+
+
+ 20
+ A
+ 10 2
+ i dW )
+ d
+ (
+
+
+
+ 10 4
+
+
+ 0 500 1000 1500 2000 2500 3000
+ B 100
+ early training flossing
+ without flossing
+ accuracy (%)
+
+
+
+
+ 50
+ 0 500 1000 1500 2000 2500 3000
+ Epochs
+Figure 10: Gradient flossing decreases condition number of recurrent weight error gradient
+ dL
+ 
+A) Singular values of recurrent weight gradient σi dW as a function of training epochs for singular
+values i ∈ 1, 20, 40 for networks without gradient flossing (blue) and early training gradient flossing
+(green). At epochs of gradient flossing, the subordinate singular value σ20 and σ40 are peaked, while
+the first singular value σ1 has only a slight peak. This indicates that gradient flossing increases the
+ dL
+effective rank of the recurrent weight gradient dW . B) Accuracy as a function of training epochs.
+Note that accuracy of networks with gradient flossing grows beyond chance level approximately
+when the subordinate singular values singular value σ20 and σ40 are peaked increase, which enables
+high-dimensional parameters updates. Parameters: g = 1, batch size b = 16, N = 80, epochs = 103 ,
+T = 300, gradient flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training.
+Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).
+
+
+
+
+In case of preflossing, thus, the total computation cost scale with O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] ,
+ 
+where E is the number of training epochs and Ep is the number of preflossing epochs.
+For gradient flossing during training (assuming that there is also preflossing done), the amortized cost scale with
+O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] , where Ef is the total number of flossing epochs during
+training.
+Empirically, we find that both the number of preflossing epochs Ep and flossing episodes Ef necessary for
+training success is much smaller than the total number of training epochs E. For example, the preflossing for
+500 epochs in the numerical experiment of Fig 3 took ∼ 37 seconds, while the overall training on 10000 training
+epochs with batch size b = 16 took ∼ 1680 seconds. Thus only approximately 2.2% of the total training time
+was spent on gradient flossing. Moreover, Tp can be smaller than T , it just has to be long enough such that the
+temporal correlations in the task can be bridged. In case of the tasks discussed in the manuscript, this would be
+the delay d. It remains an important challenge to infer the suitable number of flossing time steps Tf for tasks
+with unknown temporal correlation structure.
+It would also be interesting to investigate how the CPU hours/wall-clock time/flops/Joule/CO2-emission spent
+on gradient flossing vs on training networks with larger N are trading off against each other. For this, we would
+suggest to first find the smallest network that on median successfully trains on a binary temporal XOR task for
+a fixed given delay d and measure the computational resources involved in training it, e.g. in terms of CPU
+hours. Then compare it to a network with gradient flossing. This would be a promising analysis but is beyond
+our current computational budget. We will start such experiments an might be able to provide results during the
+reviewer period.
+
+
+ 21
+ A
+ 10 2
+ i dW )
+ d
+ (
+
+
+
+ 10 4
+
+
+ 0 500 1000 1500 2000 2500 3000
+ B 100
+ early training flossing
+ without flossing
+ accuracy (%)
+
+
+
+
+ 50
+ 0 500 1000 1500 2000 2500 3000
+ Epochs
+Figure 11: Gradient flossing decreases condition number of recurrent weight error gradient
+Same as Fig 10 for different network realization.
+
+
+
+G Additional controls
+
+We also investigate the effects of gradient flossing during the training with orthogonal weight initializations
+and confirm our finding that gradient flossing improves trainability on tasks that have long time horizons to
+bridge. Moreover, we find that gradient flossing during training can further improve trainability. We replicated
+the two more challenging tasks from the main paper (Fig 4) for orthogonal initialization with variable temporal
+complexity and performed gradient flossing either both during and before training, only before training, or not at
+all.
+Fig 14A shows the test accuracy for Vanilla RNNs with orthogonal initialization trained on the delayed temporal
+XOR task yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of orthogonal Vanilla
+RNNs falls to chance level for d ≥ 40 (Fig 14C). With gradient flossing before training, the trainability can
+be improved, but still falls close to chance level for d = 70. In contrast, for initially orthogonal networks with
+gradient flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed for
+500 epochs before task training and again after 500 epochs of training on the task. In Fig 14B, D the networks
+have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a three-dimensional binary input
+signal x1 , x2 , and x3 and generate the correct output with a delay of d steps identical to Fig 4 in the main text.
+Again, we observe similar to networks with Gaussian initialization that flossing before training improves the
+performance compared to baseline, but starts failing for long delays d > 60. In contrast, orthogonal networks
+that are also flossed during training can solve even more difficult tasks (Fig 14D). We note that for Fig 14B and
+D, we trained the network only on 5000 epochs, compared to 10000 epochs in networks with random Gaussian
+initialization because for 10000 epochs, both networks with gradient flossing only before training and with
+gradient flossing before and during training were able to bridge d = 70. These results suggest that orthogonal
+initialization does seem to slightly improve performance for tasks with long time horizons to bridge and gradient
+flossing and additionally boost the performance. Thus orthogonal initialization and gradient flossing seems to go
+well together. It would be interesting to study if orthogonal initialization also reduces the number of gradient
+flossing steps necessary to improve performance.
+
+
+
+H Additional Details on Training Tasks
+
+In this section, we provide a more rigorous definition of the tasks used for training, as discussed in Section 3:
+
+
+ 22
+ A 100
+
+
+ accuracy (%)
+ with flossing throughout training
+ early training flossing
+ without flossing
+ 50
+ 0 500 1000 1500 2000 2500 3000
+ B
+ 100
+ test error
+
+
+
+
+ 10 1
+ 0 500 1000 1500 2000 2500 3000
+ C
+ 10 3 with flossing throughout training
+ early training flossing
+ without flossing
+ |dhd 0 |
+
+
+
+
+ 10 7
+
+ 0 500 1000 1500 2000 2500 3000
+ Epochs
+Figure 12: Gradient flossing throughout training can be detrimental to learning A) Accuracy as a
+function of training epochs for binary temporal delayed XOR task for gradient flossing throughout
+training every 100 training epochs (red). Accuracy drops down close to chance level every time after
+gradient flossing but recovers quickly between. Same for only 5 episodes of gradient flossing at
+epochs e ∈ {0, 100, 200, 300, 400}) (green) and no flossing at all (blue). B) Test error as a function
+ dL
+of training epochs. C) Gradient norm of | dh 0
+ | as a function of training epochs for networks without
+gradient flossing (blue) and networks with flossing throughout training (red) and early training
+gradient flossing (green). Error gradient norm is boosted after each gradient flossing. In networks
+ dL
+without gradient flossing, the gradient norms | dh 0
+ | are much smaller overall. Parameters: g = 1,
+batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on
+k = 75 Lyapunov exponents before training. Task: binary delayed XOR, delay d = 70, loss: cross
+entropy(y, ŷ).
+
+
+H.1 Copy task
+For the copy task, the target network readout at time t is yt = xt−d , where d denotes the delay. We chose the
+input to be sampled i.i.d. from a uniform distribution between 0 and 1.
+
+H.2 Temporal XOR task
+The temporal XOR task requires the target network readout yt at time t to be computed as follows:
+ yt = |xt−d/2 − xt−d | (20)
+ 2
+where again d denotes a time delay of d time steps. In the case of x ∈ {0, 1} and y ∈ {0, 1}, the output yt
+follows the truth table of the XOR digital logic gate (Table 2). Thus, the function f (xa , xb ) = |xa − xb | can be
+seen as an analytical representation of the XOR gate. It is important to note that f (x, 0) = x only for x ≥ 0,
+and that this task requires a nonlinearity. The implementation can easily be constructed analytically, for example,
+using two rectified linear units ϕ(x) = max(x, 0) the outbut can be constructed by
+ f (xa , xb ) = |xa − xb | = ϕ(xa − xb ) + ϕ(xb − xa ). (21)
+Together with a delay line to transmit the signal xt−d over time, this can solve the task.
+
+
+ 23
+ A no gradient flossing B gradient flossing during training
+ 0.1
+ 1
+ 0.1
+
+
+
+
+ 1
+ 0.2
+ 0.2
+ 20 40 60 20 40 60
+ complexity (delay d) complexity (delay d)
+ C D
+ 0.1
+ 0.2 0.1
+ 10
+
+
+
+
+ 10
+ 0.3 0.2
+ 20 40 60 20 40 60
+ complexity (delay d) complexity (delay d)
+Figure 13: Lyapunov exponents of trained networks with and without gradient flossing A) First
+Lyapunov exponents λ1 for Vanilla networks trained on spatial delayed XOR task as a function of
+the delay with no gradient flossing. Colored-coded is test accuracy at the end of training where red
+corresponds to 100% accuracy and blue to chance level (50%). B) Same as A for networks with
+gradient flossing during training. Black dashed line shows that Lyapunov exponents of successfully
+trained networks can be approximated by the empirical fit λ1 (d) = −0.2exp.(−0.03delay). (Proto-
+col for gradient flossing during training same as main text Fig 4B). C) Same as A for tenth Lyapunov
+exponents λ10 . D) Same as B for tenth Lyapunov exponents λ10 . Same fit as in B also describes
+λ10 . Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for
+Ef = 500 epochs on k = 75 Lyapunov exponents before training. Task: binary spatial delayed XOR,
+loss: cross entropy(y, ŷ).
+
+
+I Additional Background on Lyapunov Exponents of RNNs
+
+An autonomous dynamical system is usually defined by a set of ordinary differential equations dh/dt =
+F(h), h ∈ RN in the case of continuous-time dynamics, or as a map hs+1 = f (hs ) in the case of discrete-time
+dynamics. In the following, the theory is presented for discrete-time dynamical systems for ease of notation,
+but everything directly extends to continuous-time systems [43]. Together with an initial condition h0 , the
+map forms a trajectory. As a natural extension of linear stability analysis, one can ask how an infinitesimal
+perturbation h′0 = h0 + ϵu0 evolves in time. Chaotic systems are sensitive to initial conditions; almost all
+infinitesimal perturbations ϵu0 of the initial condition grow exponentially with time |ϵut | ≈ exp(λ1 t)|ϵu0 |.
+Finite-size perturbations, therefore, may lead to a drastically different subsequent behavior. The largest Lyapunov
+exponent λ1 measures the average rate of exponential divergence or convergence of nearby initial conditions:
+
+
+ 1 ||ϵut ||
+ λ1 (h0 ) = lim lim log (22)
+ t→∞ t ϵ→0 ||ϵu0 ||
+In dynamical systems that are ergodic on the attractor, the Lyapunov exponents do not depend on the initial
+conditions as long as the initial conditions are in the basins of attraction of the attractor. Note that it is crucial
+to first take the limit ϵ → 0 and then t → ∞, as λ1 (h0 ) would be trivially zero for a bounded attractor if the
+ ||ϵut ||
+limits are exchanged, as limt→∞ log ||ϵu 0 ||
+ is bounded for finite perturbations even if the system is chaotic. To
+measure k Lyapunov exponents, one has to study the evolution of k independent infinitesimal perturbations us
+spanning the tangent space:
+
+
+ us+1 = Ds us (23)
+
+
+ 24
+ forward pass backward pass
+ O N2 b
+ 
+ RNN dynamics "
+ O N2 k
+ 
+ Jacobian step "
+ O N k2
+ 
+ QR step "
+ O N2 b T
+ 
+ total amortized costs "
+ per training epoch
+ O N 2 Tf (1 + k/tONS + k)
+ 
+ total amortized costs "
+ per gradient flossing epoch
+ O N 2 [Eb T + Ep Tf (1 + k/tONS + k)]
+ 
+ total amortized costs "
+ of preflossing
+ O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)]
+ 
+ total amortized costs "
+ flossing during training
+
+Table 1: Computational cost for gradient flossing and training of RNNs
+N denotes number of neurons, b is the batch size, T is the number of time steps in forward pass of
+training, Tf is the number of time steps in forward pass of flossing, tONS is the reorthonormalization
+interval, k is the number of flossed Lyapunov exponents, E is the number of training epochs, Ep is
+the number of preflossing epochs, Ef is the number of flossing epochs during training. Empirically,
+we find that the necessary number of preflossing epochs Ep and flossing episodes Ef is much smaller
+than both the total number of training epochs E. Moreover, Tp can be smaller than T .
+
+ Table 2: XOR
+ input xt−d input xt−2d target output yt
+ 0 0 0
+ 0 1 1
+ 1 0 1
+ 1 1 0
+
+
+where the N × N Jacobian Ds (hs ) = df (hs )/dh characterizes the evolution of generic infinitesimal perturba-
+tions during one step. Note that this Jacobian along the trajectory is equivalent to a stability matrix only at a
+fixed point, i.e., when hs+1 = f (hs ) = hs .
+We are interested in the asymptotic behavior, and therefore we study the long-term Jacobian
+
+ Tt (h0 ) = Dt−1 (ht−1 ) . . . D1 (h1 )D0 (h0 ). (24)
+Note that Tt (h0 ) is a product of generally noncommuting matrices. The Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN
+are defined as the logarithms of the eigenvalues of the Oseledets matrix
+ 1
+ Λ(h0 ) = lim [Tt (h0 )⊤ Tt (h0 )] 2t , (25)
+ t→∞
+
+where ⊤ denotes the transpose operation. The expression inside the brackets is the Gram matrix of the long-term
+Jacobian Tt (h0 ). Geometrically, the determinant of the Gram matrix is the squared volume of the parallelotope
+spanned by the columns of Tt (h0 ). Thus, the exponential volume growth rate is given by the sum of the
+logarithms of its first k (sorted) eigenvalues. Oseledets’ multiplicative ergodic theorem guarantees the existence
+of the Oseledets matrix Λ(h0 ) for almost all initial conditions h0 [42]. In ergodic systems, the Lyapunov
+exponents λi do not depend on the initial condition h0 . However, for a numerical calculation of the Lyapunov
+spectrum, Eq 25 cannot be used directly because the long-term Jacobian Tt (h0 ) quickly becomes ill-conditioned,
+i.e., the ratio between its largest and smallest singular value diverges exponentially with time.
+
+
+J Algorithm for Calculating Lyapunov Spectrum of Rate Networks
+For calculating the first k Lyapunov exponents, we exploit the fact that the growth rate of a k-dimensional
+infinitesimal volume element is given by λ(m) = m (1)
+ , λ2 = λ(2) − λ1 , λ3 =
+ P
+ i=1 λi . Therefore, λ1 = λ
+ (3)
+λ − λ1 − λ2 , . . . [44]. The volume growth rates can be obtained via QR-decomposition.
+
+
+ 25
+ orthogonal weight initialization
+ A delayed temporal XOR B delayed spatial XOR
+ 100 100
+test accuracy (%) flossing during training
+
+
+
+
+ test accuracy (%)
+ preflossing
+ 80 no flossing 80
+
+ 60 60
+
+ 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000
+ Epochs Epochs
+ C D
+ 100 100
+test accuracy (%)
+
+
+
+
+ test accuracy (%)
+ 80 80 flossing during training
+ preflossing
+ no flossing
+ 60 60
+
+ 10 20 30 40 50 60 70 10 20 30 40 50 60 70
+ complexity (delay d) complexity (delay d)
+
+Figure 14: Gradient flossing before and during training improves trainability for orthogonal
+nets
+A) Test accuracy for orthogonally initialized vanilla RNNs trained on delayed temporal binary XOR
+task yt = xt−d/2 ⊕ xt−d with gradient flossing during training (green), preflossing (orange), and
+with no gradient flossing (blue) for d = 70. Solid lines are mean, transparent thin lines are individual
+network realizations B) Same as A for delayed spatial XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d .
+C) Test accuracy as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test
+accuracy as a function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1,
+batch size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed
+spatial XOR, T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training
+and during training for green lines, and only before training for orange lines. Shaded areas are 25%
+and 75% percentiles, solid lines are means, transparent dots are individual simulations, task loss is
+cross-entropy between y, ŷ.
+
+
+First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qm
+ s ] in the tangent space along the trajectory
+using the Jacobian Ds :
+ Qe s+1 = Ds Qs (26)
+A continuous system can be transformed into a discrete system by considering a stroboscopic representation,
+where the trajectory is only considered at certain discrete time points. We use here the notation of discrete
+dynamical systems where this corresponds to performing the product of Jacobians along the trajectory Q
+ e s+1 =
+Ds Qs . We study the discrete network dynamics in the limit of small time step ∆t → 0 and for discrete time
+∆t = 1. The notation can be readily extended to continuous systems [43].
+Second, we extract the exponential growth rates using the QR-decomposition,
+ e s+1 = Qs+1 Rs+1 ,
+ Q
+
+which uniquely decomposes Q e s+1 into an orthonormal matrix Qs+1 of size N × k so Q⊤ s+1 Qs+1 = 1m×m
+and to an upper triangular matrix Rs+1 of size k × k with positive diagonal elements. Geometrically, Qs+1
+describes the rotation of Qs caused by Ds and the diagonal entries of Rs+1 describe the stretching and shrinking
+of the columns of Qs , while the off-diagonal elements represent the shearing. Fig 15 visualizes Ds and the
+QR-decomposition for k = 2.
+The Lyapunov exponents are given by time-averaged logarithms of the diagonal elements of Rs :
+ t t
+ 1 Y 1X
+ λi = lim log Rsii = lim log Rsii . (27)
+ t→∞ t t→∞ t
+ s=1 s=1
+
+
+
+ 26
+ Ds
+ QR
+
+ s+1
+ R22
+qs2 ~2 s+1
+ 2
+ qs+1
+ qs+1 R11 ~1
+ qs+1
+ s+1
+ qs1 R11 1
+ qs+1
+ s+1
+ R22
+
+Figure 15: Geometric illustration of Lyapunov spectrum calculation. An orthonormal matrix
+Qs = [q1s , q2s , . . . qms ], whose columns are the axes of an k-dimensional cube, is rotated and distorted
+by the Jacobian Ds into an k-dimensional parallelotope Q e s+1 = Ds Qs embedded in RN . The
+figure illustrates this for k = 2, in which case the columns of Q e s+1 span a parallelogram, which
+can be divided into a right triangle and a trapezoid and rearranged into a rectangle. Thus, the
+area of the gray parallelogram is the same as that of the orange rectangle. The QR-decomposition
+reorthonormalizes Q e s+1 by decomposing it into the product of an orthonormal matrix Qs+1 =
+[q1s+1 , q2s+1 , . . . qms+1 ] and the upper-triangular matrix R
+ s+1
+ . Qs+1 describes the rotation of Qs
+ s+1
+caused by Ds . The diagonal entries of R gives the stretching/shrinking along the columns of
+Qs+1 ,Qthus the volume of the parallelotope formed by the first k columns of Q e s+1 is given by
+ m
+Vm = i=1 Rs+1 ii . The time-averaged logarithms of the diagonal elements of R s
+ give the Lyapunov
+ 1
+ Qt s 1
+ Pt s
+spectrum: λi = limtsim →∞ tsim log s=1 Rii = limtsim →∞ t s=1 log Rii .
+
+
+Note that the QR-decomposition does not need to be performed at every simulation step, just sufficiently often,
+i.e., once every sONS steps such that Q ONS = Ds+sONS −1 · Ds+sONS −2 . . . Ds · Qs remains well-conditioned
+ e s+s
+[44]. An appropriate reorthonormalization interval sONS = tONS /∆t thus depends on the condition number, the
+ratio of the smallest and largest singular value:
+ s+s
+ e s+s ) = κ2 (Rs+sONS ) = σ1 (Rs+sONS ) R ONS
+ κ2 ( Q = 11 . (28)
+ Rs+s
+ ONS s+s
+ σm (R ONS ) mm
+ ONS
+
+An initial transient should be disregarded in the calculation of the Lyapunov spectrum because h first has to
+converge towards the attractor and Q has to converge to the unique eigenvectors of the Oseledets matrix (Eq 25)
+[54]. A simple example of this algorithm in pseudocode is:
+
+Algorithm 2 Jacobian-based algorithm for Lyapunov spectrum
+ initialize h, Q
+ evolve h until it is on attractor (avoid initial transient)
+ evolve Q until it converges to the eigenvectors of the backward Oseledets matrix
+ set γi = 0
+ for t = 1 → T do
+ h ← f (h)
+ df
+ D ← dh
+ Q←D·Q
+ if s ≡ 0 (mod sONS ) then
+ Q, R ← qr(Q)
+ γi += log(Rii )
+ end if
+ end for
+ λi = γi /T
+
+
+It is guaranteed that under general conditions initially random orthonormal systems will exponentially converge
+towards a unique basis that is given by the eigenvectors of the Oseledets matrix Eq 25 [54]. A minimal example
+of this algorithm in pseudocode is shown in appendix 3. A feasible strategy to determine the reorthonormalization
+time interval tONS is to get first a rough estimate of the Lyapunov spectrum using a short simulation time tsim and
+a small tONS and repeat with a longer simulation time and a tONS based on the Lyapunov spectrum of the rough
+estimate of the Lyapunov spectrum. Another strategy is, to first iteratively adapt tONS on a short simulation
+run to get an acceptable condition number. It should be noted that there exists a diversity of other methods to
+estimate the Lyapunov spectrum [14, 43, 68, 69].
+
+
+ 27
+ K Convergence of Lyapunov Exponents of RNNs
+In Fig. 16, we demonstrate the convergence of the Lyapunov exponents. We show the estimate of the Lyapunov
+exponents λi for i = 1, 20, 60, 80 for different initial conditions but identical network realization.
+
+ 0
+ 10
+i(1/steps)
+
+
+
+
+ 20
+ 30
+ 40
+ 100 101 102
+ steps
+
+
+Figure 16: Convergence of Lyapunov exponents Convergence of selected Lyapunov exponents
+λi for ten identical network realizations with different initial conditions with simulation time (i =
+1, 20, 60, 80) for σ = 1 and g = 1. (Other parameters: N = 80, tsim = 100 steps, tONS = 1).
+
+
+
+
+ 28
+ \ No newline at end of file
diff --git a/papers/txt/gram2025_generative_recursive.txt b/papers/txt/gram2025_generative_recursive.txt
new file mode 100644
index 0000000..d5f299d
--- /dev/null
+++ b/papers/txt/gram2025_generative_recursive.txt
@@ -0,0 +1,1532 @@
+ Generative Recursive Reasoning
+
+
+ Junyeob Baek1†∗ Mingyu Jo1†∗ Minsu Kim1,2
+
+ Mengye Ren3 Yoshua Bengio2,4 Sungjin Ahn1,3†
+arXiv:2605.19376v2 [cs.AI] 20 May 2026
+
+
+
+
+ 1
+ KAIST 2 Mila – Québec AI Institute
+ 3
+ New York University 4 Université de Montréal
+
+
+
+ Abstract
+ How should future neural reasoning systems implement extended computation?
+ Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
+ sive sequence extension by performing iterative latent-state refinement with shared
+ transition functions. Yet existing RRMs are largely deterministic, following a
+ single latent trajectory and converging to a single prediction. We introduce Gen-
+ erative Recursive reAsoning Models (GRAM), a framework that turns recursive
+ latent reasoning into probabilistic multi-trajectory computation. GRAM models
+ reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
+ tive solution strategies, and inference-time scaling through both recursive depth
+ and parallel trajectory sampling. This yields a latent-variable generative model
+ supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
+ unconditional generation via pθ (x). Trained with amortized variational inference,
+ GRAM improves over deterministic recurrent and recursive baselines on structured
+ reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
+ unconditional generation capability. https://ahn-ml.github.io/gram-website
+
+
+ 1 Introduction
+ A central question for future neural reasoning systems is how extended computation should be imple-
+ mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
+ process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
+ plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
+ Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
+ than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
+ because it decouples reasoning depth from both parameter scale and output length: a compact model
+ can perform many steps of internal computation by repeatedly applying shared transition functions
+ over time.
+ Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
+ potential of this approach in structured reasoning. Rather than producing a solution in a single
+ feedforward pass, they perform extended computation through iterative latent-state refinement, deep
+ supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
+ latent dynamics. These features make them well suited to problems requiring constraint propagation,
+ state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
+ ∗ Equal contribution
+ † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn
+
+ (sungjin.ahn@kaist.ac.kr)
+
+
+ Preprint.
+ Solution 1,
+
+
+ Input Task,
+
+
+
+ Solution 2,
+
+
+
+
+ (a) Deterministic RRMs (b) GRAM (Ours)
+
+Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
+Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
+deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
+alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
+valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.
+
+also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
+Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
+depth without increasing parameter count. Together, these models suggest that reasoning capability
+can emerge not only from scaling model size or generating longer traces, but also from the organization
+of computation itself.
+While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
+reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
+system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
+multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
+or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
+path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
+not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
+exploring multiple latent trajectories in parallel.
+Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
+initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
+ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
+probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
+motivates the central question of our work: can recursive latent computation support probabilistic,
+generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
+In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
+recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
+process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
+conditioned on the input and the current reasoning state, rather than deterministically updating to a
+single next state. Repeating this process defines a distribution over possible reasoning trajectories,
+allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
+inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
+a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
+marginalizing over latent reasoning trajectories, while the same recursive process can also define an
+unconditional generative model pθ (x) when the input is fixed or absent.
+We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
+chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
+multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
+comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
+including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
+whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
+Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
+transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
+MNIST [15] probes the unconditional generative interpretation.
+Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
+for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.
+
+
+ 2
+ First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
+obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
+inference-time scaling, enabling inference to scale not only with recursive depth but also with the
+number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
+yields the intended architectural advantages over deterministic recurrent and recursive baselines,
+improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.
+
+2 Generative Recursive Reasoning Models
+In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
+of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
+procedure in Section 2.2, with an architecture schematic shown in Figure 2.
+
+2.1 Architecture
+
+Overview. GRAM models the conditional distri- 𝑦 (A) CE loss
+bution pθ (y | x) by marginalizing over stochas-
+ Prior (B) KL Div. Posterior Decoder
+tic latent reasoning trajectories. Given an input 𝑝𝜃 (⋅ |𝑢𝑡 ) 𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡 𝑓dec
+x, GRAM first computes an embedding
+ ℎ𝑡−1 𝑓𝐻 𝑢𝑡 ℎ𝑡
+ ex = fenc (x; θ), (1)
+ 𝐾 times
+which is reused throughout the entire recursive 𝑙𝑡−1 𝑓𝐿 𝑓𝐿 𝑙𝑡
+computation. Starting from a fixed initial la-
+ Encoder
+tent state z0 , the model evolves the latent state 𝑥
+ 𝑓enc
+through learned stochastic transitions. The re-
+cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
+levels: inner and outer loops. latent transition in the hierarchical instantiation z =
+At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
+ level update fH produces a deterministic proposal ut , to
+a new latent state conditioned on the previous which stochastic guidance ϵt is added: ht = ut + ϵt .
+latent state and the input embedding,
+ zt ∼ pθ (zt | zt−1 , ex ), t = 1, . . . , T. (2)
+At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
+to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
+supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
+gradients computed as described in Section 2.2.
+At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
+step serving as the initial state of the next, thereby forming the full recursive computation:
+ (1) T transitions (1) (2) T transitions T transitions (N )
+ z0 −−−−−−−→ zT = z0 −−−−−−−→ · · · −−−−−−−→ zT sup , (3)
+ (n) (1)
+where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
+fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
+ (n+1) (n)
+(z0 := zT ). This abstract formulation can be instantiated with various recurrent Transformer
+backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
+as well as hierarchical designs such as HRM and TRM [8, 9].
+Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
+latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
+zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
+reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
+perturbation around a deterministic update: at each transition, the model first computes a deterministic
+update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
+and adds it to ut :
+ ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
+ 
+ (4)
+ zt = ut + ϵt . (5)
+
+
+ 3
+ We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
+direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
+ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
+and support robust exploration of the solution space without discarding the deterministic refinement
+performed by ut .
+Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
+(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
+state, while the low-level component l is updated K times within a single transition and carries
+fine-grained intermediate computation. This decomposition separates the two roles across time scales,
+with h accumulating slowly across transitions and l refined rapidly within each one.
+With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
+The low-level component is first refined for K updates, with the high-level component held fixed:
+ lt,k = fL (ht−1 , lt,k−1 , ex ; θ), k = 1, . . . , K, (6)
+where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
+component is then updated as a stochastic transition conditioned on the refined lt ,
+ ut = fH (ht−1 , lt ; θ), (7)
+ ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
+ 
+ , (8)
+ ht = ut + ϵt , (9)
+and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
+level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
+more abstract component of the latent state, where it can steer the overall reasoning trajectory
+across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
+fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
+Modeling Unconditional Distribution. While the description so far focuses on the conditional
+setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
+pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
+generation tasks in Section 4.3.
+
+2.2 Training
+
+GRAM is trained to model the conditional distribution pθ (y | x), where each training example
+consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
+latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
+respect to the generative parameters θ and variational parameters ϕ.
+Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
+the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
+TTotal = T × Nsup . The conditional likelihood is defined as
+ Z
+ pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ, (10)
+
+where x denotes the input problem and y denotes the corresponding ground-truth output.
+Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
+over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
+evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
+ log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) . (11)
+
+During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
+access to both the input problem x and the target output y. At inference time, where y is unavailable,
+trajectories are instead generated from the learned prior pθ (· | x).
+ 3We also tried injecting noise into the low-level state, but found that it did not improve performance.
+
+
+
+
+ 4
+ Both the prior and the posterior are modeled as conditional Markov processes over latent states:
+ TY
+ Total TY
+ Total
+
+ pθ (τ | x) = p(z0 ) pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 ) qϕ (zt | zt−1 , x, y). (12)
+ t=1 t=1
+Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
+adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
+transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
+whereas the prior uses pθ (ϵt | ut ).
+Since the two processes share the same Markov structure and all stochasticity is introduced through
+ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
+since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
+pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
+   TX Total h i
+ LELBO = Eqϕ log pθ (y | zTTotal , x) − Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
+ t=1
+Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
+in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
+variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
+Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
+train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
+recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
+than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
+step is reused as the initial state of the next step.
+Following standard practice for recurrent models with long computation chains, we apply truncated
+gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
+implementation, gradients are propagated only through the final transition of each supervision step,
+ (n) (n)
+zT −1 → zT . This gives the following surrogate objective for each supervision step:
+ (n)  (n)  (n) (n) (n) (n) 
+ LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
+ (n)
+where zT is the terminal state of the current supervision step n, and gradients are stopped through
+preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
+exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
+analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
+are listed in Appendix B.2.
+
+2.3 Inference-Time Scaling
+
+GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
+of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
+For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
+time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
+Appendix A.1). For width — the focus of this section — we draw {τ (i) }N i=1 ∼ pθ (τ(i)| x) from the
+learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
+multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
+To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
+Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
+from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
+inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
+chooses the candidate with the highest predicted terminal value. Details of LPRM training are
+provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
+parallel exploration, without increasing the sequential recursion length.
+
+3 Related Work
+Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
+Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous
+
+
+ 5
+ Sudoku ARC-AGI-1 ARC-AGI-2
+ 20
+ 100 97.0% 66.7%
+ 87.4% 16.0%
+ 80 60 52.0% 55.7% 15
+Accuracy (%) 61.3% 44.6% 11.1%
+ 60 55.0% 40.3% 9.7%
+ 40 34.5% 10 7.8%
+ 40 5.0%
+ 20 5 3.0%
+ 20
+ 0 0 0
+ Looped HRM TRM GRAM HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
+ TF (Ours) (Ours) high (low) thinking (Ours) high (low) thinking
+ Recursive Models GRAM (Ours) LLMs
+
+Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
+outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
+transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
+omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
+only as external reference points for benchmark difficulty.
+
+representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
+tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
+instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
+soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
+However, many remain organized around autoregressive sequence generation, where additional
+computation is tied to generating more tokens, latent positions, or sequential reasoning states.
+Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
+from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
+recursive reasoning models show that increasing inference-time depth can outperform larger static
+models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
+instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
+enabling multi-path exploration and generative sampling.
+Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
+transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
+ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
+and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
+stochastic dynamics as computation rather than temporal observation modeling: latent transitions
+define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
+pθ (y | x) and unconditional pθ (x) generation.
+
+
+4 Experiments
+
+GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
+purpose large language reasoning model whose training data, inference budgets, prompting strategies,
+tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
+recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
+tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
+stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
+In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
+AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
+(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
+MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
+choices (Section 4.4).
+
+4.1 Challenging Puzzle Tasks
+
+Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
+quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
+visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-
+
+
+ 6
+ 1.0 N= 1.0
+ 1 5 10 20 50
+ 0.8
+ 0.9
+Accuracy Looped TF
+
+
+
+
+ Accuracy
+ Looped TF 0.6 HRM
+ HRM
+ 0.8 TRM TRM
+ GRAM (Ours) 0.4 GRAM (N=1)
+ 0.2
+ 0.6
+ 0.5 0.0
+ 8 16 32 128 320 2 4 6 8 10 12 14 16 18
+ Iterations Number of Solutions
+ Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
+ longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
+ iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
+ Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
+ a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
+ performance.
+
+
+ former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
+ model results [40] are included as external reference points for benchmark difficulty, rather than
+ as controlled baselines, since their training and inference settings are not directly comparable to
+ task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
+ reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
+ Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
+ GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
+ improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
+ TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
+ stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
+ of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
+ problem spaces more effectively than models constrained to a single sequential refinement process.
+ Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
+ Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
+ number of parallel samples consistently improves performance across all iteration counts. Notably,
+ GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
+ including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
+ recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
+ explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
+ Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
+ the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
+ analysis on the ARC-AGI Challenge is provided in Appendix D.2.
+
+ 4.2 Multi-solution Puzzle Tasks
+
+Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
+10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
+input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
+HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
+N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
+total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
+constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
+Appendix C.2.
+ Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
+ models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
+ Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
+ deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
+ consistent performance regardless of solution count. This confirms that deterministic latent updates
+
+
+ 7
+ Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
+model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
+over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
+solutions discovered with 20 samples (%).
+ N-Queens Graph Coloring
+ 8×8 10 × 10 8-vertex 10-vertex
+ Method Rec. Gen. # Params Accuracy Coverage Accuracy Coverage Conflict↓ Coverage Conflict↓ Coverage
+ Direct Pred (8 layers) ✗ ✗ 27M 40.4±1.1 13.7±1.1 13.6±0.5 1.6±0.2 179.3±4.0 19.9±0.2 198.7±5.0 6.7±0.1
+ Direct Pred (32 layers) ✗ ✗ 100M 40.2±1.3 13.6±1.1 13.1±0.4 1.6±0.2 174.0±18.0 19.1±1.7 227.7±34.5 6.5±1.9
+ Looped TF ✓ ✗ 7M 68.4±3.7 23.6±1.9 50.0±7.6 6.2±3.2 136.0±16.1 20.5±1.5 157.3±9.0 7.2±0.7
+ HRM ✓ ✗ 27M 78.7±2.9 26.7±1.3 37.4±0.3 4.7±0.1 109.7±1.5 21.8±0.3 164.3±21.6 8.9±1.7
+ TRM ✓ ✗ 7M 66.8±5.7 36.1±22.5 17.5±11.2 2.0±1.3 109.3±3.1 22.3±0.6 170.7±17.9 6.8±0.3
+ AR ✗ ✓ 10.6M 96.3±1.0 84.8±0.8 90.0±2.2 53.2±0.8 19.0±11.3 83.0±0.7 61.3±8.3 40.0±0.3
+ MDLM ✗ ✓ 12.6M 96.1±1.5 87.2±0.6 74.3±6.6 47.4±2.2 2.7±0.6 84.5±4.0 12.0±7.0 48.2±1.4
+ GRAM (Ours) ✓ ✓ 10M 99.7±0.3 90.3±1.9 89.7±2.7 57.5±3.4 2.7±2.1 85.8±0.5 3.3±1.5 51.3±2.8
+
+
+
+
+cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
+analysis is provided in Appendix D.3.
+Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
+MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
+sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
+gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
+and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
+enables stricter constraint satisfaction than generative sampling alone.
+
+4.3 Exploring GRAM as an Unconditional Generator
+
+Setup. To investigate GRAM’s unconditional gen- Table 2: Unconditional generation results on
+erative capability beyond conditional reasoning, we binarized MNIST. We report IS (↑) and FID (↓).
+evaluate generation in two domains: structured con- For iterative models, a step corresponds to a super-
+straint generation on Sudoku (from empty boards, vision step for TRM and GRAM, and a denoising
+ step for D3PM. FID is calculated using real sam-
+evaluated by the fraction of generated boards satis- ples with original pixel values (0–255).
+fying Sudoku constraints) and image generation on
+binarized MNIST [15], where pixel values are thresh- Method IS (↑) FID (↓)
+olded to 0 or 1 (evaluated by Inception Score (IS) [42]
+and FID [43]). In both cases, the input is replaced by VAE 1.70 86.28
+ D3PM (1000 steps) 1.86 74.03
+an empty conditioning signal and the model samples TRM (16 steps) 1.00 303.29
+an output from its learned prior. Baselines include
+D3PM [44], a discrete diffusion model, on both tasks, GRAM (Ours)
+and additionally a VAE [45] trained with binary re- 8 steps 1.85 84.08
+ 16 steps 1.89 77.79
+construction loss on MNIST. To ensure a fair compar- 32 steps 1.91 76.65
+ison with existing literature, FID is calculated using 64 steps 1.95 75.39
+real samples from the original standard MNIST. 128 steps 1.99 74.30
+ 256 steps 2.04 73.34
+Generative Behavior Beyond Reasoning. GRAM
+extends from conditional reasoning to unconditional
+generation in two different domains. On Sudoku
+generation (Figure 5), GRAM produces valid boards
+with 99.05% validity using 10.9M parameters and
+16 supervision steps, surpassing D3PM baselines
+that use up to 55.1M parameters and 1000 denois-
+ing steps. Figure 7 shows qualitative examples, illus-
+trating that the model produces diverse, fully valid
+boards from empty inputs without any explicit con-
+straint checker. On MNIST (Table 2), the deter-
+ministic baseline TRM exhibits mode collapse (FID
+303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
+its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
+these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
+ tially fewer parameters and steps.
+
+ 8
+ D3PM
+ t=0 t=100 t=200 t=300 t=400 t=500 t=600 t=700 t=800 t=900 t=1000
+GRAM TRM
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16
+
+
+
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 Sample 1 Sample 2 Sample 3 Sample 4
+ (a) Generation Process (b) Samples
+ Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
+ steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
+ through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.
+
+ Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
+ Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
+ recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
+ either significantly degrades performance.
+ (a) Architecture Ablation. (b) Mechanism Ablation.
+
+ Model variant Sudoku N-Queens Model variant Sudoku N-Queens
+ base (Looped TF) 61.25 71.30 GRAM (ours) 93.96 99.69
+ + DS + HR (=HRM, TRM) 55.00 / 87.40 80.70 / 72.90
+ w/o stochastic guidance 82.87 72.91
+ + SG 65.64 86.30
+ stochasticity only 94.88 50.27
+ + DS + SG 73.90 100.00
+ guide only 0.00 0.00
+ + DS + HR + SG (=GRAM) 93.96 99.69 w/ direct prediction 63.43 61.44
+ TRM w/ stochastic decoder 82.87 71.66
+ TRM w/ random init. 78.53 71.82
+
+
+ transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
+ emerging as a natural byproduct of the recursive generative process.
+ Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
+ at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
+ to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
+ advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
+ additional samples are in Section D.4.
+
+ 4.4 Ablation Study
+
+ We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
+ Table 3 summarizes the results.
+ Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
+ stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
+ lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
+ on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
+ (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
+ gains in every configuration, supporting our design of stochastic guidance as the core extension
+ introduced by GRAM.
+ Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
+ learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
+ Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
+ However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
+ navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
+ tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
+ Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
+ TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and
+
+
+ 9
+ 8 9 648 59 1 648 59 1 648259317 648259317 648259317 648259317
+ 7 79 5 79 68 5 79 68 415 79 683415 79 683415 792683415 792683415
+ 9 6 289 3 6 1289 3 6 1289 3 6 71289 356 71289 356471289 356471289
+
+
+
+D3PM
+ 657 4 8 657 4 831657 4 83165734 983165734 983165734 983165734
+ 3 9 4 3 9 5 4 3 9 5 4 389 56 42389 561 423897561 423897561 423897561
+ 7 8 7 8 2 7 8 2 7 648 2 5 73648 2 5 73648 2 5173648 2 517364892
+ 5 5 13 548 139548 2 139548626 139548626 139548626 139548626
+ 59 593 593 1 8 593 1 8 27593 148 275936148 275936148 275936148
+ 2 8 2 3 8 7 2 3 8 7 2 53 8 47 2953 8 4712953 864712953 864712953
+ t=0 t=125 t=250 t=375 t=500 t=625 t=750 t=875 t=1000
+ 717348652 716348952 716348952 716348952 716348952 716348952 716348952 716348952
+ 483927379 493527861 493527861 493527861 493527861 493527861 493527861 493527861
+ 523971734 528996734 528169734 528169734 528169734 528169734 528169734 528169734
+GRAM
+
+
+ 864235917 864251379 864215379 864215379 864215379 864215379 864215379 864215379
+ 359841126 359784126 359874126 359874126 359874126 359874126 359874126 359874126
+ 172496538 172936548 172936548 172936548 172936548 172936548 172936548 172936548
+ 937812445 937812615 937482615 937482615 937482615 937482615 937482615 937482615
+ 645779283 645179283 645791283 645791283 645791283 645791283 645791283 645791283
+ 281623497 281663497 281653497 281653497 281653497 281653497 281653497 281653497
+ iter=0 iter=1 iter=2 iter=4 iter=6 iter=8 iter=10 iter=13 iter=16
+Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
+sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
+row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
+highlighted in red.
+
+
+(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
+improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
+than mere randomness.
+
+5 Conclusions and Limitations
+We introduced GRAM, a generative framework that transforms deterministic recursive architectures
+into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
+tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
+enables diverse solution discovery and improved exploration compared to deterministic counterparts.
+Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
+to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
+depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
+sion that consistently improves any recursive architecture, and that the gains stem specifically from
+the variational framework — not from mere randomness, as naive stochastic alternatives applied to
+existing models yield no improvement.
+Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
+through recursion-based generation over inputs, with generation quality improving monotonically
+with recursive depth even beyond training-time steps. This suggests new directions for generative
+modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
+limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
+toward larger foundation models.
+
+
+
+
+ 10
+ Acknowledgment
+This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
+the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
+through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
+ICT (MSIT). This work was also supported by the Institute of Information & Communications
+Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
+2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
+Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
+are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
+members of the MLML for valuable discussions and feedback throughout this project.
+
+Broader Impacts
+GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
+taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
+problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
+to improve reasoning through inference-time computation rather than parameter scaling alone. Its
+generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
+in downstream decision-making pipelines, and multi-sample inference may increase computational
+and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
+real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
+domain-specific safeguards.
+
+References
+ [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+ Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+ Advances in neural information processing systems, 35:24824–24837, 2022.
+
+ [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
+ Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
+ vances in neural information processing systems, 36:11809–11822, 2023.
+
+ [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
+ Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
+ of thoughts: Solving elaborate problems with large language models. In Proceedings of the
+ AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.
+
+ [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+ Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+ arXiv:2412.06769, 2024.
+
+ [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+ by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+ arXiv:2505.12514, 2025.
+
+ [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
+ Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
+ reasoning. arXiv preprint arXiv:2505.23648, 2025.
+
+ [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
+ are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
+
+ [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+ Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+
+ [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+ preprint arXiv:2510.04871, 2025.
+
+
+ 11
+ [10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+ versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
+[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
+[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+ agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+ 2025.
+[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
+ recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
+[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
+ recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
+[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
+ preprint arXiv:1705.08209, 2017.
+[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
+ son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
+ with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
+[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
+ beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
+[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
+ Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
+ concept space. arXiv preprint arXiv:2505.15778, 2025.
+[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
+ hard truths. arXiv preprint arXiv:2509.19170, 2025.
+[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
+ Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
+ arXiv:2502.21074, 2025.
+[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
+ Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
+ preprint arXiv:2505.18454, 2025.
+[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+ Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+ recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
+ thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
+ arXiv:2310.10845, 2023.
+[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
+ Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
+ arXiv:2410.20672, 2024.
+[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
+[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
+ 1735–1780, 1997.
+[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
+ Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
+ decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+
+
+ 12
+ [30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
+ Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
+ preprint arXiv:1909.11942, 2019.
+[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
+ preprint arXiv:1910.10073, 2019.
+[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+ arXiv:1603.08983, 2016.
+[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
+ Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
+ org/abs/1506.02216.
+[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
+ models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
+[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
+ //arxiv.org/abs/1511.05121.
+[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
+ James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
+ //arxiv.org/abs/1811.04551.
+[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
+ discrete world models. arXiv preprint arXiv:2010.02193, 2020.
+[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
+ through world models. arXiv preprint arXiv:2301.04104, 2023.
+[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+ Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
+ processing systems, 30, 2017.
+[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
+ benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
+[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
+ Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
+ models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
+[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
+ 2018.
+[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
+ Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
+ neural information processing systems, 30, 2017.
+[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
+ Structured denoising diffusion models in discrete state-spaces. Advances in neural information
+ processing systems, 34:17981–17993, 2021.
+[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
+ arXiv:1312.6114, 2013.
+[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+ Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
+[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
+ discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
+[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
+ of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
+
+
+ 13
+ [50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
+ applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
+[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
+ convolutional neural networks. Advances in neural information processing systems, 25, 2012.
+[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
+ function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
+[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
+ on computer vision (ECCV), pages 3–19, 2018.
+[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
+ arXiv:1711.05101, 2017.
+[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
+ natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
+[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
+ Advances in neural information processing systems, 33:12438–12448, 2020.
+[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
+ Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
+ BF02066689. URL https://doi.org/10.1007/BF02066689.
+[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
+ learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
+ 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
+ IEEE, 2019.
+[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
+ Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
+[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
+ mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
+ 13091446.
+
+
+
+
+ 14
+ A Additional Method Details
+A.1 Adaptive Computation Time
+
+GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
+trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
+steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
+Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
+to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
+binary action of halting or continuing computation at the current supervision step.
+Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
+ (n)
+After computing the latent state zT at the end of supervision step n, we form Q-learning targets:
+
+ • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
+ prediction.
+ 
+ • q̂ncontinue = max qn+1halt continue
+ , qn+1 , the bootstrapped value of running one more supervision
+ step.
+
+The halt head is trained by regression to these targets:
+ Nsup h
+ X 2 2 i
+ LACT = qnhalt − q̂nhalt + qncontinue − q̂ncontinue . (15)
+ n=1
+
+This auxiliary loss is added to the main training objective and contributes only through the halt head;
+it does not propagate gradients into the recursive core.
+Inference. At inference, computation proceeds one supervision step at a time. After each step
+n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
+ max
+computation continues to the next supervision step, up to a maximum budget of Nsup steps. Different
+trajectories sampled in parallel may therefore terminate at different depths, complementing the
+parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
+(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
+implementation; our released code uses this variant.
+
+A.2 Latent Process Reward Model (LPRM).
+
+To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
+accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
+with the main objective via a regression loss:
+ T
+ X
+ LLPRM = (vψ (zt ) − r)2 , (16)
+ t=1
+
+where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.
+
+A.3 Empirical Validation of the Surrogate Objective
+
+We further analyze the approximation introduced by the surrogate training objective LGRAM used in
+Section 2.2, both qualitatively and empirically.
+Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
+a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
+terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
+the entire trajectory. To enable training with constant memory, we propagate gradients only through
+the final transition of each supervision step. This is a standard practice in recurrent latent variable
+models with long computation chains: ELBOs over truncated sequences are used, for example, in
+VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
+models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore
+
+
+ 15
+ well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
+reasoning trajectories rather than temporal sequences.
+Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
+variational bound, we compute both quantities on the validation set throughout training. The full
+ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
+ (n)
+across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
+Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.
+
+ Sudoku-Extreme N-Queens 8 × 8
+ GRAM (training objective) GRAM (training objective)
+ 0.6 ELBO (full) 103 ELBO (full)
+
+
+ 0.5 102
+ ELBO
+
+
+
+
+ ELBO
+ 0.4 101
+
+ 100
+ 0.3
+ 10 1
+ 0.2
+ 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000
+ Training Step Training Step
+Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
+smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
+monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
+bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
+while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
+across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
+the large dynamic range.
+
+Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
+that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
+variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
+its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
+even though training relies on the surrogate.
+The gap between the two curves in Figure 8 reflects the structural difference between the two
+quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
+final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
+with LGRAM being a biased but useful surrogate for LELBO .
+
+B Training and Architecture Details
+B.1 Architecture Details
+
+GRAM consists of three components: Encoder, Recursive Core, and Decoder.
+Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
+catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
+(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
+Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
+step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
+each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
+architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
+[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
+Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
+Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
+meaning the initialized z0 has a fixed value.
+Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
+and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
+correctness values from the first token of h.
+
+
+ 16
+ Table 4: Architecture components.
+ Component Module Description
+ Encoder
+ Token Embedding vocab → D
+ Puzzle Embedding 16 tokens (optional, for ARC)
+ Position Encoding RoPE or learned
+ Recursive Core
+ f L , fH [Attention + SwiGLU] × 2 layers
+ Iterations K low-level, T high-level steps
+ µθ , σ θ , µ ϕ , σ ϕ SwiGLU MLP for each parameter
+ Decoder
+ LM Head Linear(D → vocab)
+ Q Head Linear(D → 2) for halt
+ V Head Linear(D → 1) for value
+
+
+
+Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
+construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
+continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
+convolutional patch encoder, following [48, 49].
+The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
+normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
+the normalized image passes through a shallow convolutional encoder. This encoder consists of
+two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
+padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
+feature map is divided into non-overlapping patches of size P × P and linearly projected to match
+the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
+are summarized in Table 5.
+
+Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
+channels, P patch size, Np the number of patches, and D the hidden dimension.
+ Stage Layer / Operation Output Dim.
+ Input Tokens (B, C, H, W )
+ 1. Norm.
+ Linear Scaling [−1, 1] (B, C, H, W )
+ Conv2d 5 × 5 (p = 2)
+ (B, D/2, H, W )
+ SiLU → GN(32)
+ 2. Conv
+ Conv2d 5 × 5 (p = 2)
+ (B, D/2, H, W )
+ SiLU → GN(32)
+ Flatten Patches (B, Np , P 2 · D
+ 2)
+ 3. Patch
+ Linear Projection (B, Np , D)
+
+
+Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
+represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
+Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
+as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
+backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
+Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
+steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.
+
+B.2 Training Details
+
+Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
+in Table 6).
+
+
+ 17
+ • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
+ (0=pad, 1=blank, 2–10=digits).
+ • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
+ markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
+ specific puzzle embeddings are prepended to distinguish different ARC tasks.
+ • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
+ (0=pad, 1=empty, 2=queen).
+ • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
+ using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
+ • For image generation on MNIST [15], images are quantized and processed via CNN-based
+ patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
+ patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
+ 1=black, 2=white).
+
+
+ Table 6: Task-specific configurations.
+ Task Seq. Len Vocab Puzzle Emb Encoding
+ Sudoku 81 11 ✗ 9×9 grid, row-major
+ ARC-AGI 900 12 ✓ 30×30 padded canvas
+ N-Queens N2 3 ✗ N × N board
+ n(n−1)
+ Graph Coloring 2
+ 6 ✗ Strict adjacency upper triangle
+ MNIST 196 3 ✗ 14×14 patches
+
+Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
+1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
+moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
+To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
+supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
+(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
+and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.
+
+ Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
+ Task Epochs GPUs Time
+ Sudoku 50K 8 2h
+ ARC-AGI 200K 8 5 days
+ N-Queens (8×8) 3K 8 1h
+ N-Queens (10×10) 1K 8 3h
+ Graph Coloring (8 nodes) 5K 8 1.5h
+ Graph Coloring (10 nodes) 5K 8 6h
+ MNIST 1.8K 8 16h
+
+
+
+C Additional Details of Experiment Setup
+C.1 Challenging Puzzle Tasks
+
+C.1.1 Looped TF on ARC-AGI
+We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
+prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
+batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
+already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
+and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
+This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
+TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
+TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped
+
+
+ 18
+ TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
+recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
+ARC-AGI scale, making a full Looped TF training run impractical.
+
+C.2 Multi-solution Puzzle Tasks
+
+C.2.1 N-Queens Problem
+
+ Input Solution 1 Solution 2 Solution 3
+
+
+
+
+Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
+full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
+admits exactly 3 valid solutions.
+
+Data Generation Details. The N-Queens problem requires placing N queens on an N × N
+chessboard such that no two queens attack each other—meaning no queens share the same row,
+column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
+solution, resulting in a puzzle with 3 distinct valid completions.
+To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
+N = 10. We then created puzzle instances by removing a specific number of queens, treating the
+remaining partial configuration as the input and the original complete board as the target label. To
+generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
+setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
+generated dataset is shown in Figure 10.
+For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
+the model learns to reason rather than memorize, the split was performed based on unique input
+configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
+are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
+puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).
+
+ 8x8 N-Queens 10x10 N-Queens
+ 3000 20000
+ 15000
+ Counts
+
+
+
+
+ Counts
+
+
+
+
+ 2000
+ 10000
+ 1000
+ 5000
+ 0 0
+ 3 6 9 12 15 18 0 20 40 60 80
+ Number of solutions Number of solutions
+Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
+covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+C.2.2 Graph Coloring Problem
+Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
+node in a graph such that no two adjacent nodes share the same color. We consider graphs with
+N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
+nodes and k = 3 colors.
+Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
+pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently
+
+
+ 19
+ with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
+3-colorable.
+For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
+redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
+structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
+The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
+192 test instances for N = 10.
+Input and Output Representation. The input graph is represented by extracting the upper triangular
+portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
+output is a sequence of length N , where each position encodes the assigned color for the corresponding
+node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
+blue, and green respectively.
+
+ Input Solution 1 Solution 2 Solution 3 Solution 4
+
+
+
+
+ Figure 11: Graph Coloring Example
+
+
+ Vertex 8 Graph Coloring Vertex 10 Graph Coloring
+ 2000
+ 1500
+ 1500
+ Counts
+
+
+
+
+ Counts
+
+
+
+
+ 1000
+ 1000
+ 500 500
+ 0 0
+ 3 6 9 12 15 18 0 20 40 60 80
+ Number of solutions Number of solutions
+Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
+dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+D Additional Experiment Results
+D.1 Additional Results on Challenging Puzzle Benchmarks
+
+Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
+observations complementing the main text.
+
+GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
+tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
+than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
+in both cases) further confirms that recursive computation is essential for these tasks — single-pass
+models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
+arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
+increased model capacity.
+
+Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
+Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
+does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
+probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
+more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially
+
+
+ 20
+ Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
+recursive models. All recursive model scores were obtained at 16 supervision steps.
+ Method #Params Sudoku ARC-1 ARC-2
+ Large Reasoning Models
+ Deepseek-R1 671B 0.0 15.8 1.3
+ Claude 3.7 16k N/A 0.0 28.6 0.7
+ o3-mini-high N/A 0.0 34.5 3.0
+ GPT 5.2 (low) N/A – 55.7 9.7
+ Grok-4-thinking 1.7T – 66.7 16.0
+ Gemini 3 Pro N/A – 75.0 31.1
+ Recursive Models
+ Direct Pred 27M 0.0 21.0 0.0
+ Looped TF 7M 61.3 - -
+ HRM 27M 55.0 40.3 5.0
+ TRM 7M 87.4 44.6 7.8
+ GRAM (Ours) 10M 97.0 52.0 11.1
+ Human Results
+ Avg. Human – – 60.2 –
+ Best Human – – 98.0 100.0
+
+
+
+ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
+we view these numbers as benchmark-difficulty reference points rather than controlled baselines.
+
+D.2 Scales with Parallel Sampling on ARC-AGI Challenge
+
+To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
+mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
+by generating 1,000 augmentations for a single problem and performing majority voting over the
+results. Because this augmentation process itself creates a wide variety of samples, we isolated
+the specific effect of generative sampling by performing inference solely on the original problem
+instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
+was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
+rate, and the number of layers.
+As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
+and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
+that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
+observations in Section 4.2, suggesting that increased inference-time compute through width scaling
+allows the model to explore more plausible reasoning trajectories and recover from initial errors,
+eventually leading to more robust solution discovery.
+
+Interaction between Augmentation and Sampling. A natural question arises: why not combine
+higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
+ablation study examining the interaction between data augmentation and inference-time sampling.
+Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
+augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
+demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
+as the level of augmentation increases, the marginal benefit of additional sampling diminishes
+substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
+nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
+and sampling serve complementary rather than additive roles: both mechanisms enable the model
+to capture solution diversity, but through different means. When training data is limited, parallel
+sampling compensates by exploring varied reasoning paths at inference time. When training data
+is abundant through augmentation, the model has already internalized sufficient diversity during
+training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
+beyond augmentation provides diminishing returns, justifying our experimental design choice to
+evaluate these two scaling axes separately.
+
+
+ 21
+ 0.450
+ 0.425
+ 0.400
+ 0.375
+
+
+
+
+ Accuracy
+ 0.350
+ 0.325
+ 0.300 TRM
+ GRAM ( =0.1)
+ 0.275 GRAM ( =0.05)
+ GRAM ( =0.04)
+ 0.250 1 2 5 10 20 50 100 250 500
+ Number of Samples (N)
+Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
+effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
+augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
+as the number of parallel samples N increases, outperforming the TRM baseline.
+
+
+ 0.500
+ 0.475
+ 0.450
+ Accuracy
+
+
+
+
+ 0.425
+ 0.400
+ 0.375 Aug=0
+ Aug=5
+ 0.350 Aug=10
+ Aug=50
+ 0.325
+ 1 2 5 10 20 50 100 250 500
+ Number of Samples (N)
+Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
+sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
+performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
+roles in capturing solution diversity.
+
+
+
+D.3 Solution Coverage Analysis
+
+We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
+istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
+respect to the total number of valid ground-truth solutions.
+As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
+coverage as the number of possible solutions increases. Since these models are constrained to a single
+fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
+collapse in multi-solution landscapes.
+In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
+of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
+improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
+that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
+where multiple valid reasoning paths exist.
+
+
+D.4 Additional Generated Image Samples
+
+In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
+ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
+MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.
+
+
+ 22
+ 1.0 HRM 1.0 HRM
+ TRM TRM
+ GRAM (N=1) GRAM (N=1)
+ GRAM (N=5) GRAM (N=5)
+ 0.8 GRAM (N=10) 0.8 GRAM (N=10)
+ GRAM (N=20) GRAM (N=20)
+
+
+ 0.6 0.6
+Coverage
+
+
+
+
+ Coverage
+ 0.4 0.4
+
+
+ 0.2 0.2
+
+
+ 0.0 0.0
+ 2 4 6 8 10 12 14 16 18 0 15 30 45 60 75 90
+ Number of Solutions Number of Solutions
+ (a) N-Queens 8 × 8 (b) N-Queens 10 × 10
+
+ Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
+ ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
+ space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
+ increases.
+
+
+
+ As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
+ from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
+ A particularly compelling property of this process is the model’s ability to recover from initially
+ ambiguous or incorrect formations.
+ For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
+ the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
+ as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
+ resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.
+
+
+
+
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16
+ Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
+ tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
+ recursive refinement process.
+
+
+
+
+ 23
+ D.5 Additional Experiment Results on Unconditional Sudoku Generation
+
+In this section, we provide additional details on unconditional Sudoku generation. Unlike the
+conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
+entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
+each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
+must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
+boards. To check whether high validity comes from repeatedly producing the same board, we also
+compute the fraction of unique boards among valid samples.
+For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
+benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
+split, discard the clue patterns, and use an all-blank board as input with the complete solution as
+the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
+epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
+contains 10.9M parameters and uses 16 inference steps.
+For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
+Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
+parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
+heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
+generate boards with 1000 denoising steps.
+As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
+strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
+ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
+In all cases, the valid samples are unique under exact board matching, indicating that the reported
+validity is not due to simple repetition of a small set of boards. These results show that GRAM can
+generate highly constrained symbolic structures from an empty input, supporting its potential as a
+generator beyond conditional puzzle solving.
+Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
+the task is to generate complete boards, and validity is determined by whether the generated board
+satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.
+
+Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
+constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
+ Method #Params Steps Validity(%)
+ D3PM-Uniform (Big) 55.1M 1000 91.33
+ D3PM-Uniform (Small) 15.9M 1000 29.24
+ D3PM-Absorb (Big) 55.1M 1000 79.18
+ D3PM-Absorb (Small) 15.9M 1000 21.88
+ GRAM (Ours) 10.9M 16 99.05
+
+
+ Empty input Valid sample Invalid sample
+ 3 6 5 4 7 8 9 2 1 4 3 8 2 9 1 7 6 5
+ 9 4 1 2 5 6 8 7 3 5 2 1 7 3 6 4 9 8
+ 2 7 8 9 1 3 6 5 4 7 6 9 4 5 8 1 2 3
+ 5 2 9 8 4 7 3 1 6 9 1 3 4 4 7 5 8 6
+ 7 3 6 1 9 2 5 4 8 2 5 4 6 8 9 3 1 7
+ 1 8 4 3 6 5 2 9 7 6 8 7 5 1 3 2 4 9
+ 8 9 7 5 3 4 1 6 2 3 7 2 9 6 4 8 5 1
+ 4 1 3 6 2 9 7 8 5 1 9 5 3 2 8 6 7 4
+ 6 5 2 7 8 1 4 3 9 8 4 6 1 7 5 9 3 2
+
+
+Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
+complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
+sample indicate cells involved in constraint violations.
+
+
+
+ 24
+ D.6 Visualizing Latent Recursion Process
+
+To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
+recursive computation. Specifically, we track the high-level state h at each supervision step throughout
+the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
+interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
+Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
+deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
+enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
+regions of latent space before converging. While some trajectories become trapped in local minima
+(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
+This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.
+
+
+
+
+Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
+indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
+loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
+no ability to escape suboptimal trajectories.
+
+
+
+
+ 25
+ Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
+Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
+of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
+the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.
+
+
+
+
+ 26
+ Licenses
+Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
+reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
+are generated by the authors and are therefore not external assets.
+
+ Asset Use in this paper License / terms Source link
+ MNIST Binarized MNIST genera- Creative Com- https://keras.io/api/
+ tion experiments mons Attribution- datasets/mnist/
+ Share Alike 3.0
+ ARC-AGI-1 / origi- ARC-AGI reasoning Apache License https://github.com/fchollet/
+ nal ARC benchmark 2.0 ARC-AGI
+ ARC-AGI-2 ARC-AGI-2 reasoning Apache License https://github.com/arcprize/
+ benchmark / reference 2.0 ARC-AGI-2
+ results
+ HRM repository HRM baseline and Apache License https://github.com/
+ Sudoku-Extreme-related 2.0 sapientinc/HRM
+ reference implementation
+ TinyRecursiveModels TRM baseline and recur- MIT License https://github.com/
+ / TRM repository sive reasoning reference SamsungSAILMontreal/
+ implementation TinyRecursiveModels
+ MDLM repository Masked diffusion base- Apache License https://github.com/
+ line reference implemen- 2.0 kuleshov-group/mdlm
+ tation, if public code is
+ used
+ Google Research D3PM image-generation Apache License https://github.com/
+ D3PM implementa- baseline reference imple- 2.0 google-research/
+ tion mentation, if public code google-research/blob/master/
+ is used d3pm/images/diffusion_
+ categorical.py
+ Looped Trans- Looped Transformer base- MIT License https://github.com/Leiay/
+ former repository line reference implemen- looped_transformer
+ tation, if public code is
+ used
+ N-Queens Synthetic multi-solution Not an external N/A
+ constraint satisfaction asset
+ task generated by the
+ authors
+ Graph Coloring Synthetic multi-solution Not an external N/A
+ constraint satisfaction asset
+ task generated by the
+ authors
+
+
+
+
+ 27
+ \ No newline at end of file
diff --git a/papers/txt/gram2026_generative_recursive.txt b/papers/txt/gram2026_generative_recursive.txt
new file mode 100644
index 0000000..d5f299d
--- /dev/null
+++ b/papers/txt/gram2026_generative_recursive.txt
@@ -0,0 +1,1532 @@
+ Generative Recursive Reasoning
+
+
+ Junyeob Baek1†∗ Mingyu Jo1†∗ Minsu Kim1,2
+
+ Mengye Ren3 Yoshua Bengio2,4 Sungjin Ahn1,3†
+arXiv:2605.19376v2 [cs.AI] 20 May 2026
+
+
+
+
+ 1
+ KAIST 2 Mila – Québec AI Institute
+ 3
+ New York University 4 Université de Montréal
+
+
+
+ Abstract
+ How should future neural reasoning systems implement extended computation?
+ Recursive Reasoning Models (RRMs) offer a promising alternative to autoregres-
+ sive sequence extension by performing iterative latent-state refinement with shared
+ transition functions. Yet existing RRMs are largely deterministic, following a
+ single latent trajectory and converging to a single prediction. We introduce Gen-
+ erative Recursive reAsoning Models (GRAM), a framework that turns recursive
+ latent reasoning into probabilistic multi-trajectory computation. GRAM models
+ reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alterna-
+ tive solution strategies, and inference-time scaling through both recursive depth
+ and parallel trajectory sampling. This yields a latent-variable generative model
+ supporting conditional reasoning via pθ (y | x) and, with fixed or absent inputs,
+ unconditional generation via pθ (x). Trained with amortized variational inference,
+ GRAM improves over deterministic recurrent and recursive baselines on structured
+ reasoning and multi-solution constraint satisfaction tasks, while demonstrating an
+ unconditional generation capability. https://ahn-ml.github.io/gram-website
+
+
+ 1 Introduction
+ A central question for future neural reasoning systems is how extended computation should be imple-
+ mented. Large autoregressive models typically scale reasoning by extending a sequence-generation
+ process, whether intermediate computation is expressed explicitly as chain-of-thought tokens or im-
+ plicitly in hidden or latent representations [1–6]. A complementary direction is explored by Recursive
+ Reasoning Models (RRMs), which use repeated computation to refine a persistent latent state rather
+ than to append new elements to an output or reasoning sequence [7–9]. This approach is appealing
+ because it decouples reasoning depth from both parameter scale and output length: a compact model
+ can perform many steps of internal computation by repeatedly applying shared transition functions
+ over time.
+ Recent recursive reasoning models such as HRM [8] and TRM [9] provide early evidence for the
+ potential of this approach in structured reasoning. Rather than producing a solution in a single
+ feedforward pass, they perform extended computation through iterative latent-state refinement, deep
+ supervision across refinement steps, and reasoning-oriented recurrent designs such as hierarchical
+ latent dynamics. These features make them well suited to problems requiring constraint propagation,
+ state tracking, iterative correction, and multi-step inference. More broadly, they build on a principle
+ ∗ Equal contribution
+ † Correspondence to: Junyeob Baek (wnsdlqjtm@kaist.ac.kr), Mingyu Jo (mingyu.jo@kaist.ac.kr), Sungjin Ahn
+
+ (sungjin.ahn@kaist.ac.kr)
+
+
+ Preprint.
+ Solution 1,
+
+
+ Input Task,
+
+
+
+ Solution 2,
+
+
+
+
+ (a) Deterministic RRMs (b) GRAM (Ours)
+
+Figure 1: Comparison of Latent Reasoning Trajectories. Left: N-Queens Example with two valid solutions.
+Right: Given three independent runs for latent reasoning (τ1 , τ2 , τ3 ): (a) Prior RRMs (e.g. HRM, TRM) are
+deterministic—all runs collapse to an identical trajectory, converging to a single solution and failing to explore
+alternatives, while (b) GRAM explores diverse trajectories, producing diverse trajectories that reach multiple
+valid solutions y1 and y2 , while naturally enabling parallel inference-time scaling.
+
+also explored in recurrent Transformer architectures such as Universal Transformers [10] and Looped
+Transformers [7]: shared Transformer blocks can be repeatedly applied to increase computational
+depth without increasing parameter count. Together, these models suggest that reasoning capability
+can emerge not only from scaling model size or generating longer traces, but also from the organization
+of computation itself.
+While recurrent latent-state refinement provides an appealing mechanism for efficiently increasing
+reasoning depth, depth alone is not sufficient for many reasoning problems. A capable reasoning
+system should also be able to maintain uncertainty, consider alternative hypotheses, and explore
+multiple possible solution strategies [11, 12]. This is especially important in settings where ambiguity
+or multiple valid solutions are intrinsic, and more generally in problems where a single refinement
+path may become trapped in a suboptimal reasoning trajectory. In this sense, future RRMs should be
+not only deep, in the sense of repeated refinement, but also wide, in the sense of maintaining and
+exploring multiple latent trajectories in parallel.
+Existing RRMs [7–10], however, remain fundamentally deterministic: given the same input and
+initialization, they follow a single latent trajectory and converge to a single prediction. This deter-
+ministic recursion collapses the space of plausible reasoning paths into a single attractor, leaving
+probabilistic multi-hypothesis latent reasoning largely unexplored within the RRM paradigm. This
+motivates the central question of our work: can recursive latent computation support probabilistic,
+generative, multi-hypothesis reasoning while preserving the efficiency of compact recurrent models?
+In this paper, we propose Generative Recursive reAsoning Models (GRAM), a framework that turns
+recursive latent reasoning into probabilistic multi-trajectory computation. GRAM treats the reasoning
+process itself as a stochastic latent trajectory: at each recursion step, the model samples a transition
+conditioned on the input and the current reasoning state, rather than deterministically updating to a
+single next state. Repeating this process defines a distribution over possible reasoning trajectories,
+allowing the model to maintain multiple hypotheses, explore alternative solution strategies, and scale
+inference not only by increasing recursive depth but also by sampling trajectories in parallel. From
+a probabilistic perspective, GRAM is a latent-variable generative model: it models pθ (y | x) by
+marginalizing over latent reasoning trajectories, while the same recursive process can also define an
+unconditional generative model pθ (x) when the input is fixed or absent.
+We evaluate GRAM on controlled reasoning and generation tasks that serve as probes of the ar-
+chitectural properties targeted by our formulation: recursive refinement, stochastic exploration,
+multi-solution coverage, and inference-time scaling. Given this goal, our experiments focus on
+comparisons with the most relevant deterministic recurrent and recursive latent reasoning baselines,
+including Looped Transformers, HRM, and TRM, rather than frontier-scale general-purpose LLMs
+whose training data, inference budgets, and external scaffolding are not directly comparable. Sudoku-
+Extreme [8] and ARC-AGI [13, 14] test structured reasoning under hard constraints and abstract
+transformations; N-Queens and Graph Coloring evaluate multi-solution recovery; and binarized
+MNIST [15] probes the unconditional generative interpretation.
+Our main contribution is to establish probabilistic multi-trajectory recursion as a design principle
+for future recurrent and recursive reasoning architectures. Concretely, we make three contributions.
+
+
+ 2
+ First, we formulate recursive reasoning as a latent-variable generative process, where solutions are
+obtained by marginalizing over stochastic reasoning trajectories. Second, we introduce width-based
+inference-time scaling, enabling inference to scale not only with recursive depth but also with the
+number of sampled latent trajectories. Third, we provide empirical evidence that this formulation
+yields the intended architectural advantages over deterministic recurrent and recursive baselines,
+improving structured reasoning, multi-solution constraint satisfaction, and unconditional generation.
+
+2 Generative Recursive Reasoning Models
+In this section, we introduce Generative Recursive reAsoning Models (GRAM), an instantiation
+of probabilistic recursive reasoning. We describe the architecture in Section 2.1 and the training
+procedure in Section 2.2, with an architecture schematic shown in Figure 2.
+
+2.1 Architecture
+
+Overview. GRAM models the conditional distri- 𝑦 (A) CE loss
+bution pθ (y | x) by marginalizing over stochas-
+ Prior (B) KL Div. Posterior Decoder
+tic latent reasoning trajectories. Given an input 𝑝𝜃 (⋅ |𝑢𝑡 ) 𝑞𝜙 (⋅ |𝑢𝑡 , 𝑦) ~ 𝜖𝑡 𝑓dec
+x, GRAM first computes an embedding
+ ℎ𝑡−1 𝑓𝐻 𝑢𝑡 ℎ𝑡
+ ex = fenc (x; θ), (1)
+ 𝐾 times
+which is reused throughout the entire recursive 𝑙𝑡−1 𝑓𝐿 𝑓𝐿 𝑙𝑡
+computation. Starting from a fixed initial la-
+ Encoder
+tent state z0 , the model evolves the latent state 𝑥
+ 𝑓enc
+through learned stochastic transitions. The re-
+cursive computation is organized into two nested Figure 2: GRAM Architecture. A single stochastic
+levels: inner and outer loops. latent transition in the hierarchical instantiation z =
+At the inner level, a latent transition samples (h, l). After K low-level refinements via fL , the high-
+ level update fH produces a deterministic proposal ut , to
+a new latent state conditioned on the previous which stochastic guidance ϵt is added: ht = ut + ϵt .
+latent state and the input embedding,
+ zt ∼ pθ (zt | zt−1 , ex ), t = 1, . . . , T. (2)
+At the end of the T transitions, the decoder produces a prediction, ŷ = arg max fdec (zT ; θ). We refer
+to the sequence of T transitions from the initial state z0 to the final state zT as a supervision step. A
+supervision step is the unit at which the decoder is invoked, and the training objective is applied, with
+gradients computed as described in Section 2.2.
+At the outer level, Nsup supervision steps are applied recursively, with the final state of one supervision
+step serving as the initial state of the next, thereby forming the full recursive computation:
+ (1) T transitions (1) (2) T transitions T transitions (N )
+ z0 −−−−−−−→ zT = z0 −−−−−−−→ · · · −−−−−−−→ zT sup , (3)
+ (n) (1)
+where zt denotes the latent state at the t-th transition of the n-th supervision step, z0 is the
+fixed initial state, and the terminal state of one supervision step serves as the initial state of the next
+ (n+1) (n)
+(z0 := zT ). This abstract formulation can be instantiated with various recurrent Transformer
+backbones, including flat designs such as Universal Transformers and Looped Transformers [10, 7],
+as well as hierarchical designs such as HRM and TRM [8, 9].
+Stochastic Latent Transitions. Unlike prior recursive reasoning models (RRMs) that update the
+latent state deterministically and follow a single fixed trajectory [8, 9], GRAM defines pθ (zt |
+zt−1 , ex ) as a stochastic transition, so that repeated computation induces a distribution over latent
+reasoning trajectories. Concretely, GRAM realizes this transition as a learned stochastic residual
+perturbation around a deterministic update: at each transition, the model first computes a deterministic
+update ut from zt−1 and ex , then samples a conditional perturbation from a state-dependent Gaussian,
+and adds it to ut :
+ ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut )I ,
+ 
+ (4)
+ zt = ut + ϵt . (5)
+
+
+ 3
+ We refer to ϵt as the learnable stochastic guidance. The mean µθ (ut ) encodes a state-dependent
+direction in which the trajectory is steered, while the variance σθ2 (ut ) controls the amount of ex-
+ploration. This design allows GRAM to capture uncertainty, prevent convergence to local minima,
+and support robust exploration of the solution space without discarding the deterministic refinement
+performed by ut .
+Hierarchical Instantiation. We instantiate the latent state with two interacting components, z =
+(h, l). The high-level component h is updated once per latent transition and carries abstract reasoning
+state, while the low-level component l is updated K times within a single transition and carries
+fine-grained intermediate computation. This decomposition separates the two roles across time scales,
+with h accumulating slowly across transitions and l refined rapidly within each one.
+With this hierarchical multi-scale structure, a single transition zt−1 → zt is computed as follows.
+The low-level component is first refined for K updates, with the high-level component held fixed:
+ lt,k = fL (ht−1 , lt,k−1 , ex ; θ), k = 1, . . . , K, (6)
+where lt,0 := lt−1 and we write lt := lt,K for the refined low-level component. The high-level
+component is then updated as a stochastic transition conditioned on the refined lt ,
+ ut = fH (ht−1 , lt ; θ), (7)
+ ϵt ∼ pθ (ϵt | ut ) := N µθ (ut ), σθ2 (ut ) I
+ 
+ , (8)
+ ht = ut + ϵt , (9)
+and we set zt = (ht , lt ). Note that stochasticity is introduced only at the high level: the low-
+level refinement is fully deterministic, while the stochastic guidance signal ϵt acts on the slower,
+more abstract component of the latent state, where it can steer the overall reasoning trajectory
+across transitions3 . Under this instantiation, the decoder reads only the high-level component, i.e.,
+fdec (zT ) = fdec (hT ). Additional architectural details are provided in Appendix B.1.
+Modeling Unconditional Distribution. While the description so far focuses on the conditional
+setting pθ (y | x), the same recursive process can also be defined as an unconditional generative model
+pθ (x) when the input is replaced with an empty conditioning embedding. We use this formulation for
+generation tasks in Section 4.3.
+
+2.2 Training
+
+GRAM is trained to model the conditional distribution pθ (y | x), where each training example
+consists of an input x and its corresponding target y. As a probabilistic model, GRAM adopts a
+latent-variable formulation and is optimized by maximizing an evidence lower bound (ELBO) with
+respect to the generative parameters θ and variational parameters ϕ.
+Latent Variable Modeling. We model GRAM as a latent-variable probabilistic model pθ , where
+the full latent trajectory τ = (z0 → · · · → zTTotal ) consists of a sequence of latent variables, with
+TTotal = T × Nsup . The conditional likelihood is defined as
+ Z
+ pθ (y | x) = pθ (y | τ, x) pθ (τ | x) dτ, (10)
+
+where x denotes the input problem and y denotes the corresponding ground-truth output.
+Direct maximum likelihood estimation of log pθ (y | x) is intractable due to the marginalization
+over latent trajectories. We therefore introduce a variational posterior qϕ (τ | x, y) and optimize the
+evidence lower bound (ELBO), jointly training θ and ϕ via variational inference:
+ log pθ (y | x) ≥ Eqϕ (τ |x,y) [log pθ (y | τ, x)] − KL(qϕ (τ | x, y) ∥ pθ (τ | x)) . (11)
+
+During training, latent trajectories are sampled from the variational posterior qϕ (· | x, y), which has
+access to both the input problem x and the target output y. At inference time, where y is unavailable,
+trajectories are instead generated from the learned prior pθ (· | x).
+ 3We also tried injecting noise into the low-level state, but found that it did not improve performance.
+
+
+
+
+ 4
+ Both the prior and the posterior are modeled as conditional Markov processes over latent states:
+ TY
+ Total TY
+ Total
+
+ pθ (τ | x) = p(z0 ) pθ (zt | zt−1 , x), qϕ (τ | x, y) = p(z0 ) qϕ (zt | zt−1 , x, y). (12)
+ t=1 t=1
+Here, z0 is a fixed initial state shared by the prior and posterior. Both transitions are implemented by
+adding reparameterized Gaussian noise ϵt after a deterministic update ut ; the posterior uses the same
+transition module as the prior, but samples from a target-conditioned noise distribution qϕ (ϵt | ut , y),
+whereas the prior uses pθ (ϵt | ut ).
+Since the two processes share the same Markov structure and all stochasticity is introduced through
+ϵ1:TTotal , their trajectory distributions can be equivalently represented in noise space. Moreover,
+since GRAM decodes the output only from the terminal latent state, the likelihood term satisfies
+pθ (y | τ, x) = pθ (y | zTTotal , x). Therefore, the full trajectory-level ELBO can be written as
+   TX Total h i
+ LELBO = Eqϕ log pθ (y | zTTotal , x) − Eqϕ (ϵ<t |x,y) KL(qϕ (ϵt | ut , y) ∥ pθ (ϵt | ut )) . (13)
+ t=1
+Here, ut = fH (ht−1 , lt ) denotes the deterministic high-level update before noise injection, as defined
+in Equation (9). Since ut depends on ht−1 , which is determined by the previously sampled noise
+variables ϵ<t := (ϵ1 , . . . , ϵt−1 ), the expectation averages over these ancestral samples.
+Practical Implementation. In practice, following previous recursive reasoning models [8, 9], we
+train GRAM with deep supervision over Nsup consecutive supervision steps, each consisting of T
+recursive latent transitions. This provides dense learning signals along the full latent trajectory, rather
+than supervising only the final state after TTotal = T × Nsup transitions. The terminal state of each
+step is reused as the initial state of the next step.
+Following standard practice for recurrent models with long computation chains, we apply truncated
+gradient propagation [16, 17], as used in recent recursive reasoning models [8, 9, 18]. In our
+implementation, gradients are propagated only through the final transition of each supervision step,
+ (n) (n)
+zT −1 → zT . This gives the following surrogate objective for each supervision step:
+ (n)  (n)  (n) (n) (n) (n) 
+ LGRAM (x, y; θ, ϕ) = Eqϕ log pθ (y | zT , x) − KL qϕ (ϵT | uT , y) ∥ pθ (ϵT | uT ) , (14)
+ (n)
+where zT is the terminal state of the current supervision step n, and gradients are stopped through
+preceding states. Thus, LGRAM should be viewed as a truncated surrogate objective rather than the
+exact ELBO; it introduces a biased but memory-efficient approximation to the full ELBO. Further
+analysis of this approximation is provided in Appendix A.3, and detailed training hyperparameters
+are listed in Appendix B.2.
+
+2.3 Inference-Time Scaling
+
+GRAM supports two complementary axes of inference-time scaling: depth, by varying the number
+of recursive transitions, and width, by sampling multiple latent reasoning trajectories in parallel.
+For depth, we follow prior recursive reasoning models [8, 9] in adopting adaptive computation
+time (ACT) [8–10], which allows each trajectory to terminate at a learned halting depth (details in
+Appendix A.1). For width — the focus of this section — we draw {τ (i) }N i=1 ∼ pθ (τ(i)| x) from the
+learned prior and decode each terminal state into a candidate output ŷ (i) = fdec (zT ), exploring
+multiple stochastic reasoning paths simultaneously rather than extending a single trajectory.
+To select among candidates, we use either majority voting or best-of-N with a Latent Process Reward
+Model (LPRM). The LPRM is a value head vψ (zt ) trained to predict the final quality of a trajectory
+from its latent state, using a regression target r ∈ [0, 1] given by the final prediction accuracy. At
+inference time, majority voting selects the most frequent prediction, whereas LPRM-guided selection
+chooses the candidate with the highest predicted terminal value. Details of LPRM training are
+provided in Appendix A.2. Overall, this procedure improves robustness and solution quality through
+parallel exploration, without increasing the sequential recursion length.
+
+3 Related Work
+Latent Reasoning. Latent reasoning aims to reduce the inefficiency and verbosity of explicit
+Chain-of-Thought (CoT) by shifting part or all of the reasoning process into latent or continuous
+
+
+ 5
+ Sudoku ARC-AGI-1 ARC-AGI-2
+ 20
+ 100 97.0% 66.7%
+ 87.4% 16.0%
+ 80 60 52.0% 55.7% 15
+Accuracy (%) 61.3% 44.6% 11.1%
+ 60 55.0% 40.3% 9.7%
+ 40 34.5% 10 7.8%
+ 40 5.0%
+ 20 5 3.0%
+ 20
+ 0 0 0
+ Looped HRM TRM GRAM HRM TRM GRAM o3-mini-GPT 5.2 Grok-4- HRM TRM GRAM o3-mini-GPT 5.2 Grok-4-
+ TF (Ours) (Ours) high (low) thinking (Ours) high (low) thinking
+ Recursive Models GRAM (Ours) LLMs
+
+Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently
+outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent
+transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are
+omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores are included
+only as external reference points for benchmark difficulty.
+
+representations [1–6]. By avoiding token-by-token generation of intermediate steps, such representa-
+tions can make reasoning traces more compact and reduce generation overhead. Existing approaches
+instantiate this idea through hidden states, latent or soft tokens, continuous thoughts, internal rea-
+soning traces, and recursive state updates for scaling test-time computation [4, 7, 19–23, 18, 24–26].
+However, many remain organized around autoregressive sequence generation, where additional
+computation is tied to generating more tokens, latent positions, or sequential reasoning states.
+Recursive Architectures. Recursive architectures perform iterative state updates and have evolved
+from RNNs to weight-sharing Transformers with adaptive computation [7, 10, 27–32, 25]. Recent
+recursive reasoning models show that increasing inference-time depth can outperform larger static
+models [8, 9, 18, 24]. GRAM builds on this line but formulates recurrence as a probabilistic process:
+instead of following a single deterministic refinement path, it maintains stochastic latent trajectories,
+enabling multi-path exploration and generative sampling.
+Probabilistic Latent State-Space Models. Probabilistic recurrent models use stochastic latent
+transitions to capture uncertainty and multimodal dynamics, often trained with variational infer-
+ence [33–38]. They have been widely used in sequential generative modeling, video prediction,
+and model-based reinforcement learning. GRAM shares this latent state-space view but reinterprets
+stochastic dynamics as computation rather than temporal observation modeling: latent transitions
+define possible reasoning trajectories, supporting multi-hypothesis exploration and both conditional
+pθ (y | x) and unconditional pθ (x) generation.
+
+
+4 Experiments
+
+GRAM is designed as an architecture for probabilistic recursive reasoning, rather than as a general-
+purpose large language reasoning model whose training data, inference budgets, prompting strategies,
+tool use, and external scaffolding are not directly comparable. Following prior work on recurrent and
+recursive reasoning models [8, 9], we therefore evaluate GRAM on standard structured reasoning
+tasks that probe the computational properties targeted by our formulation: iterative latent refinement,
+stochastic trajectory exploration, multi-solution coverage, and inference-time scaling.
+In the following, we first evaluate structured reasoning performance on Sudoku-Extreme and ARC-
+AGI (Section 4.1). We then assess multi-solution behavior on N-Queens and Graph Coloring
+(Section 4.2). Next, we examine the unconditional generative interpretation of GRAM on binarized
+MNIST (Section 4.3). Finally, we perform ablation studies to evaluate the impact of key design
+choices (Section 4.4).
+
+4.1 Challenging Puzzle Tasks
+
+Setup. We evaluate on Sudoku-Extreme [8], which contains 9×9 puzzles with minimal clues re-
+quiring extensive constraint propagation, and ARC-AGI Challenge [13, 14], which tests abstract
+visual reasoning through few-shot pattern recognition. We compare against direct prediction (Trans-
+
+
+ 6
+ 1.0 N= 1.0
+ 1 5 10 20 50
+ 0.8
+ 0.9
+Accuracy Looped TF
+
+
+
+
+ Accuracy
+ Looped TF 0.6 HRM
+ HRM
+ 0.8 TRM TRM
+ GRAM (Ours) 0.4 GRAM (N=1)
+ 0.2
+ 0.6
+ 0.5 0.0
+ 8 16 32 128 320 2 4 6 8 10 12 14 16 18
+ Iterations Number of Solutions
+ Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. While both TRM and GRAM benefit from
+ longer recursion (x-axis), GRAM additionally scales with parallel sampling (N = number of samples); each
+ iteration corresponds to a supervision step, while meaning K× more flat iterations in Looped TF. (Right)
+ Accuracy across number of solutions in N-Queens (8 × 8). Conventional deterministic recursive models suffer
+ a sharp performance drop as the number of possible solutions increases, whereas GRAM maintains consistent
+ performance.
+
+
+ former [39]), a flat recursive baselines (Looped TF [7], HRM [8], TRM [9]). Reported large reasoning
+ model results [40] are included as external reference points for benchmark difficulty, rather than
+ as controlled baselines, since their training and inference settings are not directly comparable to
+ task-specific recursive models. For the scaling analysis, all baselines (Looped TF, HRM, TRM) are
+ reproduced under identical settings following Yang et al. [7] and Jolicoeur-Martineau [9].
+ Stochastic Guidance Improves Reasoning. Figure 3 and Table 8 summarize our main results.
+ GRAM consistently outperforms prior recursive models across all benchmarks. We attribute this
+ improvement to the fundamental difference in how reasoning trajectories are utilized. While Looped
+ TF, HRM, and TRM are restricted to learning from a single deterministic path, GRAM leverages
+ stochastic transitions to explore diverse reasoning trajectories. By training on this richer distribution
+ of solution paths, GRAM acquires more robust reasoning capabilities, allowing it to navigate complex
+ problem spaces more effectively than models constrained to a single sequential refinement process.
+ Detailed experiment results, including more state-of-art methods, are provided in Appendix D.1.
+ Parallel Sampling Provides a New Test-time Scaling Axis. Figure 4 (left) shows that increasing the
+ number of parallel samples consistently improves performance across all iteration counts. Notably,
+ GRAM with N = 20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations,
+ including TRM (97.0% vs 90.5%), despite comparable computational budget. While deterministic
+ recursive models scale only through sequential refinement, GRAM leverages stochastic transitions to
+ explore multiple reasoning paths in parallel. To select the best trajectory, we employ a Latent Process
+ Reward Model (LPRM) that predicts output correctness (Section 2.3). This parallel scaling bypasses
+ the latency bottlenecks of depth-based scaling while achieving superior performance. Additional
+ analysis on the ARC-AGI Challenge is provided in Appendix D.2.
+
+ 4.2 Multi-solution Puzzle Tasks
+
+Setup. To evaluate whether GRAM can capture diverse solutions, we test on N-Queens (8 × 8,
+10 × 10) and Graph Coloring (8-vertex, 10-vertex) tasks, where multiple valid solutions exist for each
+input. We compare against direct prediction (Transformer [39]), recursive models (Looped TF [7],
+HRM [8], TRM [9]), and generative models (Autoregressive Transformer (AR), MDLM [41]). For
+N-Queens, we report accuracy (whether the output satisfies all constraints) and coverage (found /
+total valid solutions, with 20 samples). For Graph Coloring, we report conflict edges (number of
+constraint violations; lower is better) instead of accuracy. Detailed configurations are provided in
+Appendix C.2.
+ Deterministic Recursion Fails on Multi-Solution Tasks. Table 1 reveals that deterministic recursive
+ models structurally cannot capture multiple solutions, with coverage at most 36.1% across all tasks.
+ Figure 4 (right) further illustrates this limitation: as the number of valid solutions increases, all three
+ deterministic recursive baselines exhibit sharp accuracy degradation, whereas GRAM maintains
+ consistent performance regardless of solution count. This confirms that deterministic latent updates
+
+
+ 7
+ Table 1: Evaluation on N-Queens and Graph Coloring benchmarks. Rec. and Gen. indicate whether the
+model uses recursive computation and generative sampling, respectively. Values are mean ± standard deviation
+over runs. Accuracy: single-sample (%). Conflict: constraint-violating edges (↓). Coverage: unique valid
+solutions discovered with 20 samples (%).
+ N-Queens Graph Coloring
+ 8×8 10 × 10 8-vertex 10-vertex
+ Method Rec. Gen. # Params Accuracy Coverage Accuracy Coverage Conflict↓ Coverage Conflict↓ Coverage
+ Direct Pred (8 layers) ✗ ✗ 27M 40.4±1.1 13.7±1.1 13.6±0.5 1.6±0.2 179.3±4.0 19.9±0.2 198.7±5.0 6.7±0.1
+ Direct Pred (32 layers) ✗ ✗ 100M 40.2±1.3 13.6±1.1 13.1±0.4 1.6±0.2 174.0±18.0 19.1±1.7 227.7±34.5 6.5±1.9
+ Looped TF ✓ ✗ 7M 68.4±3.7 23.6±1.9 50.0±7.6 6.2±3.2 136.0±16.1 20.5±1.5 157.3±9.0 7.2±0.7
+ HRM ✓ ✗ 27M 78.7±2.9 26.7±1.3 37.4±0.3 4.7±0.1 109.7±1.5 21.8±0.3 164.3±21.6 8.9±1.7
+ TRM ✓ ✗ 7M 66.8±5.7 36.1±22.5 17.5±11.2 2.0±1.3 109.3±3.1 22.3±0.6 170.7±17.9 6.8±0.3
+ AR ✗ ✓ 10.6M 96.3±1.0 84.8±0.8 90.0±2.2 53.2±0.8 19.0±11.3 83.0±0.7 61.3±8.3 40.0±0.3
+ MDLM ✗ ✓ 12.6M 96.1±1.5 87.2±0.6 74.3±6.6 47.4±2.2 2.7±0.6 84.5±4.0 12.0±7.0 48.2±1.4
+ GRAM (Ours) ✓ ✓ 10M 99.7±0.3 90.3±1.9 89.7±2.7 57.5±3.4 2.7±2.1 85.8±0.5 3.3±1.5 51.3±2.8
+
+
+
+
+cause mode collapse when multiple valid outputs exist for the same input. Additional coverage
+analysis is provided in Appendix D.3.
+Recursive Refinement Yields Sharper Constraint Satisfaction. While generative models (AR,
+MDLM) achieve high coverage, GRAM consistently attains higher accuracy with comparable diver-
+sity. On N-Queens, GRAM reaches 99.7% accuracy versus 96.3% (AR) and 96.1% (MDLM). The
+gap is more pronounced on Graph Coloring, where GRAM reduces conflict edges to 2.7 and 3.3 on 8-
+and 10-vertex tasks, compared to 19.0 and 61.3 for AR. This demonstrates that recursive refinement
+enables stricter constraint satisfaction than generative sampling alone.
+
+4.3 Exploring GRAM as an Unconditional Generator
+
+Setup. To investigate GRAM’s unconditional gen- Table 2: Unconditional generation results on
+erative capability beyond conditional reasoning, we binarized MNIST. We report IS (↑) and FID (↓).
+evaluate generation in two domains: structured con- For iterative models, a step corresponds to a super-
+straint generation on Sudoku (from empty boards, vision step for TRM and GRAM, and a denoising
+ step for D3PM. FID is calculated using real sam-
+evaluated by the fraction of generated boards satis- ples with original pixel values (0–255).
+fying Sudoku constraints) and image generation on
+binarized MNIST [15], where pixel values are thresh- Method IS (↑) FID (↓)
+olded to 0 or 1 (evaluated by Inception Score (IS) [42]
+and FID [43]). In both cases, the input is replaced by VAE 1.70 86.28
+ D3PM (1000 steps) 1.86 74.03
+an empty conditioning signal and the model samples TRM (16 steps) 1.00 303.29
+an output from its learned prior. Baselines include
+D3PM [44], a discrete diffusion model, on both tasks, GRAM (Ours)
+and additionally a VAE [45] trained with binary re- 8 steps 1.85 84.08
+ 16 steps 1.89 77.79
+construction loss on MNIST. To ensure a fair compar- 32 steps 1.91 76.65
+ison with existing literature, FID is calculated using 64 steps 1.95 75.39
+real samples from the original standard MNIST. 128 steps 1.99 74.30
+ 256 steps 2.04 73.34
+Generative Behavior Beyond Reasoning. GRAM
+extends from conditional reasoning to unconditional
+generation in two different domains. On Sudoku
+generation (Figure 5), GRAM produces valid boards
+with 99.05% validity using 10.9M parameters and
+16 supervision steps, surpassing D3PM baselines
+that use up to 55.1M parameters and 1000 denois-
+ing steps. Figure 7 shows qualitative examples, illus-
+trating that the model produces diverse, fully valid
+boards from empty inputs without any explicit con-
+straint checker. On MNIST (Table 2), the deter-
+ministic baseline TRM exhibits mode collapse (FID
+303.29), whereas GRAM produces recognizable dig- Figure 5: Unconditional Sudoku generation. Va-
+its with IS and FID comparable to D3PM. Together, lidity (%) of generated Sudoku puzzles. GRAM
+these results indicate that GRAM’s stochastic latent achieves higher validity than D3PM with substan-
+ tially fewer parameters and steps.
+
+ 8
+ D3PM
+ t=0 t=100 t=200 t=300 t=400 t=500 t=600 t=700 t=800 t=900 t=1000
+GRAM TRM
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16
+
+
+
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16 Sample 1 Sample 2 Sample 3 Sample 4
+ (a) Generation Process (b) Samples
+ Figure 6: Visualization of the generation process and samples. (a) The generation process over recursion
+ steps. Each row corresponds to a different model. GRAM (bottom) progressively refines the generated image
+ through recursive latent updates, correcting initial errors. (b) Unconditional generated samples from each model.
+
+ Table 3: Ablation study on Sudoku-Extreme and N-Queens (8 × 8). We evaluate with 5 samples. For (a),
+ Components are added cumulatively to the Looped TF baseline (DS = deep supervision, HR = hierarchical
+ recursion, SG = stochastic guidance). For (b), both stochasticity and learned guidance are essential—removing
+ either significantly degrades performance.
+ (a) Architecture Ablation. (b) Mechanism Ablation.
+
+ Model variant Sudoku N-Queens Model variant Sudoku N-Queens
+ base (Looped TF) 61.25 71.30 GRAM (ours) 93.96 99.69
+ + DS + HR (=HRM, TRM) 55.00 / 87.40 80.70 / 72.90
+ w/o stochastic guidance 82.87 72.91
+ + SG 65.64 86.30
+ stochasticity only 94.88 50.27
+ + DS + SG 73.90 100.00
+ guide only 0.00 0.00
+ + DS + HR + SG (=GRAM) 93.96 99.69 w/ direct prediction 63.43 61.44
+ TRM w/ stochastic decoder 82.87 71.66
+ TRM w/ random init. 78.53 71.82
+
+
+ transitions support generative modeling beyond symbolic reasoning, with constraint satisfaction
+ emerging as a natural byproduct of the recursive generative process.
+ Inference-Time Scaling Transfers to Generation. Table 2 further shows that increasing recursion
+ at inference improves generation quality monotonically (IS 1.85 → 2.04, FID 84.08 → 73.34 from 8
+ to 256 steps), even though training uses only 16 steps. This indicates that the iterative-refinement
+ advantage of recursive models carries over into the generative regime. Figure 6 visualizes this process;
+ additional samples are in Section D.4.
+
+ 4.4 Ablation Study
+
+ We ablate key design choices of GRAM on Sudoku-Extreme and N-Queens (8 × 8) using 5 samples.
+ Table 3 summarizes the results.
+ Stochastic Guidance Provides Consistent Gains Across Architectures. Table 3a shows that
+ stochastic guidance (SG) improves performance regardless of the underlying architecture: SG alone
+ lifts the flat Looped TF baseline, and combining SG with deep supervision already reaches 100%
+ on N-Queens. The full GRAM (with hierarchical recursion on top) achieves the best results overall
+ (93.96% / 99.69%). While the effect of hierarchical recursion is task-dependent, SG yields consistent
+ gains in every configuration, supporting our design of stochastic guidance as the core extension
+ introduced by GRAM.
+ Both Stochasticity and Guidance Are Essential. We ablate each component by modifying the
+ learned distribution ϵt ∼ N (µθ , σθ2 I) in Equation (4). Removing guidance (N (0, σθ2 I)) maintains
+ Sudoku performance (94.88%), indicating that stochasticity alone can enable diverse reasoning paths.
+ However, this variant collapses on N-Queens (50.27%), where structured guidance is necessary to
+ navigate multi-solution spaces. Removing stochasticity (N (µθ , 0)) fails completely (0.0% on both
+ tasks), as deterministic guidance conditioned on the target leads to severe overfitting.
+ Naive Stochasticity Does Not Help TRM. We test two simple approaches to add stochasticity to
+ TRM: (1) stochastic decoding, which samples from the output distribution instead of argmax, and
+
+
+ 9
+ 8 9 648 59 1 648 59 1 648259317 648259317 648259317 648259317
+ 7 79 5 79 68 5 79 68 415 79 683415 79 683415 792683415 792683415
+ 9 6 289 3 6 1289 3 6 1289 3 6 71289 356 71289 356471289 356471289
+
+
+
+D3PM
+ 657 4 8 657 4 831657 4 83165734 983165734 983165734 983165734
+ 3 9 4 3 9 5 4 3 9 5 4 389 56 42389 561 423897561 423897561 423897561
+ 7 8 7 8 2 7 8 2 7 648 2 5 73648 2 5 73648 2 5173648 2 517364892
+ 5 5 13 548 139548 2 139548626 139548626 139548626 139548626
+ 59 593 593 1 8 593 1 8 27593 148 275936148 275936148 275936148
+ 2 8 2 3 8 7 2 3 8 7 2 53 8 47 2953 8 4712953 864712953 864712953
+ t=0 t=125 t=250 t=375 t=500 t=625 t=750 t=875 t=1000
+ 717348652 716348952 716348952 716348952 716348952 716348952 716348952 716348952
+ 483927379 493527861 493527861 493527861 493527861 493527861 493527861 493527861
+ 523971734 528996734 528169734 528169734 528169734 528169734 528169734 528169734
+GRAM
+
+
+ 864235917 864251379 864215379 864215379 864215379 864215379 864215379 864215379
+ 359841126 359784126 359874126 359874126 359874126 359874126 359874126 359874126
+ 172496538 172936548 172936548 172936548 172936548 172936548 172936548 172936548
+ 937812445 937812615 937482615 937482615 937482615 937482615 937482615 937482615
+ 645779283 645179283 645791283 645791283 645791283 645791283 645791283 645791283
+ 281623497 281663497 281653497 281653497 281653497 281653497 281653497 281653497
+ iter=0 iter=1 iter=2 iter=4 iter=6 iter=8 iter=10 iter=13 iter=16
+Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. Each board is independently
+sampled from an empty grid using the learned prior. GRAM produces diverse, complete boards satisfying all
+row, column, and box constraints, without an explicit constraint checker or search procedure. Incorrect digits are
+highlighted in red.
+
+
+(2) random initialization, which samples z0 from a Gaussian N (0, I) at each inference. Neither
+improves performance, demonstrating that GRAM’s gains stem from the variational framework rather
+than mere randomness.
+
+5 Conclusions and Limitations
+We introduced GRAM, a generative framework that transforms deterministic recursive architectures
+into probabilistic generative models capable of modeling both p(y | x) and p(x) via recursive amor-
+tized variational inference. For reasoning problems, introducing stochasticity into latent transitions
+enables diverse solution discovery and improved exploration compared to deterministic counterparts.
+Notably, we demonstrate GRAM can leverage width-based inference-time scaling as a complement
+to depth: by sampling multiple latent trajectories in parallel, bypassing the latency bottleneck of
+depth-only scaling. Our ablations further reveal that stochastic guidance is a general-purpose exten-
+sion that consistently improves any recursive architecture, and that the gains stem specifically from
+the variational framework — not from mere randomness, as naive stochastic alternatives applied to
+existing models yield no improvement.
+Beyond solution-seeking, GRAM also demonstrates potential as an unconditional generative model
+through recursion-based generation over inputs, with generation quality improving monotonically
+with recursive depth even beyond training-time steps. This suggests new directions for generative
+modeling via hierarchical recursion. Despite these strengths, the sequential nature of deep supervision
+limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM
+toward larger foundation models.
+
+
+
+
+ 10
+ Acknowledgment
+This research was supported by the Brain Pool Plus Program (No. 2021H1D3A2A03103645) and
+the GRDC (Global Research Development Center) Cooperative Hub Program (RS-2024-00436165)
+through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and
+ICT (MSIT). This work was also supported by the Institute of Information & Communications
+Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-
+2024-00509279, Global AI Frontier Lab) and by the NYU-KAIST Global Innovation and Research
+Institute. Minsu Kim acknowledges funding from the KAIST Jang Young Sil Fellow Program. We
+are especially grateful to Gyubin and Seungju for their non-trivial contributions, and we thank the
+members of the MLML for valuable discussions and feedback throughout this project.
+
+Broader Impacts
+GRAM studies probabilistic recursive reasoning for structured reasoning and generation. By main-
+taining multiple latent trajectories, it may benefit tasks such as constraint satisfaction, and scientific
+problem solving, where uncertainty and multiple valid solutions are common. It also suggests a way
+to improve reasoning through inference-time computation rather than parameter scaling alone. Its
+generality also entails risks: plausible but invalid generations may be mistaken for verified solutions
+in downstream decision-making pipelines, and multi-sample inference may increase computational
+and energy costs at scale. Since our experiments focus on controlled benchmarks, deployment in
+real-world or high-stakes settings would require rigorous validation, uncertainty calibration, and
+domain-specific safeguards.
+
+References
+ [1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+ Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+ Advances in neural information processing systems, 35:24824–24837, 2022.
+
+ [2] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
+ Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
+ vances in neural information processing systems, 36:11809–11822, 2023.
+
+ [3] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
+ Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph
+ of thoughts: Solving elaborate problems with large language models. In Proceedings of the
+ AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024.
+
+ [4] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+ Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+ arXiv:2412.06769, 2024.
+
+ [5] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+ by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+ arXiv:2505.12514, 2025.
+
+ [6] Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh
+ Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and
+ reasoning. arXiv preprint arXiv:2505.23648, 2025.
+
+ [7] Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers
+ are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
+
+ [8] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+ Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+
+ [9] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+ preprint arXiv:2510.04871, 2025.
+
+
+ 11
+ [10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+ versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
+[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
+[13] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+[14] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+ agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+ 2025.
+[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
+ recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
+[16] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of
+ recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
+[17] Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv
+ preprint arXiv:1705.08209, 2017.
+[18] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold-
+ son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute
+ with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
+[19] Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation
+ beyond discrete token sampling. arXiv preprint arXiv:2505.14827, 2025.
+[20] Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong
+ Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous
+ concept space. arXiv preprint arXiv:2505.15778, 2025.
+[21] Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens,
+ hard truths. arXiv preprint arXiv:2509.19170, 2025.
+[22] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi:
+ Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint
+ arXiv:2502.21074, 2025.
+[23] Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu
+ Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv
+ preprint arXiv:2505.18454, 2025.
+[24] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+ Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+ recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[25] Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-
+ thought driven architecture with budget-adaptive computation cost at inference. arXiv preprint
+ arXiv:2310.10845, 2023.
+[26] Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster.
+ Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. arXiv preprint
+ arXiv:2410.20672, 2024.
+[27] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
+[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
+ 1735–1780, 1997.
+[29] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
+ Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
+ decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+
+
+ 12
+ [30] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
+ Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
+ preprint arXiv:1909.11942, 2019.
+[31] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. arXiv
+ preprint arXiv:1910.10073, 2019.
+[32] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+ arXiv:1603.08983, 2016.
+[33] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua
+ Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.
+ org/abs/1506.02216.
+[34] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural
+ models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571.
+[35] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https:
+ //arxiv.org/abs/1511.05121.
+[36] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and
+ James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https:
+ //arxiv.org/abs/1811.04551.
+[37] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
+ discrete world models. arXiv preprint arXiv:2010.02193, 2020.
+[38] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
+ through world models. arXiv preprint arXiv:2301.04104, 2023.
+[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+ Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
+ processing systems, 30, 2017.
+[40] ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI
+ benchmark. https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22.
+[41] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
+ Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
+ models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
+[42] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
+ 2018.
+[43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
+ Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
+ neural information processing systems, 30, 2017.
+[44] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
+ Structured denoising diffusion models in discrete state-spaces. Advances in neural information
+ processing systems, 34:17981–17993, 2021.
+[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
+ arXiv:1312.6114, 2013.
+[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+ Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+[47] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
+[48] Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in
+ discrete state-spaces), in pytorch. https://github.com/cloneofsimo/d3pm, 2024.
+[49] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings
+ of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
+
+
+ 13
+ [50] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
+ applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 2002.
+[51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
+ convolutional neural networks. Advances in neural information processing systems, 25, 2012.
+[52] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
+ function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
+[53] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
+ on computer vision (ECCV), pages 3–19, 2018.
+[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
+ arXiv:1711.05101, 2017.
+[55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
+ natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
+[56] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.
+ Advances in neural information processing systems, 33:12438–12448, 2020.
+[57] P. Erdős and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica
+ Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/
+ BF02066689. URL https://doi.org/10.1007/BF02066689.
+[58] Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep
+ learning: Effective graph neural network models for combinatorial problems. In 2019 IEEE
+ 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885.
+ IEEE, 2019.
+[59] Alexey L. Pomerantsev. Principal component analysis (pca). Encyclopedia of Autism Spectrum
+ Disorders, 2014. URL https://api.semanticscholar.org/CorpusID:2534141.
+[60] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-
+ mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID:
+ 13091446.
+
+
+
+
+ 14
+ A Additional Method Details
+A.1 Adaptive Computation Time
+
+GRAM optionally adopts adaptive computation time (ACT) [8–10] at inference, allowing each
+trajectory to terminate at a learned halting depth rather than running for a fixed number of supervision
+steps. We follow the Q-learning formulation introduced by HRM [8] and adopted in TRM [9].
+Halt head. The decoder includes an auxiliary head qψ : RD → R2 that maps the high-level state h
+to two scalar values, qψ (h) = (q halt , q continue ). These are interpreted as estimated Q-values for the
+binary action of halting or continuing computation at the current supervision step.
+Training. The halt head is trained jointly with the main objective via a temporal-difference loss.
+ (n)
+After computing the latent state zT at the end of supervision step n, we form Q-learning targets:
+
+ • q̂nhalt = 1[ŷ (n) = y], indicating whether decoding the current state would yield a correct
+ prediction.
+ 
+ • q̂ncontinue = max qn+1halt continue
+ , qn+1 , the bootstrapped value of running one more supervision
+ step.
+
+The halt head is trained by regression to these targets:
+ Nsup h
+ X 2 2 i
+ LACT = qnhalt − q̂nhalt + qncontinue − q̂ncontinue . (15)
+ n=1
+
+This auxiliary loss is added to the main training objective and contributes only through the halt head;
+it does not propagate gradients into the recursive core.
+Inference. At inference, computation proceeds one supervision step at a time. After each step
+n, we evaluate qψ (h(n) ) and halt if qnhalt > qncontinue , returning ŷ (n) as the prediction. Otherwise,
+ max
+computation continues to the next supervision step, up to a maximum budget of Nsup steps. Different
+trajectories sampled in parallel may therefore terminate at different depths, complementing the
+parallel-sampling scheme described in Section 2.3. In practice, we found that using only q halt
+(halting when σ(q halt ) > 0.5, without the continue branch) performs comparably while simplifying
+implementation; our released code uses this variant.
+
+A.2 Latent Process Reward Model (LPRM).
+
+To rank or select among sampled candidates, we train a value head vψ (zt ) to predict the expected
+accuracy of the final output, conditioned on the current latent state zt . The LPRM is trained jointly
+with the main objective via a regression loss:
+ T
+ X
+ LLPRM = (vψ (zt ) − r)2 , (16)
+ t=1
+
+where r ∈ [0, 1] denotes the accuracy of the final prediction for a given trajectory.
+
+A.3 Empirical Validation of the Surrogate Objective
+
+We further analyze the approximation introduced by the surrogate training objective LGRAM used in
+Section 2.2, both qualitatively and empirically.
+Truncation as a gradient approximation. We frame LGRAM as a gradient approximation rather than
+a separate variational objective. The full trajectory-level ELBO (Equation (13)) involves a sum of KL
+terms across all TTotal transitions, and computing its exact gradient requires backpropagation through
+the entire trajectory. To enable training with constant memory, we propagate gradients only through
+the final transition of each supervision step. This is a standard practice in recurrent latent variable
+models with long computation chains: ELBOs over truncated sequences are used, for example, in
+VRNN [33] and SRNN [34], while truncated latent imagination is used in Dreamer-family world
+models [37, 38]. Trading a small gradient bias for training stability via local truncation is therefore
+
+
+ 15
+ well-precedented; what is specific to GRAM is applying this approximation at the level of recursive
+reasoning trajectories rather than temporal sequences.
+Empirical validation. To verify that optimizing LGRAM effectively drives improvement in the full
+variational bound, we compute both quantities on the validation set throughout training. The full
+ELBO LELBO is evaluated as in Equation (13), summing the reconstruction term and KL contributions
+ (n)
+across all TTotal transitions; the surrogate objective is evaluated as the average of LGRAM over the
+Nsup supervision steps. Figure 8 reports the results on Sudoku-Extreme and N-Queens 8 × 8.
+
+ Sudoku-Extreme N-Queens 8 × 8
+ GRAM (training objective) GRAM (training objective)
+ 0.6 ELBO (full) 103 ELBO (full)
+
+
+ 0.5 102
+ ELBO
+
+
+
+
+ ELBO
+ 0.4 101
+
+ 100
+ 0.3
+ 10 1
+ 0.2
+ 10000 20000 30000 40000 50000 60000 0 10000 20000 30000 40000
+ Training Step Training Step
+Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO,
+smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease
+monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational
+bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions
+while LGRAM evaluates only the final-step KL of each supervision step; their gap reflects the cumulative KL
+across earlier transitions, not a failure of optimization. The N-Queens plot uses a log scale on the y-axis due to
+the large dynamic range.
+
+Both LELBO and LGRAM improve monotonically throughout training on both tasks. This indicates
+that, despite the truncation, gradient updates of LGRAM effectively drive improvement in the full
+variational bound. Since LELBO also serves as an indirect estimate of the negative log-likelihood,
+its consistent improvement provides evidence that GRAM optimizes a well-defined data likelihood,
+even though training relies on the surrogate.
+The gap between the two curves in Figure 8 reflects the structural difference between the two
+quantities — LELBO accumulates KL terms across all transitions while LGRAM evaluates only the
+final-step KL of each supervision step — rather than an optimization failure. This gap is consistent
+with LGRAM being a biased but useful surrogate for LELBO .
+
+B Training and Architecture Details
+B.1 Architecture Details
+
+GRAM consists of three components: Encoder, Recursive Core, and Decoder.
+Encoder. Input tokens are mapped to embeddings via a token embedding layer, optionally con-
+catenated with puzzle embeddings (for ARC√ [13, 14]), and combined with positional encodings
+(RoPE) [46]. The embeddings are scaled by D and prepended with 16 puzzle embedding tokens [8].
+Recursive Core. The core maintains two latent states: h (high-level) and l (low-level). For each outer
+step, the low-level state is refined K times via l ← fL (l, h + ex ), injecting the input embedding at
+each iteration. The high-level state is then updated via h ← fH (h, l). Both fL and fH share the same
+architecture: a stack of attention and SwiGLU [47] MLP layers. In addition, as an exception, we use
+[SwiGLU + SwiGLU] network for the Recursive Core module instead of [Attention + SwiGLU] for
+Sudoku tasks, following [9]. For initialization of z0 = (h0 , l0 ), we sample once from the standard
+Gaussian distribution N (0, I), then save the value within the network checkpoint and load it again,
+meaning the initialized z0 has a fixed value.
+Decoder. The decoder extracts content tokens from h (excluding puzzle embedding positions)
+and maps them to logits via a SwiGLU MLP head. An auxiliary head predicts halt decisions and
+correctness values from the first token of h.
+
+
+ 16
+ Table 4: Architecture components.
+ Component Module Description
+ Encoder
+ Token Embedding vocab → D
+ Puzzle Embedding 16 tokens (optional, for ARC)
+ Position Encoding RoPE or learned
+ Recursive Core
+ f L , fH [Attention + SwiGLU] × 2 layers
+ Iterations K low-level, T high-level steps
+ µθ , σ θ , µ ϕ , σ ϕ SwiGLU MLP for each parameter
+ Decoder
+ LM Head Linear(D → vocab)
+ Q Head Linear(D → 2) for halt
+ V Head Linear(D → 1) for value
+
+
+
+Encoder and Decoder for Image Patches. In the MNIST [15] image generation task, we first
+construct a binarized dataset by normalizing the original discrete pixel values (0 ∼ 255) to the
+continuous range [0, 1] and applying a threshold at 0.5. For the network architecture, we employ a
+convolutional patch encoder, following [48, 49].
+The encoding process proceeds in three stages. First, the discrete input tokens x ∈ {0, 1} are
+normalized to the range [−1, 1]. Second, to capture local spatial dependencies before patchification,
+the normalized image passes through a shallow convolutional encoder. This encoder consists of
+two stacked blocks, where each block comprises a 2D convolution [50, 51] with a 5 × 5 kernel and
+padding 2, a SiLU non-linearity [52], and Group Normalization (GN) [53]. Finally, the resulting
+feature map is divided into non-overlapping patches of size P × P and linearly projected to match
+the model’s hidden dimension D. The detailed architectural specifications and dimension transitions
+are summarized in Table 5.
+
+Table 5: Detailed architecture of the Image Patch Encoder for MNIST. H, W denote image resolution, C input
+channels, P patch size, Np the number of patches, and D the hidden dimension.
+ Stage Layer / Operation Output Dim.
+ Input Tokens (B, C, H, W )
+ 1. Norm.
+ Linear Scaling [−1, 1] (B, C, H, W )
+ Conv2d 5 × 5 (p = 2)
+ (B, D/2, H, W )
+ SiLU → GN(32)
+ 2. Conv
+ Conv2d 5 × 5 (p = 2)
+ (B, D/2, H, W )
+ SiLU → GN(32)
+ Flatten Patches (B, Np , P 2 · D
+ 2)
+ 3. Patch
+ Linear Projection (B, Np , D)
+
+
+Hyperparameters. Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are
+represented as sequences of shape [B, L], where B denotes the batch size and L the context length.
+Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and lt , as well
+as the decoder output, have shape [B, L, D], with embedding dimension D. The Transformer [39]
+backbone uses embedding dimension D = 512, attention heads Nhead =8 , and FFN hidden dimension
+Dh =512. Within a recursion step, meaning a latent transition zt → zt+1 , we use low-level (inner)
+steps K = 6 for Sudoku [8] and K = 4 for all other tasks, with high-level (outer) steps T = 3.
+
+B.2 Training Details
+
+Task Configuration. All tasks represent inputs and outputs as discrete token sequences (Summarized
+in Table 6).
+
+
+ 17
+ • For Sudoku [8], the 9×9 grid is flattened row-by-row into 81 tokens with vocabulary size 11
+ (0=pad, 1=blank, 2–10=digits).
+ • For ARC-AGI [13, 14], variable-size grids are padded to a fixed 30×30 canvas with EOS
+ markers, yielding 900 tokens and vocabulary size 12 (0=pad, 1=eos, 2–11=colors); task-
+ specific puzzle embeddings are prepended to distinguish different ARC tasks.
+ • N-Queens flattens an N × N board row-by-row into N 2 tokens with vocabulary size 3
+ (0=pad, 1=empty, 2=queen).
+ • Graph Coloring encodes the strict upper triangle of the adjacency matrix as n(n−1)/2 tokens,
+ using 0=PAD, 1=no-edge, and 2=edge for inputs and 3 + color_id for output colors.
+ • For image generation on MNIST [15], images are quantized and processed via CNN-based
+ patchification [50, 49], with the encoder applying patchify and the decoder unpatchify. Then,
+ patched input forms 14 × 14 flattened sequence tokens with vocabulary size 3 (0=pad,
+ 1=black, 2=white).
+
+
+ Table 6: Task-specific configurations.
+ Task Seq. Len Vocab Puzzle Emb Encoding
+ Sudoku 81 11 ✗ 9×9 grid, row-major
+ ARC-AGI 900 12 ✓ 30×30 padded canvas
+ N-Queens N2 3 ✗ N × N board
+ n(n−1)
+ Graph Coloring 2
+ 6 ✗ Strict adjacency upper triangle
+ MNIST 196 3 ✗ 14×14 patches
+
+Training Details. We train all models using AdamW [54] with learning rate 10−4 , weight decay
+1.0, and gradient clipping at 1.0. The global batch size is 768. For stability, we apply exponential
+moving average (EMA) with decay 0.9999, following Brock et al. [55] and Song and Ermon [56].
+To prevent posterior collapse, we use a KL balance [37, 38] coefficient of 0.8. The number of deep
+supervision steps is Nsup = 16 for all tasks. The KL coefficient β is set to 0.1 (Sudoku), 0.04/0.1
+(ARC-AGI-1/2), 0.07/0.045 (N-Queens 8 × 8/10 × 10), 0.5/0.45 (Graph Coloring with 8/10 nodes),
+and 0.07 (MNIST). Task-specific training configurations are summarized in Table 7.
+
+ Table 7: Training configurations on NVIDIA RTX 4090 GPUs.
+ Task Epochs GPUs Time
+ Sudoku 50K 8 2h
+ ARC-AGI 200K 8 5 days
+ N-Queens (8×8) 3K 8 1h
+ N-Queens (10×10) 1K 8 3h
+ Graph Coloring (8 nodes) 5K 8 1.5h
+ Graph Coloring (10 nodes) 5K 8 6h
+ MNIST 1.8K 8 16h
+
+
+
+C Additional Details of Experiment Setup
+C.1 Challenging Puzzle Tasks
+
+C.1.1 Looped TF on ARC-AGI
+We report Looped Transformer [7] results on Sudoku-Extreme but omit them on ARC-AGI due to
+prohibitive training cost. Under the same setup used for our other recursive baselines (200K epochs,
+batch size 768, on 8× NVIDIA RTX Pro 6000 GPUs), training Looped TF on Sudoku-Extreme
+already takes 19 hours, and extrapolating to ARC-AGI — which uses substantially longer sequences
+and a larger training set — suggests approximately 97 days (≈ 776 GPU-days) for a full training run.
+This gap stems from two compounding factors. First, Looped TF lacks deep supervision: HRM,
+TRM, and GRAM perform Nsup gradient updates per trajectory (one per segment), whereas Looped
+TF performs only one update at the end of the full trajectory, slowing convergence. Second, Looped
+
+
+ 18
+ TF lacks adaptive halting such as ACT [8–10], so every input must be processed for the maximum
+recursion depth, increasing per-example sequential compute. Both inefficiencies compound at
+ARC-AGI scale, making a full Looped TF training run impractical.
+
+C.2 Multi-solution Puzzle Tasks
+
+C.2.1 N-Queens Problem
+
+ Input Solution 1 Solution 2 Solution 3
+
+
+
+
+Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the
+full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration
+admits exactly 3 valid solutions.
+
+Data Generation Details. The N-Queens problem requires placing N queens on an N × N
+chessboard such that no two queens attack each other—meaning no queens share the same row,
+column, or diagonal. Figure 9 illustrates an example where 5 queens are removed from an 8 × 8
+solution, resulting in a puzzle with 3 distinct valid completions.
+To construct the dataset, we first generated all valid complete N-Queens solutions for N = 8 and
+N = 10. We then created puzzle instances by removing a specific number of queens, treating the
+remaining partial configuration as the input and the original complete board as the target label. To
+generate instances yielding diverse valid completions, we removed k ∈ {5, 6, 7} queens for the 8 × 8
+setting and k ∈ {7, 8, 9} queens for the 10 × 10 setting. The distribution of solution counts for our
+generated dataset is shown in Figure 10.
+For evaluation, we employed an 85:15 train-test split. Crucially, to prevent data leakage and ensure
+the model learns to reason rather than memorize, the split was performed based on unique input
+configurations. This guarantees that no input pattern in the test set appears in the training set. Inputs
+are flattened into discrete 1D sequences x ∈ {0, 1, 2}L , where L = N 2 , along with zero-padded
+puzzle embedding tokens. Vocabulary mapping follows: padding (0), empty (1), and queen (2).
+
+ 8x8 N-Queens 10x10 N-Queens
+ 3000 20000
+ 15000
+ Counts
+
+
+
+
+ Counts
+
+
+
+
+ 2000
+ 10000
+ 1000
+ 5000
+ 0 0
+ 3 6 9 12 15 18 0 20 40 60 80
+ Number of solutions Number of solutions
+Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. The dataset
+covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+C.2.2 Graph Coloring Problem
+Data Generation Details. The Graph Coloring problem requires assigning one of k colors to each
+node in a graph such that no two adjacent nodes share the same color. We consider graphs with
+N ∈ {8, 10} nodes and use k = 3 colors. Figure 11 illustrates an example instance with N = 8
+nodes and k = 3 colors.
+Graphs are generated using the Erdős–Rényi random graph model [57], following the generation
+pipeline from GNN-GCP [58]. Specifically, for each instance, edges are sampled independently
+
+
+ 19
+ with a fixed probability p, producing a symmetric adjacency matrix. We retain only graphs that are
+3-colorable.
+For each graph, we enumerate all valid 3-colorings and retain only canonical forms to eliminate
+redundant solutions under color permutation (e.g., swapping red and blue). This yields a set of
+structurally distinct solutions per input. The distribution of solution counts is shown in Figure 12.
+The final dataset consists of 7,002 training and 255 test instances for N = 8, and 13,465 training and
+192 test instances for N = 10.
+Input and Output Representation. The input graph is represented by extracting the upper triangular
+portion of the adjacency matrix (excluding the diagonal) and flattening it into a 1D sequence. The
+output is a sequence of length N , where each position encodes the assigned color for the corresponding
+node. Vocabulary mapping is as follows: PAD (0), no edge (1), edge (2), and colors (3, 4, 5) for red,
+blue, and green respectively.
+
+ Input Solution 1 Solution 2 Solution 3 Solution 4
+
+
+
+
+ Figure 11: Graph Coloring Example
+
+
+ Vertex 8 Graph Coloring Vertex 10 Graph Coloring
+ 2000
+ 1500
+ 1500
+ Counts
+
+
+
+
+ Counts
+
+
+
+
+ 1000
+ 1000
+ 500 500
+ 0 0
+ 3 6 9 12 15 18 0 20 40 60 80
+ Number of solutions Number of solutions
+Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. The
+dataset covers a wide range of solution counts, testing the model’s ability to recover multiple valid outputs.
+
+
+D Additional Experiment Results
+D.1 Additional Results on Challenging Puzzle Benchmarks
+
+Table 8 reports test accuracy on three challenging puzzle benchmarks. Here we provide additional
+observations complementing the main text.
+
+GRAM Advances the Recursive-Reasoning Line. Across all three benchmarks, GRAM consis-
+tently outperforms prior recursive baselines (Looped TF, HRM, TRM) while using fewer parameters
+than HRM (10M vs. 27M). The complete failure of direct prediction on Sudoku and ARC-AGI-2 (0%
+in both cases) further confirms that recursive computation is essential for these tasks — single-pass
+models, regardless of capacity, cannot solve them. Together, these results indicate that GRAM’s gains
+arise from how recursive computation is organized (probabilistic, multi-trajectory) rather than from
+increased model capacity.
+
+Sudoku-Extreme Resists Parameter Scaling. All tested large reasoning models (LRMs), including
+Deepseek-R1 (671B), score 0% on Sudoku-Extreme. This suggests that pretrained capacity alone
+does not transfer to constraint-propagation reasoning, and that benchmarks like Sudoku-Extreme
+probe a fundamentally different axis from those captured by general-purpose LRMs. On ARC-AGI,
+more recent LRMs such as Gemini 3 Pro (75.0% on ARC-1, 31.1% on ARC-2) remain substantially
+
+
+ 20
+ Table 8: Test accuracy (%) on Challenging Puzzle Benchmarks. GRAM significantly outperforms prior
+recursive models. All recursive model scores were obtained at 16 supervision steps.
+ Method #Params Sudoku ARC-1 ARC-2
+ Large Reasoning Models
+ Deepseek-R1 671B 0.0 15.8 1.3
+ Claude 3.7 16k N/A 0.0 28.6 0.7
+ o3-mini-high N/A 0.0 34.5 3.0
+ GPT 5.2 (low) N/A – 55.7 9.7
+ Grok-4-thinking 1.7T – 66.7 16.0
+ Gemini 3 Pro N/A – 75.0 31.1
+ Recursive Models
+ Direct Pred 27M 0.0 21.0 0.0
+ Looped TF 7M 61.3 - -
+ HRM 27M 55.0 40.3 5.0
+ TRM 7M 87.4 44.6 7.8
+ GRAM (Ours) 10M 97.0 52.0 11.1
+ Human Results
+ Avg. Human – – 60.2 –
+ Best Human – – 98.0 100.0
+
+
+
+ahead of all recursive models, highlighting that abstract few-shot reasoning still benefits from scale;
+we view these numbers as benchmark-difficulty reference points rather than controlled baselines.
+
+D.2 Scales with Parallel Sampling on ARC-AGI Challenge
+
+To investigate the effect of GRAM’s sampling on the ARC-AGI-1 benchmark, we measured perfor-
+mance without relying on external data augmentation. Typically, TRM achieves its reported accuracy
+by generating 1,000 augmentations for a single problem and performing majority voting over the
+results. Because this augmentation process itself creates a wide variety of samples, we isolated
+the specific effect of generative sampling by performing inference solely on the original problem
+instance and conducting majority voting over multiple sampled paths. For a fair comparison, TRM
+was evaluated using the same hyperparameters as GRAM, including the number of epochs, learning
+rate, and the number of layers.
+As illustrated in Figure 13, removing augmentations causes a performance decline for both GRAM
+and TRM compared to the values reported in Table 8. However, in the case of GRAM, we observe
+that accuracy consistently improves as the model generates more parallel samples. This trend mirrors
+observations in Section 4.2, suggesting that increased inference-time compute through width scaling
+allows the model to explore more plausible reasoning trajectories and recover from initial errors,
+eventually leading to more robust solution discovery.
+
+Interaction between Augmentation and Sampling. A natural question arises: why not combine
+higher levels of augmentation with extensive parallel sampling? To address this, we conducted an
+ablation study examining the interaction between data augmentation and inference-time sampling.
+Figure 14 presents the results across varying augmentation levels (Aug=0 to Aug=50). Without
+augmentation (Aug=0), increasing the number of samples yields consistent accuracy improvements,
+demonstrating that stochastic sampling effectively explores diverse reasoning trajectories. However,
+as the level of augmentation increases, the marginal benefit of additional sampling diminishes
+substantially. At Aug=50, performance saturates regardless of sample count—accuracy remains
+nearly constant whether we draw 1 or 50 samples. This observation reveals that augmentation
+and sampling serve complementary rather than additive roles: both mechanisms enable the model
+to capture solution diversity, but through different means. When training data is limited, parallel
+sampling compensates by exploring varied reasoning paths at inference time. When training data
+is abundant through augmentation, the model has already internalized sufficient diversity during
+training, rendering additional inference-time exploration redundant. Consequently, scaling sampling
+beyond augmentation provides diminishing returns, justifying our experimental design choice to
+evaluate these two scaling axes separately.
+
+
+ 21
+ 0.450
+ 0.425
+ 0.400
+ 0.375
+
+
+
+
+ Accuracy
+ 0.350
+ 0.325
+ 0.300 TRM
+ GRAM ( =0.1)
+ 0.275 GRAM ( =0.05)
+ GRAM ( =0.04)
+ 0.250 1 2 5 10 20 50 100 250 500
+ Number of Samples (N)
+Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. To isolate the internal sampling
+effect, both models are evaluated on original problem instances without 1,000 augmentations. While removing
+augmentations causes an initial performance drop, GRAM exhibits robust scaling through generative sampling
+as the number of parallel samples N increases, outperforming the TRM baseline.
+
+
+ 0.500
+ 0.475
+ 0.450
+ Accuracy
+
+
+
+
+ 0.425
+ 0.400
+ 0.375 Aug=0
+ Aug=5
+ 0.350 Aug=10
+ Aug=50
+ 0.325
+ 1 2 5 10 20 50 100 250 500
+ Number of Samples (N)
+Figure 14: Effect of augmentation on sampling efficiency. With limited augmentation (Aug=0), parallel
+sampling provides consistent gains. As augmentation increases, sampling benefits diminish—at Aug= 50,
+performance saturates regardless of sample count, suggesting augmentation and sampling serve complementary
+roles in capturing solution diversity.
+
+
+
+D.3 Solution Coverage Analysis
+
+We analyze the ability of GRAM to capture the diversity of the solution space compared to determin-
+istic baselines. Figure 15 presents the solution coverage on 8 × 8 and 10 × 10 N-Queens tasks with
+respect to the total number of valid ground-truth solutions.
+As shown in Figure 15, deterministic recursive models (HRM and TRM) exhibit a sharp decline in
+coverage as the number of possible solutions increases. Since these models are constrained to a single
+fixed reasoning trajectory, they structurally fail to explore alternative paths, resulting in severe mode
+collapse in multi-solution landscapes.
+In contrast, GRAM effectively leverages its generative latent transitions to cover a broader range
+of solutions. As the number of parallel samples N increases (from 1 to 20), the solution coverage
+improves monotonically across both 8 × 8 and 10 × 10 settings. This empirical evidence confirms
+that GRAM’s stochastic guidance mechanism is essential for navigating complex problem spaces
+where multiple valid reasoning paths exist.
+
+
+D.4 Additional Generated Image Samples
+
+In this section, we provide further qualitative results demonstrating GRAM’s capability in uncon-
+ditional image generation. Figure 16 presents a diverse set of samples generated on the binarized
+MNIST dataset, visualized across the recursive inference steps t = 0 to t = 16.
+
+
+ 22
+ 1.0 HRM 1.0 HRM
+ TRM TRM
+ GRAM (N=1) GRAM (N=1)
+ GRAM (N=5) GRAM (N=5)
+ 0.8 GRAM (N=10) 0.8 GRAM (N=10)
+ GRAM (N=20) GRAM (N=20)
+
+
+ 0.6 0.6
+Coverage
+
+
+
+
+ Coverage
+ 0.4 0.4
+
+
+ 0.2 0.2
+
+
+ 0.0 0.0
+ 2 4 6 8 10 12 14 16 18 0 15 30 45 60 75 90
+ Number of Solutions Number of Solutions
+ (a) N-Queens 8 × 8 (b) N-Queens 10 × 10
+
+ Figure 15: Solution coverage analysis on N-Queens (8 × 8 and 10 × 10) with respect to the number of
+ ground-truth solutions. While deterministic baselines (HRM, TRM) suffer from mode collapse as the solution
+ space grows, GRAM demonstrates monotonic improvement in coverage as the number of parallel samples N
+ increases.
+
+
+
+ As observed in the main text, GRAM exhibits a distinct progressive refinement behavior. Starting
+ from a black initialization, the model iteratively adds details and sharpens the structure of the digit.
+ A particularly compelling property of this process is the model’s ability to recover from initially
+ ambiguous or incorrect formations.
+ For instance, in the second row (generating the digit ’2’) and the last row (generating the digit ’1’),
+ the early predictions at t = 1 and t = 2 manifest as disjointed artifacts or incorrect shapes. However,
+ as the recursion proceeds, GRAM effectively leverages its feedback loop to correct these initial errors,
+ resolving the ambiguity and converging to a coherent, high-quality digit by t = 16.
+
+
+
+
+ t=0 t=1 t=2 t=4 t=6 t=7 t=9 t=11 t=12 t=14 t=16
+ Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi-
+ tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its
+ recursive refinement process.
+
+
+
+
+ 23
+ D.5 Additional Experiment Results on Unconditional Sudoku Generation
+
+In this section, we provide additional details on unconditional Sudoku generation. Unlike the
+conditional Sudoku-solving setting, where the input board contains given clues, the model receives an
+entirely blank board and samples a complete 9 × 9 Sudoku board from its learned prior. We evaluate
+each generated board using the standard Sudoku validity criterion: every row, column, and 3 × 3 box
+must contain the digits 1 through 9 exactly once. We report the validity rate over 100K generated
+boards. To check whether high validity comes from repeatedly producing the same board, we also
+compute the fraction of unique boards among valid samples.
+For GRAM, we construct the unconditional training set from Sudoku-Extreme [8], the Sudoku
+benchmark used by HRM and TRM. We sample 50K complete solutions from the original training
+split, discard the clue patterns, and use an all-blank board as input with the complete solution as
+the target. No data augmentation is used. We train GRAM on this derived 50K-solution set for 200
+epochs with learning rate 10−4 , EMA decay 0.999, and KL coefficient 0.05. The resulting model
+contains 10.9M parameters and uses 16 inference steps.
+For D3PM baselines, we use a DiT-style Transformer backbone and evaluate two model sizes. D3PM-
+Big uses hidden dimension 768, 5 Transformer blocks, and 12 attention heads, yielding 55.1M
+parameters, while D3PM-Small uses hidden dimension 512, 3 Transformer blocks, and 8 attention
+heads, yielding 15.9M parameters. Both variants are trained on the same derived training set and
+generate boards with 1000 denoising steps.
+As shown in Table 9, GRAM achieves 99.05% validity, outperforming all D3PM baselines. The
+strongest D3PM baseline, D3PM-Uniform (Big), reaches 91.33% validity while using 55.1M parame-
+ters and 1000 denoising steps. In contrast, GRAM uses fewer parameters and only 16 inference steps.
+In all cases, the valid samples are unique under exact board matching, indicating that the reported
+validity is not due to simple repetition of a small set of boards. These results show that GRAM can
+generate highly constrained symbolic structures from an empty input, supporting its potential as a
+generator beyond conditional puzzle solving.
+Figure 17 illustrates the unconditional Sudoku generation setup. Starting from an empty board,
+the task is to generate complete boards, and validity is determined by whether the generated board
+satisfies all Sudoku constraints. Figure 7 shows qualitative examples of boards generated by GRAM.
+
+Table 9: Unconditional Sudoku generation. We report the ratio of generated boards satisfying Sudoku
+constraints over 100K samples. All valid boards are unique for all methods in this evaluation.
+ Method #Params Steps Validity(%)
+ D3PM-Uniform (Big) 55.1M 1000 91.33
+ D3PM-Uniform (Small) 15.9M 1000 29.24
+ D3PM-Absorb (Big) 55.1M 1000 79.18
+ D3PM-Absorb (Small) 15.9M 1000 21.88
+ GRAM (Ours) 10.9M 16 99.05
+
+
+ Empty input Valid sample Invalid sample
+ 3 6 5 4 7 8 9 2 1 4 3 8 2 9 1 7 6 5
+ 9 4 1 2 5 6 8 7 3 5 2 1 7 3 6 4 9 8
+ 2 7 8 9 1 3 6 5 4 7 6 9 4 5 8 1 2 3
+ 5 2 9 8 4 7 3 1 6 9 1 3 4 4 7 5 8 6
+ 7 3 6 1 9 2 5 4 8 2 5 4 6 8 9 3 1 7
+ 1 8 4 3 6 5 2 9 7 6 8 7 5 1 3 2 4 9
+ 8 9 7 5 3 4 1 6 2 3 7 2 9 6 4 8 5 1
+ 4 1 3 6 2 9 7 8 5 1 9 5 3 2 8 6 7 4
+ 6 5 2 7 8 1 4 3 9 8 4 6 1 7 5 9 3 2
+
+
+Figure 17: Unconditional Sudoku generation setup. Starting from an empty board, the task is to generate
+complete Sudoku boards. The valid sample satisfies all Sudoku constraints, while red entries in the invalid
+sample indicate cells involved in constraint violations.
+
+
+
+ 24
+ D.6 Visualizing Latent Recursion Process
+
+To understand how stochastic guidance shapes reasoning, we visualize latent trajectories during
+recursive computation. Specifically, we track the high-level state h at each supervision step throughout
+the recursion process. For visualization, we project these latent vectors into 2D using PCA [59] and
+interpolate unobserved states via K-D tree [60] to construct a continuous loss landscape.
+Figures 18 and 19 compare TRM and GRAM on the same Sudoku puzzle. TRM follows a single
+deterministic path from initialization to solution, offering no mechanism to escape if the trajectory
+enters a suboptimal region. In contrast, GRAM samples diverse trajectories that explore different
+regions of latent space before converging. While some trajectories become trapped in local minima
+(bright yellow regions), others successfully navigate toward the global optimum (dark blue regions).
+This diversity enables GRAM to discover valid solutions more reliably through parallel exploration.
+
+
+
+
+Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot
+indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high
+loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with
+no ability to escape suboptimal trajectories.
+
+
+
+
+ 25
+ Figure 19: Latent reasoning trajectories of GRAM (50 samples). Using the same visualization scheme as
+Figure 18, we show 50 sampled trajectories from GRAM. The stochastic guidance enables diverse exploration
+of the latent space: while some trajectories converge to local minima (right bottom), others successfully reach
+the global optimum (left middle), demonstrating how parallel sampling improves solution discovery.
+
+
+
+
+ 26
+ Licenses
+Table 10: Existing assets, licenses, and source links. We list the existing datasets, benchmarks, and public
+reference implementations used or cited in our experiments. Synthetic N-Queens and Graph Coloring instances
+are generated by the authors and are therefore not external assets.
+
+ Asset Use in this paper License / terms Source link
+ MNIST Binarized MNIST genera- Creative Com- https://keras.io/api/
+ tion experiments mons Attribution- datasets/mnist/
+ Share Alike 3.0
+ ARC-AGI-1 / origi- ARC-AGI reasoning Apache License https://github.com/fchollet/
+ nal ARC benchmark 2.0 ARC-AGI
+ ARC-AGI-2 ARC-AGI-2 reasoning Apache License https://github.com/arcprize/
+ benchmark / reference 2.0 ARC-AGI-2
+ results
+ HRM repository HRM baseline and Apache License https://github.com/
+ Sudoku-Extreme-related 2.0 sapientinc/HRM
+ reference implementation
+ TinyRecursiveModels TRM baseline and recur- MIT License https://github.com/
+ / TRM repository sive reasoning reference SamsungSAILMontreal/
+ implementation TinyRecursiveModels
+ MDLM repository Masked diffusion base- Apache License https://github.com/
+ line reference implemen- 2.0 kuleshov-group/mdlm
+ tation, if public code is
+ used
+ Google Research D3PM image-generation Apache License https://github.com/
+ D3PM implementa- baseline reference imple- 2.0 google-research/
+ tion mentation, if public code google-research/blob/master/
+ is used d3pm/images/diffusion_
+ categorical.py
+ Looped Trans- Looped Transformer base- MIT License https://github.com/Leiay/
+ former repository line reference implemen- looped_transformer
+ tation, if public code is
+ used
+ N-Queens Synthetic multi-solution Not an external N/A
+ constraint satisfaction asset
+ task generated by the
+ authors
+ Graph Coloring Synthetic multi-solution Not an external N/A
+ constraint satisfaction asset
+ task generated by the
+ authors
+
+
+
+
+ 27
+ \ No newline at end of file
diff --git a/papers/txt/hrm2025_hierarchical_reasoning.txt b/papers/txt/hrm2025_hierarchical_reasoning.txt
new file mode 100644
index 0000000..3cc1f2e
--- /dev/null
+++ b/papers/txt/hrm2025_hierarchical_reasoning.txt
@@ -0,0 +1,1302 @@
+ Hierarchical Reasoning Model
+ Guan Wang1,† , Jin Li1 , Yuhao Sun1 , Xing Chen1 , Changling Liu1 ,
+ Yue Wu1 , Meng Lu1,† , Sen Song2,† , Yasin Abbasi Yadkori1,†
+ 1
+ Sapient Intelligence, Singapore
+
+
+
+ Abstract
+
+ Reasoning, the process of devising and executing complex goal-oriented action sequences,
+arXiv:2506.21734v3 [cs.AI] 4 Aug 2025
+
+
+
+
+ remains a critical challenge in AI. Current large language models (LLMs) primarily employ
+ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive
+ data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro-
+ cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel
+ recurrent architecture that attains significant computational depth while maintaining both train-
+ ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass
+ without explicit supervision of the intermediate process, through two interdependent recurrent
+ modules: a high-level module responsible for slow, abstract planning, and a low-level mod-
+ ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves
+ exceptional performance on complex reasoning tasks using only 1000 training samples. The
+ model operates without pre-training or CoT data, yet achieves nearly perfect performance on
+ challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes.
+ Furthermore, HRM outperforms much larger models with significantly longer context windows
+ on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial
+ general intelligence capabilities. These results underscore HRM’s potential as a transformative
+ advancement toward universal computation and general-purpose reasoning systems.
+
+
+ ARC-AGI-1 ARC-AGI-2 Sudoku-Extreme (9x9) Maze-Hard (30x30)
+ 960 training examples 1120 training examples 1000 training examples 1000 training examples
+ 40.3 5.0 60 55.0 80 74.5
+ 40 5
+ 34.5
+ 60
+ Deepseek R1
+
+
+
+
+ 4
+ Claude 3.7 8K
+
+
+
+
+ 30 40
+ Accuracy %
+
+
+
+
+ 3.0
+ 3
+ Claude 3.7 8K
+
+
+
+
+ Claude 3.7 8K
+ 21.0 21.2
+ Deepseek R1
+
+
+
+
+ Deepseek R1
+ 40
+ o3-mini-high
+
+
+
+
+ o3-mini-high
+
+
+ 20 15.8
+ Direct pred
+
+
+
+
+ Direct pred
+
+
+
+
+ Direct pred
+ o3-mini-high
+
+
+
+
+ o3-mini-high
+ Claude 3.7 8K
+
+
+
+
+ 2 20
+ Deepseek R1
+
+
+
+
+ 1.3
+ Direct pred
+
+
+
+
+ 10 0.9 20
+ 1
+ HRM
+
+
+
+
+ HRM
+
+
+
+
+ HRM
+
+
+
+
+ HRM
+
+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
+ 0 0 0 0
+ Chain-of-thought, pretrained Direct prediction, small-sample learning
+
+
+
+ Figure 1: Left: HRM is inspired by hierarchical processing and temporal separation in the brain. It
+ has two recurrent networks operating at different timescales to collaboratively solve tasks. Right:
+ With only about 1000 training examples, the HRM (~27M parameters) surpasses state-of-the-art
+ CoT models on inductive benchmarks (ARC-AGI) and challenging symbolic tree-search puzzles
+ (Sudoku-Extreme, Maze-Hard) where CoT models failed completely. The HRM was randomly
+ initialized, and it solved the tasks directly from inputs without chain of thoughts.
+ 2
+ Tsinghua University † Corresponding author. Contact: research@sapient.inc.
+ Code available at: github.com/sapientinc/HRM
+
+ 1
+ 1 Introduction
+Deep learning, as its name suggests, emerged from the idea of stacking more layers to achieve
+increased representation power and improved performance 1,2 . However, despite the remarkable
+success of large language models, their core architecture is paradoxically shallow 3 . This imposes
+a fundamental constraint on their most sought-after capability: reasoning. The fixed depth of stan-
+dard Transformers places them in computational complexity classes such as AC 0 or T C 0 4 , prevent-
+ing them from solving problems that require polynomial time 5,6 . LLMs are not Turing-complete
+and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic rea-
+soning that is necessary for deliberate planning or symbolic manipulation tasks 7,8 . For example,
+our results on the Sudoku task show that increasing Transformer model depth can improve per-
+formance,1 but performance remains far from optimal even with very deep models (see Figure 2),
+which supports the conjectured limitations of the LLM scaling paradigm 9 .
+The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning 10 .
+CoT externalizes reasoning into token-level language by breaking down complex tasks into sim-
+pler intermediate steps, sequentially generating text using a shallow model 11 . However, CoT for
+reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions
+where a single misstep or a misorder of the steps can derail the reasoning process entirely 12,13 . This
+dependency on explicit linguistic steps tethers reasoning to patterns at the token level. As a result,
+CoT reasoning often requires significant amount of training data and generates a large number of
+tokens for complex reasoning tasks, resulting in slow response times. A more efficient approach is
+needed to minimize these data requirements 14 .
+Towards this goal, we explore “latent reasoning”, where the model conducts computations within
+its internal hidden state space 15,16 . This aligns with the understanding that language is a tool for
+human communication, not the substrate of thought itself 17 ; the brain sustains lengthy, coherent
+chains of reasoning with remarkable efficiency in a latent space, without constant translation back
+to language. However, the power of latent reasoning is still fundamentally constrained by a model’s
+effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gra-
+dients, which plague training stability and effectiveness 1,18 . Recurrent architectures, a natural al-
+ternative for sequential tasks, often suffer from early convergence, rendering subsequent computa-
+tional steps inert, and rely on the biologically implausible, computationally expensive and memory
+intensive Backpropagation Through Time (BPTT) for training 19 .
+The human brain provides a compelling blueprint for achieving the effective computational depth
+that contemporary artificial models lack. It organizes computation hierarchically across corti-
+cal regions operating at different timescales, enabling deep, multi-stage reasoning 20,21,22 . Recur-
+rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to
+guide, and fast, lower-level circuits to execute—subordinate processing while preserving global
+coherence 23,24,25 . Notably, the brain achieves such depth without incurring the prohibitive credit-
+assignment costs that typically hamper recurrent networks from backpropagation through time 19,26 .
+Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierar-
+chical Reasoning Model (HRM). HRM is designed to significantly increase the effective compu-
+tational depth. It features two coupled recurrent modules: a high-level (H) module for abstract,
+deliberate reasoning, and a low-level (L) module for fast, detailed computations. This structure
+ 1
+ Simply increasing the model width does not improve performance here.
+
+ 2
+ 100 100
+ Scaling Width - 8 layers fixed Transformer
+ Scaling Depth - 512 hidden size fixed Recurrent Transformer
+ 80 80 HRM
+ Accuracy %
+ 60 60
+
+ 40 40
+
+ 20 20
+ 27M 54M 109M 218M 436M 872M 8 16 32 64 128 256 512
+ Parameters Depth / Transformer layers computed
+
+Figure 2: The necessity of depth for complex reasoning. Left: On Sudoku-Extreme Full, which
+require extensive tree-search and backtracking, increasing a Transformer’s width yields no perfor-
+mance gain, while increasing depth is critical. Right: Standard architectures saturates, failing to
+benefit from increased depth. HRM overcomes this fundamental limitation, effectively using its
+computational depth to achieve near-perfect accuracy.
+
+avoids the rapid convergence of standard recurrent models through a process we term “hierarchi-
+cal convergence.” The slow-updating H-module advances only after the fast-updating L-module
+has completed multiple computational steps and reached a local equilibrium, at which point the
+L-module is reset to begin a new computational phase.
+Furthermore, we propose a one-step gradient approximation for training HRM, which offers im-
+proved efficiency and eliminates the requirement for BPTT. This design maintains a constant mem-
+ory footprint (O(1) compared to BPTT’s O(T ) for T timesteps) throughout the backpropagation
+process, making it scalable and more biologically plausible.
+Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and
+backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervi-
+sion, HRM learns to solve problems that are intractable for even the most advanced LLMs. For
+example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and
+optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% ac-
+curacy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark
+of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 exam-
+ples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance
+of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%)
+and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and con-
+text lengths, as shown in Figure 1. This represents a promising direction toward the development
+of next-generation AI reasoning systems with universal computational capabilities.
+
+
+2 Hierarchical Reasoning Model
+We present the HRM, inspired by three fundamental principles of neural computation observed in
+the brain:
+• Hierarchical processing: The brain processes information across a hierarchy of cortical ar-
+ eas. Higher-level areas integrate information over longer timescales and form abstract repre-
+ sentations, while lower-level areas handle more immediate, detailed sensory and motor process-
+ ing 20,22,21 .
+
+ 3
+ • Temporal Separation: These hierarchical levels in the brain operate at distinct intrinsic timescales,
+ reflected in neural rhythms (e.g., slow theta waves, 4–8 Hz and fast gamma waves, 30–100
+ Hz) 30,31 . This separation allows for stable, high-level guidance of rapid, low-level computa-
+ tions 32,33 .
+• Recurrent Connectivity: The brain features extensive recurrent connections. These feedback
+ loops enable iterative refinement, yielding more accurate and context-sensitive representations
+ at the cost of additional processing time. Additionally, the brain largely avoids the problematic
+ deep credit assignment problem associated with BPTT 19 .
+The HRM model consists of four learnable components: an input network fI (·; θI ), a low-level re-
+current module fL (·; θL ), a high-level recurrent module fH (·; θH ), and an output network fO (·; θO ).
+The model’s dynamics unfold over N high-level cycles of T low-level timesteps each2 . We index
+the total timesteps of one forward pass by i = 1, . . . , N × T . The modules fL and fH each keep a
+hidden state—zLi for fL and zH i
+ for fH —which are initialized with the vectors zL0 and zH0
+ , respec-
+tively.
+The HRM maps an input vector x to an output prediction vector ŷ as follows. First, the input x is
+projected into a working representation x̃ by the input network:
+
+ x̃ = fI (x; θI ) .
+At each timestep i, the L-module updates its state conditioned on its own previous state, the H-
+module’s current state (which remains fixed throughout the cycle), and the input representation.
+The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final
+state at the end of that cycle:
+
+ zLi = fL zLi−1 , zH
+ i−1
+ 
+ , x̃; θL ,
+ (
+ i−1 i−1
+ 
+ i fH zH , zL ; θH if i ≡ 0 (mod T ) ,
+ zH = i−1
+ zH otherwise .
+
+Finally, after N full cycles, a prediction ŷ is extracted from the hidden state of the H-module:
+
+
+ NT
+ ŷ = fO (zH ; θO ) .
+
+This entire N T -timestep process represents a single forward pass of the HRM. A halting mecha-
+nism (detailed later in this section) determines whether the model should terminate, in which case
+ŷ will be used as the final prediction, or continue with an additional forward pass.
+Hierarchical convergence Although convergence is crucial for recurrent networks, standard RNNs
+are fundamentally limited by their tendency to converge too early. As the hidden state settles toward
+a fixed point, update magnitudes shrink, effectively stalling subsequent computation and capping
+the network’s effective depth. To preserve computational power, we actually want convergence to
+proceed very slowly–but engineering that gradual approach is difficult, since pushing convergence
+too far edges the system toward instability.
+ 2
+ While inspired by temporal separation in the brain, our model’s “high-level” and “low-level” modules are concep-
+tual abstractions and do not map directly to specific neural oscillation frequencies.
+
+
+ 4
+ 250 250 250
+ HRM H Recurrent Neural Net Deep Neural Net
+ 200 200 200
+Forward residual
+ HRM L
+ 150 150 150
+ 100 100 100
+ 50 50 50
+ 0 0 0
+ 0 20 40 60 0 20 40 60 0 100 200
+ Step Index # Step Index # Layer Index #
+
+ 60 60
+
+
+
+
+ Layer Index #
+ Step Index #
+
+
+
+
+ Step Index #
+ 200
+ 30 30 100
+
+
+ Principal Components Principal Components Principal Components
+
+Figure 3: Comparison of forward residuals and PCA trajectories. HRM shows hierarchical conver-
+gence: the H-module steadily converges, while the L-module repeatedly converges within cycles
+before being reset by H, resulting in residual spikes. The recurrent neural network exhibits rapid
+convergence with residuals quickly approaching zero. In contrast, the deep neural network experi-
+ences vanishing gradients, with significant residuals primarily in the initial (input) and final layers.
+
+HRM is explicitly designed to counteract this premature convergence through a process we term
+hierarchical convergence. During each cycle, the L-module (an RNN) exhibits stable convergence
+to a local equilibrium. This equilibrium, however, depends on the high-level state zH supplied
+during that cycle. After completing the T steps, the H-module incorporates the sub-computation’s
+outcome (the final state zL ) and performs its own update. This zH update establishes a fresh context
+for the L-module, essentially “restarting” its computational path and initiating a new convergence
+phase toward a different local equilibrium.
+This process allows the HRM to perform a sequence of distinct, stable, nested computations, where
+the H-module directs the overall problem-solving strategy and the L-module executes the intensive
+search or refinement required for each step. Although a standard RNN may approach convergence
+within T iterations, the hierarchical convergence benefits from an enhanced effective depth of N T
+steps. As empirically shown in Figure 3, this mechanism allows HRM both to maintain high
+computational activity (forward residual) over many steps (in contrast to a standard RNN, whose
+activity rapidly decays) and to enjoy stable convergence. This translates into better performance at
+any computation depth, as illustrated in Figure 2.
+Approximate gradient Recurrent models typically use BPTT to compute gradients. However,
+BPTT requires storing the hidden states from the forward pass and then combining them with
+gradients during the backward pass, which demands O(T ) memory for T timesteps. This heavy
+memory burden forces smaller batch sizes and leads to poor GPU utilization, especially for large-
+scale networks. Additionally, because retaining the full history trace through time is biologically
+implausible, it is unlikely that the brain implements BPTT 19 .
+Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unrolling its state
+sequence by applying backpropagation in a single step at that equilibrium point. Moreover, such a
+mechanism could plausibly be implemented in the brain using only local learning rules 34,35 . Based
+
+ 5
+ on this finding, we propose a one-step approximation of the HRM gradient–using the gradient of
+the last state of each module and treating other states as constant. The gradient path is, therefore,
+
+ Output head → final state of the H-module → final state of the L-module → input embedding
+
+The above method needs O(1) memory, does not require unrolling through time, and can be easily
+implemented with an autograd framework such as PyTorch, as shown in Figure 4. Given that
+each module only needs to back-propagate errors through its most recent local synaptic activity,
+this approach aligns well with the perspective that cortical credit assignment relies on short-range,
+temporally local mechanisms rather than on a global replay of activity patterns.
+The one-step gradient approximation is theoretically
+grounded in the mathematics of Deep Equilibrium Mod-
+els (DEQ) 36 which employs the Implicit Function Theo-
+rem (IFT) to bypass BPTT, as detailed next. Consider an
+idealized HRM behavior where, during high-level cycle
+k, the L-module repeatedly updates until its state zL con- def hrm(z, x, N=2, T=2):
+ x = input_embedding(x)
+verges to a local fixed point zL⋆ . This fixed point, given zH, zL = z
+ k−1
+the current high-level state zH , can be expressed as with torch.no_grad():
+ for _i in range(N ∗ T − 1):
+ k−1 zL = L_net(zL, zH, x)
+ zL⋆ = fL (zL⋆ , zH , x̃; θL ) . if (_i + 1) % T == 0:
+ zH = H_net(zH, zL)
+
+The H-module then performs a single update using this # 1−step grad
+ zL = L_net(zL, zH, x)
+converged L-state: zH = H_net(zH, zL)
+ return (zH, zL), output_head(zH)
+ k k−1 ⋆
+ zH = fH (zH , zL ; θH ) . # Deep Supervision
+ for x, y_true in train_dataloader:
+ z = z_init
+With a proper mapping F, the updates to the high-level for step in range(N_supervision):
+ k z, y_hat = hrm(z, x)
+state can be written in a more compact form as zH =
+ k−1 loss = softmax_cross_entropy(y_hat, y_true)
+F(zH ; x̃, θ), where θ = (θI , θL ), and the fixed-point z = z.detach()
+ ⋆ ⋆ ∂F
+can be written as zH = F(zH ; x̃, θ). Let JF = ∂zH be loss.backward()
+the Jacobian of F, and assume that the matrix I − JF is opt.step()
+ opt.zero_grad()
+ ⋆
+invertible at zH and that the mapping F is continuously
+differentiable. The Implicit Function Theorem then al- Figure 4: Top: Diagram of HRM with
+ ⋆
+lows us to calculate the exact gradient of fixed point zH approximate gradient. Bottom: Pseu-
+with respect to the parameters θ without explicit back- docode of HRM with deep supervision
+propagation: training in PyTorch.
+ ⋆ −1 ∂F
+ ∂zH 
+ = I − JF z⋆ . (1)
+ ∂θ H ∂θ z⋆
+ H
+
+
+Calculating the above gradient requires evaluating and inverting matrix (I − JF ) that can be com-
+putationally expensive. Given the Neumann series expansion,
+ (I − JF )−1 = I + JF + JF2 + JF3 + . . . ,
+the so-called 1-step gradient 37 approximates the series by considering only its first term, i.e. (I −
+JF )−1 ≈ I, and leads to the following approximation of Equation (1):
+ ∗ ∗
+ ∂zH ∂fH ∂zH ∂fH ∂zL∗ ∗
+ ∂zH ∂fH ∂zL∗
+ ≈ , ≈ · , ≈ · . (2)
+ ∂θH ∂θH ∂θL ∂zL∗ ∂θL ∂θI ∂zL∗ ∂θI
+ 6
+ ∂z ∗ ∂z ∗
+The gradients of the low-level fixed point, ∂θLL and ∂θLI , can also be approximated using another
+application of the 1-step gradient:
+ ∂zL∗ ∂fL ∂zL∗ ∂fL
+ ≈ , ≈ . (3)
+ ∂θL ∂θL ∂θI ∂θI
+By substituting Equation (3) back into Equation (2), we arrive at the final simplified gradients.
+Before defining our loss function, we must first introduce two key elements of our proposed
+method: deep supervision and adaptive computational time.
+Deep supervision Inspired by the principle that periodic neural oscillations regulate when learning
+occurs in the brain 38 , we incorporate a deep supervision mechanism into HRM, as detailed next.
+Given a data sample (x, y), we run multiple forward passes of the HRM model, each of which we
+refer to as a segment. Let M denote the total number of segments executed before termination.
+For each segment m ∈ {1, . . . , M }, let z m = (zHmN T
+ , zLmN T ) represent the hidden state at the
+conclusion of segment m, encompassing both high-level and low-level state components.
+At each segment m, we apply a deep supervision step as follows:
+ 1. Given the state z m−1 from the previous segment, compute the next state z m and its associated
+ output ŷ m through a forward pass in the HRM model:
+
+ (z m , ŷ m ) ← HRM(z m−1 , x; θ)
+
+ 2. Compute the loss for the current segment:
+
+ Lm ← L OSS(ŷ m , y)
+
+ 3. Update parameters:
+
+ θ ← O PTIMIZER S TEP(θ, ∇θ Lm )
+
+The crucial aspect of this procedure is that the hidden state z m is “detached” from the computa-
+tion graph before being used as the input state for the next segment. Consequently, gradients from
+segment m + 1 do not propagate back through segment m, effectively creating a 1-step approxi-
+mation of the gradient of the recursive deep supervision process 39,40 . This approach provides more
+frequent feedback to the H-module and serves as a regularization mechanism, demonstrating supe-
+rior empirical performance and enhanced stability in deep equilibrium models when compared to
+more complex, Jacobian-based regularization techniques 39,41 . Figure 4 shows pseudocode of deep
+supervision training.
+Adaptive computational time (ACT) The brain dynamically alternates between automatic think-
+ing (“System 1”) and deliberate reasoning (“System 2”) 42 . Neuroscientific evidence shows that
+these cognitive modes share overlapping neural circuits, particularly within regions such as the
+prefrontal cortex and the default mode network 43,44 . This indicates that the brain dynamically mod-
+ulates the “runtime” of these circuits according to task complexity and potential rewards 45,46 .
+Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that en-
+ables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning
+
+
+
+ 7
+ algorithm 47 to adaptively determine the number of segments. A Q-head uses the final state of the
+H-module to predict the Q-values Q̂m = (Q̂m m
+ halt , Q̂continue ) of the “halt” and “continue” actions:
+
+ ⊤ mN T
+ Q̂m = σ(θQ zH ) ,
+where σ denotes the sigmoid function applied element-wise. The halt or continue action is chosen
+using a randomized strategy as detailed next. Let Mmax denote the maximum number of segments
+(a fixed hyperparameter) and Mmin denote the minimum number of segments (a random variable).
+The value of Mmin is determined stochastically: with probability ε, it is sampled uniformly from the
+set {2, · · · , Mmax } (to encourage longer thinking), and with probability 1 − ε, it is set to 1. The halt
+action is selected under two conditions: when the segment count surpasses the maximum threshold
+Mmax , or when the estimated halt value Q̂halt exceeds the estimated continue value Q̂continue and the
+segment count has reached at least the minimum threshold Mmin .
+The Q-head is updated through a Q-learning algorithm, which is defined on the following episodic
+Markov Decision Process (MDP). The state of the MDP at segment m is z m , and the action space
+is {halt, continue}. Choosing the action “halt” terminates the episode and returns a binary reward
+indicating prediction correctness, i.e., 1{ŷ m = y}. Choosing “continue” yields a reward of 0 and
+the state transitions to z m+1 . Thus, the Q-learning targets for the two actions Ĝm = (Ĝm m
+ halt , Ĝcontinue )
+are given by
+
+ Ĝm m
+ halt = 1{ŷ = y} ,
+ 
+ Q̂m+1
+ halt , if m ≥ Nmax ,
+ m
+ Ĝcontinue =
+ max(Q̂m+1 , Q̂m+1 ) , otherwise .
+ halt continue
+
+We can now define the loss function of our learning procedure. The overall loss for each supervision
+segment combines both the Q-head loss and the sequence-to-sequence loss:
+
+ Lm m m m
+ ACT = L OSS (ŷ , y) + B INARY C ROSS E NTROPY (Q̂ , Ĝ ) .
+
+Minimizing the above loss enables both accurate predictions and nearly optimal stopping decisions.
+Selecting the “halt” action ends the supervision loop. In practice, sequences are processed in
+batches, which can be easily handled by substituting any halted sample in the batch with a fresh
+sample from the dataloader.
+Figure 5 presents a performance comparison between two HRM variants: one incorporating ACT
+and another employing a fixed computational step count equivalent to ACT’s Mmax parameter. It
+shows that ACT effectively adapts its computational resources based on task complexity, achieving
+significant computational savings with minimal impact on performance.
+Inference-time scaling An effective neural model should exploit additional computational re-
+sources during inference to enhance performance. As illustrated in Figure 5-(c), HRM seamlessly
+achieves inference-time scaling by simply increasing the computational limit parameter, Mmax
+without requiring further training or architectural modifications.
+Additional compute is especially effective for tasks that demand deeper reasoning. On Sudoku—
+a problem that often requires long-term planning—HRM exhibits strong inference-time scaling.
+On the other hand, we find that extra computational resources yield minimal gains in ARC-AGI
+challenge, as solutions generally require only a few transformations.
+
+ 8
+ (a) ACT Compute Spent (b) ACT Performance (c) Inference-time scaling
+ 8 100.0 100.0
+ Fixed M Fixed M
+ 7
+Mean Compute Steps
+
+ ACT (Mmax limit) 97.5 ACT (Mmax limit) 97.5
+ 6 95.0 95.0
+
+
+
+
+ Accuracy %
+
+
+
+
+ Accuracy %
+ 5 92.5 92.5
+ 4 90.0 90.0
+ 3 87.5 87.5 Train Mmax = 2
+ 85.0 85.0 Train Mmax = 4
+ 2 Train Mmax = 8
+ 1 82.5 82.5
+ 2 4 8 2 4 8 2 4 8 16
+ M (Fixed) or Mmax (ACT) M (Fixed) or Mmax (ACT) Inference Mmax
+
+
+Figure 5: Effectiveness of Adaptive Computation Time (ACT) on the Sudoku-Extreme-Full. (a)
+Mean compute steps used by models with ACT versus models with a fixed number of compute steps
+(M ). ACT maintains a low and stable number of average compute steps even as the maximum limit
+(Mmax ) increases. (b) Accuracy comparison. The ACT model achieves performance comparable
+to the fixed-compute model while utilizing substantially fewer computational steps on average. (c)
+Inference-time scalability. Models trained with a specific Mmax can generalize to higher compu-
+tational limits during inference, leading to improved accuracy. For example, a model trained with
+Mmax = 8 continues to see accuracy gains when run with Mmax = 16 during inference.
+
+Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is
+known to be prone to instability, often requiring stabilization techniques such as replay buffers
+and target networks 48 , which are absent in our design. Our approach, however, achieves stability
+through the intrinsic properties of our model and training procedure. Recent theoretical work by
+Gallici et al. 49 shows that Q-learning can achieve convergence if network parameters are bounded,
+weight decay is incorporated during training, and post-normalization layers are implemented. Our
+model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a
+layer normalization variant) and the AdamW optimizer. AdamW has been shown to solve an L∞ -
+constrained optimization problem, ensuring that model parameters remain bounded by 1/λ 50 .
+Architectural details We employ a sequence-to-sequence architecture for HRM. Both input and
+output are represented as token sequences: x = (x1 , . . . , xl ) and y = (y1 , . . . , yl′ ) respectively.
+The model includes an embedding layer fI that converts discrete tokens into vector representa-
+tions, and an output head fO (z; θO ) = softmax(θO z) that transforms hidden states into token prob-
+ability distributions ŷ. For small-sample experiments, we replace softmax with stablemax 51 to
+improve generalization performance. The sequence-to-sequence loss is averaged over all tokens,
+ Pl′
+L OSS(ŷ, y) = l1′ i=1 log p(yi ), where p(yi ) is the probability that distribution ŷi assigns to token
+yi . The initial hidden states z 0 are initialized by sampling from a truncated normal distribution with
+standard deviation of 1, truncation of 2, and kept fixed throughout training.
+Both the low-level and high-level recurrent modules fL and fH are implemented using encoder-
+only Transformer 52 blocks with identical architectures and dimensions. These modules take mul-
+tiple inputs, and we use straightforward element-wise addition to combine them, though more
+sophisticated merging techniques such as gating mechanisms could potentially improve perfor-
+mance and is left for future work. For all Transformer blocks in this work—including those in
+the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama 53
+architectures). These improvements include Rotary Positional Encoding 54 , Gated Linear Units 55 ,
+RMSNorm 56 , and the removal of bias terms from linear layers.
+Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture
+
+ 9
+ 8 4 5 6
+ 8 7
+ 3 4
+ 3 8 4 2
+ 6 3 8
+ 9 6
+ 5
+ 2 1
+ 2 5 3 8
+
+
+
+
+ 7 8 4 1 2 5 9 6 3
+ 2 6 1 3 8 9 7 4 5
+ 3 5 9 6 4 7 8 1 2
+ 5 3 8 4 9 6 1 2 7
+ 4 1 6 2 7 3 5 9 8
+ 9 7 2 8 5 1 4 3 6
+ 6 9 3 5 1 8 2 7 4
+ 8 4 7 9 6 2 3 5 1
+ 1 2 5 7 3 4 6 8 9
+
+
+
+
+ (a) ARC-AGI (b) Sudoku-Hard (c) Maze navigation (d) Sudoku-Extreme subset difficulty
+
+Figure 6: Left: Visualization of benchmark tasks. Right: Difficulty of Sudoku-Extreme examples.
+
+with weights initialized via truncated LeCun Normal initialization 57,58,59 , while the scale and bias
+parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 op-
+timizer 60 , a scale-invariant variant of Adam 61 , combined with a constant learning rate that includes
+linear warm-up.
+
+
+3 Results
+This section begins by describing the ARC-AGI, Sudoku, and Maze benchmarks, followed by an
+overview of the baseline models and their results. Figure 6-(a,b,c) presents a visual representa-
+tion of the three benchmark tasks, which are selected to evaluate various reasoning abilities in AI
+models.
+
+3.1 Benchmarks
+ARC-AGI Challenge The ARC-AGI benchmark evaluates general fluid intelligence through IQ-
+test-like puzzles that require inductive reasoning 27 . The initial version, ARC-AGI-1, presents chal-
+lenges as input-label grid pairs that force AI systems to extract and generalize abstract rules from
+just a few examples. Each task provides a few input–output demonstration pairs (usually 2–3) and
+a test input. An AI model has two attempts to produce the correct output grid. Although some be-
+lieve that mastering ARC-AGI would signal true artificial general intelligence, its primary purpose
+is to expose the current roadblocks in AGI progress. In fact, both conventional deep learning meth-
+ods and CoT techniques have faced significant challenges with ARC-AGI-1, primarily because it
+requires the ability to generalize to entirely new tasks 28 .
+Addressing the limitations identified in ARC-AGI-1, ARC-AGI-2 significantly expands the bench-
+mark by providing a more comprehensive and carefully refined collection of tasks. These new
+tasks emphasize deeper compositional reasoning, multi-step logic, contextual rule application, and
+symbolic abstraction. Human calibration studies show these tasks are challenging but doable for
+people, while being much harder for current AI systems, offering a clearer measure of general
+reasoning abilities 29 .
+
+
+ 10
+ Sudoku-Extreme Sudoku is a 9×9 logic puzzle, requiring each row, column, and 3×3 block to
+contain the digits 1–9 exactly once. A prediction is considered correct if it exactly matches the
+puzzle’s unique solution. Sudoku’s complex logical structure makes it a popular benchmark for
+evaluating logical reasoning in machine learning 62,63,64 .
+The most frequently used Sudoku dataset in research, namely the Kaggle dataset 65 , can be fully
+solved using elementary single-digit techniques 66 . The minimal 17-clue puzzles 62 , another widely-
+used collection, might seem more challenging due to its small number of clues. However, this
+perception is misleading—since 17 represents the minimum number of clues required to guarantee
+a unique Sudoku solution, these hints need to be highly orthogonal to each other. This orthogonal
+arrangement leads to many direct, easily-resolved solution paths 67 .
+We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforemen-
+tioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally
+difficult for human players:
+• Easy puzzles compiled from Kaggle, 17-clue, plus unbiased samples from the Sudoku puzzle
+ distribution 67 : totaling 1 149 158 puzzles.
+• Challenging puzzles compiled from Magictour 1465, Forum-Hard and Forum-Extreme subsets:
+ totaling 3 104 157 puzzles.
+The compiled data then undergo a strict 90/10 train-test split, ensuring that the test set puzzles
+cannot be derived through equivalent transformations of any training samples. Sudoku-Extreme is
+a down-sampled subset of this data containing 1000 training examples. We use Sudoku-Extreme in
+our main experiments (Figure 1), which focuses on small-sample learning scenarios. To guarantee
+convergence and control overfitting effects in our analysis experiments (Figures 2, 3 and 5), we use
+the complete training data, Sudoku-Extreme-Full, containing 3 831 994 examples.
+We measure puzzle difficulty by counting the number of search backtracks (“guesses”) required
+by a smart Sudoku solver program tdoku, which uses propositional logic to reduce the number of
+guesses 67 . Our Sudoku-Extreme dataset exhibits a mean difficulty of 22 backtracks per puzzle, sig-
+nificantly higher than existing datasets, including recent handmade puzzles Sudoku-Bench 68 which
+average just 0.45 backtracks per puzzle. These subset complexity levels are shown in Figure 6-(d).
+Maze-Hard This task involves finding the optimal path in a 30×30 maze, making it interpretable
+and frequently used for training LLMs in search tasks 69,70,71 . We adopt the instance generation
+procedure of Lehnert et al. 71 , but introduce an additional filter to retain only those instances whose
+difficulty exceeds 110. Here, “difficulty” is defined as the length of the shortest path, which aligns
+with the linear time complexity of the wavefront breadth-first search algorithm on GPUs 72 . A path
+is considered correct if it is valid and optimal—that is, the shortest route from the start to the goal.
+The training and test set both include 1000 examples.
+
+3.2 Evaluation Details
+For all benchmarks, HRM models were initialized with random weights and trained in the sequence-
+to-sequence setup using the input-output pairs. The two-dimensional input and output grids were
+flattened and then padded to the maximum sequence length. The resulting performance is shown in
+Figure 1. Remarkably, HRM attains these results with just ~1000 training examples per task—and
+without pretraining or CoT labels.
+
+
+ 11
+ For ARC-AGI challenge, we start with (1) all demonstration and test input-label pairs from the
+training set, and (2) all demonstration pairs along with test inputs from the evaluation set. The
+dataset is augmented by applying translations, rotations, flips, and color permutations to the puz-
+zles. Each task example is prepended with a learnable special token that represents the puzzle it
+belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Gener-
+ate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to
+obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All reported
+results are obtained by comparing the outputs with the withheld test labels from the evaluation set.
+We augment Sudoku puzzles by applying band and digit permutations, while data augmentation is
+disabled for Maze tasks. Both tasks undergo only a single inference pass.
+For ARC-AGI, the scores of the CoT models are taken from the official leaderboard 29 , while for
+Sudoku and Maze, the scores are obtained by evaluating through the corresponding API.
+In Figure 1, the baselines are grouped based on whether they are pre-trained and use CoT, or neither.
+The “Direct pred” baseline means using “direct prediction without CoT and pre-training”, which
+retains the exact training setup of HRM but swaps in a Transformer architecture. Interestingly, on
+ARC-AGI-1, “Direct pred” matches the performance of Liao and Gu 73 , who built a carefully de-
+signed, domain-specific equivariant network for learning the ARC-AGI task from scratch, without
+pre-training. By substituting the Transformer architecture with HRM’s hierarchical framework and
+implementing ACT, we achieve more than a twofold performance improvement.
+On the Sudoku-Extreme and Maze-Hard benchmarks, the performance gap between HRM and the
+baseline methods is significant, as the baselines almost never manage to solve the tasks. These
+benchmarks that demand lengthy reasoning traces are particularly difficult for CoT-based methods.
+With only 1000 training examples, the “Direct pred” baseline—which employs an 8-layer Trans-
+former identical in size to HRM—fails entirely on these challenging reasoning problems. When
+trained on the larger Sudoku-Extreme-Full dataset, however, “Direct pred” can solve some easy
+Sudoku puzzles and reaches 16.9% accuracy (see Figure 2). Lehnert et al. 71 showed that a large
+vanilla Transformer model with 175M parameters, trained on 1 million examples across multiple
+trials, achieved only marginal success on 30x30 Maze tasks, with accuracy below 20% using the
+pass@64 evaluation metric.
+
+3.3 Visualization of intermediate timesteps
+Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intrigu-
+ing question: what underlying reasoning algorithms does the HRM neural network actually imple-
+ment? Addressing this question is important for enhancing model interpretability and developing a
+deeper understanding of the HRM solution space.
+While a definitive answer lies beyond our current scope, we begin our investigation by analyzing
+state trajectories and their corresponding solution evolution. More specifically, at each timestep
+i and given the low-level and high-level state pair (zLi and zH i
+ ) we perform a preliminary forward
+ i i i
+pass through the H-module to obtain z̄ = fH (zH , zL ; θH ) and its corresponding decoded prediction
+ȳ i = fO (z̄ i ; θO ). The prediction ȳ i is then visualized in Figure 7.
+In the Maze task, HRM appears to initially explore several potential paths simultaneously, subse-
+quently eliminating blocked or inefficient routes, then constructing a preliminary solution outline
+ 3
+ The ARC-AGI allows two attempts for each test input.
+
+ 12
+ Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6
+
+
+
+
+ Initial Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6 Timestep i = 7
+ 4 89
+ 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 6 2 4 3 5 7 1 8 9 3 2 4 3 5 7 1 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3 2 4 1 6 7 5 8 9 3
+ 7 3 1
+ 6 7 8 6 3 4 1 5 4 6 7 8 6 3 4 1 5 4 8 7 9 6 3 4 1 5 2 8 7 9 6 3 4 1 5 2 6 7 9 6 3 4 1 5 2 6 7 9 9 3 4 1 5 2 6 7 9 8 3 4 1 5 2 6 7 9 8 3 4 1 5 2
+ 2 6 5 1 2 7 9 7 3 4 6 5 1 2 8 9 7 3 4 6 5 1 2 8 8 7 3 4 6 5 1 2 8 9 7 3 4 6 5 3 2 1 8 7 6 4 9 5 1 2 8 8 7 3 4 8 5 3 2 1 9 7 6 4 8 5 3 2 1 9 7 6 4
+ 67 8 3 4 8 6 7 2 1 2 5 3 4 8 6 7 2 1 9 5 3 4 8 6 7 2 1 5 5 3 4 8 6 7 2 1 5 5 3 4 9 6 7 2 1 5 5 3 4 8 6 7 2 1 8 5 3 4 9 6 7 2 1 8 5 3 4 9 6 7 2 1 8
+ 3 4 7 2 8 3 1 8 6 4 8 7 2 5 3 1 5 6 4 8 7 2 5 3 1 5 6 4 9 7 2 5 3 1 1 6 4 6 7 2 5 3 8 1 6 4 6 7 2 5 3 1 1 6 4 6 7 2 8 3 8 1 6 4 6 7 2 8 3 5 1 6 4 9
+1 64 23 15649237 9 19645237 7 19645237 7 19645237 7 19645238 7 19645237 7 19648237 5 19648237 5
+ 27 3 8 1 2 7 9 3 4 6 6 9 1 2 7 4 3 9 8 6 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 6 8 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6 9 1 2 7 4 3 5 8 6
+46 12 468128 7 3 7 465128 9 7 3 465128 9 7 3 468128 9 7 3 468128 9 7 3 468128 9 3 7 465128 9 3 7 465128 9 3 7
+3 7 6 1 3979 964 21 3875 564 21 3879 564 21 3879 564 21 3875 864 21 3875 864 21 3875 964 21 3875 964 21
+[7666fa5d] Example Input [7666fa5d] Example Output [7666fa5d] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4
+
+
+
+
+ [7b80bb43] Test Input Timestep i = 0 Timestep i = 1 Timestep i = 2 Timestep i = 3 Timestep i = 4 Timestep i = 5 Timestep i = 6
+ [7b80bb43] Example Input [7b80bb43] Example Output
+
+
+
+
+ Figure 7: Visualization of intermediate predictions by HRM on benchmark tasks. Top: Maze-
+ Hard—blue cells indicate the predicted path. Middle: Sudoku-Extreme—bold cells represent ini-
+ tial givens; red highlights cells violating Sudoku constraints; grey shading indicates changes from
+ the previous timestep. Bottom: ARC-AGI-2 Task—left: provided example input-output pair; right:
+ intermediate steps solving the test input.
+
+ followed by multiple refinement iterations. In Sudoku, the strategy resembles a depth-first search
+ approach, where the model appears to explore potential solutions and backtracks when it hits dead
+ ends. HRM uses a different approach for ARC tasks, making incremental adjustments to the board
+ and iteratively improving it until reaching a solution. Unlike Sudoku, which involves frequent
+ backtracking, the ARC solution path follows a more consistent progression similar to hill-climbing
+ optimization.
+ Importantly, the model shows that it can adapt to different reasoning approaches, likely choosing an
+ effective strategy for each particular task. Further research is needed to gain more comprehensive
+ insights into these solution strategies.
+
+
+ 4 Brain Correspondence
+ A key principle from systems neuroscience is that a brain region’s functional repertoire—its ability
+ to handle diverse and complex tasks—is closely linked to the dimensionality of its neural represen-
+ tations 75,76 . Higher-order cortical areas, responsible for complex reasoning and decision-making,
+ must handle a wide variety of tasks, demanding more flexible and context-dependent processing 77 .
+ In dynamical systems, this flexibility is often realized through higher-dimensional state-space tra-
+ jectories, which allow for a richer repertoire of potential computations 78 . This principle gives rise
+ to an observable dimensionality hierarchy, where a region’s position in the processing hierarchy
+
+ 13
+ (a) (c) (e)
+
+
+
+
+ (b) (d) (f)
+ 5.0
+
+ 4.5
+ Participation Ratio (PR)
+
+
+
+
+ 4.0
+
+ 3.5
+
+ 3.0
+
+ 2.5
+
+ 2.0
+
+ 0 20 40
+ Position in the hierarchy
+
+
+Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex. (a,b) are
+adapted from Posani et al. 74 . (a) Anatomical illustration of mouse cortical areas, color-coded by
+functional modules. (b) Correlation between Participation Ratio (PR), a measure of effective neural
+dimensionality, and hierarchical position across different mouse cortical areas. Higher positions in
+the hierarchy (e.g., MOs, ACAd) exhibit significantly higher PR values compared to lower sensory
+areas (e.g., SSp-n), with a Spearman correlation coefficient of ρ = 0.79 (P = 0.0003). (c,d) Trained
+HRM. (c) PR scaling of the trained HRM with task diversity. The dimensionality of the high-
+level module (zH ) scales with the number of unique tasks (trajectories) included in the analysis,
+indicating an adaptive expansion of its representational capacity. In contrast, the low-level module’s
+(zL ) dimensionality remains stable. (d) PR values for the low-level (zL , PR = 30.22) and high-
+level (zH , PR = 89.95) modules of the trained HRM, computed from neural activity during 100
+unique Sudoku-solving trajectories. A clear dimensionality hierarchy is observed, with the high-
+level module operating in a substantially higher-dimensional space. (e,f) Analysis of Untrained
+Network. To verify that the dimensionality hierarchy is an emergent property of training, the same
+analyses were performed on an untrained HRM with random weights. (e) In contrast to the trained
+model’s scaling in (c), the dimensionality of both modules in the untrained model remains low and
+stable, failing to scale with the number of tasks. (f) Similarly, contrasting with the clear separation
+in (d), the PR values for the untrained model’s modules (zL , PR = 42.09; zH , PR = 40.75) are
+low and nearly identical, showing no evidence of hierarchical separation. This confirms that the
+observed hierarchical organization of dimensionality is a learned property that emerges through
+training, not an artifact of the model’s architecture.
+
+
+
+
+ 14
+ correlates with its effective dimensionality. To quantify this phenomenon, we can examine the
+Participation Ratio (PR), which serves as a standard measure of the effective dimensionality of a
+high-dimensional representation 79 . The PR is calculated using the formula
+ ( i λi )2
+ P
+ PR = P 2 ,
+ i λi
+
+where {λi } are the eigenvalues of the covariance matrix of neural trajectories. Intuitively, a higher
+PR value signifies that variance is distributed more evenly across many dimensions, corresponding
+to a higher-dimensional representation. Conversely, a lower PR value indicates that variance is
+concentrated in only a few principal components, reflecting a more compact, lower-dimensional
+structure.
+The dimensionality hierarchy can be observed, for example, in the mouse cortex, where the PR of
+population activity increases monotonically from low-level sensory areas to high-level associative
+areas, supporting this link between dimensionality and functional complexity 74 (Figure 8 (a,b)).
+We evaluated whether HRM reproduces this neuroscientific principle by calculating the PR for
+both recurrent modules after training on the Sudoku-Extreme Full dataset. The PR computation
+used the covariance matrix derived from neural states gathered across multiple Sudoku-solving
+trajectories. The results show a striking parallel to the biological findings. The low-level module’s
+state (zL ) occupies a relatively small subspace with a participation ratio of 30.22, whereas the high-
+level module’s state (zH ) operates in a substantially larger subspace with a participation ratio of
+89.95, as shown in Figure 8(c). Furthermore, Figure 8(d) shows that increasing the number of
+unique tasks (trajectories) from 10 to 100 causes zH dimensionality to scale up accordingly, while
+zL dimensionality remains stable. These results suggest an emergent separation of representational
+capacity between the modules that parallels their functional roles.
+To confirm that this hierarchical organization is an emergent property of training, and not an artifact
+of the network’s architecture, we performed a control analysis using an identical but untrained
+network with random weights.
+We initialized an identical HRM architecture with random weights and, without any training, mea-
+sured the PR of its modules as the network processed the same task-specific inputs given to the
+trained model.
+The results, shown in Figure 8(e,f), reveal a stark contrast: the high-level and low-level modules of
+the untrained network exhibit no hierarchical separation, with their PR values remaining low and
+nearly indistinguishable from each other. This control analysis validates that the dimensionality
+hierarchy is an emergent property that arises as the model learns to perform complex reasoning.
+The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse
+cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where
+last-layer features converge to a low-dimensional subspace 80,81,82 . HRM therefore departs from the
+collapse pattern and instead fosters a high-dimensional representation in its higher module. This
+is significant because such representations are considered crucial for cognitive flexibility and are a
+hallmark of higher-order brain regions like the prefrontal cortex (PFC), which is central to complex
+reasoning.
+This structural parallel suggests the model has discovered a fundamental organizational principle.
+By learning to partition its representations into a high-capacity, high-dimensional subspace (zH )
+
+ 15
+ and a more specialized, low-dimensional one (zL ), HRM autonomously discovers an organizational
+principle that is thought to be fundamental for achieving robust and flexible reasoning in biological
+systems. This provides a potential mechanistic explanation for the model’s success on complex,
+long-horizon tasks that are intractable for models lacking such a differentiated internal structure.
+We emphasize, however, that this evidence is correlational. While a causal link could be tested
+via intervention (e.g., by constraining the H-module’s dimensionality), such methods are difficult
+to interpret in deep learning due to potential confounding effects on the training process itself.
+Thus, the causal necessity of this emergent hierarchy remains an important question for future
+investigation.
+
+
+5 Related Work
+Reasoning and algorithm learning Given the central role of reasoning problems and their close
+relation to algorithms, researchers have long explored neural architectures that enable algorithm
+learning from training instances. This line of work includes Neural Turing Machines (NTM) 83 ,
+the Differentiable Neural Computer (DNC) 84 , and Neural GPUs 85 –all of which construct iterative
+neural architectures that mimic computational hardware for algorithm execution, and are trained to
+learn algorithms from data. Another notable work in this area is Recurrent Relational Networks
+(RRN) 62 , which executes algorithms on graph representations through graph neural networks.
+Recent studies have integrated algorithm learning approaches with Transformer-based architec-
+tures. Universal Transformers extend the standard Transformer model by introducing a recurrent
+loop over the layers and implementing an adaptive halting mechanism. Geiping et al. 86 demonstrate
+that looped Transformers can generalize to a larger number of recurrent steps during inference than
+what they were trained on. Shen et al. 16 propose adding continuous recurrent reasoning tokens
+to the Transformer. Finally, TransNAR 8 combine recurrent graph neural networks with language
+models.
+Building on the success of CoT-based reasoning, a line of work have introduced fine-tuning meth-
+ods that use reasoning paths from search algorithms (like A*) as SFT targets 87,71,70 .
+We also mention adaptive halting mechanisms designed to allocate additional computational re-
+sources to more challenging problems. This includes the Adaptive Computation Time (ACT) for
+RNNs 88 and follow-up research like PonderNet 89 , which aims to improve the stability of this allo-
+cation process.
+HRM further pushes the boundary of algorithm learning through a brain-inspired computational
+architecture that achieves exceptional data efficiency and model expressiveness, successfully dis-
+covering complex and diverse algorithms from just 1000 training examples.
+Brain-inspired reasoning architectures Developing a model with the reasoning power of the
+brain has long been a goal in brain-inspired computing. Spaun 90 is one notable example, which uses
+spiking neural networks to create distinct modules corresponding to brain regions like the visual
+cortex and prefrontal cortex. This design enables an architecture to perform a range of cognitive
+tasks, from memory recall to simple reasoning puzzles. However, its reasoning relies on hand-
+designed algorithms, which may limit its ability to learn new tasks. Another significant model is the
+Tolman-Eichenbaum Machine (TEM) 91 , which is inspired by the hippocampal-entorhinal system’s
+role in spatial and relational memory tasks. TEM proposes that medial entorhinal cells create a
+basis for structural knowledge, while hippocampal cells link this basis to sensory information. This
+
+ 16
+ allows TEM to generalize and explains the emergence of various cell types like grid, border, and
+place cells. Another approach involves neural sampling models 92 , which view the neural signaling
+process as inference over a distribution, functioning similarly to a Boltzmann machine. These
+models often require hand-made rules to be set up for solving a specific reasoning task. In essence,
+while prior models are restricted to simple reasoning problems, HRM is designed to solve complex
+tasks that are hard for even advanced LLMs, without pre-training or task-specific manual design.
+Hierarchical memory The hierarchical multi-timescale structure also plays an important role in
+how the brain processes memory. Models such as Hierarchical Sequential Models 93 and Clockwork
+RNN 94 use multiple recurrent modules that operate at varying time scales to more effectively cap-
+ture long-range dependencies within sequences, thereby mitigating the forgetting issue in RNNs.
+Similar mechanisms have also been adopted in linear attention methods for memorizing long con-
+texts (see the Discussions section). Since HRM focuses on reasoning, full attention is applied for
+simplicity. Incorporating hierarchical memory into HRM could be a promising future direction.
+
+
+6 Discussions
+Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal
+Transformer 95 , HRM is computationally universal when given sufficient memory and time con-
+straints. In other words, it falls into the category of models that can simulate any Turing machine,
+overcoming the computational limitations of standard Transformers discussed previously in the in-
+troduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks,
+they suffer from premature convergence and memory intensive BPTT. Therefore, in practice, their
+effective computational depth remains limited, though still deeper than that of a standard Trans-
+former. By resolving these two challenges and being equipped with adaptive computation, HRM
+could be trained on long reasoning processes, solve complex puzzles requiring intensive depth-first
+search and backtracking, and move closer to practical Turing-completeness.
+Reinforcement learning with chain-of-thought Beyond fine-tuning using human-annotated CoT,
+reinforcement learning (RL) represents another widely adopted training methodology. However,
+recent evidence suggests that RL primarily unlocks existing CoT-like capabilities rather than dis-
+covering fundamentally new reasoning mechanisms 96,97,98,99 . Additionally, CoT-training with RL
+is known for its instability and data inefficiency, often requiring extensive exploration and careful
+reward design. In contrast, HRM takes feedback from dense gradient-based supervision rather than
+relying on a sparse reward signal. Moreover, HRM operates naturally in a continuous space, which
+is biologically plausible and avoids allocating same computational resources to each token, even
+though tokens vary in their reasoning and planning complexity 16 .
+Linear attention Recurrence has been explored not only for its capability in universal computa-
+tion, but also as a means to replace the attention mechanism in Transformers, which suffers from
+quadratic time and memory complexity 100 . Recurrent alternatives offer a more efficient design by
+processing input tokens sequentially and predicting the next token at each time step, similar to early
+RNN-based language models.
+Some linear-attention variants, such as Log-linear Attention 101 , share an RNN-like state-update that
+can be interpreted as propagating multi-timescale summary statistics, thereby retaining long-range
+context without the quadratic memory growth of standard self-attention. However, substituting the
+attention mechanism alone does not change the fact that Transformers are still fixed-depth, and
+
+ 17
+ require CoT as a compensatory mechanism. Notably, linear attention can operate with a reduced
+key-value cache over extended contexts, making them more suitable for deployment on resource-
+constrained edge devices.
+
+
+7 Conclusion
+This work introduces the Hierarchical Reasoning Model, a brain-inspired architecture that lever-
+ages hierarchical structure and multi-timescale processing to achieve substantial computational
+depth without sacrificing training stability or efficiency. With only 27M parameters and train-
+ing on just 1000 examples, HRM effectively solves challenging reasoning problems such as ARC,
+Sudoku, and complex maze navigation–tasks that typically pose significant difficulties for contem-
+porary LLM and chain-of-thought models.
+Although the brain relies heavily on hierarchical structures to enable most cognitive processes,
+these concepts have largely remained confined to academic literature rather than being translated
+into practical applications. The prevailing AI approach continues to favor non-hierarchical models.
+Our results challenge this established paradigm and suggest that the Hierarchical Reasoning Model
+represents a viable alternative to the currently dominant chain-of-thought reasoning methods, ad-
+vancing toward a foundational framework capable of Turing-complete universal computation.
+Acknowledgements We thank Mingli Yuan, Ahmed Murtadha Hasan Mahyoub and Hengshuai
+Yao for their insightful discussions and valuable feedback throughout the course of this work.
+
+
+
+
+ 18
+ References
+ 1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
+ http://www.deeplearningbook.org.
+ 2. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
+ recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
+ pages 770–778, 2015.
+ 3. Lena Strobl. Average-hard attention transformers are constant-depth uniform threshold
+ circuits, 2023.
+ 4. Tom Bylander. Complexity results for planning. In Proceedings of the 12th International Joint
+ Conference on Artificial Intelligence - Volume 1, IJCAI’91, page 274–279, San Francisco,
+ CA, USA, 1991. Morgan Kaufmann Publishers Inc. ISBN 1558601600.
+ 5. William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. In
+ Neural Information Processing Systems, 2023.
+ 6. David Chiang. Transformers in DLOGTIME-uniform TC0 . Transactions on Machine
+ Learning Research, 2025.
+ 7. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael
+ Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search
+ dynamics bootstrapping. In First Conference on Language Modeling, 2024.
+ 8. Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick, Larisa Markeeva, Alex
+ Vitvitskyi, Razvan Pascanu, and Petar Velivckovi’c. Transformers meet neural algorithmic
+ reasoners. ArXiv, abs/2406.09308, 2024.
+ 9. William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision
+ transformers. Transactions of the Association for Computational Linguistics, 11:531–545,
+ 2023. doi: 10.1162/tacl_a_00562.
+10. Jason Wei, Yi Tay, et al. Chain-of-thought prompting elicits reasoning in large language
+ models, 2022. arXiv preprint arXiv:2201.11903.
+11. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of
+ thought. In ICLR, 2024.
+12. Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in
+ reasoning with large language models. ArXiv, abs/2402.08939, 2024.
+13. Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer "attacks" on chain-of-thought
+ reasoning. In Annual Meeting of the Association for Computational Linguistics, 2024.
+14. Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius
+ Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data.
+ arXiv preprint arXiv:2211.04325, 2022.
+15. Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang,
+ Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive
+ survey on latent chain-of-thought reasoning, 2025.
+16. Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu.
+ Training large language models to reason in a continuous latent space. arXiv preprint
+ arXiv:2412.07423, 2024.
+
+
+ 19
+ 17. Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a
+ tool for communication rather than thought. Nature, 630(8017):575–586, 2024.
+18. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei.
+ Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and
+ Machine Intelligence, 2024.
+19. Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current
+ Opinion in Neurobiology, 55:82–89, 2019. ISSN 0959-4388. doi: https://doi.org/10.1016/j.
+ conb.2019.01.011.
+20. John D Murray, Alberto Bernacchia, David J Freedman, Ranulfo Romo, Jonathan D Wallis,
+ Xinying Cai, Camillo Padoa-Schioppa, Tatiana Pasternak, Hyojung Seo, Daeyeol Lee, et al.
+ A hierarchy of intrinsic timescales across primate cortex. Nature neuroscience, 17(12):1661–
+ 1663, 2014.
+21. Roxana Zeraati, Yan-Liang Shi, Nicholas A Steinmetz, Marc A Gieselmann, Alexander
+ Thiele, Tirin Moore, Anna Levina, and Tatiana A Engel. Intrinsic timescales in the
+ visual cortex change with selective attention and reflect spatial connectivity. Nature
+ communications, 14(1):1858, 2023.
+22. Julia M Huntenburg, Pierre-Louis Bazin, and Daniel S Margulies. Large-scale gradients in
+ human cortical organization. Trends in cognitive sciences, 22(1):21–31, 2018.
+23. Victor AF Lamme and Pieter R Roelfsema. The distinct modes of vision offered by
+ feedforward and recurrent processing. Trends in neurosciences, 23(11):571–579, 2000.
+24. Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J
+ Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012.
+25. Klara Kaleb, Barbara Feulner, Juan Gallego, and Claudia Clopath. Feedback control guides
+ credit assignment in recurrent neural networks. Advances in Neural Information Processing
+ Systems, 37:5122–5144, 2024.
+26. Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
+ Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.
+27. François Chollet. On the measure of intelligence (abstraction and reasoning corpus), 2019.
+ arXiv preprint arXiv:1911.01547.
+28. Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024:
+ Technical report. ArXiv, abs/2412.04604, 2024.
+29. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+ agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+ 2025.
+30. György Buzsáki. Gamma, alpha, delta, and theta oscillations govern cognitive processes.
+ International Journal of Psychophysiology, 39:241–248, 2000.
+31. György Buzsáki. Rhythms of the Brain. Oxford university press, 2006.
+32. Anja Pahor and Norbert Jaušovec. Theta–gamma cross-frequency coupling relates to the level
+ of human intelligence. Intelligence, 46:283–290, 2014.
+33. Adriano BL Tort, Robert W Komorowski, Joseph R Manns, Nancy J Kopell, and Howard
+ Eichenbaum. Theta–gamma coupling increases during the learning of item–context
+ associations. Proceedings of the National Academy of Sciences, 106(49):20942–20947, 2009.
+
+
+ 20
+ 34. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between
+ energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11,
+ 2016.
+35. Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert
+ Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent
+ networks of spiking neurons. Nature Communications, 11, 07 2020. doi: 10.1038/
+ s41467-020-17236-y.
+36. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in
+ Neural Information Processing Systems, pages 690–701, 2019.
+37. Zhengyang Geng, Xinyu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training
+ implicit models. ArXiv, abs/2111.05177, 2021.
+38. Katarina Begus and Elizabeth Bonawitz. The rhythm of learning: Theta oscillations as an
+ index of active learning in infancy. Developmental Cognitive Neuroscience, 45:100810, 2020.
+ ISSN 1878-9293. doi: https://doi.org/10.1016/j.dcn.2020.100810.
+39. Shaojie Bai, Zhengyang Geng, Yash Savani, and J. Zico Kolter. Deep Equilibrium
+ Optical Flow Estimation . In 2022 IEEE/CVF Conference on Computer Vision and Pattern
+ Recognition (CVPR), pages 610–620, 2022.
+40. Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and
+ Thomas Moreau. Shine: Sharing the inverse estimate from the forward pass for bi-level
+ optimization and implicit models. ArXiv, abs/2106.00553, 2021.
+41. Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing equilibrium models by jacobian
+ regularization. In International Conference on Machine Learning, 2021.
+42. Daniel Kahneman and P Egan. Thinking, fast and slow (farrar, straus and giroux, new york),
+ 2011.
+43. Matthew D Lieberman. Social cognitive neuroscience: a review of core processes. Annu. Rev.
+ Psychol., 58(1):259–289, 2007.
+44. Randy L Buckner, Jessica R Andrews-Hanna, and Daniel L Schacter. The brain’s default
+ network: anatomy, function, and relevance to disease. Annals of the new York Academy of
+ Sciences, 1124(1):1–38, 2008.
+45. Marcus E Raichle. The brain’s default mode network. Annual review of neuroscience, 38(1):
+ 433–447, 2015.
+46. Andrew Westbrook and Todd S Braver. Cognitive effort: A neuroeconomic approach.
+ Cognitive, Affective, & Behavioral Neuroscience, 15:395–415, 2015.
+47. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT
+ Press, Cambridge, MA, 2018.
+48. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
+ Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. ArXiv,
+ abs/1312.5602, 2013.
+49. Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja,
+ Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning,
+ 2025.
+
+
+
+ 21
+ 50. Shuo Xie and Zhiyuan Li. Implicit bias of adamw: L inf norm constrained optimization.
+ ArXiv, abs/2404.04454, 2024.
+51. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of
+ numerical stability. In The Thirteenth International Conference on Learning Representations,
+ 2025.
+52. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+ Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
+ information processing systems, pages 5998–6008, 2017.
+53. Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta,
+ 2024. URL https://ai.meta.com/llama/.
+54. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
+ Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
+55. Noam M. Shazeer. Glu variants improve transformer. ArXiv, abs/2002.05202, 2020.
+56. Biao Zhang and Rico Sennrich. Root mean square layer normalization. ArXiv,
+ abs/1910.07467, 2019.
+57. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
+ normalizing neural networks. In Neural Information Processing Systems, 2017.
+58. JAX Developers. jax.nn.initializers.lecun_normal. Google Research, 2025. URL
+ https://docs.jax.dev/en/latest/_autosummary/jax.nn.initializers.lecun_
+ normal.html. Accessed June 22, 2025.
+59. Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop.
+ In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
+60. Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak,
+ Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and
+ Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. In Forty-first
+ International Conference on Machine Learning, 2024.
+61. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
+62. Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In Neural
+ Information Processing Systems, 2017.
+63. Jieyi Long. Large language model guided tree-of-thought. ArXiv, abs/2305.08291, 2023.
+64. Yilun Du, Jiayuan Mao, and Josh Tenenbaum. Learning iterative reasoning through energy
+ diffusion. ArXiv, abs/2406.11179, 2024.
+65. Kyubyong Park. Can convolutional neural networks crack sudoku puzzles? https:
+ //github.com/Kyubyong/sudoku, 2018.
+66. Single-digit techniques. https://hodoku.sourceforge.net/en/tech_singles.php.
+ Accessed: 2025-06-16.
+67. Tom Dillion. Tdoku: A fast sudoku solver and generator. https://t-dillon.github.io/
+ tdoku/, 2025.
+68. Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones. Sudoku-bench:
+ Evaluating creative reasoning with sudoku variants. arXiv preprint arXiv:2505.16135, 2025.
+69. Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous
+ thought machines. arXiv preprint arXiv:2505.05522, 2025.
+
+ 22
+ 70. DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng.
+ Dualformer: Controllable fast and slow thinking by learning with randomized reasoning
+ traces, 2025.
+71. Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul McVay, Michael
+ Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search
+ dynamics bootstrapping. In First Conference on Language Modeling, 2024.
+72. Mubbasir Kapadia, Francisco Garcia, Cory D. Boatright, and Norman I. Badler. Dynamic
+ search on the gpu. In 2013 IEEE/RSJ International Conference on Intelligent Robots and
+ Systems, pages 3332–3337, 2013. doi: 10.1109/IROS.2013.6696830.
+73. Isaac Liao and Albert Gu. Arc-agi without pretraining, 2025. URL https:
+ //iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_
+ without_pretraining.html.
+74. Lorenzo Posani, Shuqi Wang, Samuel P Muscinelli, Liam Paninski, and Stefano Fusi.
+ Rarely categorical, always high-dimensional: how the neural code changes along the cortical
+ hierarchy. bioRxiv, pages 2024–11, 2025.
+75. Mattia Rigotti, Omri Barak, Melissa R. Warden, Xiao-Jing Wang, Nathaniel D. Daw, Earl K.
+ Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks.
+ Nature, 497:585–590, 2013. doi: 10.1038/nature12160.
+76. Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. Context-
+ dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84,
+ 2013. doi: 10.1038/nature12742.
+77. Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function.
+ Annual Review of Neuroscience, 24(1):167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167.
+78. Wolfgang Maass. Real-time computing without stable states: a new framework for neural
+ computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi:
+ 10.1162/089976602760407955.
+79. Ege Altan, Sara A. Solla, Lee E. Miller, and Eric J. Perreault. Estimating the dimensionality
+ of the manifold underlying multi-electrode neural recordings. PLoS Computational Biology,
+ 17(11):e1008591, 2021. doi: 10.1371/journal.pcbi.1008591.
+80. Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the
+ terminal phase of deep learning training. Proceedings of the National Academy of Sciences,
+ 117(40):24652–24663, 2020. doi: 10.1073/pnas.2015509117.
+81. Cong Fang, Hangfeng He, Qi Long, and Weijie J. Su. Exploring deep neural networks via
+ layer–peeled model: Minority collapse in imbalanced training. Proceedings of the National
+ Academy of Sciences, 118(43):e2103091118, 2021. doi: 10.1073/pnas.2103091118.
+82. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu.
+ A geometric analysis of neural collapse with unconstrained features. In Advances in Neural
+ Information Processing Systems, volume 34 of NeurIPS, pages 29820–29834, 2021.
+83. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014.
+84. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka
+ Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John
+ Agapiou, et al. Hybrid computing using a neural network with dynamic external memory.
+ Nature, 538(7626):471–476, 2016.
+
+ 23
+ 85. Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In ICLR, 2016.
+ 86. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R.
+ Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time
+ compute with latent reasoning: A recurrent depth approach, 2025.
+ 87. Tiedong Liu and Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic
+ tasks. ArXiv, abs/2305.14201, 2023.
+ 88. Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv,
+ abs/1603.08983, 2016.
+ 89. Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. ArXiv,
+ abs/2107.05407, 2021.
+ 90. Chris Eliasmith, Terrence C Stewart, Xuan Choo, Trevor Bekolay, Travis DeWolf, Yichuan
+ Tang, and Daniel Rasmussen. A large-scale model of the functioning brain. science, 338
+ (6111):1202–1205, 2012.
+ 91. James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil
+ Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine: unifying space and
+ relational memory through generalization in the hippocampal formation. Cell, 183(5):1249–
+ 1263, 2020.
+ 92. Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as
+ sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS
+ computational biology, 7(11):e1002211, 2011.
+ 93. Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term
+ dependencies. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural
+ Information Processing Systems, volume 8. MIT Press, 1995.
+ 94. Jan Koutník, Klaus Greff, Faustino J. Gomez, and Jürgen Schmidhuber. A clockwork rnn. In
+ International Conference on Machine Learning, 2014.
+ 95. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser.
+ Universal transformers, 2018. arXiv preprint arXiv:1807.03819.
+ 96. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng,
+ Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du,
+ and Yelong Shen. Reinforcement learning for reasoning in large language models with one
+ training example, 2025. URL https://arxiv.org/abs/2504.20571.
+ 97. Niklas Muennighoff. s1: Simple test-time scaling. arXiv preprint arXiv:2502.23456, 2025.
+ 98. Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu,
+ Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng
+ Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025.
+ 99. Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025.
+100. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms
+ through structured state space duality. ArXiv, abs/2405.21060, 2024.
+101. Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. Log-linear
+ attention. arXiv preprint arXiv:2506.04761, 2025.
+
+
+
+
+ 24
+ \ No newline at end of file
diff --git a/papers/txt/ptrm2025_probabilistic_trm.txt b/papers/txt/ptrm2025_probabilistic_trm.txt
new file mode 100644
index 0000000..a00f31c
--- /dev/null
+++ b/papers/txt/ptrm2025_probabilistic_trm.txt
@@ -0,0 +1,906 @@
+ Probabilistic Tiny Recursive Model
+
+
+ Amin Sghaier Ali Parviz Alexia Jolicoeur-Martineau
+ Mila – Quebec AI Institute Mila – Quebec AI Institute Independent
+ ILLS & ETS Montreal
+
+ {amin.sghaier, ali.parviz}@mila.quebec
+arXiv:2605.19943v1 [cs.AI] 19 May 2026
+
+
+
+
+ alexia.jolicoeur-martineau@mail.mcgill.ca
+
+
+
+ Abstract
+ Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of
+ the parameters of modern large language models (LLMs) by iteratively refining a
+ latent state and final answer. While powerful, their deterministic recursion can lead
+ to convergence at suboptimal solutions, without escape mechanism. A common
+ workaround relies on task-specific input perturbations at test time combined with
+ answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-
+ agnostic framework for test-time compute scaling that addresses this limitation
+ through stochastic exploration. PTRM injects Gaussian noise at each deep recursion
+ step, enabling parallel trajectories to explore diverse solution basins, and selects
+ among them using the model’s existing Q head (used for early stopping in the
+ original TRM). Without requiring retraining or task-specific augmentations, PTRM
+ enables substantial accuracy gains across benchmarks, including Sudoku-Extreme
+ (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to
+ 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs
+ (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.
+
+
+
+ PPBench Puzzles
+ sudoku, lightup, nurikabe, heyawake, and tapa Sudoku-Extreme
+ 100 100 98.75
+ 91.2 87.4
+ 80 80
+ Accuracy (%)
+
+
+
+
+ 62.6
+ 60 55.1 60 55
+ 40 34.7 40
+ 24.5 24.5
+ 20 20
+ 2 0 0 0 0
+ 0 0
+ Direct pred
+
+
+
+
+ Direct pred
+ gemini-3.1-pro
+
+ claude-opus-4-6
+
+
+
+ TRM
+
+
+
+
+ Deepseek R1
+
+
+
+ o3-mini-high
+
+
+
+ HRM
+
+ TRM
+ gpt-5.2@xhigh
+
+
+
+
+ PTRM (ours)
+
+
+
+
+ Claude 3.7 8K
+
+
+
+
+ PTRM (ours)
+ LLM ensemble
+
+
+
+
+ Chain-of-thought, pretrained Direct prediction Probabilistic recursive prediction (ours)
+ LLM ensemble Deterministic recursive prediction
+ Best of 7 strongest LLMs. Assumes access to a perfect verifier.
+
+ Figure 1: PTRM performance comparison. On various PPBench puzzles, PTRM boosts TRM
+ performance by 28.6 points without any retraining. It outperforms the strongest single frontier LLMs
+ by 56.5 points and an ensemble of the seven strongest LLMs (assuming a perfect verifier) by 36
+ points. On Sudoku-Extreme, PTRM reaches a state of the art 98.75%.
+ 1 Introduction
+Tiny Recursive Models (TRM) [1] achieve strong performance on complex reasoning puzzles with
+orders of magnitude fewer parameters than the large language models (LLMs) they outperform on
+tasks like Sudoku-Extreme [2] and ARC-AGI [3, 4]. TRM and its predecessor Hierarchical Reasoning
+Model (HRM) [2] represent an emerging architectural alternative to standard autoregressive reasoning
+models. Rather than autoregressively generating chains of token-level reasoning, they recursively
+refine a latent state. This approach produces a single deterministic answer per input, fitting well with
+tasks where the answer is unique.
+Despite their strong performance, their deterministic inference does not make full use of their
+capabilities. We show that many of TRM’s incorrect answers are from rollouts trapped in bad latent
+space basins (i.e., regions of the latent space which decode to incorrect answers and from which the
+deterministic recursions cannot escape). This observation, which aligns with recent mechanistic work
+on related models [5], suggests that TRM has the capabilities to solve significantly more problems
+but is limited by its standard inference procedure.
+Although each puzzle has a unique correct answer, many distinct latent trajectories can reach it. This
+is analogous to reasoning LLMs, where many reasoning trajectories can lead to the same unique
+answer. However, being non-deterministic, LLMs can be randomly sampled in order to form different
+trajectories (including Chains of Thought and actual answer). By then selecting a trajectory using
+a voting mechanism or based on the answer’s projected value (via a verifier), LLMs can leverage
+test-time compute to achieve very high accuracy [6]. We propose a way to achieve similar test-time
+scaling performance gains by sampling stochastic latent trajectories, each producing a deterministic
+decoded answer, and selecting among the answers using the model’s own Q head.
+TRM’s Q head is trained jointly (as a correctness classifier) with the rest of the network and is
+conventionally used only at training time for adaptive computation (ACT) [7]. It carries valuable
+information that the standard inference procedure discards.
+We propose Probabilistic TRM (PTRM), a test-time compute scaling framework that introduces a
+new width scaling axis. At inference we run K parallel rollouts per puzzle, each receiving Gaussian
+noise injected into the latent at every deep recursion step. The noise causes rollouts to follow different
+latent trajectories and settle in different basins. Among the resulting candidate answers, the Q head
+is used to select the one most likely to be correct. PTRM requires no training changes and no
+task-specific test-time augmentation, yet, as illustrated in Figure 1, delivers substantial accuracy
+gains across diverse reasoning benchmarks.
+
+2 Background: Tiny Recursive Model
+Tiny Recursive Model (TRM) is a single network that iteratively refines a predicted answer y to a
+question x through recursive updates of a reasoning latent z. Specifically, a single latent recursion
+consists of n updates to the latent state z followed by one update to the predicted answer y, all using
+the same two-layer network fθ : z ← fθ (x + y + z) n times, then y ← fθ (y + z).
+fθ distinguishes the two update types by whether the input includes x. A deep recursion runs T
+latent recursions in sequence, with only the final one retaining gradients, allowing the model to
+leverage a large effective depth while keeping training efficient.
+Rather than doing one optimization step per sample, TRM is trained via deep supervision, which
+consists in keeping the previous latent state z and answer y as initialization (after being detached from
+the computational graph) for the next supervision step. This is done for up to Nsup supervision steps.
+The loss at each step is calculated using cross entropy between the predicted answer logits fO (y)
+(where fO is a linear output head) and the ground truth ytrue . This trains the network to progressively
+refine its prediction across reasoning steps. At inference, the recurrence can be unrolled for more
+steps than during training, providing a depth axis for test-time compute scaling (additional steps may
+correct otherwise-incorrect answers).
+Without halting mechanism during training, each puzzle stays in the mini-batch for Nsup supervision
+steps rather than being replaced after each one. To avoid wasting compute on already-solved samples,
+an Adaptive Computational Time (ACT) halting mechanism is used. This is done by adding a binary
+cross entropy loss between a halting logit q̂ = fQ (y) (where fQ is a linear Q head) and the binary exact
+
+
+ 2
+ Correct answer Incorrect answer Cell accuracy Start End
+ Quick success Delayed success Failure
+
+ PC 2 (15% var)
+
+
+
+
+ PC 2 (36% var)
+
+
+
+
+ PC 2 (8% var)
+ PC 1 (84% var) PC 1 (58% var) PC 1 (85% var)
+
+
+ 1.0 1.0 1.0
+ 5.0 5.0 5
+
+
+
+
+ Cell accuracy
+ 2.5 0.9 2.5 0.9 0.8
+Q value
+
+
+
+
+ 0.0 0
+ 0.0 0.8 2.5 0.8
+ 0.6
+ 2.5 5
+ 5.0
+ 0.7 0.7
+ 5.0
+ 0 5 10 15 0 5 10 15 0 5 10 15
+ Supervision step Supervision step Supervision step
+ Figure 2: TRM Trajectory Modes. PCA projection of y (top) and Q value (solid, left axis) with cell
+ accuracy (dashed, right axis) across supervision steps (bottom) for three PPBench puzzles, illustrating
+ three trajectory modes (left to right): quick success, delayed success, and failure (Sec. 3). Latents are
+ projected into the principal plane per puzzle, so PC axes are not comparable across plots. Trajectories
+ fade from light (early steps) to dark (later steps). Circle marks the start and square marks end.
+
+
+ correctness of the predicted answer ŷ = arg max fO (y): Lstep = CE(fO (y), ytrue ) + BCE(q̂, 1[ŷ =
+ ytrue ]). The Q head thus allows the supervision loop to halt early on samples where sigmoid(q̂) > 0.5,
+ improving data efficiency. During inference, the Q head is not used, and the model performs Nsup
+ supervision steps to maximize answer correctness.
+ While TRM is powerful, it sometimes gets stuck into incorrect solutions. In the next section, we will
+ investigate such failures cases in order to determine a way to remedy them.
+
+
+ 3 Problem: When Does TRM Fail?
+
+ 3.1 Analysis of failures and successes
+
+ We present observations about TRM that motivate our method. In this section, we train a TRM on
+ multiple Pencil Puzzle Bench (PPBench) [8] puzzles and inspect the latent dynamics and Q head
+ behavior across supervision steps on a held-out validation set. For each puzzle, we record the latent
+ yt and the Q logit q̂t = fQ (yt ) at every supervision step t = 1, . . . , Nsup , project the latents into
+ the principal plane (PCA per puzzle), and jointly plot the Q value alongside cell accuracy (fraction
+ of correct cells in the predicted answer) over supervision steps. Figure 2 shows paired PCA and
+ Q/cell-accuracy plots for three representative puzzles, illustrating three trajectory modes we observe:
+ Quick success: the trajectory transitions in a few steps from its starting location to a convergence
+ region and remains there. Cell accuracy and the Q value rise together and saturate near their maxima
+ within the same few steps.
+ Delayed success: the trajectory initially oscillates around one region and remains there for multiple
+ supervision steps before sharply escaping to a different region where it converges. During the initial
+
+
+ 3
+ phase, the Q value is negative, and at the step where the trajectory escapes, both Q value and cell
+accuracy spike together.
+Failure: the trajectory oscillates in a bounded region without converging. Cell accuracy never reaches
+near 100%, and the Q value stays negative for all supervision steps.
+We refer to latent space regions that trajectories remain in across multiple supervision steps and
+exhibit similar cell accuracy throughout as basins. Basins where cell accuracy is near-maximal are
+good basins and basins where it is not are bad basins. Initially, failures and delayed successes behave
+similarly (both are caught in bad basins with negative Q). They diverge only later in their trajectories,
+when delayed successes find an escape to a good basin while failures remain stuck.
+
+3.2 The Q head tracks trajectory quality
+
+ 6 1.00
+ 0.95 Figure 3: Q value follows cell ac-
+ 4
+ 0.90 curacy across reasoning. Mean
+ 2
+
+
+
+
+ Cell accuracy
+ Incorrect (28) 0.85 Q value (solid, left axis) and mean
+Q value
+
+
+
+
+ 0 Correct (69) 0.80 cell accuracy (dashed, right axis)
+ Cell accuracy (right axis) over supervision steps, aggregated
+ 2 0.75
+ 0.70
+ over 100 PPBench validation puz-
+ 4 zles, separated by final correctness
+ 0.65
+ 6 (green: correct, red: incorrect).
+ 0.60
+ 0 2 4 6 8 10 12 14
+ Supervision step
+Across all three modes (failures, delayed successes, and quick successes), we find that the Q head’s
+value closely tracks cell accuracy at every supervision step. To further confirm this, Figure 3
+aggregates trajectories from 100 PPBench validation puzzles, separating them by final-answer
+correctness. The aggregate view corroborates the per-puzzle observation: mean Q and mean cell
+accuracy rise together on correct trajectories and remain mostly flat on incorrect ones. Moreover, at
+convergence, the Q logit sharply separates the two populations where q̂ ≈ +6 (sigmoid ≈ 1) for
+correct trajectories and q̂ ≈ −6 (sigmoid ≈ 0) for incorrect ones. The Q head is therefore a reliable
+learned indicator of whether a trajectory has reached a good basin.
+Given that the Q head’s ability to distinguish good from bad trajectories, a natural question follows:
+can we leverage the Q head to identify better trajectories? The main challenge is that the standard
+TRM is inherently deterministic, and thus cannot be used to sample different trajectories for a given
+problem. In the next section, we will show that by simply adding Gaussian noise to the latent state,
+we can sample different parallel trajectories and leverage the Q head to pick the best one.
+
+4 Method: Test-Time Compute Scaling via Stochastic Rollouts
+We propose Probabilistic TRM (PTRM), an inference-time procedure that makes the TRM recursion
+stochastic and selects the best of K resulting trajectories. PTRM requires no special training and
+can be readily applied to any pretrained TRM model. Furthermore it requires no task-specific
+augmentations. PTRM works as follows: at each supervision step, we add Gaussian noise (scaled by
+σ) to the latent state input. The Q head fQ scores each candidate latent output, and the one with the
+highest Q value is selected and then decoded using the model’s output head fO . The algorithm in
+Figure 4 (left) states this formally. PTRM offers two complementary benefits: 1) it enables trajectories
+to escape bad basins where deterministic TRM remains stuck, and 2) it introduces width as a new
+axis for test-time scaling.
+
+4.1 Escaping bad basins
+
+In Sec. 3, we found that some failed deterministic trajectories are caught in bad solution basins in
+latent space, with no way to escape. PTRM lets us test whether stochastic perturbations are enough
+for some of the rollouts of a previously failed puzzle to reach a good solution basin. Figure 5 shows
+K=100 independent rollouts, from the same failed puzzle used in Figure 2 (which fails at K=1),
+
+
+ 4
+ PTRM Inference (a) Standard TRM (deterministic)
+
+ 1: Input: puzzle x, rollouts K, puzzle ··· answer
+ 2: supervision steps D, noise scale σ
+ 3: for k = 1, . . . , K in parallel do depth axis: D deep recursion steps
+ (k) (k)
+ 4: Initialize z0 , y0 (b) PTRM (ours): K stochastic rollouts + Q-head selection
+ 5: for t = 1, . . . , D do +ϵ +ϵ +ϵ
+ ···
+
+
+
+
+ width axis: K rollouts
+ (k)
+ 6: zt−1 += ϵ, ϵ ∼ N (0, σ 2 I) 1 +ϵ +ϵ +ϵ
+ k=
+ (k) (k) (k) (k)
+ 7: zt , yt ← rec(x, zt−1 , yt−1 ) ···
+ puzzle arg maxk Qk
+ 8: end for k=2
+ ·
+ ·
+ ·
+ ·
+ ·
+ ·
+ (k)
+ 9: ŷ (k) ← arg max fO (yD ) k=
+ K
+ ·
+ +ϵ
+ ·
+ +ϵ
+ ·
+ +ϵ
+ (k) (k)
+10: q̂ ← fQ (yD ) ··· final answer
+
+11: end for ∗
+12: return ŷ (k ) , k ∗ = arg maxk q̂ (k) deep recursion step +ϵ Gaussian noise injection
+
+
+
+
+Figure 4: Left: PTRM inference procedure (the rec() function refers to a deep recursion step). Right:
+PTRM mechanism. (a) Standard TRM: a single deterministic rollout. (b) PTRM: K stochastic latent
+rollouts with Gaussian noise ϵ at each deep recursion step, with the Q head selecting the final answer.
+
+
+projected into the principal plane. Most rollouts (92%) remain stuck in the same bad basin, while
+a minority (8%) escape to a distinct region in latent space and produce correct answers. We also
+observe that recurrent noise creates a per-rollout probability of escape: at K = 5 no rollouts escape,
+at K = 25 one does, and at K = 100 eight do. This confirms that noise provides the stochasticity
+needed to occasionally find an escape trajectory.
+
+4.2 Width scaling
+
+Since more rollouts per puzzle compound the chance that at least one reaches a good basin, the
+number of rollouts K is a natural quantity to scale. Given K independent rollouts, pass@K (any
+rollout correct) is the oracle upper bound and best-Q@K (the rollout with highest q̂ is correct) is a
+metric available at inference without a correctness oracle. The choice of Q as selector is motivated by
+Sec. 3’s observation that Q accurately separates correct from incorrect trajectories (Figure 3).
+Figure 6 shows pass@K and best-Q@K as K grows, averaged over 3 seeds on the held-out PPBench
+validation set (sudoku, nurikabe, tapa, lightup, and heyawake). Both metrics rise from 76.4% at
+K = 1 to 89.5% at K = 100, a gain of 13 percentage points. Across all tested K, the gap between
+pass@K and best-Q@K stays under 1pp, making the Q head a strong verifier on this validation set.
+By contrast, mode@K (most frequent answer across rollouts) rises by only 1.3pp over the same
+range, showing that the width-scaling gains come mostly from the Q head’s ability to identify correct
+solutions even when they are rare.
+Interaction with depth scaling. Depth is another scaling axis already supported by TRM, which
+consists of running more deep recursions (supervision steps) at inference than the Nsup the model
+was trained on. On the deterministic baseline (K=1), tripling the depth from 16 to 48 steps raises
+PPBench validation accuracy from 76.4% to 79.5% (+3.1pp). At higher K, depth scaling only
+provides additional gains on specific puzzle types such as sudoku (+4pp at K = 100). Both depth
+and width scaling can be seen as ways to explore the model’s solution space. Since rollouts are
+independent and parallelizable while extra depth is sequential, width is the more practical scaling
+axis.
+PTRM unlocks a simple and task-agnostic recipe for scaling TRM test-time compute. The next
+section evaluates the method across multiple benchmarks and against several baselines, including
+frontier LLMs.
+
+5 Experiments
+This section evaluates PTRM’s performance on diverse reasoning benchmarks. We compare against
+the deterministic TRM baseline, a non-recursive direct-prediction baseline, and frontier LLMs.
+Across several PPBench puzzles [8], Sudoku-Extreme [2], Maze-Hard [2], and ARC-AGI 2 [4],
+PTRM substantially boosts the performance of each pretrained TRM using only inference compute.
+
+
+ 5
+ Correct (8)
+ 10 Incorrect (92)
+ Start
+ 8 End 92.5 pass@K
+ 90.0 best-Q@K
+ 6 mode@K
+ 87.5
+PC 2 (34% var)
+
+
+
+
+ PPBench accuracy (%)
+ 4 85.0
+
+ 2
+ 82.5
+ 80.0
+ 0
+ 77.5
+ 2 75.0
+ 72.5
+ 4 1 5 10 25 100
+ 2.5 0.0 2.5 5.0 7.5 10.0 12.5 Rollouts per puzzle K (log scale)
+ PC 1 (53% var)
+
+Figure 5: Stochastic rollouts escape bad Figure 6: Width scaling. pass@K, best-Q@K,
+basins. Principal plane projection of K = and mode@K as K grows, averaged over 3
+100 independent rollouts of the same failed seeds on a held-out PPBench validation set. The
+puzzle as in Figure 2 (right). 92 rollouts Q head is a strong verifier on the tested puzzles,
+remain caught in the bad basin (red). 8 consistently outperforming selection of the most
+escape to a good basin and produce correct frequent answer.
+answers (green).
+
+
+ 5.1 Setup
+
+Datasets. Pencil Puzzle Bench (PPBench) [8] consists of 62,231 constraint-satisfaction pencil puzzles
+(from 94 puzzle types). From the full PPBench dataset, 300 puzzles (15 puzzles from 20 types)
+selected by Waugh [8] are held out to form the golden set. From the remainder we hold out a
+fixed-size validation set of 100 puzzles per puzzle type (50 for tapa, due to its smaller base size),
+and the rest forms the training set. We filter all three sets to puzzles of six types (sudoku, lightup,
+nurikabe, shakashaka, heyawake, and tapa) of grid size 9×9 for sudoku, and 10×10 for the rest.
+We use the validation set to track performance during training and select the final checkpoint. We
+report per-puzzle accuracy on five of these types on the golden set (TRM already reaches 100% on
+shakashaka, so we omit it from the reported results), with aggregate scores sample-weighted across
+types. We also report results on the Sudoku-Extreme, Maze-Hard, and ARC-AGI 2 datasets.
+ Models and inference. For each benchmark we use a standard TRM checkpoint. For Sudoku-
+ Extreme we use the TRM-MLP variant (which the TRM paper showed to be stronger on Sudoku),
+ and for the other datasets, we use TRM-Att. PTRM inference uses K parallel rollouts each running
+ D supervision steps with Gaussian noise of scale σ added to the latent state at each supervision step.
+ The selected configuration (K, D, σ) varies by benchmark and is given alongside each result. Metrics
+ are averaged across three seeds.
+ Baselines. To isolate the contribution of PTRM’s stochastic rollouts from the underlying backbone,
+ we report standard TRM performance (the same checkpoint as PTRM ran deterministically). For
+ each dataset, we report the performance of frontier LLMs. For Sudoku-Extreme, Maze-Hard, and
+ ARC2 we additionally report the published direct prediction and TRM baselines from [1].
+ Cost estimation. PPBench provides the dollar cost per attempt for each LLM. We convert PTRM’s
+ wall-clock to a comparable dollar figure using a single H100 at $2.50/hr (standard cloud pricing [9])
+ so that cost = $2.50 · tpuzzle /3600, where tpuzzle is the time (in seconds) to complete a puzzle.
+
+
+ 5.2 Pencil Puzzle Bench
+
+ 5.2.1 Per-puzzle accuracy
+
+ Table 1 reports per-puzzle accuracy on the PPBench golden set. PTRM at K=100, D=48, σ=0.2
+ raises aggregate best-Q@K from 62.6% to 91.2%. Increasing supervision depth alone (K=1, D=48)
+ gives a small boost over the standard TRM baseline (K=1, D=16). Most of the gain comes
+ from scaling width (stochastic rollouts). The largest improvements are on puzzle types where
+
+
+ 6
+ the deterministic baseline performed the worst (most headroom): sudoku improves from 46.7% to
+97.8% and tapa from 40.0% to 80.0%.
+
+ % accuracy # Params sudoku lightup nurikabe heyawake tapa agg.
+ Direct prediction 27M 0.0 0.0 0.0 14.3 0.0 2.0
+ TRM (K=1, D=16) 7M 46.7 87.5 74.1 85.7 40.0 62.6
+ TRM (K=1, D=48) 7M 57.8 87.5 74.1 85.7 40.0 66.0
+ PTRM, best-Q@K (K=100, D=16) 7M 93.3 100 88.9 85.7 80.0 89.8
+ PTRM, best-Q@K (K=100, D=48) 7M 97.8 100 88.9 85.7 80.0 91.2
+Table 1: PPBench per-puzzle accuracy on the golden set. PTRM uses the same backbone as
+the deterministic TRM. Scaling depth alone (K=1, D=48) lifts aggregate accuracy by 3.4 points
+over the standard D=16 baseline. Combining depth with K=100 stochastic (σ=0.2) rollouts raises
+accuracy by 28.6 percentage points overall. The direct-prediction baseline is a larger transformer
+trained on the same data.
+
+
+
+5.2.2 Comparison with frontier LLMs on golden set
+PPBench reported per-puzzle results for several frontier LLMs using two strategies: 1) direct response
+from a single prompt, and 2) multi-turn agentic strategy with verification. We report results for direct
+and any (best of any strategy attempted, including agentic). The agentic strategy gives the LLM
+substantially more resources than PTRM has access to. It provides the LLM the ability to iteratively
+verify each move with a perfect verifier. The direct strategy is the fairer comparison since, while
+it may use the model provider’s reasoning harness, it does not have direct access to a multi-turn
+verifier (the LLM could still self-verify by writing verification code within the same response). We
+additionally observe that the agentic strategy was applied selectively in the published PPBench data:
+across the LLMs we compare against, only 9.6% of direct failures on the golden set were retried
+with agentic. We restrict the comparison to the 7 strongest LLMs that attempted every puzzle in our
+golden set: claude-opus-4-6@thinking, gpt-5.2@xhigh, gemini-3.1-pro, gpt-5.2@high,
+claude-sonnet-4-6@thinking, gpt-5.2@medium, and kimi-k2.5. Table 2 lists the top 3 in
+each strategy block.
+We additionally report an ensemble score formed from these 7 LLMs where a puzzle counts as solved
+if at least one of them solved it via any strategy. This ensemble setup is deliberately stacked against
+PTRM. It assumes a perfect verifier since, if any of the 7 LLMs produced a correct answer under
+any strategy, the ensemble counts it as solved, even though in practice we would not have access
+to an oracle verifier. Although it is not deployable, we include the ensemble to demonstrate that
+even under these heavily favorable conditions, frontier LLMs fall well short of PTRM. Ensemble
+cost-per-attempt averages over the attempts of all 7 models on each puzzle, and cost-per-correct
+divides total cost by the number of puzzles the ensemble solved.
+Table 2 reports the comparison. PTRM exceeds the strongest single LLM (direct strategy) by 57
+points aggregate (91.2% vs. 34.7%), and exceeds the LLM ensemble by 36 points (91.2% vs. 55.1%)
+despite the ensemble’s stacked advantages. Cost per attempt is several orders of magnitude higher for
+LLMs than PTRM.
+
+5.3 Sudoku-Extreme, Maze-Hard, and ARC-AGI-2
+
+For each benchmark we use the standard TRM checkpoint trained as described in [1] without
+modification (TRM-MLP for Sudoku-Extreme and TRM-Att for Maze-Hard and ARC-AGI-2).
+Table 3 summarizes results on all three.
+On Sudoku-Extreme, PTRM at K=100, D=64, σ=0.3 raises the deterministic baseline of 87.3% to
+99.06% pass@K and 98.75% best-Q@K, achieving state of the art.
+On Maze-Hard, PTRM at K=100, D=16, σ=1.0 reaches 95.63% pass@K, an 11.83 point gain
+over the 83.8% deterministic baseline. mode@K gives the best PTRM accuracy here at 86.73%
+(+2.93 points), with best-Q@K slightly behind at 85.17% (+1.37 points). While pass@K shows
+that PTRM is able to unlock several correct answers, the Q head identifies them less reliably than on
+the previous benchmarks.
+
+
+ 7
+ % accuracy sudoku lightup nurikabe heyawake tapa agg. $/att. $/corr.
+ Direct
+ gemini-3.1-pro 6.7 75.0 22.2 0.0 30.0 24.5 $0.40 $1.62
+ gpt-5.2@xhigh 20.0 50.0 0.0 0.0 50.0 24.5 $1.79 $7.29
+ claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 60.0 34.7 $2.91 $8.40
+ Any strategy (direct or agentic)†
+ gemini-3.1-pro 6.7 87.5 33.3 0.0 40.0 30.6 $10.38 $33.91
+ gpt-5.2@xhigh 33.3 75.0 0.0 0.0 60.0 34.7 $3.09 $8.90
+ claude-opus-4-6@thinking 0.0 87.5 44.4 0.0 70.0 36.7 $4.38 $11.92
+ LLM ensemble†
+ Any strategy (direct or agentic) 46.7 100 44.4 0.0 80.0 55.1 $2.66 $38.51
+ Ours, trained from scratch, 7M parameters
+ PTRM, best-Q@K 97.8 100 88.9 85.7 80.0 91.2 $0.001 $0.001
+Table 2: PTRM vs. frontier LLMs on PPBench golden. Per-puzzle accuracy and per-attempt /
+per-correct cost on the golden set. LLM costs are from PPBench. PTRM cost is estimated from H100
+wall-clock (Sec. 5.1). The direct and agentic blocks list the 3 highest scoring LLMs on aggregate,
+and the ensemble row uses all 7 listed in Sec. 5.2.2. † Assumes access to a perfect verifier.
+
+
+
+
+On ARC-AGI-2, the standard inference pipeline applies data augmentations and votes across them.
+PTRM adds K stochastic rollouts per augmentation. For selection, we pick the rollout with the
+highest Q value within each augmentation, then vote across augmentations as in the standard pipeline.
+With K=25 and σ=0.2, PTRM lifts pass@1 from 7.36% to 8.47% and pass@100 from 14.31% to
+15.97% over our deterministic TRM baseline, while matching it at pass@2.
+
+
+ Sudoku-Extreme Maze-Hard ARC-AGI-2
+ Method # Params Acc. (%) Acc. (%) pass@1 pass@2 pass@100
+ HRM 27M 55.0 74.5 – 5.0 –
+ TRM 5M / 7M† 87.4 85.3 – 7.8 –
+ Ours
+ Standard TRM, our reproduction 5M / 7M† 87.28 83.80 7.36 9.72 14.31
+ PTRM 5M / 7M† 98.75 86.73 8.47 9.72 15.97
+Table 3: Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 results. For Sudoku-Extreme, K=100,
+D=64, σ=0.3. For Maze-Hard, K=100, D=16, σ=1.0. For ARC-AGI-2, K=25, D=16, σ=0.2.
+pass@k for ARC-AGI-2 reports the top-k predictions from the augmentation-voting pipeline. PTRM
+shows an accuracy improvement over standard TRM across all 3 benchmarks. † Following [1], 5M
+for Sudoku-Extreme (TRM-MLP), 7M for Maze-Hard and ARC-AGI-2 (TRM-Att).
+
+
+5.4 Q head selection as σ grows
+
+With a higher σ value, PTRM finds many correct solutions that the deterministic inference misses.
+For instance, on Maze-Hard, the deterministic model solves 83.8% of puzzles, but PTRM raises
+pass@K to nearly 96%. The extent to which PTRM helps depends on the task, but on every dataset
+we tested, it unlocks correct solutions well beyond the deterministic model’s reach.
+TRM’s jointly trained Q head serves as a strong verifier on most tasks. On PPBench and Sudoku-
+Extreme, best-Q@K reaches values within a point of the saturated pass@K, so PTRM’s exploration
+translates directly into accuracy gains. On Maze-Hard, more exploration (higher σ) produces
+significantly more correct rollouts, but the existing Q head is not able to identify them, leaving
+performance on the table. The gap between best-Q@K and pass@K represents headroom for a
+stronger verifier which is left for future work. Appendix B reports the full σ sweep.
+
+
+ 8
+ 6 Related Work
+
+A long line of work explores recursive computation for iterative reasoning and representation re-
+finement. Early examples include Universal Transformers [10], Mixture-of-Recursions [11], Deep
+Thinking models [12, 13, 14], and HRM [2], all of which investigate the use of repeated computation
+steps to improve reasoning performance. More recent work has introduced methods to substantially
+accelerate TRM training [15], while TRM-style recursive architectures have also been extended to
+language modeling tasks [16].
+Building on this broader perspective of recursive computation, a growing body of work studies
+latent-space reasoning through the reuse of hidden states. Hao et al. [17] propose continuous
+“thinking tokens” derived from Chain-of-Thought (CoT) traces [18], which are autoregressively
+generated and appended to the model context, enabling reasoning directly in latent space without
+producing intermediate textual outputs. Similarly, Zhu et al. [19] formalize learning by superposition
+and demonstrate improvements on tasks such as graph reachability. By avoiding explicit token
+sampling and implicitly representing multiple reasoning trajectories, these approaches may mitigate
+the unfaithfulness and backtracking often observed in standard autoregressive reasoning [20, 21].
+Related to our work, Baek et al. [22] propose a generative version of TRM where the hidden state
+z is sampled instead of deterministic. This improves performance on multiple tasks, but requires
+retraining. Efstathiou and Balwani [23] (concurrent work) propose a similar test-time compute
+method where they only apply noise in the initial hidden state z, while we apply noise at every
+supervision step. Furthermore, they test their method on a small subset of the Sudoku-Extreme
+dataset, and treat it as a proof-of-concept that needs to be developed and tested further. Note that
+Baek et al. [22] also tested applying noise to the initial z with TRM and obtained negative results (no
+improvement in accuracy on two datasets).
+Our observations in Sec. 3 are consistent with the mechanistic analysis of Ren and Liu [5], who
+identify spurious fixed points in HRM’s latent dynamics on Sudoku-Extreme. Their method mitigates
+these attractors through a combination of task-specific training data augmentation, inference-time
+input perturbations, and model bootstrapping across training checkpoints, thereby effectively in-
+creasing test-time compute. However, these interventions are comparatively less general and less
+computationally efficient. In contrast, we observe analogous basin structure in TRM across multiple
+puzzle types and achieve attractor escape using a substantially simpler, task-agnostic mechanism:
+injecting Gaussian noise into the latent state at each supervision step while using a single deterministic
+checkpoint.
+
+
+7 Conclusion
+
+In this work, we introduced Probabilistic TRM (PTRM), a novel test-time scaling paradigm for
+Tiny Recursive Models (TRM) through parallel exploration and selection. This approach scales
+test-time compute using width (K parallel rollouts), yielding substantially larger gains than depth
+scaling (increasing deep recursion steps) alone. PTRM requires no retraining and does not rely on
+task-specific data augmentations making it extremely easy to use and versatile.
+By scaling both width and depth, PTRM obtains significant gains in accuracy when tested on a wide
+selection of puzzles. On PPBench (Sudoku, Lightup, Nurikabe, Heyawake, Tapa puzzles), PTRM
+nearly obtains twice the accuracy (91.2%; $0.001 cost) of ensemble of SOTA LLMs (55.1%; $38.51
+cost) at less than 0.0001x the cost. Furthermore, PTRM improves accuracy on Sudoku (from 87.4%
+to 98.75%), Maze-Hard (from 83.80% to 86.73%), and ARC-AGI (from 7.8% to 8.47% pass@1).
+Limitations. Our experiments focus on reasoning puzzles rather than general tasks. We only test
+on a subset of PPBench puzzles. We are limited to puzzles with a small grid-size due to limited
+computational resources. It is not guaranteed that the method works as well for all types of problems
+(e.g., accuracy gains on ARC-AGI-2 and Heyawake are smaller).
+Future work. It would be interesting to understand why some puzzles benefit from test-time scaling
+more than others. We suspect that problems that are harder to verify (e.g., ARC-AGI-2) benefit less
+from PTRM because the Q head may struggle to distinguish correct solutions from incorrect ones.
+Developing stronger verifiers than the existing Q head is an interesting direction for future work.
+
+
+ 9
+ References
+ [1] Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv
+ preprint arXiv:2510.04871, 2025.
+ [2] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and
+ Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.
+ [3] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
+ [4] Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-
+ agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831,
+ 2025.
+ [5] Zirui Ren and Ziming Liu. Are your reasoning models reasoning or guessing? a mechanistic
+ analysis of hierarchical reasoning models. arXiv preprint arXiv:2601.10679, 2026.
+ [6] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti-
+ mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314,
+ 2024.
+ [7] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
+ arXiv:1603.08983, 2016.
+ [8] Justin Waugh. Pencil puzzle bench: A benchmark for multi-step verifiable reasoning. arXiv
+ preprint arXiv:2603.02119, 2026.
+ [9] Vast.ai. Rent h100 pcie gpus on vast.ai. https://vast.ai/pricing/gpu/H100-PCIE, 2026.
+ Accessed: 2026-05-01.
+[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
+ versal transformers. arXiv preprint arXiv:1807.03819, 2018.
+[11] Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch,
+ Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic
+ recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025.
+[12] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum,
+ and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with
+ recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021.
+[13] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum,
+ and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation
+ without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242,
+ 2022.
+[14] Jay Bear, Adam Prugel-Bennett, and Jonathon Hare. Rethinking deep thinking: Stable learning
+ of algorithms using lipschitz constraints. Advances in Neural Information Processing Systems,
+ 37:97027–97052, 2024.
+[15] Navid Hakimi. Form follows function: Recursive stem model. arXiv preprint arXiv:2603.15641,
+ 2026.
+[16] Yinxi Li, Jiaao Chen, Fang Wu, Jiakai Yu, Heli Qi, Weihao Xuan, Haokai Zhao, Pengyu Nie,
+ Di Jin, and Xiangru Tang. Learning multi-step reasoning via persistent latent state propagation.
+ In Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning,
+ 2026.
+[17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong
+ Tian. Training large language models to reason in a continuous latent space. arXiv preprint
+ arXiv:2412.06769, 2024.
+[18] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
+ Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
+ Advances in neural information processing systems, 35:24824–24837, 2022.
+
+
+ 10
+ [19] Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning
+ by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint
+ arXiv:2505.12514, 2025.
+[20] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny
+ Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring
+ faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
+[21] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul-
+ man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t
+ always say what they think. arXiv preprint arXiv:2505.05410, 2025.
+[22] Junyeob Baek, Mingyu Jo, Minsu Kim, Yoshua Bengio, and Sungjin Ahn. Generative recursive
+ reasoning models. ICLR 2026 Workshop on AI with Recursive Self-Improvement, 2026.
+[23] Andreas Efstathiou and Aishwarya Balwani. Recursive reasoning as attractor landscape search:
+ Mechanistic dynamics of the tiny recursive model. Workshop on Latent & Implicit Think-
+ ing – Going Beyond CoT Reasoning, 2026. URL https://openreview.net/forum?id=
+ kKps9W1K7n.
+
+
+
+
+ 11
+ A Implementation Details
+
+A.1 Compute
+
+We train and evaluate all models on a single NVIDIA H100 80GB GPU. PTRM introduces no
+additional training cost over standard TRM since it operates entirely at inference time.
+
+A.2 Models
+
+All experiments use the standard TRM backbone [1] with the released architecture and training recipes.
+Following the TRM paper, we use the MLP variant (TRM-MLP, 5M parameters) for Sudoku-Extreme
+and the attention variant (TRM-Att, 7M parameters) for Maze-Hard, ARC-AGI-2, and PPBench.
+Layout and hyperparameters are unchanged from TRM.
+
+A.3 PPBench dataset construction
+
+Sudoku-Extreme, Maze-Hard, and ARC-AGI-2 use the same checkpoints and data splits as TRM.
+The PPBench dataset is more recent and has previously been used only with frontier LLMs, so we
+detail how we built our training, validation, and golden splits.
+
+Source. PPBench contains 62,231 constraint-satisfaction pencil puzzles spanning 94 puzzle types.
+Of these, 300 puzzles (15 puzzles × 20 types) are held out as the golden benchmark set by Waugh [8].
+
+Filtering. From the remaining 61,931 puzzles we hold out a validation set by sampling 100 puzzles
+from each puzzle type (50 for tapa, due to its smaller base size), and the rest forms the training
+set. We then filter all three sets (training, validation, golden) to retain only puzzles of six types
+(sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) at fixed grid sizes: 9×9 for sudoku
+and 10×10 for the others. Sudoku grids are padded with a pad token to 10×10, giving a uniform
+sequence length of seq_len = 100 across all six puzzle types. The deterministic TRM baseline
+reaches 100% accuracy on shakashaka, so we exclude it from per-puzzle accuracy reporting (no
+headroom to compare against PTRM).
+
+Augmentation. Each training puzzle is expanded into 10 examples using two augmentations: 1)
+trajectory sampling, where the input is set to a random intermediate solve state along the puzzle’s
+solution trajectory rather than always the empty initial grid, while the label is always the fully solved
+grid; and 2) dihedral transformation, where a random dihedral transformation of a square grid, among
+the 8 possibilities given by 4 rotations × 2 {identity, reflection}, is applied to both the input and the
+label. For each puzzle, the first example is the unaugmented (initial state, solved) pair. The remaining
+9 are randomly sampled (trajectory and dihedral transform). Validation and golden splits are not
+augmented.
+
+Resulting splits. The merged multi-type splits use a unified vocabulary of 294 tokens and seq_len =
+100. Per-type sample counts are reported in Table 4.
+
+ puzzle type train val golden
+ sudoku 7,810 97 15
+ lightup 9,504 65 8
+ nurikabe 15,180 55 9
+ heyawake 42,108 70 7
+ tapa 3,663 26 10
+ shakashaka∗ 20,702 62 12
+ total 98,967 375 61
+Table 4: Per-puzzle-type sample counts in the PPBench splits used in training and evaluation.
+∗
+ Shakashaka is included in training but excluded from per-puzzle accuracy reporting because deter-
+ministic TRM already solves all evaluated shakashaka puzzles.
+
+
+
+ 12
+ B Noise Ablation
+
+ We ablate the inference noise level σ on three benchmarks at K=25 (K=100 for Maze-Hard) and
+ D=16 to keep the sweep tractable. For Sudoku-Extreme we randomly sample 1000 puzzles from the
+ test set for the same reason. Figure 7 shows pass@K, best-Q@K, and mode@K as a function of σ,
+ averaged over three random seeds.
+
+ pass@K best-Q@K mode@K K = 1 baseline
+
+ Sudoku-Extreme Maze-Hard ARC-AGI-2 (within-aug)
+ 100
+ 96 5.5
+ 90
+ 94
+ 80 5.0
+ 92
+accuracy (%)
+
+
+
+
+ 70 90 4.5
+ 60 88
+ 50 86 4.0
+ 40 84
+ 3.5
+ 30 82
+ 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
+
+
+ Figure 7: pass@K, best-Q@K, and mode@K across σ per rollout batch. On every task,
+ increasing the inference noise consistently produces more correct rollouts (pass@K, blue) up to
+ a task-dependent σ value. The Q head (best-Q@K, orange) tracks the pass@K ceiling closely
+ on Sudoku-Extreme and leaves a larger gap on Maze-Hard and ARC-AGI-2. The shaded region
+ represents the verifier headroom (accuracy that a better verifier could extract). mode@K (green) has
+ the edge over the Q head only on Maze-Hard. For ARC-AGI-2, metrics are per puzzle/augmentation
+ to isolate the Q head’s verification abilities from the augmentation pipeline.
+
+ On Maze-Hard pass@K climbs from 83.8% (deterministic) to nearly 96% by σ≈1.0 and then
+ plateaus. On Sudoku-Extreme it is already near its ceiling at σ=0.1 and stays roughly flat across the
+ sweep. On ARC-AGI-2 it peaks near σ=0.6 before declining. Q head selection nearly matches the
+ ceiling (maximum pass@K) on Sudoku-Extreme while best-Q@K peaks at 98.5% (within a point of
+ pass@K’s peak of 99.3%). On the other hand, the gap between best-Q@K and maximum pass@K
+ is more pronounced on Maze-Hard and ARC-AGI-2 (headroom a stronger verifier could close).
+
+
+ C Q-guided Langevin sampling
+
+ We initially explored Langevin sampling (using the Q head gradient) as a more principled exploration
+ mechanism than the Gaussian noise injection used in PTRM. The idea is to better guide the stochastic
+ search by additionally steering each rollout (using the Q head gradient) toward regions of high Q
+ value. We ultimately found that the gain from this approach was entirely attributable to the Langevin
+ noise term, with the gradient component contributing nothing measurable on top of the equivalent
+ recurrent noise of Sec. 4. We document the approach here as a negative result.
+
+ Motivation. The Q head is trained as a correctness predictor over latent states. Let fQ (z) denote
+ the head’s scalar output. We treated E(z) = − log sigmoid(fQ (z)) as an energy function over latent
+ space. Empirical observations during early experiments suggested that regions of low E correspond
+ to good basins from which the decoded answer is likely correct. PCA visualizations of the latent
+ dynamics showed that ∇z fQ points toward the good-basin region from both good-basin (correct) and
+ bad-basin (incorrect) latents (Figure 8). This made ∇z fQ look like a valuable direction along which
+ to push latents.
+
+ Method. We sample from the target distribution p(z) ∝ e−E(z) = sigmoid(fQ (z)) via Langevin
+ dynamics where at the end of each deep recursion step t = 1, . . . , D we apply N Langevin steps to
+ the latent,
+ p
+ z ← z − η ∇z E(z) + 2η ξ, ξ ∼ N (0, I),
+ The number of Langevin steps N is the additional scaling axis under this scheme.
+
+
+ 13
+ t=0 t=5 t = 10 t = 15
+ Correct (21)
+ Incorrect (4)
+ Q
+
+
+
+
+Figure 8: y latents and their ∇z fQ gradients projected into the principal plane at several recur-
+sive/supervision steps, for multiple rollouts (using recurrent noise) of a single puzzle (correct rollouts
+in green, incorrect in red). Arrows are drawn at each latent in the direction of ∇z fQ . From both
+good-basin and bad-basin latents, gradients point toward the good-basin region. This visualization
+motivated the Langevin sampling experiment described below.
+
+
+Tractable gradient computation. TRM’s original Q head is a linear projection on a single token,
+fQ (y) = w⊤ y[:, 0]+b, so its gradient with respect to this head’s input is a constant vector independent
+of z. For ∇z fQ to be input-dependent, the gradient must flow back through the last latent recursion.
+This works but requires backpropagating through a full latent recursion at every Langevin step, which
+scales poorly with N . To make guidance tractable for large N , we replaced the linear Q head with
+an attention-pooled variant that reads the full latent and produces a scalar through a small nonlinear
+network. With this head, ∇z fQ can be computed by backpropagating through the head alone, which
+is ∼8× faster per step and does not sacrifice accuracy.
+
+The gain came from the noise,√ not the gradient. Comparing Langevin sampling against a noise-
+only ablation (with the same 2η ξ, but with the −η ∇z E(z) term zeroed out) produced essentially
+identical accuracy at matched N . The gradient component contributed nothing measurable on
+top of the equivalent recurrent noise. This prompted us to focus on the noise-only formulation in
+Sec. 4, which is much more impactful since it is: 1) significantly simpler (no retraining, no test-time
+backpropagation), 2) applicable to any TRM checkpoint out of the box, and 3) equally effective.
+
+D Per-puzzle accuracy on the PPBench validation set
+The main paper reports per-puzzle accuracy on the PPBench golden set (Table 1) for direct compara-
+bility with the LLM evaluations from Waugh [8] who used that set. For a lower-variance complement,
+Table 5 reports results on our validation set (313 puzzles across the five reported types vs. 49 for
+golden). Trends match the golden-set results: depth scaling alone (K=1, D=48) provides a small lift,
+and combining depth with stochastic rollouts (K=100, D=48, σ=0.2) raises aggregate best-Q@K
+from 76.4% to 90.4%, a 14.0 percentage-point improvement. The biggest gains again are on puzzles
+where the deterministic baseline has the most headroom (tapa ∼ 40% to 71.8%, sudoku ∼ 69%
+to 93.3%). Types where the baseline is already near ceiling (heyawake at 96.7%) increase only
+marginally.
+
+ % accuracy # Params sudoku lightup nurikabe heyawake tapa agg.
+ Direct prediction 27M 0.0 10.0 4.0 14.0 0.0 6.2
+ TRM (K=1, D=16) 7M 68.7 83.3 76.0 96.7 39.7 76.4
+ TRM (K=1, D=48) 7M 74.0 84.0 76.7 98.0 41.0 78.3
+ PTRM, best-Q@K (K=100, D=48) 7M 93.3 93.3 84.7 100 71.8 90.4
+Table 5: PPBench per-puzzle accuracy on the validation set. PTRM uses the same backbone as the
+deterministic TRM. Results on the larger validation set follow the same trends as on the golden set.
+
+
+
+
+ 14
+ \ No newline at end of file
diff --git a/papers/txt/trm2025_tiny_recursive.txt b/papers/txt/trm2025_tiny_recursive.txt
new file mode 100644
index 0000000..55cf994
--- /dev/null
+++ b/papers/txt/trm2025_tiny_recursive.txt
@@ -0,0 +1,796 @@
+ Less is More: Recursive Reasoning with Tiny Networks
+
+
+
+ Alexia Jolicoeur-Martineau
+ Samsung SAIL Montréal
+ alexia.j@samsung.com
+
+
+ Abstract
+arXiv:2510.04871v1 [cs.LG] 6 Oct 2025
+
+
+
+
+ Hierarchical Reasoning Model (HRM) is a
+ novel approach using two small neural net-
+ works recursing at different frequencies. This
+ biologically inspired method beats Large Lan-
+ guage models (LLMs) on hard puzzle tasks
+ such as Sudoku, Maze, and ARC-AGI while
+ trained with small models (27M parameters)
+ on small data (∼ 1000 examples). HRM holds
+ great promise for solving hard problems with
+ small networks, but it is not yet well un-
+ derstood and may be suboptimal. We pro-
+ pose Tiny Recursive Model (TRM), a much
+ simpler recursive reasoning approach that
+ achieves significantly higher generalization
+ than HRM, while using a single tiny network
+ with only 2 layers. With only 7M parameters,
+ TRM obtains 45% test-accuracy on ARC-AGI-
+ 1 and 8% on ARC-AGI-2, higher than most
+ LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
+ Pro) with less than 0.01% of the parameters.
+
+
+
+ 1. Introduction
+ While powerful, Large Language models (LLMs) can
+ struggle on hard question-answer problems. Given
+ that they generate their answer auto-regressively, there
+ is a high risk of error since a single incorrect token can
+ render an answer invalid. To improve their reliabil- Figure 1. Tiny Recursion Model (TRM) recursively improves
+ ity, LLMs rely on Chain-of-thoughts (CoT) (Wei et al., its predicted answer y with a tiny network. It starts with the
+ 2022) and Test-Time Compute (TTC) (Snell et al., 2024). embedded input question x and initial embedded answer
+ y, and latent z. For up to Nsup = 16 improvements steps,
+ CoTs seek to emulate human reasoning by having the
+ it tries to improve its answer y. It does so by i) recursively
+ LLM to sample step-by-step reasoning traces prior to
+ updating n times its latent z given the question x, current
+ giving their answer. Doing so can improve accuracy, answer y, and current latent z (recursive reasoning), and
+ but CoT is expensive, requires high-quality reasoning then ii) updating its answer y given the current answer y
+ data (which may not be available), and can be brittle and current latent z. This recursive process allows the model
+ since the generated reasoning may be wrong. To fur- to progressively improve its answer (potentially address-
+ ther improve reliability, test-time compute can be used ing any errors from its previous answer) in an extremely
+ by reporting the most common answer out of K or the parameter-efficient manner while minimizing overfitting.
+ highest-reward answer (Snell et al., 2024).
+
+ 1
+ Recursive Reasoning with Tiny Networks
+
+However, this may not be enough. LLMs with CoT ers that achieves significantly higher generalization
+and TTC are not enough to beat every problem. While than HRM on a variety of problems. In doing so, we
+LLMs have made significant progress on ARC-AGI improve the state-of-the-art test accuracy on Sudoku-
+(Chollet, 2019) since 2019, human-level accuracy still Extreme from 55% to 87%, Maze-Hard from 75% to
+has not been reached (6 years later, as of writing of 85%, ARC-AGI-1 from 40% to 45%, and ARC-AGI-2
+this paper). Furthermore, LLMs struggle on the newer from 5% to 8%.
+ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test
+accuracy with a high amount of TTC) (Chollet et al., 2. Background
+2025; ARC Prize Foundation, 2025b).
+ HRM is described in Algorithm 2. We discuss the
+An alternative direction has recently been proposed by
+ details of the algorithm further below.
+Wang et al. (2025). They propose a new way forward
+through their novel Hierarchical Reasoning Model
+(HRM), which obtains high accuracy on puzzle tasks 2.1. Structure and goal
+where LLMs struggle to make a dent (e.g., Sudoku The focus of HRM is supervised learning. Given an
+solving, Maze pathfinding, and ARC-AGI). HRM is a input, produce an output. Both input and output are
+supervised learning model with two main novelties: 1) assumed to have shape [ B, L] (when the shape differs,
+recursive hierarchical reasoning, and 2) deep supervision. padding tokens can be added), where B is the batch-
+Recursive hierarchical reasoning consists of recurs- size and L is the context-length.
+ing multiple times through two small networks ( f L at HRM contains four learnable components: the in-
+high frequency and f H at low frequency) to predict the put embedding f I (·; θ I ), low-level recurrent network
+answer. Each network generates a different latent fea- f L (·; θ L ), high-level recurrent network f H (·; θ H ), and
+ture: f L outputs z H and f H outputs z L . Both features the output head fO (·; θO ). Once the input is embedded,
+(z L , z H ) are used as input to the two networks. The the shape becomes [ B, L, D ] where D is the embedding
+authors provide some biological arguments in favor of size. Each network is a 4-layer Transformers architec-
+recursing at different hierarchies based on the different ture (Vaswani et al., 2017), with RMSNorm (Zhang
+temporal frequencies at which the brains operate and & Sennrich, 2019), no bias (Chowdhery et al., 2023),
+hierarchical processing of sensory inputs. rotary embeddings (Su et al., 2024), and SwiGLU acti-
+Deep supervision consists of improving the answer vation function (Hendrycks & Gimpel, 2016; Shazeer,
+through multiple supervision steps while carrying the 2020).
+two latent features as initialization for the improve-
+ment steps (after detaching them from the computa- 2.2. Recursion at two different frequencies
+tional graph so that their gradients do not propagate). Given the hyperparameters used by Wang et al. (2025)
+This provide residual connections, which emulates (n = 2 f L steps, 1 f H steps; done T = 2 times), a
+very deep neural networks that are too memory ex- forward pass of HRM is done as follows:
+pensive to apply in one forward pass.
+An independent analysis on the ARC-AGI benchmark x ← f I ( x̃ )
+showed that deep supervision seems to be the primary z L ← f L (z L + z H + x ) # without gradients
+driver of the performance gains (ARC Prize Founda- z L ← f L (z L + z H + x ) # without gradients
+tion, 2025a). Using deep supervision doubled accuracy
+ z H ← f H (z L + z H ) # without gradients
+over single-step supervision (going from 19% to 39%
+accuracy), while recursive hierarchical reasoning only z L ← f L (z L + z H + x ) # without gradients
+slightly improved accuracy over a regular model with z L ← z L .detach()
+a single forward pass (going from 35.7% to 39.0% ac- z H ← z H .detach()
+curacy). This suggests that reasoning across different
+ z L ← f L (z L + z H + x ) # with gradients
+supervision steps is worth it, but the recursion done
+in each supervision step is not particularly important. z H ← f H (z L + z H ) # with gradients
+ ŷ ← argmax( fO (z H ))
+In this work, we show that the benefit from recursive
+reasoning can be massively improved, making it much
+more than incremental. We propose Tiny Recursive where ŷ is the predicted output answer, z L and z H are
+Model (TRM), an improved and simplified approach either initialized embeddings or the embeddings of
+using a much smaller tiny network with only 2 lay- the previous deep supervision step (after detaching
+ them from the computational graph). As can be seen,
+
+ 2
+ Recursive Reasoning with Tiny Networks
+
+def hrm(z, x, n=2, T=2): # hierarchical reasoning 2.4. Deep supervision
+ zH, zL = z
+ with torch.no grad(): To improve effective depth, deep supervision is used.
+ for i in range(nT − 2):
+ zL = L net(zL, zH, x) This consists of reusing the previous latent features
+ if (i + 1) % T == 0:
+ zH = H net(zH, zL)
+ (z H and z L ) as initialization for the next forward pass.
+ # 1−step grad This allows the model to reason over many iterations
+ zL = L net(zL, zH, x)
+ zH = H net(zH, zL) and improve its latent features (z L and z H ) until it
+ return (zH, zL), output head(zH), Q head(zH) (hopefully) converges to the correct solution. At most
+def ACT halt(q, y hat, y true): Nsup = 16 supervision steps are used.
+ target halt = (y hat == y true)
+ loss = 0.5∗binary cross entropy(q[0], target halt)
+ return loss 2.5. Adaptive computational time (ACT)
+def ACT continue(q, last step):
+ if last step: With deep supervision, each mini-batch of data sam-
+ target continue = sigmoid(q[0]) ples must be used for Nsup = 16 supervision steps
+ else:
+ target continue = sigmoid(max(q[0], q[1]))) before moving to the next mini-batch. This is expen-
+ loss = 0.5∗binary cross entropy(q[1], target continue)
+ return loss
+ sive, and there is a balance to be reached between
+ optimizing a few data examples for many supervision
+# Deep Supervision
+for x input, y true in train dataloader: steps versus optimizing many data examples with less
+ z = z init supervision steps. To reach a better balance, a halting
+ for step in range(N sup): # deep supervision
+ x = input embedding(x input) mechanism is incorporated to determine whether the
+ z, y pred, q = hrm(z, x)
+ loss = softmax cross entropy(y pred, y true)
+ model should terminate early. It is learned through
+ # Adaptive computational time (ACT) using Q−learning a Q-learning objective that requires passing the z H
+ loss += ACT halt(q, y pred, y true)
+ , , q next = hrm(z, x) # extra forward pass through an additional head and running an additional
+ loss += ACT continue(q next, step == N sup − 1) forward pass (to determine if halting now rather than
+ z = z.detach()
+ loss.backward() later would have been preferable). They call this
+ opt.step()
+ opt.zero grad()
+ method Adaptive computational time (ACT). It is only
+ if q[0] > q[1]: # early−stopping used during training, while the full Nsup = 16 super-
+ break
+ vision steps are done at test time to maximize down-
+ stream performance. ACT greatly diminishes the time
+Figure 2. Pseudocode of Hierarchical Reasoning Models spent per example (on average spending less than 2
+(HRMs). steps on the Sudoku-Extreme dataset rather than the
+ full Nsup = 16 steps), allowing more coverage of the
+a forward pass of HRM consists of applying 6 function dataset given a fixed number of training iterations.
+evaluations, where the first 4 function evaluations are
+detached from the computational graph and are not 2.6. Deep supervision and 1-step gradient
+back-propagated through. The authors uses n = 2 approximations replaces BPTT
+with T = 2 in all experiments, but HRM can be gener-
+alized by allowing for an arbitrary number of L steps Deep supervision and the 1-step gradient approxima-
+(n) and recursions (T) as shown in Algorithm 2. tion provide a more biologically plausible and less
+ computationally-expansive alternative to Backpropa-
+2.3. Fixed-point recursion with 1-step gradient gation Through Time (BPTT) (Werbos, 1974; Rumel-
+ approximation hart et al., 1985; LeCun, 1985) for solving the temporal
+ credit assignment (TCA) (Rumelhart et al., 1985; Wer-
+Assuming that (z L , z H ) reaches a fixed-point (z∗L , z∗H ) bos, 1988; Elman, 1990) problem (Lillicrap & Santoro,
+through recursing from both f L and f H , 2019). The implication is that HRM can learn what
+ would normally require an extremely large network
+ z∗L ≈ f L (z∗L + z H + x ) without having to back-propagate through its entire
+ z∗H ≈ f H (z L + z∗H ) , depth. Given the hyperparameters used by Jang et al.
+ (2023) in all their experiments, HRM effectively rea-
+the Implicit Function Theorem (Krantz & Parks, 2002)
+ sons over nlayers (n + 1) TNsup = 4 ∗ (2 + 1) ∗ 2 ∗ 16 =
+with the 1-step gradient approximation (Bai et al.,
+ 384 layers of effective depth.
+2019) is used to approximate the gradient by back-
+propagating only the last f L and f H steps. This theo-
+rem is used to justify only tracking the gradients of
+the last two steps (out of 6), which greatly reduces
+memory demands.
+
+ 3
+ Recursive Reasoning with Tiny Networks
+
+2.7. Summary of HRM different from the much smaller n = 2 and T = 2 used
+ in every experiment of their paper, we observe the
+HRM leverages recursion from two networks at dif-
+ following:
+ferent frequencies (high frequency versus low fre-
+quency) and deep supervision to learn to improve
+its answer over multiple supervision steps (with ACT 1. the residual for z H is clearly well above 0 at every
+to reduce time spent per data example). This enables step
+the model to imitate extremely large depth without
+requiring backpropagation through all layers. This
+approach obtains significantly higher performance on 2. the residual for z L only becomes closer to 0 after
+hard question-answer tasks that regular supervised many cycles, but it remains significantly above 0
+models struggle with. However, this method is quite
+complicated, relying a bit too heavily on uncertain
+biological arguments and fixed-point theorems that 3. z L is very far from converged after one f L evalu-
+are not guaranteed to be applicable. In the next sec- ation at T cycles, which is when the fixed-point
+tion, we discuss those issues and potential targets for is assumed to be reached and the 1-step gradient
+improvements in HRM. approximation is used
+
+3. Target for improvements in Hierarchical Thus, while the application of the IFT theorem and
+ Reasoning Models 1-step gradient approximation to HRM has some basis
+ since the residuals do generally reduce over time, a
+In this section, we identify key targets for improve-
+ fixed point is unlikely to be reached when the theorem
+ments in HRM, which will be addressed by our pro-
+ is actually applied.
+posed method, Tiny Recursion Models (TRMs).
+ In the next section, we show that we can bypass the
+3.1. Implicit Function Theorem (IFT) with 1-step need for the IFT theorem and 1-step gradient approxi-
+ gradient approximation mation, thus bypassing the issue entirely.
+
+HRM only back-propagates through the last 2 of the 6 3.2. Twice the forward passes with Adaptive
+recursions. The authors justify this by leveraging the computational time (ACT)
+Implicit Function Theorem (IFT) and one-step approx-
+imation, which states that when a recurrent function HRM uses Adaptive computational time (ACT) during
+converges to a fixed point, backpropagation can be training to optimize the time spent of each data sam-
+applied in a single step at that equilibrium point. ple. Without ACT, Nsup = 16 supervision steps would
+ be spent on the same data sample, which is highly in-
+There are concerns about applying this theorem to
+ efficient. They implement ACT through an additional
+HRM. Most importantly, there is no guarantee that
+ Q-learning objective, which decides when to halt and
+a fixed-point is reached. Deep equilibrium models
+ move to a new data sample rather than keep iterating
+normally do fixed-point iteration to solve for the fixed
+ on the same data. This allows much more efficient
+pointz∗ = f (z∗ ) (Bai et al., 2019). However, in the case
+ use of time especially since the average number of su-
+of HRM, it is not iterating to the fixed-point but simply
+ pervision steps during training is quite low with ACT
+doing forward passes of f L and f H . To make matters
+ (less than 2 steps on the Sudoku-Extreme dataset as
+worse, HRM is only doing 4 recursions before stopping
+ per their reported numbers).
+to apply the one-step approximation. After its first
+loop of two f L and 1 f H evaluations, it only apply a However, ACT comes at a cost. This cost is not directly
+single f L evaluation before assuming that a fixed-point shown in the HRM’s paper, but it is shown in their of-
+is reached for both z L and z H (z∗L = f L (z∗L + z H + x ) ficial code. The Q-learning objective relies on a halting
+and z∗H = f H (z∗L + z∗H )). Then, the one-step gradient loss and a continue loss. The continue loss requires an
+approximation is applied to both latent variables in extra forward pass through HRM (with all 6 function
+succession. evaluations). This means that while ACT optimizes
+ time more efficiently per sample, it requires 2 forward
+The authors justify that a fixed-point is reached by
+ passes per optimization step. The exact formulation is
+depicting an example with n = 7 and T = 7 where
+ shown in Algorithm 2.
+the forward residuals is reduced over time (Figure 3
+in Wang et al. (2025)). Even in this setting, which is In the next section, we show that we can bypass the
+ need for two forward passes in ACT.
+
+ 4
+ Recursive Reasoning with Tiny Networks
+
+3.3. Hierarchical interpretation based on complex of 2 passes). Our approach is described in Algorithm 3
+ biological arguments and illustrated in Figure 1. We also provide an ablation
+ in Table 1 on the Sudoku-Extreme dataset (a dataset
+The HRM’s authors justify the two latent variables
+ of difficult Sudokus with only 1K training examples,
+and two networks operating at different hierarchies
+ but 423K test examples). Below, we explain the key
+based on biological arguments, which are very far
+ components of TRMs.
+from artificial neural networks. They even try to match
+HRM to actual brain experiments on mice. While in-
+teresting, this sort of explanation makes it incredibly Table 1. Ablation of TRM on Sudoku-Extreme comparing %
+hard to parse out why HRM is designed the way it Test accuracy, effective depth per supervision step ( T (n +
+is. Given the lack of ablation table in their paper, the 1)nlayers ), number of Forward Passes (NFP) per optimization
+over-reliance on biological arguments and fixed-point step, and number of parameters
+theorems (that are not perfectly applicable), it is hard Method Acc (%) Depth NFP # Params
+to determine what parts of HRM is helping what and HRM 55.0 24 2 27M
+why. Furthermore, it is not clear why they use two TRM (T = 3, n = 6) 87.4 42 1 5M
+latent features rather than other combinations of fea- w/ ACT 86.1 42 2 5M
+tures. w/ separate f H , f L 82.4 42 1 10M
+ no EMA 79.9 42 1 5M
+In the next section, we show that the recursive process w/ 4-layers, n = 3 79.5 48 1 10M
+can be greatly simplified and understood in a much w/ self-attention 74.7 42 1 7M
+simpler manner that does not require any biological w/ T = 2, n = 2 73.7 12 1 5M
+argument, fixed-point theorem, hierarchical interpre- w/ 1-step gradient 56.5 42 1 5M
+tation, nor using two networks. It also explains why 2
+is the optimal number of features (z L and z H ).
+ 4.1. No fixed-point theorem required
+
+def latent recursion(x, y, z, n=6): HRM assumes that the recursions converge to a fixed-
+ for i in range(n): # latent reasoning point for both z L and z H in order to leverage the 1-step
+ z = net(x, y, z)
+ y = net(y, z) # refine output answer gradient approximation (Bai et al., 2019). This allows
+ return y, z
+ the authors to justify only back-propagating through
+def deep recursion(x, y, z, n=6, T=3): the last two function evaluations (1 f L and 1 f H ). To
+ # recursing T−1 times to improve y and z (no gradients needed)
+ with torch.no grad(): bypass this theoretical requirement, we define a full
+ for j in range(T−1): recursion process as containing n evaluations of f L
+ y, z = latent recursion(x, y, z, n)
+ # recursing once to improve y and z and 1 evaluation of f H :
+ y, z = latent recursion(x, y, z, n)
+ return (y.detach(), z.detach()), output head(y), Q head(y) z L ← f L (z L + z H + x )
+# Deep Supervision ...
+for x input, y true in train dataloader:
+ y, z = y init, z init z L ← f L (z L + z H + x )
+ for step in range(N supervision):
+ x = input embedding(x input) z H ← f H (z L + z H ) .
+ (y, z), y hat, q hat = deep recursion(x, y, z)
+ loss = softmax cross entropy(y hat, y true)
+ loss += binary cross entropy(q hat, (y hat == y true)) Then, we simply back-propagate through the full re-
+ loss.backward() cursion process.
+ opt.step()
+ opt.zero grad()
+ if q hat > 0: # early−stopping Through deep supervision, the models learns to take
+ break any (z L , z H ) and improve it through a full recursion
+ process, hopefully making z H closer to the solution.
+Figure 3. Pseudocode of Tiny Recursion Models (TRMs). This means that by the design of the deep supervi-
+ sion goal, running a few full recursion processes (even
+ without gradients) is expected to bring us closer to the
+4. Tiny Recursion Models solution. We propose to run T − 1 recursion processes
+ without gradient to improve (z L , z H ) before running
+In this section, we present Tiny Recursion Models
+ one recursion process with backpropagation.
+(TRMs). Contrary to HRM, TRM requires no com-
+plex mathematical theorem, hierarchy, nor biological Thus, instead of using the 1-step gradient approxi-
+arguments. It generalizes better while requiring only mation, we apply a full recursion process containing
+a single tiny network (instead of two medium-size net- n evaluations of f L and 1 evaluation of f H . This re-
+works) and a single forward pass for the ACT (instead moves entirely the need to assume that a fixed-point
+
+ 5
+ Recursive Reasoning with Tiny Networks
+
+is reached and the use of the IFT theorem with 1-step While this is intuitive, we wanted to verify whether
+gradient approximation. Yet, we can still leverage using more or less features could be helpful. Results
+multiple backpropagation-free recursion processes to are shown in Table 2.
+improve (z L , z H ). With this approach, we obtain a
+ More features (> 2): We tested splitting z into dif-
+massive boost in generalization on Sudoku-Extreme
+ ferent features by treating each of the n recursions as
+(improving TRM from 56.5% to 87.4%; see Table 1).
+ producing a different zi for i = 1, ..., n. Then, each
+ zi is carried across supervision steps. The approach
+4.2. Simpler reinterpretation of z H and z L is described in Algorithm 5. In doing so, we found
+HRM is interpreted as doing hierarchical reasoning performance to drop. This is expected because, as dis-
+over two latent features of different hierarchies due to cussed, there is no apparent need for splitting z into
+arguments from biology. However, one might wonder multiple parts. It does not have to be hierarchical.
+why use two latent features instead of 1, 3, or more? Single feature: Similarly, we tested the idea of taking
+And do we really need to justify these so-called ”hier- a single feature by only carrying z H across supervi-
+archical” features based on biology to make sense of sion steps. The approach is described in Algorithm 4.
+them? We propose a simple non-biological explana- In doing so, we found performance to drop. This is
+tion, which is more natural, and directly answers the expected because, as discussed, it forces the model to
+question of why there are 2 features. store the solution y within z.
+The fact of the matter is: z H is simply the current Thus, we explored using more or less latent variables
+(embedded) solution. The embedding is reversed by on Sudoku-Extreme, but found that having only y and
+applying the output head and rounding to the nearest z lead to better test accuracy in addition to being the
+token using the argmax operation. On the other hand, simplest more natural approach.
+z L is a latent feature that does not directly correspond
+to a solution, but it can be transformed into a solution
+by applying z H ← f H ( x, z L , z H ). We show an example
+on Sudoku-Extreme in Figure 6 to highlight the fact Table 2. TRM on Sudoku-Extreme comparing % Test accu-
+that z H does correspond to the solution, but z L does racy when using more or less latent features
+not. Method # of features Acc (%)
+ TRM y, z (Ours) 2 87.4
+Once this is understood, hierarchy is not needed; there TRM multi-scale z n+1 = 7 77.6
+is simply an input x, a proposed solution y (previously TRM single z 1 71.9
+called z H ), and a latent reasoning feature z (previously
+called z L ). Given the input question x, current solution
+y, and current latent reasoning z, the model recursively
+improves its latent z. Then, given the current latent z 4.3. Single network
+and the previous solution y, the model proposes a new HRM uses two networks, one applied frequently as a
+solution y (or stay at the current solution if its already low-level module f H and one applied rarely as an high-
+good). level module ( f H ). This requires twice the number of
+Although this has no direct influence on the algorithm, parameters compared to regular supervised learning
+this re-interpretation is much simpler and natural. It with a single network.
+answers the question about why two features: remem- As mentioned previously, while f L iterates on the la-
+bering in context the question x, previous reasoning tent reasoning feature z (z L in HRM), the goal of f H
+z, and previous answer y helps the model iterate on is to update the solution y (z H in HRM) given the la-
+the next reasoning z and then the next answer y. If tent reasoning and current solution. Importantly, since
+we were not passing the previous reasoning z, the z ← f L ( x + y + z) contains x but y ← f H (y + z) does
+model would forget how it got to the previous solu- not contains x, the task to achieve (iterating on z versus
+tion y (since z acts similarly as a chain-of-thought). If using z to update y) is directly specified by the inclu-
+we were not passing the previous solution y, then the sion or lack of x in the inputs. Thus, we considered
+model would forget what solution it had and would the possibility that both networks could be replaced
+be forced to store the solution y within z instead of by a single network doing both tasks. In doing so, we
+using it for latent reasoning. Thus, we need both y and obtain better generalization on Sudoku-Extreme (im-
+z separately, and there is no apparent reason why one proving TRM from 82.4% to 87.4%; see Table 1) while
+would need to split z into multiple features. reducing the number of parameters by half. It turns
+ out that a single network is enough.
+
+ 6
+ Recursive Reasoning with Tiny Networks
+
+4.4. Less is more 4.6. No additional forward pass needed with ACT
+We attempted to increase capacity by increasing the As previously mentioned, the implementation of ACT
+number of layers in order to scale the model. Sur- in HRM through Q-learning requires two forward
+prisingly, we found that adding layers decreased gen- passes, which slows down training. We propose a
+eralization due to overfitting. In doing the oppo- simple solution, which is to get rid of the continue loss
+site, decreasing the number of layers while scaling (from the Q-learning) and only learn a halting proba-
+the number of recursions (n) proportionally (to keep bility through a Binary-Cross-Entropy loss of having
+the amount of compute and emulated depth approxi- reached the correct solution. By removing the continue
+mately the same), we found that using 2 layers (instead loss, we remove the need for the expensive second for-
+of 4 layers) maximized generalization. In doing so, we ward pass, while still being able to determine when to
+obtain better generalization on Sudoku-Extreme (im- halt with relatively good accuracy. We found no sig-
+proving TRM from 79.5% to 87.4%; see Table 1) while nificant difference in generalization from this change
+reducing the number of parameters by half (again). (going from 86.1% to 87.4%; see Table 1).
+It is quite surprising that smaller networks are bet-
+ter, but 2 layers seems to be the optimal choice. Bai 4.7. Exponential Moving Average (EMA)
+& Melas-Kyriazi (2024) also observed optimal perfor- On small data (such as Sudoku-Extreme and Maze-
+mance for 2-layers in the context of deep equilibrium Hard), HRM tends to overfit quickly and then diverge.
+diffusion models; however, they had similar perfor- To reduce this problem and improves stability, we
+mance to the bigger networks, while we instead ob- integrate Exponential Moving Average (EMA) of the
+serve better performance with 2 layers. This may ap- weights, a common technique in GANs and diffusion
+pear unusual, as with modern neural networks, gener- models to improve stability (Brock et al., 2018; Song &
+alization tends to directly correlate with model sizes. Ermon, 2020). We find that it prevents sharp collapse
+However, when data is too scarce and model size is and leads to higher generalization (going from 79.9%
+large, there can be an overfitting penalty (Kaplan et al., to 87.4%; see Table 1).
+2020). This is likely an indication that there is too little
+data. Thus, using tiny networks with deep recursion 4.8. Optimal the number of recursions
+and deep supervision appears to allow us to bypass a
+lot of the overfitting. We experimented with different number of recursions
+ by varying T and n and found that T = 3, n = 3
+4.5. attention-free architecture for tasks with small (equivalent to 48 recursions) in HRM and T = 3, n = 6
+ fixed context length in TRM (equivalent to 42 recursions) to lead to optimal
+ generalization on Sudoku-Extreme. More recursions
+Self-attention is particularly good for long-context could be helpful for harder problems (we have not
+lengths when L ≫ D since it only requires a matrix of tested it, given our limited resources); however, in-
+[ D, 3D ] parameters, even though it can account for the creasing either T or n incurs massive slowdowns. We
+whole sequence. However, when focusing on tasks show results at different n and T for HRM and TRM
+where L ≤ D, a linear layer is cheap, requiring only a in Table 3. Note that TRM requires backpropagation
+matrix of [ L, L] parameters. Taking inspiration from through a full recursion process, thus increasing n too
+the MLP-Mixer (Tolstikhin et al., 2021), we can replace much leads to Out Of Memory (OOM) errors. How-
+the self-attention layer with a multilayer perceptron ever, this memory cost is well worth its price in gold.
+(MLP) applied on the sequence length. Using an MLP
+instead of self-attention, we obtain better generaliza- In the following section, we show our main results on
+tion on Sudoku-Extreme (improving from 74.7% to multiple datasets comparing HRM, TRM, and LLMs.
+87.4%; see Table 1). This worked well on Sudoku 9x9
+grids, given the small and fixed context length; how- 5. Results
+ever, we found this architecture to be suboptimal for
+tasks with large context length, such as Maze-Hard Following Wang et al. (2025), we test our approach
+and ARC-AGI (both using 30x30 grids). We show on the following datasets: Sudoku-Extreme (Wang
+results with and without self-attention for all experi- et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI-
+ments. 1 (Chollet, 2019) and, ARC-AGI-2 (Chollet et al., 2025).
+ Results are presented in Tables 4 and 5. Hyperparame-
+ ters are detailed in Section 6. Datasets are discussed
+ below.
+
+
+ 7
+ Recursive Reasoning with Tiny Networks
+
+ ity of the MLP on large 30x30 grids). TRM with self-
+Table 3. % Test accuracy on Sudoku-Extreme dataset. HRM
+ attention obtains 85.3% accuracy on Maze-Hard, 44.6%
+versus TRM matched at a similar effective depth per super-
+vision step ( T (n + 1)nlayers )
+ accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-
+ AGI-2 with 7M parameters. This is significantly higher
+ HRM TRM
+ than the 74.5%, 40.3%, and 5.0% obtained by HRM us-
+ n = k, 4 layers n = 2k, 2 layers
+ ing 4 times the number of parameters (27M).
+ k T Depth Acc (%) Depth Acc (%)
+ 1 1 9 46.4 7 63.2
+ 2 2 24 55.0 20 81.9 Table 4. % Test accuracy on Puzzle Benchmarks (Sudoku-
+ 3 3 48 61.6 42 87.4 Extreme and Maze-Hard)
+ 4 4 80 59.5 72 84.2 Method # Params Sudoku Maze
+ 6 3 84 62.3 78 OOM Chain-of-thought, pretrained
+ 3 6 96 58.8 84 85.8 Deepseek R1 671B 0.0 0.0
+ 6 6 168 57.5 156 OOM Claude 3.7 8K ? 0.0 0.0
+ O3-mini-high ? 0.0 0.0
+ Direct prediction, small-sample training
+Sudoku-Extreme consists of extremely difficult Su- Direct pred 27M 0.0 0.0
+doku puzzles (Dillion, 2025; Palm et al., 2018; Park, HRM 27M 55.0 74.5
+2018) (9x9 grid), for which only 1K training samples TRM-Att (Ours) 7M 74.7 85.3
+are used to test small-sample learning. Testing is done TRM-MLP (Ours) 5M/19M 1 87.4 0.0
+on 423K samples. Maze-Hard consists of 30x30 mazes
+generated by the procedure by Lehnert et al. (2024)
+whose shortest path is of length above 110; both the
+training set and test set include 1000 mazes. Table 5. % Test accuracy on ARC-AGI Benchmarks (2 tries)
+ Method # Params ARC-1 ARC-2
+ARC-AGI-1 and ARC-AGI-2 are geometric puzzles in- Chain-of-thought, pretrained
+volving monetary prizes. Each puzzle is designed to Deepseek R1 671B 15.8 1.3
+be easy for a human, yet hard for current AI models. Claude 3.7 16K ? 28.6 0.7
+Each puzzle task consists of 2-3 input–output demon- o3-mini-high ? 34.5 3.0
+stration pairs and 1-2 test inputs to be solved. The final Gemini 2.5 Pro 32K ? 37.0 4.9
+score is computed as the accuracy over all test inputs Grok-4-thinking 1.7T 66.7 16.0
+from two attempts to produce the correct output grid. Bespoke (Grok-4) 1.7T 79.6 29.4
+The maximum grid size is 30x30. ARC-AGI-1 con- Direct prediction, small-sample training
+tains 800 tasks, while ARC-AGI-2 contains 1120 tasks.
+ Direct pred 27M 21.0 0.0
+We also augment our data with the 160 tasks from
+ HRM 27M 40.3 5.0
+the closely related ConceptARC dataset (Moskvichev
+et al., 2023). We provide results on the public evalua- TRM-Att (Ours) 7M 44.6 7.8
+tion set for both ARC-AGI-1 and ARC-AGI-2. TRM-MLP (Ours) 19M 29.6 2.4
+
+While these datasets are small, heavy data-
+augmentation is used in order to improve gen- 6. Conclusion
+eralization. Sudoku-Extreme uses 1000 shuffling
+(done without breaking the Sudoku rules) augmenta- We propose Tiny Recursion Models (TRM), a simple
+tions per data example. Maze-Hard uses 8 dihedral recursive reasoning approach that achieves strong gen-
+transformations per data example. ARC-AGI uses eralization on hard tasks using a single tiny network
+1000 data augmentations (color permutation, dihedral- recursing on its latent reasoning feature and progres-
+group, and translations transformations) per data sively improving its final answer. Contrary to the
+example. The dihedral-group transformations consist Hierarchical Reasoning Model (HRM), TRM requires
+of random 90-degree rotations, horizontal/vertical no fixed-point theorem, no complex biological justi-
+flips, and reflections. fications, and no hierarchy. It significantly reduces
+ the number of parameters by halving the number of
+From the results, we see that TRM without self- layers and replacing the two networks with a single
+attention obtains the best generalization on Sudoku- tiny network. It also simplifies the halting process,
+Extreme (87.4% test accuracy). Meanwhile, TRM with removing the need for the extra forward pass. Over-
+self-attention generalizes better on the other tasks
+(probably due to inductive biases and the overcapac- 1 5M on Sudoku and 19M on Maze
+
+
+
+ 8
+ Recursive Reasoning with Tiny Networks
+
+all, TRM is much simpler than HRM, while achieving gan training for high fidelity natural image synthe-
+better generalization. sis. arXiv preprint arXiv:1809.11096, 2018.
+While our approach led to better generalization on 4 Chollet, F. On the measure of intelligence. arXiv
+benchmarks, every choice made is not guaranteed to preprint arXiv:1911.01547, 2019.
+be optimal on every dataset. For example, we found
+that replacing the self-attention with an MLP worked Chollet, F., Knoop, M., Kamradt, G., Landers, B.,
+extremely well on Sudoku-Extreme (improving test ac- and Pinkard, H. Arc-agi-2: A new challenge
+curacy by 10%), but poorly on other datasets. Different for frontier ai reasoning systems. arXiv preprint
+problem settings may require different architectures arXiv:2505.11831, 2025.
+or number of parameters. Scaling laws are needed
+to parametrize these networks optimally. Although Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
+we simplified and improved on deep recursion, the Mishra, G., Roberts, A., Barham, P., Chung, H. W.,
+question of why recursion helps so much compared Sutton, C., Gehrmann, S., et al. Palm: Scaling lan-
+to using a larger and deeper network remains to be guage modeling with pathways. Journal of Machine
+explained; we suspect it has to do with overfitting, but Learning Research, 24(240):1–113, 2023.
+we have no theory to back this explaination. Not all
+our ideas made the cut; we briefly discuss some of the Dillion, T. Tdoku: A fast sudoku solver and gener-
+failed ideas that we tried but did not work in Section 6. ator. https://t-dillon.github.io/tdoku/,
+Currently, recursive reasoning models such as HRM 2025.
+and TRM are supervised learning methods rather than
+ Elman, J. L. Finding structure in time. Cognitive science,
+generative models. This means that given an input
+ 14(2):179–211, 1990.
+question, they can only provide a single deterministic
+answer. In many settings, multiple answers exist for a Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
+question. Thus, it would be interesting to extend TRM ers: Scaling to trillion parameter models with simple
+to generative tasks. and efficient sparsity. Journal of Machine Learning Re-
+ search, 23(120):1–39, 2022.
+Acknowledgements
+ Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep
+Thank you Emy Gervais for your invaluable support equilibrium models. arXiv preprint arXiv:2310.18605,
+and extra push. This research was enabled in part 2023.
+by computing resources, software, and technical as-
+sistance provided by Mila and the Digital Research Hendrycks, D. and Gimpel, K. Gaussian error linear
+Alliance of Canada. units (gelus). arXiv preprint arXiv:1606.08415, 2016.
+
+ Jang, Y., Kim, D., and Ahn, S. Hierarchical graph
+References generation with k2-trees. In ICML 2023 Workshop on
+ARC Prize Foundation. The Hidden Drivers of HRM’s Structured Probabilistic Inference Generative Modeling,
+ Performance on ARC-AGI. https://arcprize. 2023.
+ org/blog/hrm-analysis, 2025a. [Online; ac-
+ Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
+ cessed 2025-09-15].
+ Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
+ARC Prize Foundation. ARC-AGI Leaderboard. and Amodei, D. Scaling laws for neural language
+ https://arcprize.org/leaderboard, 2025b. models. arXiv preprint arXiv:2001.08361, 2020.
+ [Online; accessed 2025-09-24].
+ Kingma, D. P. and Ba, J. Adam: A method for stochas-
+Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium tic optimization. arXiv preprint arXiv:1412.6980,
+ models. Advances in neural information processing 2014.
+ systems, 32, 2019.
+ Krantz, S. G. and Parks, H. R. The implicit function
+Bai, X. and Melas-Kyriazi, L. Fixed point diffusion theorem: history, theory, and applications. Springer
+ models. In Proceedings of the IEEE/CVF Conference Science & Business Media, 2002.
+ on Computer Vision and Pattern Recognition, pp. 9430–
+ 9440, 2024. LeCun, Y. Une procedure d’apprentissage ponr reseau
+ a seuil asymetrique. Proceedings of cognitiva 85, pp.
+Brock, A., Donahue, J., and Simonyan, K. Large scale 599–604, 1985.
+
+ 9
+ Recursive Reasoning with Tiny Networks
+
+Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P., all-mlp architecture for vision. Advances in neural
+ Rabbat, M., and Tian, Y. Beyond a*: Better planning information processing systems, 34:24261–24272, 2021.
+ with transformers via search dynamics bootstrap-
+ ping. arXiv preprint arXiv:2402.14083, 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
+ Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,
+Lillicrap, T. P. and Santoro, A. Backpropagation I. Attention is all you need. Advances in neural
+ through time and the brain. Current opinion in neuro- information processing systems, 30, 2017.
+ biology, 55:82–89, 2019.
+ Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y.,
+Loshchilov, I. and Hutter, F. Decoupled weight decay Lu, M., Song, S., and Yadkori, Y. A. Hierarchical
+ regularization. arXiv preprint arXiv:1711.05101, 2017. reasoning model. arXiv preprint arXiv:2506.21734,
+ 2025.
+Moskvichev, A., Odouard, V. V., and Mitchell, M. The
+ conceptarc benchmark: Evaluating understanding Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
+ and generalization in the arc domain. arXiv preprint Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
+ arXiv:2305.07141, 2023. prompting elicits reasoning in large language mod-
+ els. Advances in neural information processing systems,
+Palm, R., Paquet, U., and Winther, O. Recurrent re-
+ 35:24824–24837, 2022.
+ lational networks. Advances in neural information
+ processing systems, 31, 2018. Werbos, P. Beyond regression: New tools for predic-
+ tion and analysis in the behavioral sciences. PhD
+Park, K. Can convolutional neural networks
+ thesis, Committee on Applied Mathematics, Harvard
+ crack sudoku puzzles? https://github.com/
+ University, Cambridge, MA, 1974.
+ Kyubyong/sudoku, 2018.
+ Werbos, P. J. Generalization of backpropagation with
+Prieto, L., Barsbey, M., Mediano, P. A., and Birdal, T.
+ application to a recurrent gas market model. Neural
+ Grokking at the edge of numerical stability. arXiv
+ networks, 1(4):339–356, 1988.
+ preprint arXiv:2501.04697, 2025.
+ Zhang, B. and Sennrich, R. Root mean square layer
+Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
+ normalization. Advances in Neural Information Pro-
+ Learning internal representations by error propaga-
+ cessing Systems, 32, 2019.
+ tion. Technical report, 1985.
+Shazeer, N. Glu variants improve transformer. arXiv
+ preprint arXiv:2002.05202, 2020.
+Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
+ Q., Hinton, G., and Dean, J. Outrageously large neu-
+ ral networks: The sparsely-gated mixture-of-experts
+ layer. arXiv preprint arXiv:1701.06538, 2017.
+Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
+ llm test-time compute optimally can be more effec-
+ tive than scaling model parameters. arXiv preprint
+ arXiv:2408.03314, 2024.
+Song, Y. and Ermon, S. Improved techniques for train-
+ ing score-based generative models. Advances in
+ neural information processing systems, 33:12438–12448,
+ 2020.
+Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu,
+ Y. Roformer: Enhanced transformer with rotary
+ position embedding. Neurocomputing, 568:127063,
+ 2024.
+Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer,
+ L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A.,
+ Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An
+
+ 10
+ Recursive Reasoning with Tiny Networks
+
+Hyper-parameters and setup propagating through the whole n + 1 recursions makes
+ the most sense and works best.
+All models are trained with the AdamW opti-
+mizer(Loshchilov & Hutter, 2017; Kingma & Ba, 2014) We tried removing ACT with the option of stopping
+with β 1 = 0.9, β 2 = 0.95, small learning rate warm- when the solution is reached, but we found that gen-
+up (2K iterations), batch-size 768, hidden-size of 512, eralization dropped significantly. This can probably
+Nsup = 16 max supervision steps, and stable-max loss be attributed to the fact that the model is spending
+(Prieto et al., 2025) for improved stability. TRM uses an too much time on the same data samples rather than
+Exponential Moving Average (EMA) of 0.999. HRM focusing on learning on a wide range of data samples.
+uses n = 2, T = 2 with two 4-layers networks, while We tried weight tying the input embedding and out-
+we use n = 6, T = 3 with one 2-layer network. put head, but this was too constraining and led to a
+For Sudoku-Extreme and Maze-Hard, we train for 60k massive generalization drop.
+epochs with learning rate 1e-4 and weight decay 1.0. We tried using TorchDEQ (Geng & Kolter, 2023) to
+For ARC-AGI, we train for 100K epochs with learning replace the recursion steps by fixed-point iteration as
+rate 1e-4 (with 1e-2 learning rate for the embeddings) done by Deep Equilibrium Models (Bai et al., 2019).
+and weight decay 0.1. The numbers for Deepseek R1, This would provide a better justification for the 1-step
+Claude 3.7 8K, O3-mini-high, Direct prediction, and gradient approximation. However, this slowed down
+HRM from the Table 4 and 5 are taken from Wang et al. training due to the fixed-point iteration and led to
+(2025). Both HRM and TRM add an embedding of worse generalization. This highlights the fact that
+shape [0, 1, D ] on Sudoku-Extreme and Maze-Hard to converging to a fixed-point is not essential.
+the input. For ARC-AGI, each puzzle (containing 2-3
+training examples and 1-2 test examples) at each data-
+augmentation is given a specific embedding of shape
+[0, 1, D ] and, at test-time, the most common answer
+out of the 1000 data augmentations is given as answer.
+Experiments on Sudoku-Extreme were ran with 1 L40S
+with 40Gb of RAM for generally less than 36 hours.
+Experiments on Maze-Hard were ran with 4 L40S with
+40Gb of RAM for less than 24 hours. Experiments on
+ARC-AGI were ran for around 3 days with 4 H100
+with 80Gb of RAM.
+
+Ideas that failed
+In this section, we quickly mention a few ideas that
+did not work to prevent others from making the same
+mistake.
+We tried replacing the SwiGLU MLPs by SwiGLU
+Mixture-of-Experts (MoEs) (Shazeer et al., 2017; Fedus
+et al., 2022), but we found generalization to decrease
+massively. MoEs clearly add too much unnecessary
+capacity, just like increasing the number of layers does.
+Instead of back-propagating through the whole n + 1
+recursions, we tried a compromise between HRM 1-
+step gradient approximation, which back-propagates
+through the last 2 recursions. We did so by decou-
+pling n from the number of last recursions k that we
+back-propagate through. For example, while n = 6
+requires 7 steps with gradients in TRM, we can use
+gradients for only the k = 4 last steps. However, we
+found that this did not help generalization in any way,
+and it made the approach more complicated. Back-
+
+
+ 11
+ Recursive Reasoning with Tiny Networks
+
+Algorithms with different number of latent Example on Sudoku-Extreme
+features
+ 8 3 1
+ 9 6 8 7
+def latent recursion(x, z, n=6): 3 5
+ for i in range(n+1): # latent recursion
+ z = net(x, z) 6 8
+ return z 6 2
+def deep recursion(x, z, n=6, T=3):
+ 7 4 3
+ # recursing T−1 times to improve z (no gradients needed) 9 4
+ with torch.no grad():
+ for j in range(T−1):
+ 2 4 1
+ z = latent recursion(x, z, n) 6 2 5 7
+ # recursing once to improve z
+ z = latent recursion(x, z, n) Input x
+ return z.detach(), output head(y), Q head(y)
+ 5 2 6 7 9 4 8 3 1
+# Deep Supervision
+for x input, y true in train dataloader: 3 9 1 2 6 8 4 7 5
+ z = z init 4 8 7 3 1 5 2 9 6
+ for step in range(N supervision):
+ x = input embedding(x input)
+ 1 6 8 5 3 2 7 4 9
+ z, y hat, q hat = deep recursion(x, z) 9 3 5 4 7 6 1 8 2
+ loss = softmax cross entropy(y hat, y true) 7 4 2 9 8 1 5 6 3
+ loss += binary cross entropy(q hat, (y hat == y true))
+ z = z.detach() 8 7 3 1 5 9 6 2 4
+ loss.backward() 2 5 9 6 4 7 3 1 8
+ opt.step()
+ opt.zero grad() 6 1 4 8 5 3 9 5 7
+ if q[0] > 0: # early−stopping
+ break Output y
+ 5 2 6 7 9 4 8 3 1
+Figure 4. Pseudocode of TRM using a single-z with deep 3 9 1 2 6 8 4 7 5
+supervision training in PyTorch. 4 8 7 3 1 5 2 9 6
+ 1 6 8 5 3 2 7 4 9
+ 9 3 5 4 7 6 1 8 2
+ 7 4 2 9 8 1 5 6 3
+ 8 7 3 1 5 9 6 2 4
+def latent recursion(x, y, z, n=6):
+ for i in range(n): # latent recursion 2 5 9 6 4 7 3 1 8
+ z[i] = net(x, y, z[0], ... , z[n−1]) 6 1 4 8 5 3 9 5 7
+ y = net(y, z[0], ... , z[n−1]) # refine output answer
+ return y, z Tokenized z H (denoted y in TRM)
+def deep recursion(x, y, z, n=6, T=3): 5 5 4 9 4 6 3
+ # recursing T−1 times to improve y and z (no gradients needed)
+ with torch.no grad(): 4 3 1 4 6 5
+ for j in range(T−1): 4 8 4 3 6 6 4
+ y, z = latent recursion(x, y, z, n)
+ # recursing once to improve y and z 9 6 5 3 5 4
+ y, z = latent recursion(x, y, z, n) 3 5 4 3 5 4 4
+ return (y.detach(), z.detach()), output head(y), Q head(y)
+ 6 3 3 3 5 8 8
+# Deep Supervision 3 3 3 6 5 6 6 4
+for x input, y true in train dataloader: 7 5 6 3 3 6 6
+ y, z = y init, z init
+ for step in range(N supervision): 4 3 4 8 3 6 6 4
+ x = input embedding(x input)
+ (y, z), y hat, q hat = deep recursion(x, y, z) Tokenized z L (denoted z in TRM)
+ loss = softmax cross entropy(y hat, y true)
+ loss += binary cross entropy(q hat, (y hat == y true))
+ loss.backward()
+ opt.step()
+ Figure 6. This Sudoku-Extreme example shows an input, ex-
+ opt.zero grad() pected output, and the tokenized z H and z L (after reversing
+ if q[0] > 0: # early−stopping the embedding and using argmax) for a pretrained model.
+ break
+ This highlights the fact that z H corresponds to the predicted
+ response, while z L is a latent feature that cannot be decoded
+Figure 5. Pseudocode of TRM using multi-scale z with deep to a sensible output unless transformed into z H by f H .
+supervision training in PyTorch.
+
+
+
+
+ 12
+ \ No newline at end of file